Merrimack Research | AI-Powered Financial Intelligence

Contents

EXPLAINERMulti-Agent Financial Analysis System

Overview

A full-stack financial research platform for automating due diligence on publicly traded US equities. Structured financial data, SEC filings and macro indicators are ingested, processed by a reproducible deterministic layer, then evaluated by a four-agent LLM pipeline that produces explainable investment reports.

Core principle

Deterministic evidence first, LLM reasoning second. Every agent claim must be traceable to a specific data point in the supplied context block — no external knowledge is permitted.

Evaluation result

~500 runs across 36 tickers (Jan 2020 – Dec 2023) at a total cost of $15.10. Kimi-K2's top-4 portfolio returned +102.4%, beating the S&P 500 benchmark of +80.6% over the same period.

Two-phase backtesting flow

Use Ticker Onboarding to validate and ingest a new symbol across all data pipelines. The Deterministic layer then computes reproducible quantitative signals (DCF, options, sentiment, macro) before any LLM inference begins. Finally, the Agent layer evaluates those signals using a four-agent AutoGen pipeline and generates a structured investment report.

Open Pipelines Open Backtest View Reports

Data Sources

All data is stored in Cloudflare R2, partitioned by domain so each pipeline can evolve and fail independently. Financial statements are parsed via XBRL for machine-readable extraction. Sentiment scores are pre-computed by AlphaVantage to avoid duplicating NLP infrastructure inside the pipeline.

prices

Historical OHLCV candles, partitioned by ticker and date. Used directly by charting, backtest return calculation and Monte Carlo market-price anchoring.

Key pattern: {TICKER}/{YEAR}/{DATE}.jsonl

options

Historical options chains with strikes, expiries, Greeks and implied volatility mappings. Each trading day is a separate .jsonl file (~231 MB per ticker over 3 years).

Key pattern: {TICKER}/{YEAR}/{DATE}.jsonl

sentiment-data

Pre-computed NLP sentiment scores sourced from AlphaVantage, normalised per ticker/time window. Decay-weighted averages favour recent articles for near-term signal quality.

Key pattern: {TICKER}/...

sec-filings

Raw SEC 10-K annual, 10-Q quarterly and 8-K current reports scraped from EDGAR. XBRL extraction provides machine-readable, directly queryable financial statements.

Key pattern: {TICKER}/{TICKER}-{FORM}-{YEAR}/

financials

Structured financial statement data (income statement, balance sheet, cash flow) parsed via XBRL. Sector-aware branching applies equity-earnings valuation to financial-sector companies.

Key pattern: {TICKER}/...

macro

15 FRED macroeconomic time-series including GDP, CPI, Federal Funds Rate, unemployment and VIX. Used for macro regime classification and WACC risk-free rate anchoring at each quarterly snapshot.

Key pattern: {slug}/{slug}.json

reports

Generated investment reports produced by the agent layer, stored as PDFs. Each report includes an executive summary, score breakdown, claim citations, grounding rate and run metadata.

Key pattern: reports/...

File Browser

Direct inspection tool for stored R2 objects. Useful for confirming that a newly onboarded ticker has data in all expected buckets before triggering a backtest run.

Use /filebrowser and /dataviewer to trace a ticker across buckets and verify expected objects exist.

After onboarding, confirm files are present across prices/, options/, financials/ and sentiment-data/ before triggering the deterministic or agent layers. The sentiment viewer also provides a manual inspection interface for individual article scores and relevance classifications.

Chart Viewer

Visualises single-ticker OHLCV candles or multi-ticker comparisons over selected windows. Pricing data also anchors the DCF Monte Carlo to observed market conditions at each quarterly snapshot.

Route: /chartviewer. Data is served from /v1/prices/candles, /v1/prices/compare and /v1/prices/manifest endpoints in the system worker. A 5-day date fallback in the manifest adaptor resolves prices on market holidays and weekends.

Pipelines

Eight independent ETL pipelines, each responsible for a specific data category. Pipeline failures are isolated — a rate limit on one source cannot cascade to others. All pipelines stream live logs to the UI over SSE and can be re-run independently without reprocessing the full dataset.

Pipeline	Purpose	Output
ticker_onboarding	Master orchestrator for initial data population. Validates ticker via Finnhub, registers it in the database, then sequentially triggers Pricing, Options, SEC Filings, Financials, Sentiment and Profiles pipelines. Takes ~25–30 minutes per ticker.	Fully onboarded ticker across all data buckets, immediately eligible for deterministic backtesting.
pricing	Acquires and aggregates OHLCV candlestick data across a user-specified time range. Builds a manifest.json for date-based price resolution with a 5-day fallback for holidays and weekends.	Queryable candles for charting, true return calculation and Monte Carlo market-price anchoring.
sec_filings	Fetches and parses 10-K, 10-Q and 8-K filings from SEC EDGAR. Preprocesses documents to remove legal boilerplate, reducing token count by an estimated 50–70% per filing.	Raw filing corpus (XBRL + PDF) for financial extraction and future forward-looking statement analysis.
financials	Extracts structured statement data from XBRL-formatted filings. Applies sector-aware branching: equity-earnings valuation for financial companies, standard DCF inputs for all others.	Balance sheet / income / cash flow data stored in Postgres for direct agent consumption.
options	Ingests historical options chain data including Greeks (Delta, Gamma, Theta, Vega) and constructs an implied volatility mapping per ticker. Most data-intensive pipeline — 252 files per year.	Contract-level options datasets for IV surface analysis and put/call imbalance signal extraction.
sentiment	Processes AlphaVantage pre-computed NLP sentiment scores for news articles associated with each ticker. Only articles meeting a relevance threshold are included to suppress noise.	Decay-weighted sentiment scores normalised to [-1, 1] for consumption by the deterministic aggregator.
macro	Fetches 15 FRED macroeconomic indicators including GDP, CPI, unemployment and VIX. Non-ticker-specific — shared across all tickers at each quarterly snapshot.	Macro time-series used for regime classification and WACC risk-free rate derivation.
profiles	Builds aggregated company profile views from financial statement and filing data.	Unified profile snapshots and overview metrics for dashboard display.

Start with ticker_onboarding for any new symbol. It validates the identifier, registers it in the corpus and chains all ingestion pipelines automatically — a ticker is fully eligible for backtesting in a single ~25–30 minute run. Rate limiting is handled via exponential-backoff retry logic; data is aggressively cached to suppress duplicate requests.

Deterministic Layer

The reproducible analytical core. Operates across all corpus tickers on a quarterly snapshot basis — naturally aligned with public company reporting cycles. Each run is assigned a UUID and written to the run_registry table, allowing any run to be fully reconstructed and compared. All computations use fixed random seeds.

Macro Regime Classification

Classifies each quarterly snapshot into a regime label (e.g. expansionary, stagflation) with a confidence score from 15 FRED series. Regime confidence re-weights all downstream signals — a bullish DCF signal in a high-inflation regime receives a lower weight than under an expansionary backdrop.

DCF Agent

Runs a 1,000-path Monte Carlo DCF simulation per ticker-snapshot pair, producing intrinsic value estimates at the 10th, 50th and 90th percentile. WACC is anchored to the prevailing FRED 10-year treasury yield at each snapshot date to avoid systemic bias during rate-movement periods like 2022.

Options Agent

Loads a 63-trading-day window of options data and extracts two primary signals: aggregate implied volatility level (market uncertainty) and put/call imbalance (directional positioning of sophisticated market participants). Complements DCF with forward-looking market sentiment absent from historical filings.

Sentiment Agent

Computes a decay-weighted composite of AlphaVantage NLP sentiment scores over the snapshot window, applying higher weight to recent articles. Normalised to [-1, 1] for consistent aggregation with DCF and options signals.

Signal Normalisation

All agent outputs are mapped to a common [-1, 1] scale via a scaled hyperbolic tangent transformation (signal = tanh(upside / 0.3)). This preserves proportionality across the full range of upside values while preventing extreme valuations from dominating the aggregation step.

Deterministic Aggregator

Combines DCF (0–30 pts), Options (0–20 pts), Sentiment (0–20 pts) and Macro Regime Fit (0–25 pts) into a single pre-score per ticker-snapshot pair. Directional conflicts between agents are flagged explicitly. The resulting base score is stored in Postgres and acts as a quantitative anchor for the agent layer.

The DCF step is parallelised across ticker-snapshot pairs via a configurable ThreadPoolExecutor (default 4 workers) as it is the highest-latency step — 1,000 Monte Carlo paths per pair. All other agents execute sequentially to prevent race conditions on shared Postgres tables. Estimated token cost is surfaced to the user before any LLM inference begins.

Agent Layer (LLM)

A four-agent AutoGen pipeline operating in single-turn sequential mode. Each agent builds on its predecessor's output. No agent has access to information beyond the supplied TickerContext JSON block (~30,000 tokens on average), ensuring every claim is traceable to a specific data point.

DCF Analyst (Agent 1)

Evaluates intrinsic value and fundamental analysis from the Monte Carlo output. Explicitly prohibited from using any knowledge not present in the supplied context block — preventing the model from regurgitating training data. Outputs a structured thesis with cited bull and bear cases, each referencing a specific snapshot and source file.

Sentiment & Risk Analyst (Agent 2)

Independently assesses options flow, sentiment trajectory and macro regime signal without access to the DCF Analyst's conclusions. Independence is a deliberate design constraint — it allows the agent to flag explicit divergences when DCF implies undervaluation but options positioning is defensively skewed.

Research Referee (Agent 3)

Verifies every factual claim made by the preceding agents against the raw context block, assigning each a status of VERIFIED, ADJUSTED, UNSUPPORTED or CONTRADICTED. Rule-based scoring (bounded to [-5, +5] per component) is applied based on verification outcomes. The grounding rate — fraction of claims verified or adjusted — is computed here and surfaced in the final report.

Aggregator (Agent 4)

Synthesises referee-verified claims into the final investment report. Strictly prohibited from introducing new claims. Applies referee adjustments to the base score and computes a cross-signal alignment adjustment (bounded to [-15, +15]). Final score is clamped to [0, 100] and mapped to a recommendation band: Strong Buy (≥78), Buy (≥62), Neutral (≥45), Reduce (≥30) or Avoid (<30).

Multi-Model Support

All agents are model-agnostic. Supported backends: GPT-4o, GPT-4.5-nano (OpenAI) and Qwen3 235B, Kimi-K2, Llama 3.3 70B, MiniMax M2 (via OpenRouter). All agents operate at temperature 0 to ensure fully deterministic outputs across runs. Model identifier, token consumption, latency and cost are stored per run for cross-model comparison.

Claim Verification & Grounding Rate

Every agent claim carries a direct citation to the underlying data point (snapshot date and source file). The grounding rate is computed as the fraction of VERIFIED or ADJUSTED claims and is included in the generated report as a transparent measure of evidence quality.

The intended operating model is: deterministic base score → referee-verified claim adjustments → aggregator synthesis. The base score is fully independent of the LLM used, which isolates each model's contribution and enables direct cross-model comparison on identical inputs. All results are written to the llm_results Postgres table alongside model identifier, token consumption, latency, cost and grounding rate.

Evaluation & Results

Evaluated over a three-year backtesting window (January 2020 – December 2023) across 36 tickers selected from major US indices for sector, volatility and return-profile diversity. The primary benchmark is the S&P 500 (SPY, +80.6% over the period).

Total runs

~500

Across GPT-4o, Llama 3.3 70B, Qwen3 235B and Kimi-K2

Total cost

$15.10

~$0.031 average per run across all model configurations

Directional accuracy

63%

Agent-layer score adjustment matched subsequent return direction (vs 50% random)

Best top-4 return

+102.4%

Kimi-K2 top-4 portfolio vs S&P 500 benchmark of +80.6% over same period

Best spread

+38.5%

Kimi-K2 top-4 avg minus bottom-4 avg — the primary model quality metric

Models with positive spread

3 of 4

GPT-4o was the only model producing a negative top/bottom spread

Score adjustment analysis

When the agent layer raised the base score and the stock then rose, the median 3-year return was +111.5% (n=37). When the score was raised but the stock fell, the median was -25.0% (n=19) — consistent with DCF's structural bias against high-growth, low free cash flow companies. Directional accuracy of 63% is meaningfully above random chance and supports the portfolio-level spreads observed across models.

Cross-model behaviour

GPT-4o and Kimi-K2 produced more dispersed score distributions (full 0–100 range). Llama 3.3 70B and Qwen3 235B clustered conservatively in the 40–75 range. The same tickers appeared consistently in the high-scoring cluster across all models (LLY, RTX, JPM, XOM, NEM, ORCL) and the low-scoring cluster (AMT, NKE, ALB, PYPL), indicating the deterministic base is the primary driver of rank ordering regardless of which model performs final synthesis.

Known limitations

DCF's structural bias against high-growth, negative-FCF companies (e.g. early-stage META) is the most consequential limitation — a negative DCF signal is mathematically difficult for options and sentiment components to overcome given DCF's 30-point weighting. Forward-looking statements from SEC filings (earnings guidance, management growth outlook) are not currently processed. The evaluation universe of 35 tickers and the COVID-dominated 2020–2023 window limits statistical robustness relative to broader market conditions.

Full System Architecture

High-level deployment topology and end-to-end data flow through both backtesting phases.

┌─────────────────────────────────────────────────────┐
│                  Next.js Frontend                   │
│              http://localhost:3000                  │
└────────────┬────────────────────────────────────────┘
             │
             ▼
┌────────────────────────────────────────────────────────────────────┐
│                    Cloudflare Workers (Deployed)                  │
│  dataset.markcallan101.workers.dev  — R2 data access/downloads    │
│  pipeline.markcallan101.workers.dev — pipeline orchestration      │
│  system.markcallan101.workers.dev   — SSE proxy + report gen      │
└────────────┬───────────────────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────┐
│          FastAPI Backend (Python / VPS)             │
│   localhost:8000 (dev) · 46.62.228.197 (hosted)     │
└────────┬───────────────────┬────────────────────────┘
         │                   │
         ▼                   ▼
┌─────────────────┐  ┌───────────────────────┐
│  Supabase       │  │  Cloudflare R2        │
│  Auth + Meta    │  │  Datasets + Reports   │
└─────────────────┘  └───────────────────────┘

Backtesting pipeline flow
─────────────────────────
R2 Buckets → Deterministic Layer → Postgres (agent_signals)
           → Agent Layer (AutoGen) → Postgres (llm_results)
           → Report Generator → R2 (reports/) → Dashboard

Workers separate concerns cleanly: the dataset worker handles R2 object access and downloads; the pipeline worker orchestrates ingestion and exposes report utilities; the system worker proxies deterministic and agent layer SSE streams and hosts the report generation endpoint at v1/reports/generate. The FastAPI backend runs on a Hetzner VPS and contains all core business logic. Supabase manages authentication and metadata; Cloudflare R2 stores all datasets and generated PDF reports.