From 10-K to Dataset: Automating Financial Research Pipelines

Sid and Ritvik

February 16, 2026

Every year, thousands of publicly traded companies file 10-Ks with the SEC. Each one is a dense, multi-hundred-page document packed with financial statements, risk disclosures, segment breakdowns, footnotes, and management commentary. For investment analysts and quantitative researchers, these filings are raw material, serving as the foundation for models, screens, and investment theses.

And yet, the path from a 10-K filing to a usable, structured dataset remains remarkably manual at most firms.

The gap between filing and insight

Most financial research teams interact with SEC filings through one of two channels: terminal-based data providers that have already extracted and normalized the numbers, or raw EDGAR downloads that require significant processing before they're useful.

The cover page of JPMorgan Chase's 2025 Form 10-K, before and after word-level bounding box detection. The original contains a securities registration table, registrant details, and dense compliance checkboxes. After processing, every word, ticker symbol, and checkbox label is individually identified and spatially mapped.

The first path is convenient but limiting. Pre-extracted data covers standard line items like revenue, net income, and EPS, but strips away the context, footnotes, and segment-level detail that differentiate a surface-level screen from a genuine research edge. When an analyst needs non-standard fields like customer concentration percentages, lease obligation breakdowns, or segment-level capital expenditure, the terminal often comes up empty.

The second path offers completeness but demands enormous effort. A raw 10-K is filed as HTML, XHTML, or increasingly iXBRL, but the visual presentation varies wildly across filers. Tables that look clean to a human reader, complete with nested headers, merged cells, and hierarchical line items, are structurally complex in markup. Financial statements routinely span multiple periods with interleaved columns, use parenthetical negatives, and embed footnote references mid-cell.

This is where most automation attempts stall, and where the real complexity of financial document extraction begins to surface.

Why traditional extraction breaks on financial filings

The instinct is to reach for a general-purpose parser: pull the HTML, find the tables, map headers to values. In practice, this approach hits a wall almost immediately.

Consider a typical consolidated balance sheet. The document presents assets and liabilities in a hierarchical tree, with current assets nested under total assets and individual line items like "accounts receivable, net" sitting two or three levels deep. The visual indentation communicates structure, but that hierarchy is rarely encoded in the markup. A parser that reads the table as a flat grid will mash "Cash and cash equivalents" into the same structural level as "Total stockholders' equity."

Income statements introduce a different problem. Multi-period layouts present three years of data side by side, often with percentage-change columns interspersed. Merged header cells span fiscal years, and subtotals appear at irregular intervals. Without understanding that "Cost of revenue" is a child of the revenue section and not a standalone line item, the extracted data loses its meaning.

Then there are the footnotes. A significant portion of the analytical value in a 10-K lives outside the primary financial statements, in the notes to the financials, MD&A, and risk factor disclosures. These sections contain tables embedded in narrative text, cross-references to other sections, and data presented in formats that resist simple tabular extraction. Lease schedules, debt maturity profiles, and segment revenue breakdowns often appear only in footnotes, and they rarely conform to a standard template.

Building an automated research pipeline

An effective pipeline from 10-K to dataset needs to solve several problems in sequence, and each one matters.

Document acquisition and normalization. EDGAR provides filings in multiple formats, and the same company might file differently year over year. The pipeline needs to handle HTML, XHTML, and iXBRL inputs and normalize them into a consistent representation before any extraction begins. This includes resolving embedded images, handling Unicode edge cases, and reconstructing reading order from markup that doesn't always flow linearly.

Layout-aware parsing. Financial tables can't be extracted as flat grids. The system needs to understand visual hierarchy, including indentation levels, bolded subtotals, and merged headers that span columns, and map those visual cues to structural relationships. A "Total current assets" row isn't just another number; it's a parent node in an implicit tree, and every line item above it up to the previous subtotal is a child.

The Five-Year Stock Performance section, before and after word-level bounding box detection. The original combines a multi-period comparison table with an embedded line chart. After processing, every table value, axis label, legend entry, and narrative paragraph is individually identified and spatially mapped.

Schema-guided extraction. Research workflows require specific fields, not generic table dumps. A quant team building a working capital model needs current assets, current liabilities, accounts receivable, inventory, and accounts payable, all extracted consistently across hundreds of filers with different formatting conventions. This demands a schema-driven approach where the extraction target is defined upfront and the system maps source content to the target structure, handling the fact that "Inventories" at one company and "Merchandise inventories, net" at another refer to the same concept.

Cross-section linking. A single data point in a 10-K often requires context from multiple sections. Revenue figures in the income statement connect to segment disclosures in the footnotes, which connect to risk factors in the MD&A. An automated pipeline that only extracts tables in isolation misses these connections. Effective systems preserve document-level context so that downstream consumers, whether human analysts or LLMs, can trace any value back to its source and surrounding narrative.

Temporal consistency. Financial research is inherently longitudinal. The same pipeline must produce consistent output across five, ten, or twenty years of filings from the same company, even as that company changes its reporting format, adopts new accounting standards, or restructures its segments. This is where extraction systems that rely on rigid templates fail: the template that works for a 2024 filing silently breaks on a 2019 filing that used a different table layout.

What changes when extraction actually works

When the pipeline from filing to dataset is reliable, the downstream impact compounds quickly.

Analysts stop spending hours copying numbers from PDFs into spreadsheets and start spending that time on analysis. Quant teams can backtest signals across thousands of filers without manually auditing each extraction. Risk models can incorporate non-standard data like contingent liabilities, related-party transactions, and off-balance-sheet exposures that were previously too expensive to extract at scale.

Perhaps most importantly, the data becomes auditable. When every extracted value carries a citation back to the exact location in the source document, the research process gains a layer of traceability that manual workflows never had. An analyst can click from a number in a model to the specific cell in the specific table on the specific page of the specific filing it came from. In regulated environments, this level of traceability is not a nice-to-have but a hard requirement.

The executive officers table from Item 10, before and after word-level bounding box detection. The original mixes fixed-width columns for name and age with variable-height narrative descriptions of each officer's career history. After processing, every word across names, titles, dates, and multi-sentence career narratives is detected with precise spatial coordinates.

The LLM dimension

The rise of LLM-powered research tools makes reliable extraction even more critical. Large language models are increasingly used to summarize filings, compare disclosures across companies, and answer natural-language questions about financial data, but these models are only as good as the data they receive.

Feed an LLM a raw 10-K PDF and ask it to extract the debt maturity schedule, and you'll get plausible-looking numbers that may or may not match the source. The model will confidently produce output regardless of whether it correctly parsed the table structure. In financial research, "mostly right" is often worse than "no answer" because a hallucinated number that looks reasonable can flow into a model undetected.

The solution is not to avoid LLMs but to feed them structured, validated data extracted by systems that are purpose-built for document understanding. When an LLM operates on accurately extracted, schema-conformant data with source citations intact, it shifts from a liability to a genuine accelerator.

Where this is going

The volume of financial documents is only increasing, with EDGAR alone adding tens of thousands of filings each quarter. Private markets generate even more, including pitch decks, CIMs, and quarterly reports from portfolio companies, with no standardized format at all. The firms that build reliable, automated pipelines from document to dataset now will have a structural advantage as the surface area of available information continues to expand.

The technology to do this at production quality exists today, and the question for research teams is whether they build it in-house, investing in layout models, table parsers, schema engines, and citation systems, or partner with a platform that has already solved these problems across billions of pages.

Either way, the era of copying numbers out of PDFs is ending, and the firms that recognize it first will move faster, cover more ground, and make better decisions with the same team.