Back

The Silent Failures in Financial Document Extraction

Sid and Ritvik
March 17, 2026

Every document extraction system works on clean inputs. The real question is what happens when the input stops being clean, and in financial documents, that happens constantly. Not because the documents are poorly made, but because financial reporting encodes an enormous amount of meaning through conventions that were designed for human readers and have never been formalized for machines.

Most writing about document AI focuses on the headline problems: table detection, OCR accuracy, multi-page continuity. Those matter, but they're well understood, and the failures that actually persist in production tend to be subtler, living in the formatting choices that carry meaning, the inconsistencies across documents that describe the same underlying data, and the long tail of filing-level weirdness that no benchmark dataset has ever captured.

Formatting That Carries Meaning

Financial documents use visual formatting to communicate information that never appears in the text itself. This is the single largest category of silent extraction failure, because a system can extract every character perfectly and still produce structurally wrong output if it doesn't interpret what the formatting means.

Layout-level bounding boxes on the cover page of KKR's 10-Q filing, showing structural segmentation of headers, tables, and text blocks.

Parenthetical negatives are the most common example. In financial statements, $(1,234) means negative 1,234, but the parentheses are a typographic convention, not a mathematical operator. Systems that pass the raw text downstream without conversion will produce strings that parsers may read as positive numbers or formatting artifacts, and the problem compounds when the same document uses parentheses for negatives in some tables and minus signs in others.

Bold and indentation hierarchies encode the structure of financial statements in ways that are invisible to text-only extraction. A bold row typically indicates a subtotal, an indented line item indicates a sub-category, and a double-indented item indicates a component of that sub-category. These relationships define what the numbers actually represent, and an extraction system that treats every row as a flat list has lost the financial logic of the statement entirely.

Superscript footnote markers modify the meaning of a specific cell without being part of the cell's value. A revenue figure with a superscript "1" might mean that the number excludes a discontinued operation, reflects a restatement, or is denominated in a different currency than the rest of the table. Stripping the superscript or failing to link it to the corresponding footnote silently discards information that changes the interpretation of the number.

Unit and currency placement varies across filers in ways that create ambiguity. Some companies place the dollar sign in every cell, others only in the header. Some tables specify units ("in thousands" or "in millions") in a subtitle above the table, others embed the unit in the column header. An extraction system that doesn't propagate unit context from wherever it appears to every cell it governs will produce numbers that are off by orders of magnitude.

Cross-Document Inconsistencies

Financial data rarely lives in a single document. The same company's revenue might appear in a 10-K, a quarterly earnings press release, an investor presentation, and a proxy statement, each reporting the same underlying figures in ways that make automated reconciliation genuinely difficult.

Rounding conventions differ across document types. A 10-K might report revenue as $4,827,341 thousand while the corresponding press release rounds to $4.8 billion. Both are correct, but an extraction system that treats them as independent data points without normalizing will produce apparent discrepancies that don't actually exist.

Restatements create real discrepancies that do exist but are easy to miss. A company might restate prior-year revenue in its current 10-K to reflect a change in accounting policy, which means the FY2023 figure in the FY2024 filing differs from the FY2023 figure in the original FY2023 filing, even though both were correct at the time they were published. An extraction pipeline that pulls from both without tracking the restatement will have two conflicting values and no way to know which is current.

Temporal misalignment adds another layer. A press release issued in January 2025 reports preliminary Q4 2024 results, but the 10-K filed in February reports the full fiscal year with slightly different Q4 numbers because of year-end adjustments. Both documents exist in the corpus, and an extraction pipeline needs to understand which takes precedence for which purpose.

The Long Tail of Filing Weirdness

Beyond formatting and cross-document issues, there is a long tail of filing-level edge cases that no benchmark captures because they are too varied to generalize, but they appear constantly in production.

Tables rendered as images. Some filers embed tables as images within PDFs rather than as structured text, particularly in older filings or international submissions. The system needs to detect the image, run OCR, and reconstruct table structure from the output, which introduces a completely different set of accuracy challenges than extracting from native text.

XBRL tags that contradict the visual layout. In iXBRL filings, each figure is tagged with a machine-readable label. In theory, this should make extraction trivial. In practice, filers tag inconsistently, reuse templates from prior years without updating, and create divergence between the visual structure and the tagged structure that forces the extraction system to decide which source of truth to follow.

Null value ambiguity. A blank cell might mean zero, not applicable, or not reported. A dash might mean zero or not applicable depending on the filer. "N/M," "NM," "n/a," "N/A," and "—" all carry different connotations in different contexts, and the extraction system needs to either normalize them consistently or preserve enough context for downstream systems to interpret correctly.

Multi-currency tables. Some multinationals report segment results in local currencies within the same table that reports consolidated results in the reporting currency. The currency context might be indicated by a column header, a footnote, or a parenthetical in the segment name, and missing it means every number in that column is interpreted in the wrong denomination.

Non-standard exhibit formatting. The financial statements in a 10-K follow relatively predictable conventions, but exhibits like credit agreements, indentures, and lease schedules have no standard format at all and can contain multiple differently-formatted tables within a single document, each requiring correct handling without any template to match against.

Why This Matters

Public benchmarks like FUNSD and DocVQA systematically exclude the edge cases described above, which is why a system that achieves state-of-the-art accuracy on those datasets may still fail on a real 10-K with parenthetical negatives, footnoted adjustments, and multi-level spanning headers. Financial data firms understand this, which is why they build their own evaluation datasets rather than relying on public benchmarks, and why extraction quality is ultimately determined at the edges rather than on the easy cases that were never the bottleneck.

As more industries begin feeding extracted document data into workflows where accuracy has real consequences, the standards that financial services have always enforced will stop looking like an outlier and start looking like the floor.