Merged Cells, Broken Data: How Span Tables Defeat Document Extraction

Sid and Ritvik

January 16, 2026

A span table is any table where cells merge across rows or columns. The header "Year Ended December 31" spanning three columns, or a segment name like "North America" spanning four product rows. They're everywhere in financial documents, and they break nearly every general-purpose extraction tool on the market.

The core problem is that span tables encode hierarchy visually rather than explicitly. The spanning cell says "everything below me belongs to this group," but that relationship exists only in the geometry, not in the underlying data structure. Most extraction systems process cells independently. They detect "Year Ended December 31" as text, and they detect the revenue numbers below it, but the binding between them (which numbers belong to which period) gets lost in translation.

The failure is silent. You get a table with correct numbers and plausible headers, but the values are associated with the wrong columns. Downstream analysis is wrong in ways that don't trigger obvious errors. You won't get an exception or a warning. You'll just make decisions based on Q3 revenue that's actually Q2 revenue, and you won't know until someone manually reconciles the data.

Why Multi-Level Spans Compound the Problem

Simple span tables are manageable. The real complexity emerges when spans nest within spans.

A table might have "Consolidated Results" spanning the full width, then "Operating Segments" and "Corporate" as second-level spans, then individual segment names below that. That's three levels of hierarchy encoded purely in visual layout. Each level adds another dimension that the extraction system needs to track, and each level is another opportunity for context to get lost.

When extraction fails on a single-level span, you get column misalignment. When it fails on multi-level spans, you get complete structural collapse. The relationship between a revenue number and its segment, its geography, and its time period all depend on correctly parsing every level of the hierarchy above it. Miss any one of them and the number becomes meaningless.

SEC Filings as a Case Study

10-Ks and 10-Qs are span table minefields. Period comparisons (FY2024 vs FY2023 vs FY2022), segment breakdowns, and geographic splits all use column and row spans to organize dense financial data into something humans can read.

The challenge is that the same logical disclosure (say, revenue by segment) renders completely differently across filers. One company uses a clean three-column span. Another nests segments within geographies within business units. There's no standard layout, just standard data requirements. An extraction system that works perfectly on Apple's 10-K might fail completely on Microsoft's, even though both are disclosing the same underlying information.

Column spans for period headers are the most common failure point. "Three Months Ended" spanning Q1, Q2, and Q3 columns means every value below inherits that temporal context. Miss the span, and you might attribute Q3 revenue to Q2. The numbers are right; they're just in the wrong buckets.

Row spans appear in segment or subsidiary breakdowns where a parent entity name governs multiple line items. These are rarer but more dangerous. Misreading the boundary scrambles which revenues belong to which business unit, and that's the kind of error that cascades through every downstream calculation.

Proxy statements (DEF 14A) add executive compensation tables with particularly nasty spans: award types spanning multiple years, performance periods spanning multiple metrics, all packed into a single dense table. If you've ever tried to extract comp data programmatically, you know exactly how painful this gets.

Why the Obvious Solutions Don't Work

Parsing the underlying HTML or XML from EDGAR sounds like a shortcut. The markup should tell you exactly which cells span which columns. In practice, filing formats are wildly inconsistent. Older filings are essentially PDFs converted to HTML with table structures that don't reflect the visual layout at all. The colspan attribute might say one thing while the rendered table shows something completely different.

Newer iXBRL filings have explicit tags that should solve the problem. Each number is labeled with what it represents (revenue, net income, whatever) along with its context and period. In practice, filers tag inconsistently. The same line item might be tagged as "Revenue" by one company and "NetSales" by another, or tagged at different hierarchy levels. When visual structure and tagged structure diverge, which one do you trust?

General-purpose vision-language models can identify that spans exist, but they struggle to propagate the context correctly. They'll note "this header spans three columns" without reliably binding each value below to the right column. The visual understanding is there; the structural reasoning isn't.

What Actually Works

The fix starts with detecting span regions geometrically before extracting content. The merged cell's bounding box tells you exactly which columns or rows it governs. That geometric relationship is the source of truth, not the text content or any markup that might be present.

From there, you need to build a header tree rather than a flat list. Each extracted cell should carry its full header stack: Period: FY2024, Segment: North America, Line Item: Revenue. The stack is what makes the number meaningful. Without it, you just have a floating value with no context.

At Pulse, we validate span detection against any available structured data (XBRL tags, HTML table markup) but we don't trust it blindly. When visual structure and tags conflict, we flag for review rather than guessing. The goal is accurate extraction, not just fast extraction.

For historical analysis across multiple years of filings, normalization matters as much as extraction. The fact that Company A and Company B format their segment tables differently shouldn't mean your downstream models need custom logic for each filer. A good extraction system produces consistent output schemas regardless of how the source document happens to be formatted.

One edge case that catches people off guard: cross-page spans. Tables that start on page 47 and continue on page 48 require row-continuity tracking. The first row of page 48 inherits the column headers from page 47. Lose that link and half your table is orphaned data with no column associations at all.

The Questions Worth Asking

If you're evaluating extraction tools for SEC filings or other financial documents, span table handling should be near the top of your checklist. Does the system preserve spanning relationships in its output, or does it flatten everything to a simple grid? Can it handle multi-level spans where headers nest within headers? Does it track context across page breaks?

Run test cases with real 10-K filings that have complex table structures. Look at segment disclosures, period comparisons, and anything with geographic breakdowns. If the extracted data doesn't maintain the relationships that make those tables meaningful, you'll spend more time fixing extraction errors than you saved by automating in the first place.

Working with complex financial documents? Talk to us about how Pulse handles span tables and other structural challenges in SEC filings.

‍