Advanced Spreadsheet Parsing: General Availability

Sid and Ritvik

January 8, 2026

In November, we introduced our spreadsheet encoder to early access customers working with complex XLSX files. That release addressed the core challenge we'd been wrestling with: how to process multi-tab workbooks without losing the structural relationships that make spreadsheet data meaningful.

Since then, we've processed millions of spreadsheet pages across production deployments with enterprise customers in finance, real estate, private equity, and more. The feedback loop from these deployments, combined with thousands of PhD-annotated training examples we built specifically for irregular layouts, merged cell regions, and cross-sheet references, allowed us to refine the model substantially.

Today we're releasing our spreadsheet parsing capabilities to general availability.

What's improved

The GA release includes several updates over the preview version.

We reduced median latency by 11% through optimizations in our compression pipeline. For large workbooks, processing time scales more predictably, and we've eliminated edge cases that caused timeouts on files exceeding 100 tabs.

The encoder now handles larger files more gracefully. Our updated chunking strategy processes workbooks that previously exceeded context limits, maintaining structural fidelity across segments without requiring manual splitting.

Layout segmentation accuracy improved on edge cases involving nested merged regions and inconsistent header patterns. The model better distinguishes data regions from formatting artifacts, particularly in financial models where presentation elements intermix with actual values.

How it works

Our spreadsheet encoder compresses XLSX structural information while preserving the relationships that matter for extraction. Rather than treating spreadsheets as linear sequences or naive grid serializations, the encoder maintains cell adjacency, merge spans, and cross-sheet linkage in a representation optimized for our VLMs. The encoder enforces structural invariants over adjacency, merge continuity, and reference linkage, allowing downstream models to reason over spreadsheets as constrained relational objects rather than unconstrained token sequences.

This approach solves the token efficiency problem that plagues standard extraction methods. A 50-tab financial model serialized naively can exceed context windows before processing even begins. Merged cells create coordinate ambiguity, cross-sheet references lose their graph structure when flattened, and whitespace or formatting cells pollute the token stream with non-semantic data. Our encoder addresses each of these by treating spreadsheet structure as a first-class object throughout the pipeline.

Example of a valuation spreadsheet where meaning depends on structure. Inputs, assumptions, and downstream logic are distributed across cells and sections, so preserving layout and relationships is critical for correct extraction and analysis.

‍

Many financial spreadsheets go further, embedding decision logic and lookup tables that translate inputs into ratings, spreads, or actions.

When Pulse processes a multi-tab P&L, it understands which cells are headers, which are data, and how they relate, even when formatting is inconsistent or layout spans dozens of tabs. Cell coordinates, merge boundaries, and cross-sheet relationships are preserved rather than flattened, which means extracted data retains the structural fidelity required for reliable downstream use.

We are actively exploring these problems and are interested in working with researchers who care about representation, structure, and failure modes in real-world systems.

Access

Spreadsheet parsing is now generally available to all Pulse customers through our standard API. Due to high demand, we're onboarding new customers in batches to ensure consistent processing quality and support. Documentation is available here. For teams with specific requirements around file size limits or processing guarantees, reach out to hello@trypulse.ai.

Announcements

Introducing Extraction Library - Launch Week (Day 5)

A centralized system of record for extraction workflows that gives teams full version history, traceability, and safe iteration on schemas and prompts without breaking production.

Sid and Ritvik

December 22, 2025

Best Practices

Why Word-Level Bounding Boxes Are Non-Negotiable for Enterprise Data Pipelines

Pulse goes beyond text accuracy by preserving word-level spatial context for every extracted value. This enables precise source linking, reliable audits, and production-ready document AI you can trust.

Sid and Ritvik

February 11, 2026

Advanced Spreadsheet Parsing: General Availability

What's improved

How it works

Access

Related articles