When we started building Pulse, we assumed that structured outputs had solved the document extraction problem. Define a JSON schema, point an LLM at your documents, and get clean data back. Every major API provider now supports it. OpenAI, Anthropic, Google, and a growing ecosystem of open source tools like Outlines and XGrammar have made constrained generation accessible to anyone with an API key.
So our research team internally ran some evals - used a relatively simple document layout with bank statements and invoices, defined a schema, and watched it work. Then we tried it on ten thousand more complex documents with alternate currency markers, faxed documents with unwired tables, nested line items, conditional fields, and variable-length arrays. That's when things got interesting.
The short version: enforcing structured outputs during generation is not free, and for the kinds of complex, nested schemas that real document extraction requires, naive approaches either blow up in compute or silently degrade extraction quality.
How Constrained Decoding Actually Works
At each generation step, the model produces a probability distribution over its entire vocabulary (128,000+ tokens in OSS models like Llama). Constrained decoding works by masking out tokens that would violate the schema, setting their probabilities to zero before sampling. The model can only generate tokens that keep the output on a valid path through the grammar.
For simple constraints like regular expressions, this is efficient. The Outlines library showed you can precompute a finite state machine (FSM) from a regex and use it to generate token masks in O(1) time per step. For regular languages, the set of valid next tokens depends only on the current FSM state, not on the history of how you got there.
While a specific JSON schema can be mapped to a regular language, the recursive nature of nested objects and arrays makes it behave more like a context-free grammar in terms of parser complexity.
The State Explosion Problem
The XGrammar paper introduced a clever optimization. Although you can't precompute masks for every state, you can categorize tokens into two sets:
- Context-independent tokens: validity can be determined by the current position in the PDA, ignoring the stack. For most grammars, this covers the vast majority of the vocabulary.
- Context-dependent tokens: validity depends on the full stack state. These require runtime verification.
By precomputing masks for context-independent tokens and only doing runtime checks for the small context-dependent set, XGrammar achieves up to 100x speedup over naive approaches.
But even with this optimization, complex grammars create problems. When the grammar is non-deterministic (multiple valid ways to continue parsing at a given point), the parser must maintain multiple parallel stack states. Each ambiguous choice can double the number of states being tracked.
The classic result from automata theory (very interesting course FYI!) is that converting a nondeterministic finite automaton (NFA) to a deterministic one (DFA) can produce 2^n states from an n-state NFA. For context-free grammars, the situation is worse because the stack adds another dimension of complexity.
What does this mean for document extraction? The schemas we care about (invoices, contracts, medical records, financial statements) tend to have exactly the properties that make constrained decoding expensive: deep nesting, optional fields that create ambiguity, arrays of complex objects, and union types.
The Quality-Constraint Tradeoff
Here's the part that surprised us most. Constrained decoding doesn't just cost compute. It can hurt extraction accuracy.
The paper "Let Me Speak Freely?" systematically studied this effect. They found that stricter format constraints lead to greater performance degradation on reasoning tasks. JSON-mode significantly reduced accuracy on tasks like GSM8K math problems compared to free-form generation.
Why? The model isn't "aware" of the constraints during the forward pass. It computes logits as usual, and then the constrained decoder masks out invalid tokens after the fact. When the model wants to output one token but is forced to choose another, you get a distribution shift. The decoder may force a low-probability token, introducing noise into the generation process.
For document extraction, this creates a paradox:
- Tighter schemas give you more reliable parsing and cleaner downstream data pipelines
- But tighter schemas can degrade the model's ability to correctly extract information in the first place
What We're Exploring at Pulse
Document extraction at scale requires solving three interconnected problems: schema design that captures document structure without over-constraining the model, efficient enforcement that doesn't kill throughput when processing millions of pages, and maintaining extraction accuracy despite the overhead of constrained generation.
Schema Complexity Analysis. Can we predict extraction difficulty from schema structure before running inference? We're building tools to analyze schemas and estimate expected compilation time, likelihood of state explosion, and risk of quality degradation based on constraint density. The goal is feedback at schema design time, before you burn GPU hours on a suboptimal configuration.
Adaptive Constraint Strategies. Not all documents need the same level of constraint strictness. A highly templated form might benefit from tight constraints. A free-form contract paragraph might need looser constraints to let the model reason about what information is present. We're exploring hybrid approaches: generate with minimal constraints, then use a second pass to restructure into the target schema. This "NL-to-Format" strategy often outperforms direct constrained generation on reasoning-heavy tasks.
Grammar Compilation Optimization. When multiple schemas share common substructures (dates, currencies, addresses), you can compute token masks once and reuse them. We're building a schema registry that identifies shared substructures across document types and pre-compiles them into reusable grammar fragments.
Confidence-Aware Extraction. When constrained decoding forces a low-probability token, that's a signal. We're experimenting with tracking perplexity during generation and flagging extractions where the model was "fighting" the constraints. These high-perplexity extractions can be routed to human review or processed with alternative strategies.
Schema-guided extraction sits at the intersection of formal language theory, LLM inference optimization, and practical document understanding. The problems are hard. But the payoff for solving them-reliable structured data extraction from the messy reality of business documents-is substantial.
If this is the kind of problem that excites you, we're hiring. Reach out.
Related: Why LLMs Suck at OCR • Advanced Spreadsheet Parsing: General Availability
