When you're processing a few thousand documents a month, almost any architecture works. You can run a single model on a single GPU, process documents one at a time, and still meet your SLAs with room to spare. The pipeline you built for a proof-of-concept handles production just fine.
Then volume grows, and a few thousand becomes a few hundred thousand. Latency that was acceptable becomes a bottleneck, costs that were rounding errors become line items, and the architectural decisions that worked at small scale start to crack.
By the time you're processing hundreds of millions of pages, everything needs to change. Batching strategies, caching layers, model selection, GPU versus CPU allocation, horizontal scaling, memory management. The system that worked at 10,000 pages per month won't survive 10 million, and what works at 10 million won't scale to a billion.
This is the story of what we learned building Pulse to handle document processing at enterprise scale, and the architectural patterns that make the difference between a system that buckles under load and one that scales smoothly.
The Burst Problem in Financial Services
Financial document processing doesn't arrive in a steady stream; it comes in bursts tied to business cycles, and those bursts can be massive.
Consider quarterly earnings season. Public companies file 10-Qs and 10-Ks within tight windows, and financial data providers need to process those filings immediately. A firm tracking 5,000 companies might see 3,000 filings land within a 48-hour window at quarter end. Each filing averages 80-120 pages, some exceeding 500. That's 300,000+ pages that need processing in two days, with customers expecting extracted data within hours of each filing hitting EDGAR.
Or consider month-end reconciliation at a large asset manager. Portfolio statements, trade confirmations, and NAV reports from hundreds of counterparties arrive in the final days of the month. A firm with 50 fund administrators might receive 10,000 statements totaling 2 million pages, all needing processing before the monthly close, and the window might be just 24 hours. Miss it, and downstream reporting deadlines cascade into failures.
Year-end is even more intense, as annual reports, K-1s for partnership allocations, and tax documents create a surge that dwarfs normal volume. A fund administrator processing K-1s for a large private equity complex might handle 50,000 documents in January alone, each requiring precise extraction for investor tax reporting. Errors propagate to thousands of investors and their accountants.
These aren't hypothetical scenarios; they're the workloads our financial services customers face, and they drove the architecture decisions we'll describe.
The Burst Problem in Insurance
Insurance has its own burst patterns, often triggered by events rather than calendars.
Catastrophic events create sudden processing surges, and when a hurricane makes landfall, claims volume in affected regions can spike 100x within days. An insurer covering Florida homeowners might process 500 claims per day normally, then face 50,000 claims in the week following a major storm. Each claim includes loss reports, adjuster assessments, contractor estimates, and supporting documentation, often totaling 30-50 pages per claim. That's over a million pages requiring processing while policyholders wait for settlements.
Open enrollment periods create similar bursts in health insurance. A large employer's benefits administrator might process 100,000 enrollment forms in a three-week window as employees select coverage for the coming year. Each form requires extraction of plan selections, dependent information, and eligibility verification. The deadline is fixed by regulatory requirements, and missing it means employees without coverage.
Policy renewal cycles compound the challenge. Commercial insurers renewing large books of business might process 10,000 policy applications in a quarter, each with loss runs, financial statements, and supplementary questionnaires. Underwriters need extracted data to quote renewals before policies lapse.
The common thread is that volume isn't constant, and systems need to handle 10x or 100x normal load during peak periods before scaling back down when the surge passes. Building for peak capacity that sits idle most of the year is economically wasteful, but building for average capacity means failing during the moments that matter most.
Model Routing: Right-Size the Compute
The naive approach runs every document through your most capable model, which becomes economically and operationally unsustainable as volume grows.
Most documents don't need your heaviest models, and a standard W-2 form doesn't require the same processing as a 200-page credit agreement with nested tables and handwritten annotations. Routing documents to appropriately-sized models based on complexity is the single biggest lever for handling burst volume without proportionally scaling infrastructure.
At Pulse, we use lightweight vision models for initial layout and structure detection. These models classify document type, assess quality, identify regions of interest, and make routing decisions. Only documents that actually need heavy semantic extraction get routed to larger models. The lightweight classifier runs at roughly 50ms per page on CPU; the full extraction pipeline might take 2-3 seconds on GPU. Running classification first means the majority of documents never touch the expensive path.
The routing decision itself needs to be fast and cheap. We use a multi-stage classifier:
Stage 1: Document type classification (CPU, ~50ms)
Stage 2: Complexity scoring based on detected regions (CPU, ~20ms)
Stage 3: Route to appropriate extraction pipelineSimple forms go to template-based extraction, standard tables go to our table-specific model, and only genuinely complex documents (nested structures, mixed handwriting, unusual layouts) get the full treatment.
During a catastrophic event surge, this routing means that the flood of standardized claim forms routes to fast template-based processing, while the occasional complex document with handwritten adjuster notes gets the attention it needs. The system handles both without choking on volume or compromising on difficult cases.
Batch Processing: Amortize Fixed Costs
Model loading is expensive, and loading a large vision model into GPU memory takes seconds. If you're processing documents one at a time, that overhead dominates your throughput and makes burst handling impossible.
Batching amortizes these fixed costs across many documents, but naive batching (just group N documents together) leaves performance on the table. Smart batching groups similar documents together so you can keep the same model resident in memory across the batch.
Our batching strategy considers:
- Document type: Process all W-2s together, then all 1099s, then all bank statements
- Page count: Group similarly-sized documents to optimize memory allocation
- Complexity tier: Keep simple documents separate from complex ones to avoid blocking fast jobs behind slow ones
- Customer priority: High-priority customers get dedicated batch queues with guaranteed latency even during surges
The improvement is substantial: random batching might achieve 60% GPU utilization, while similarity-based batching pushes this above 90%, which translates directly to handling higher burst volume on the same infrastructure.
During earnings season, this means grouping 10-K filings by approximate page count and complexity tier. A batch of 50-page filings from smaller companies processes together, while 300-page filings from large multinationals form their own batch. Both complete faster than if they were interleaved.
GPU Memory Management
GPU memory is precious, and a single large vision model might consume 8-16GB of VRAM. You can't keep every model loaded simultaneously, but swapping models in and out kills throughput, especially during high-volume periods.
Our approach is to partition models into tiers based on usage frequency.
Resident models (always loaded): Core OCR, basic layout detection, document classifier. These handle the first stage of every document and never get evicted.
Warm models (loaded on-demand, kept for batch duration): Table extraction, form field detection, specialized models for high-volume document types. Loaded when a batch needs them, kept resident until the batch completes.
Cold models (loaded per-document): Rare specializations like handwriting recognition for specific form types, language-specific models for non-English documents. Loaded only when needed, evicted immediately after.
This tiering keeps GPU utilization high while ensuring we can handle the full range of document types. Monitoring tracks model load times and usage patterns to optimize tier assignments over time, and the tiers adjust based on observed traffic patterns during burst periods.
Horizontal Scaling and Work Distribution
Single-node processing hits a ceiling, and scaling horizontally introduces coordination challenges around distributing work efficiently across nodes.
Naive round-robin distribution ignores document characteristics. A node that gets unlucky with a batch of complex documents falls behind while other nodes sit idle. We use characteristic-based work distribution:
1. Incoming documents go to a classification queue
2. Lightweight classifier assigns complexity score and document type
3. Router assigns documents to node pools based on characteristics
4. Within each pool, work-stealing balances load across nodesThis keeps all nodes busy with appropriate work, with simple-document nodes maintaining high throughput on easy cases while complex-document nodes have more headroom for slow jobs without blocking the overall pipeline.
During burst periods, we can scale node pools independently based on the composition of incoming work. A hurricane claims surge might be 90% simple forms, so we scale the simple-document pool while keeping the complex-document pool stable. An earnings season surge has different composition, so scaling adjusts accordingly.
Pre-Processing Pipeline
Before any document touches GPU, it passes through CPU-based pre-processing:
Quality assessment: Detect skew, blur, low resolution, truncation. Documents below quality thresholds get flagged for remediation rather than wasting GPU cycles on guaranteed failures.
Page classification: Identify blank pages, cover pages, table of contents, and other low-value pages that can be skipped or processed minimally.
Document splitting: Detect document boundaries in batched scans. A single PDF containing 50 concatenated invoices needs to be split before processing.
Deduplication: Hash-based detection of exact duplicates. Near-duplicate detection for documents that differ only in timestamps or reference numbers.
This pre-processing runs on CPU at roughly 200 pages/second per core and scales horizontally with ease. It filters out 10-15% of pages that would otherwise waste GPU resources, and during burst periods, that filtering translates directly to handling more real work on the same infrastructure.
Memory Management for Large Documents
A 500-page prospectus doesn't fit comfortably in memory as a single object. Loading entire PDFs into memory causes allocation spikes that destabilize the system under load, exactly when stability matters most.
We process large documents as streams of page chunks. Each chunk (typically 10-20 pages) is processed independently, with cross-page context maintained through lightweight state objects rather than full page representations. Results are assembled after all chunks complete.
This streaming approach caps memory usage per document regardless of page count. A 10-page form and a 1,000-page contract use similar peak memory, just different processing duration. During burst periods with mixed document sizes, this prevents large documents from causing memory pressure that affects processing of smaller documents in the same batch.
Building for the Surge
The difference between handling normal volume and handling burst volume is the difference between a system that works and a system that works when it matters.
Every architectural decision we've described serves the same goal: maintaining quality and latency during periods of extreme load. Model routing ensures compute matches document complexity, smart batching maximizes GPU utilization, template caching eliminates redundant work, tiered memory management keeps models ready, horizontal scaling matches capacity to demand, pre-processing filters waste, and streaming handles large documents without memory pressure.
None of these techniques are revolutionary in isolation, but the insight is that document processing at scale requires all of them working together, tuned for the specific burst patterns of the industries being served.
When the next hurricane hits, or earnings season peaks, or year-end tax documents flood in, the system needs to absorb the surge without degradation. That's what processing at scale actually means.
--
Processing documents at scale? Talk to our team about how Pulse handles high-volume extraction for enterprise customers.