Pulse's Approach to Document Intelligence

Document intelligence remains a critical challenge in enterprise automation, with millions of business-critical documents needing to be digitized for legacy industries. While LLMs have revolutionized many aspects of process automation, they fundamentally struggle with precise document ingestion.
We released a blog post last week detailing this phenomena, and we’re now here to discuss our technical solution.
I. Our Architecture Approach
Our ingestion pipeline can be split into multiple stages:
- Layout understanding w/ component detection models
- Low-latency OCR models for individual extraction
- Reading order algorithms for various document types
- Table structure recognition + parsing models
- Fine-tuned VLMs for chart, table, figure conversion
We've trained a suite of specialized computer vision models for our internal layout segmentation. These models handle diverse document types, from construction blueprints to financial statements, and work in concert to segment documents into their component parts - tables, charts, lists, headers, and more.

Rather than treating all document elements uniformly, we deploy specialized low-latency OCR and object detection models optimized for each component type. This targeted approach dramatically improves accuracy in our testing while maintaining high processing speeds.
The most important piece of the puzzle is reading order, and this is where traditional OCR and other providers fail for multicolumn documents and unstructured layouts. Imagine reading a newspaper – as a human, we know exactly how our eyes will peruse each section individually. We focused very heavily on this and we landed on our final reading order algorithm, which correctly processes complex layouts that other systems often misinterpret. Especially for a retrieval system, having the information in the right order keeps contextual understanding + semantics in place.
Table extraction presents perhaps the most formidable challenge in document processing. We've invested significant resources in developing proprietary table structure models that can accurately interpret even the most complex, nested tables. This has been a very active research field as of late, and we look forward to continue building on the latest research!
II. Processing Pipeline
Ingestion is just the first step in building any data pipeline, it’s important to consider the questions of where the data originates, its destination, and all intermediaries in between.
Pulse offers the following post-processing steps that prepare the data for downstream applications:
- Chunking
- Deduplication
- Embedding
- Vector storage
Our customers have seen tremendous success with Pulse’s processing pipeline for retrieval systems – this is a core use case we see continuing in the future.
We offer multiple chunking strategies to optimize the extracted content for different use cases. The deduplication system eliminates redundant information while preserving context, ensuring clean, reliable data output. We also have an exciting partnership coming out soon allowing us to directly create + manage vector stores for our customers after embeddings are completed, allowing customers to move from documents to deployed systems with minimal friction.
III. Performance Metrics
Our initial benchmarking demonstrates significant performance advantages over existing solutions. In a comprehensive evaluation across 12,000 diverse business documents, Pulse substantially outperformed Unstructured, Amazon Textract, and OpenAI's o1 model. These results encompass both standard OCR metrics and complex table extraction tasks. Our team is releasing our complete evaluation methodology in an upcoming benchmark. The performance gap widens further on documents containing complex financial and technical data, where precision in character and structure recognition is crucial.
__
The limitations of current LLM-based document processing systems led us to develop this hybrid approach, combining classical computer vision algorithms with modern transformer architectures. The result is Pulse - most reliable system that handles the complexities of real-world documents (of all formats) while maintaining the speed and scalability needed for enterprise deployments.
Want to learn more about Pulse? Book a demo here.