One of the hardest problems in document extraction isn't generating output. It's knowing whether that output is actually correct as systems evolve.
Schemas change. Prompts get updated. New document types get added to the pipeline. And somewhere along the way, accuracy drifts. Most teams don't catch it until something breaks downstream or a customer flags an issue.
The standard approach is manual spot checks, maybe some delayed feedback loops. That works when you're processing hundreds of documents. It doesn't scale when you're processing millions.
Today we're launching Accuracy Scorer to bring systematic evaluation directly into the Pulse platform.
How It Works
Accuracy Scorer lets you upload ground truth datasets (CSVs or JSONs containing known-correct extractions) and automatically score your extraction outputs against them.
The system measures:
Field-level precision and recall. Not just "did we get the document right," but "which specific fields are we nailing and which need attention."
Semantic and JSON similarity. Scoring accounts for equivalent values and structural alignment, not just exact string matching.
Accuracy over time. Because Accuracy Scorer integrates with the Extraction Library, you can track how accuracy changes across schema versions. When someone updates a prompt or modifies a schema, you see the impact immediately.
Why This Matters
Extraction pipelines are living systems. They evolve constantly: new requirements, new edge cases, new document formats. Without systematic measurement, you're flying blind.
Accuracy Scorer gives you:
- Objective benchmarking before pushing changes to production
- Regression detection when updates cause unexpected drift
- Version-level traceability so you understand exactly what changed and when
- Confidence in quality that doesn't depend on manual review bandwidth
If you're treating extraction as a production system (and you should be), this is the evaluation layer that's been missing.
Available now in the Pulse platform. Link here.
Want to see Accuracy Scorer in action? Schedule a demo.
