Document Intelligence in Production: Automating Back-Office Paperwork Safely
TL;DR: A real document-intelligence system is not a demo that reads one clean PDF. It is a pipeline that survives messy scans, untrusted attachments, and edge cases without ever corrupting your data. We were brought in as a task force to finish and productionize exactly that for an EU flight-compensation firm: read every incoming letter, work out what kind of document it is, pull the key fields, and write them back into the CRM â or flag a human when anything is uncertain. The hard part was not the AI. It was making it safe.
Most businesses that drown in paperwork already know which documents they receive. The pain is that a person has to open each one, recognise it, copy the important fields somewhere, and file it. It is slow, it is boring, and it scales linearly with volume â every new case means more manual reading.
That was the situation for an EU flight-compensation firm handling claims under EU Regulation 261/2004. A steady stream of PDFs arrived â objection notices, court-cost invoices, handover letters, booking confirmations, customer claim forms â and each one had to be identified and keyed into the CRM by hand. A previous team had built most of the surrounding system but could never get it to run end to end, because the engine that read the documents never worked reliably.
//The pipeline, in plain terms
Every incoming document goes through one async pass:
- 1Read it (OCR). Pull the text off the scan or PDF.
- 2Classify it. Decide which of the known document types it is.
- 3Extract the fields. Pull out the data the business actually needs â references, dates, amounts, names, addresses, flight details.
- 4Write it back â or escalate. Apply the right labels and case data in the CRM, or route it to a human if anything is uncertain.
Simple to describe. The value is entirely in how robustly each step behaves when the input is not clean.
//Trick one: do not OCR what you do not have to
Not every PDF is a scanned image. Roughly half of the incoming documents were born-digital â they already contained a real text layer. Running heavy vision OCR on those is slow and can even be less accurate than just reading the text that is already there.
So the first thing the pipeline does is triage: if a document already has a usable text layer, it reads it directly and skips OCR entirely. Only true scans go to the vision model. That one decision is a large throughput and accuracy win before any AI even runs. The lesson generalises: the fastest, most accurate step is often the one you can avoid.
//Trick two: deterministic rules that prime, never override
Pure LLM classification is powerful but occasionally overconfident, especially on near-empty or ambiguous documents. So before the model runs, a deterministic keyword pre-filter looks for strong, unambiguous signals and uses them to prime the classifier â to point it in the right direction â without ever hard-coding the final answer.
The same pattern applies to extraction. A rule-based pre-filter fills in fields it can prove from the text and leaves everything else to the model, with each value tagged by how it was derived. You get the reliability of rules where rules are safe, and the flexibility of an LLM everywhere else. Rules and models are not rivals; the rules make the model more trustworthy. This is the same bounded-signal philosophy behind our probability engine work.
//Trick three: keep the brain away from the credentials
This system opens untrusted attachments for a living. That is precisely the kind of component you do not want holding the keys to your CRM.
So the architecture splits cleanly. The part that reads and understands documents has no CRM access at all. When it finishes, it sends its result over an HMAC-signed callback to a separate, single workflow that is the only thing allowed to write to the CRM. The signing means the receiver can verify the result genuinely came from the pipeline and was not tampered with in transit.
The benefit is structural, not cosmetic: the box that handles risky input is permanently isolated from the system of record. Even a worst-case compromise of the document reader cannot reach the database directly.
//Trick four: when in doubt, ask a human
The single most important behaviour in the whole system is what it does when it is not sure.
If a document is unrecognised, or a required field is missing, or the text is too thin to trust, the pipeline does not guess and write something plausible. It routes the item to manual review and writes nothing. We proved this with deliberately broken inputs â garbage files and malformed documents â and watched them land safely in the review queue instead of polluting real case data.
This is the difference between automation people trust and automation people quietly turn off. A system that is 95% automated and never corrupts data is far more valuable than one that is 100% automated and occasionally writes nonsense into your CRM. Safe by design beats fully autonomous.
//What "done" actually looks like
The result is a live, end-to-end flow: a new PDF lands, gets read, classified, and extracted in one pass, and a structured result is written back to the CRM under proper labels â with anything uncertain handed to a person. Born-digital documents skip OCR, untrusted input never touches credentials, and every output is auditable.
If your team spends hours moving data from documents into a system by hand â invoices, contracts, claims, leases, intake forms â that is exactly the kind of workflow this approach removes. See the full flight-compensation case study, our broader take on extracting data from contracts and invoices, or tell us about your paperwork and we will tell you honestly whether it is worth automating.