PDF to Structured JSON Without ML Training
Every team that ships a PDF processing feature reaches the same wall: OCR returns a string of words, but the user wants { "invoice_number": "INV-1234", "total": 4582.00, "line_items": [...] }. This is the gap LLMs solved in 2024-2025, and the patterns that actually work in production in 2026.
The four eras of PDF extraction
-
Text PDFs (1995-2010) —
pdftotextgives you the text. Parsing structure is custom regex. - Scanned PDFs (2010-2018) — Tesseract OCR. Layout is gone. You parse word streams.
- Layout-aware OCR (2018-2023) — Google Document AI, AWS Textract, Azure Form Recognizer. Tables sort of work. Custom models cost $$$.
- LLM extraction (2023-2026) — GPT-4V, Claude, Gemini handle complex layouts directly. Schema in, JSON out.
If you're starting fresh in 2026, you start at era 4. Era 3 is still useful for high-volume regulated workflows.
The "send the page to an LLM" pattern
The pattern is straightforward:
// Pseudocode — works with OpenAI, Anthropic, Google, Mistral
const pageImage = await pdfPageToImage(pdfBuffer, pageIndex);
const response = await llm.chat.completions.create({
model: "claude-3-5-sonnet-20241022",
messages: [
{ role: "user", content: [
{ type: "image", source: { type: "base64", data: pageImage }},
{ type: "text", text: extractionPrompt }
]}
],
response_format: { type: "json_schema", schema: invoiceSchema }
});
Where invoiceSchema is your strongly-typed JSON Schema:
const invoiceSchema = {
type: "object",
required: ["invoice_number", "issue_date", "total_amount", "line_items"],
properties: {
invoice_number: { type: "string" },
issue_date: { type: "string", format: "date" },
vendor: { type: "object", properties: { name: { type: "string" }, vat_id: { type: "string" }}},
customer: { type: "object", properties: { name: { type: "string" }}},
total_amount: { type: "number" },
currency: { type: "string", enum: ["EUR","USD","GBP","CHF"]},
line_items: {
type: "array",
items: {
type: "object",
properties: {
description: { type: "string" },
quantity: { type: "number" },
unit_price: { type: "number" },
total: { type: "number" }
}
}
}
}
};
Three lessons from running this on 50,000+ documents:
1. Page-by-page beats whole-document
LLMs degrade on long contexts. Split PDFs into pages, extract per page, then merge by document. You can parallelize.
2. Image vs text mode
For text PDFs (text layer present): you can pass the extracted text. Cheaper, but loses layout.
For scanned PDFs: image mode is mandatory. ~3-5x cost.
For mixed PDFs (forms with stamps, signatures): always image mode.
3. JSON schema enforcement is everything
Without schema enforcement, you get hallucinated fields, missing fields, wrong types. With it, you get deterministic outputs you can validate. OpenAI's response_format, Anthropic's tool use forcing, and Mistral's structured output all do this.
What goes wrong (and how to handle it)
Hallucinations
LLMs invent invoice numbers when they can't read them. Always include in the prompt: "If a field cannot be confidently determined from the document, set it to null. Do not guess." Then validate downstream — if invoice_number is null, flag for human review.
Confidence scores
Add a _confidence field per extracted entity:
{
"invoice_number": "INV-1234",
"_confidence": { "invoice_number": 0.97, "total_amount": 0.99 }
}
Use these to route low-confidence docs to human review. Don't auto-process below 0.85.
Multi-page documents
Concatenate JSON per page, then a final LLM call to merge:
const merged = await llm.chat({
messages: [{ role: "user", content:
`Merge these per-page extractions into a single invoice JSON. Reconcile line items across pages. Pages: ${JSON.stringify(perPageResults)}`
}]
});
Cost optimization
Image-mode extraction is ~$0.005-0.02 per page. For 100K pages/month that's $500-2000. Optimizations:
- Pre-filter empty pages (blank, separator, cover) — saves 15-25%
- Use cheaper models for triage (Haiku, Flash, Mistral Small) — route only complex pages to Sonnet/Opus
- Cache by document hash — same PDF processed twice = no extra cost
- Batch with prompt caching (Anthropic, OpenAI) — 50% off on repeated system prompts
Compliance
For regulated workflows (invoices, contracts, healthcare):
- Don't use models that train on inputs (default for OpenAI/Anthropic enterprise tiers, opt-out on others)
- EU data residency: Mistral, Anthropic Frankfurt region, Azure OpenAI EU
- Audit trail: log model + version + prompt hash + input hash + output
When NOT to use LLMs
- Standardized forms with strict layout (W2, 1099) — Document AI templates are cheaper and more reliable
- Tables only (price lists, schedules) — Camelot/Tabula are deterministic
- High-volume, low-margin (>1M pages/month, <$0.001/page budget) — train a custom model
What we built
We needed this for our portfolio's contract review workflow. We wrapped the pattern above into an API: send a PDF, get JSON back, with schema enforcement, confidence scoring, and EU data residency. Free tier 100 pages/month no card at parseflow.dev. Same pattern, packaged.
If you build it yourself, the patterns above are the difference between a 60% accurate prototype and a 95%+ accurate production system. The cost is mostly API tokens — engineering time is one weekend.
Antonio Altomonte builds developer APIs at DevToolsmith. PDF to JSON API: parseflow.dev.
This article was originally published by DEV Community and written by DevToolsmith.
Read original article on DEV Community