PDF to Structured JSON Without ML Training

Every team that ships a PDF processing feature reaches the same wall: OCR returns a string of words, but the user wants { "invoice_number": "INV-1234", "total": 4582.00, "line_items": [...] }. This is the gap LLMs solved in 2024-2025, and the patterns that actually work in production in 2026.

The four eras of PDF extraction

Text PDFs (1995-2010) — pdftotext gives you the text. Parsing structure is custom regex.
Scanned PDFs (2010-2018) — Tesseract OCR. Layout is gone. You parse word streams.
Layout-aware OCR (2018-2023) — Google Document AI, AWS Textract, Azure Form Recognizer. Tables sort of work. Custom models cost $$$.
LLM extraction (2023-2026) — GPT-4V, Claude, Gemini handle complex layouts directly. Schema in, JSON out.

If you're starting fresh in 2026, you start at era 4. Era 3 is still useful for high-volume regulated workflows.

The "send the page to an LLM" pattern

The pattern is straightforward:

// Pseudocode — works with OpenAI, Anthropic, Google, Mistral
const pageImage = await pdfPageToImage(pdfBuffer, pageIndex);
const response = await llm.chat.completions.create({
  model: "claude-3-5-sonnet-20241022",
  messages: [
    { role: "user", content: [
      { type: "image", source: { type: "base64", data: pageImage }},
      { type: "text", text: extractionPrompt }
    ]}
  ],
  response_format: { type: "json_schema", schema: invoiceSchema }
});

Where invoiceSchema is your strongly-typed JSON Schema:

const invoiceSchema = {
  type: "object",
  required: ["invoice_number", "issue_date", "total_amount", "line_items"],
  properties: {
    invoice_number: { type: "string" },
    issue_date: { type: "string", format: "date" },
    vendor: { type: "object", properties: { name: { type: "string" }, vat_id: { type: "string" }}},
    customer: { type: "object", properties: { name: { type: "string" }}},
    total_amount: { type: "number" },
    currency: { type: "string", enum: ["EUR","USD","GBP","CHF"]},
    line_items: {
      type: "array",
      items: {
        type: "object",
        properties: {
          description: { type: "string" },
          quantity: { type: "number" },
          unit_price: { type: "number" },
          total: { type: "number" }
        }
      }
    }
  }
};

Three lessons from running this on 50,000+ documents:

1. Page-by-page beats whole-document

LLMs degrade on long contexts. Split PDFs into pages, extract per page, then merge by document. You can parallelize.

2. Image vs text mode

For text PDFs (text layer present): you can pass the extracted text. Cheaper, but loses layout.
For scanned PDFs: image mode is mandatory. ~3-5x cost.
For mixed PDFs (forms with stamps, signatures): always image mode.

3. JSON schema enforcement is everything

Without schema enforcement, you get hallucinated fields, missing fields, wrong types. With it, you get deterministic outputs you can validate. OpenAI's response_format, Anthropic's tool use forcing, and Mistral's structured output all do this.

What goes wrong (and how to handle it)

Hallucinations

LLMs invent invoice numbers when they can't read them. Always include in the prompt: "If a field cannot be confidently determined from the document, set it to null. Do not guess." Then validate downstream — if invoice_number is null, flag for human review.

Confidence scores

Add a _confidence field per extracted entity:

{
  "invoice_number": "INV-1234",
  "_confidence": { "invoice_number": 0.97, "total_amount": 0.99 }
}

Use these to route low-confidence docs to human review. Don't auto-process below 0.85.

Multi-page documents

Concatenate JSON per page, then a final LLM call to merge:

const merged = await llm.chat({ 
  messages: [{ role: "user", content: 
    `Merge these per-page extractions into a single invoice JSON. Reconcile line items across pages. Pages: ${JSON.stringify(perPageResults)}` 
  }]
});

Cost optimization

Image-mode extraction is ~$0.005-0.02 per page. For 100K pages/month that's $500-2000. Optimizations:

Pre-filter empty pages (blank, separator, cover) — saves 15-25%
Use cheaper models for triage (Haiku, Flash, Mistral Small) — route only complex pages to Sonnet/Opus
Cache by document hash — same PDF processed twice = no extra cost
Batch with prompt caching (Anthropic, OpenAI) — 50% off on repeated system prompts

Compliance

For regulated workflows (invoices, contracts, healthcare):

Don't use models that train on inputs (default for OpenAI/Anthropic enterprise tiers, opt-out on others)
EU data residency: Mistral, Anthropic Frankfurt region, Azure OpenAI EU
Audit trail: log model + version + prompt hash + input hash + output

When NOT to use LLMs

Standardized forms with strict layout (W2, 1099) — Document AI templates are cheaper and more reliable
Tables only (price lists, schedules) — Camelot/Tabula are deterministic
High-volume, low-margin (>1M pages/month, <$0.001/page budget) — train a custom model

What we built

We needed this for our portfolio's contract review workflow. We wrapped the pattern above into an API: send a PDF, get JSON back, with schema enforcement, confidence scoring, and EU data residency. Free tier 100 pages/month no card at parseflow.dev. Same pattern, packaged.

If you build it yourself, the patterns above are the difference between a 60% accurate prototype and a 95%+ accurate production system. The cost is mostly API tokens — engineering time is one weekend.

Antonio Altomonte builds developer APIs at DevToolsmith. PDF to JSON API: parseflow.dev.

DE

Source

This article was originally published by DEV Community and written by DevToolsmith.

Read original article on DEV Community

Back to Discover

PDF to Structured JSON Without ML Training: A 2026 Developer Guide