Replacing Puppeteer, Sharp, and Tesseract with One API

The Stack That Seemed Like a Good Idea

Every developer who processes content has built some version of the same stack. Puppeteer for PDF rendering and screenshots. Sharp or ImageMagick for image transformation. Tesseract for OCR. Maybe wkhtmltopdf or LibreOffice thrown in for good measure.

Each tool solves a real problem. Each tool works in isolation. And each tool becomes a maintenance liability the moment you deploy it to production and start handling real-world input.

The pain isn't any single tool. It's the combination. It's the Dockerfile that takes eight minutes to build because it needs Chromium, libvips, Leptonica, and Tesseract with language packs. It's the glue code that converts Tesseract's output into something Sharp can process. It's the 3 AM page because Puppeteer leaked memory again and your Kubernetes pod hit its limit.

If you've shipped this stack, you know exactly what I'm talking about. If you're about to build it, this post might save you months.

Puppeteer: The Headless Browser That Isn't Headless Enough

Puppeteer is the default choice for generating PDFs and screenshots from HTML. It launches a real Chromium instance, navigates to your content, and captures the output. Simple concept. Nightmarish in production.

Memory leaks are structural, not incidental. Every browser.newPage() allocates memory. If your code throws before page.close(), that memory never comes back. Process 500 documents and your Node.js process is sitting at 2 GB of RAM. The fix is always the same — restart the process periodically, add aggressive timeouts, wrap everything in try/finally blocks. You're not writing a PDF generator anymore. You're babysitting a browser.

Chrome dependency hell is real. Puppeteer bundles a specific Chromium version. That Chromium version needs specific system libraries — libx11, libnss, libatk, libcups, libdrm, libgbm, and dozens more. Miss one and you get a cryptic error: error while loading shared libraries: libnss3.so. Your Dockerfile grows to handle every library, every architecture.

Font rendering is environment-dependent. The PDF you generate on macOS looks different from the one your Linux server produces. Missing fonts fall back to system defaults. CJK text renders as boxes. Custom web fonts need explicit waitForSelector calls or they load as Times New Roman.

Timeouts and hanging pages. A page that loads fine in your browser can hang indefinitely in Puppeteer. An external stylesheet that takes 30 seconds. A JavaScript error that prevents the load event. A redirect loop. Each failure mode needs its own timeout configuration, and the default behavior is to wait forever.

Here's what a typical Puppeteer PDF generation looks like, with the error handling you actually need:

const puppeteer = require("puppeteer");

async function generatePdf(html) {
  let browser;
  try {
    browser = await puppeteer.launch({
      headless: "new",
      args: [
        "--no-sandbox",
        "--disable-setuid-sandbox",
        "--disable-dev-shm-usage",
        "--disable-gpu",
        "--single-process",
      ],
    });

    const page = await browser.newPage();
    await page.setContent(html, {
      waitUntil: "networkidle0",
      timeout: 30000,
    });
    const pdfBuffer = await page.pdf({ format: "A4" });
    await page.close();

    return pdfBuffer;
  } catch (error) {
    console.error("PDF generation failed:", error);
    throw error;
  } finally {
    if (browser) {
      await browser.close();
    }
  }
}

Eleven lines of configuration and error handling before you even get to the PDF. And this is the simplified version — production code adds process pooling, memory monitoring, and restart logic on top.

The API equivalent:

import { IterationLayer } from "iterationlayer";

const client = new IterationLayer({
  apiKey: "YOUR_API_KEY",
});

const result = await client.generateDocument({
  format: "pdf",
  document: {
    metadata: {
      title: "Monthly Report",
    },
    content: [
      {
        type: "headline",
        level: "h1",
        text: "Monthly Report",
      },
      {
        type: "paragraph",
        markdown: "Generated on 2026-04-15. All figures are preliminary.",
      },
    ],
  },
});

{
  "success": true,
  "data": {
    "buffer": "JVBERi0xLjQKJ...",
    "mime_type": "application/pdf"
  }
}

No Chromium. No system libraries. No memory leaks. No font rendering inconsistencies. Define the document structure as JSON, get a PDF back. The rendering happens on infrastructure you don't maintain.

Sharp: Fast Until It Isn't

Sharp is the best Node.js image processing library. That's not debatable. It's built on libvips, it's fast, and for simple resize-and-convert pipelines, it works well.

The problems start at the edges.

CMYK images silently produce wrong colors. Most web images are RGB. But product photography, print materials, and professional assets often arrive in CMYK. Sharp processes them, but the color conversion isn't always correct. A deep red becomes orange. A rich black becomes dark gray. You don't notice until a customer complains about their product photos looking wrong.

HEIF/HEIC support requires a specific libvips build. iPhones shoot in HEIF by default. If your users upload photos from their phone, you need HEIF support. But the precompiled libvips that ships with Sharp doesn't include it. You need to compile libvips from source with the HEIF plugin, which means adding build dependencies to your Docker image and managing the compilation.

Memory spikes on large images. Sharp is fast because it streams pixels through a pipeline. But some operations — upscaling, certain rotation angles, compositing with transparency — require buffering the full image in memory. A 50 MP camera image at full resolution can consume over a gigabyte of RAM during processing.

Platform-specific build failures. Sharp is a native addon. npm install compiles C++ code. That works fine on your MacBook. It fails on Alpine Linux because the musl libc is missing a function. It fails on ARM because the prebuilt binary doesn't exist for your exact architecture and Node version combination.

A typical Sharp pipeline for processing user-uploaded product images:

const sharp = require("sharp");

async function processProductImage(inputBuffer) {
  // Handle CMYK conversion explicitly
  const metadata = await sharp(inputBuffer).metadata();
  let pipeline = sharp(inputBuffer);

  if (metadata.space === "cmyk") {
    pipeline = pipeline.toColorspace("srgb");
  }

  return pipeline
    .resize(800, 600, { fit: "cover", position: "attention" })
    .sharpen({ sigma: 0.5 })
    .webp({ quality: 85 })
    .toBuffer();
}

This handles the CMYK case. It doesn't handle HEIF input, animated GIFs, images with embedded ICC profiles that conflict with the color space conversion, or the memory spike when position: "attention" loads the full image to detect the focal point.

The API equivalent:

import { IterationLayer } from "iterationlayer";

const client = new IterationLayer({
  apiKey: "YOUR_API_KEY",
});

const result = await client.transform({
  file: {
    type: "url",
    name: "product.jpg",
    url: "https://cdn.example.com/images/product.jpg",
  },
  operations: [
    {
      type: "smart_crop",
      width_in_px: 800,
      height_in_px: 600,
    },
    {
      type: "sharpen",
      sigma: 0.5,
    },
    {
      type: "convert",
      format: "webp",
      quality: 85,
    },
  ],
});

{
  "success": true,
  "data": {
    "buffer": "iVBORw0KGgoAAAANSUhEUg...",
    "mime_type": "image/webp"
  }
}

CMYK, HEIF, ICC profiles — all handled on the server side. The smart_crop operation uses AI object detection to find the subject, which is what you actually want when cropping product photos. No attention-based heuristics that sometimes focus on the wrong thing.

Tesseract: Good Enough Until It Matters

Tesseract is the default open-source OCR engine. It's been around since 1985. It works well on clean, high-resolution scans with standard fonts. Production documents are none of those things.

Accuracy drops on real-world documents. Scanned receipts with thermal printer smudging. Invoices photographed at an angle on a phone. Forms with handwritten annotations next to printed text. Multi-column layouts where Tesseract reads across columns instead of down them. These are normal documents in any business workflow, and Tesseract struggles with all of them.

No structured output. Tesseract gives you raw text. If you need to extract an invoice number, a total amount, and a date from a PDF, you get a wall of text and it's your job to parse it. That means regex, heuristics, and fragile position-based extraction that breaks every time a vendor changes their invoice template.

Language pack management. Tesseract needs trained data files for each language. A German invoice with English product descriptions needs both language packs. Each pack is 15-40 MB. Your Docker image grows, your deployment slows, and you need to know in advance which languages your documents will contain.

Pre-processing requirements. To get decent results from Tesseract, you need to pre-process images: deskew, denoise, binarize, adjust contrast. That means adding another image processing library to your stack — usually ImageMagick or Sharp — just to prepare input for OCR. The pre-processing pipeline becomes its own maintenance burden.

A typical Tesseract-based extraction pipeline:

const Tesseract = require("tesseract.js");

async function extractInvoiceData(imageBuffer) {
  const {
    data: { text },
  } = await Tesseract.recognize(imageBuffer, "eng+deu");

  // Now parse the raw text with regex
  const invoiceNumber = text.match(/Invoice\s*#?\s*(\w+)/i)?.[1];
  const totalMatch = text.match(/Total[:\s]*[\$€]?\s*([\d,.]+)/i)?.[1];
  const dateMatch = text.match(/Date[:\s]*(\d{1,2}[\/.-]\d{1,2}[\/.-]\d{2,4})/i)?.[1];

  return {
    invoiceNumber: invoiceNumber || null,
    total: totalMatch ? parseFloat(totalMatch.replace(",", "")) : null,
    date: dateMatch || null,
  };
}

Three regex patterns. Three failure modes. And this only works on English invoices with "Invoice", "Total", and "Date" as labels. A German invoice with "Rechnungsnummer", "Gesamtbetrag", and "Datum" needs entirely different patterns. A French invoice needs a third set. You're not building an extraction system — you're building a regex collection that grows with every new vendor and every new language.

The API equivalent:

import { IterationLayer } from "iterationlayer";

const client = new IterationLayer({
  apiKey: "YOUR_API_KEY",
});

const result = await client.extract({
  files: [
    {
      type: "url",
      name: "invoice.pdf",
      url: "https://cdn.example.com/docs/invoice.pdf",
    },
  ],
  schema: {
    fields: [
      {
        type: "TEXT",
        name: "invoice_number",
        description: "The invoice or document number",
      },
      {
        type: "CURRENCY_AMOUNT",
        name: "total_amount",
        description: "The total amount due",
      },
      {
        type: "DATE",
        name: "invoice_date",
        description: "The date the invoice was issued",
      },
    ],
  },
});

{
  "success": true,
  "data": {
    "invoice_number": {
      "value": "INV-2026-0847",
      "confidence": 0.97,
      "citations": [
        "Invoice #INV-2026-0847"
      ],
      "source": "invoice.pdf",
      "type": "TEXT"
    },
    "total_amount": {
      "value": 1250.00,
      "confidence": 0.95,
      "citations": [
        "Total: €1,250.00"
      ],
      "source": "invoice.pdf",
      "type": "CURRENCY_AMOUNT"
    },
    "invoice_date": {
      "value": "2026-04-10",
      "confidence": 0.98,
      "citations": [
        "Date: 10/04/2026"
      ],
      "source": "invoice.pdf",
      "type": "DATE"
    }
  }
}

No regex. No language-specific patterns. You describe what you want — a text field, a currency amount, a date — and the API returns structured data with confidence scores. The same schema works on English invoices, German Rechnungen, and French factures. When confidence is low, you know to flag it for review instead of silently accepting garbage data.

The Docker Problem

Each of these tools has its own system dependencies. Puppeteer needs Chromium and its library ecosystem. Sharp needs libvips (and optionally libheif, libimagequant, and more). Tesseract needs Leptonica, trained data files, and image format libraries.

Combine them in one Dockerfile and you get something like this:

FROM node:20-slim

RUN apt-get update && apt-get install -y \
  # Puppeteer/Chromium dependencies
  chromium \
  fonts-liberation \
  libappindicator3-1 \
  libasound2 \
  libatk-bridge2.0-0 \
  libatk1.0-0 \
  libcups2 \
  libdbus-1-3 \
  libdrm2 \
  libgbm1 \
  libnspr4 \
  libnss3 \
  libx11-xcb1 \
  libxcomposite1 \
  libxdamage1 \
  libxrandr2 \
  xdg-utils \
  # Sharp/libvips dependencies
  libvips-dev \
  # Tesseract dependencies
  tesseract-ocr \
  tesseract-ocr-eng \
  tesseract-ocr-deu \
  tesseract-ocr-fra \
  libleptonica-dev \
  && rm -rf /var/lib/apt/lists/*

ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true
ENV PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium

WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .

Thirty-plus system packages. A Docker image that's over a gigabyte. A build that takes minutes. And every package is a potential security vulnerability that needs patching.

ARM support? Multi-arch builds that triple your CI time. Alpine Linux? Half these packages don't exist — you need to find the musl equivalents. Distroless? Not an option — too many system dependencies.

The Iteration Layer equivalent is an HTTP client. Your Dockerfile goes back to the slim base image you started with. No system packages, no native addons, no build dependencies. Your image stays small, your builds stay fast, and your attack surface shrinks to an HTTPS endpoint.

Where One API Doesn't Win

This is the honest part.

If you need pixel-perfect rendering of arbitrary web pages, Puppeteer is still the right tool. Iteration Layer's Document Generation API works from structured content blocks — headlines, paragraphs, tables, images. It doesn't render arbitrary HTML with external CSS frameworks and JavaScript.

If you're processing millions of images per day at the lowest possible cost, self-hosted Sharp on dedicated hardware will be cheaper per image. The API's per-request pricing is competitive for moderate volumes, but at massive scale, the economics of self-hosted processing can win.

If you need to train a custom OCR model for highly specialized documents — handwritten medical forms, historical manuscripts, unusual scripts — Tesseract's trainable architecture lets you build domain-specific models. The extraction API uses general-purpose models that handle most business documents well but can't be fine-tuned for niche use cases.

The API wins when you're processing varied content at moderate scale and don't want infrastructure to maintain. It wins when your Docker image is already too large, when your on-call rotation includes "Puppeteer crashed again," and when you're spending more time on glue code than on your actual product.

The Composability Argument

The individual tool replacements are useful on their own. But the real value shows up when you chain operations across APIs — something the DIY stack makes painful and the API makes trivial.

Consider a real workflow: a supplier sends you an invoice PDF. You need to extract the data, generate a summary report, and create a thumbnail of the first page for your dashboard.

With the DIY stack, that's three tools, three sets of error handling, and custom code to convert Tesseract's text output into a format Puppeteer can render as a PDF, then feed that PDF to Sharp for thumbnail generation. Each handoff point is a potential failure.

With the API, you make three calls with the same API key, the same authentication, and the same error format. The extraction returns structured JSON. You feed that JSON into document generation. You feed the generated document into image transformation for a thumbnail. Same credit pool, same response structure, no format conversion between steps.

That composability is the fundamental difference. The DIY stack is a collection of disconnected tools that you wire together. The API is a set of operations designed to chain together. One approach scales with glue code. The other scales with API calls.

Migration: Start with the Biggest Pain

You don't have to replace everything at once. Start with the tool that causes the most operational pain.

If Puppeteer is your biggest headache — memory leaks, Chrome crashes, flaky rendering — start by moving PDF generation to the Document Generation API. Your Dockerfile loses Chromium and its dependencies. Your memory usage becomes predictable.

If Sharp edge cases are eating your time — CMYK color shifts, HEIF support, memory spikes on large images — move image processing to the Iteration Layer's Image Transformation API. Your native addon compilation problems disappear.

If Tesseract accuracy is holding you back — wrong extractions, missing fields, language limitations — move document parsing to the Document Extraction API. Your regex collection retires.

If the Dockerfile is the problem — build times, image size, security patches — moving any of the three tools to API calls immediately reduces your dependency footprint.

Each migration is independent. Replace one tool, leave the others. Replace two, leave one. Replace all three when you're ready. The API doesn't require an all-or-nothing commitment.

Get Started

Pick your most painful tool. Check the docs for the API that replaces it: Document Generation for Puppeteer workflows, Image Transformation for Sharp pipelines, Document Extraction for Tesseract parsing.

Sign up for a free account — no credit card required. Take your most complex pipeline, the one that breaks most often, and see what it looks like as an API call. The TypeScript, Python, and Go SDKs handle authentication and response parsing, so the integration takes minutes.

One API key. One credit pool. No system packages. No glue code.

The Stack That Seemed Like a Good Idea

Puppeteer: The Headless Browser That Isn't Headless Enough

Sharp: Fast Until It Isn't

Tesseract: Good Enough Until It Matters

The Docker Problem

Where One API Doesn't Win

The Composability Argument

Migration: Start with the Biggest Pain

Get Started

Reading List