Why do PDFs break on Vercel with pdfjs-dist?

Vercel's bundler performs static analysis and omits dynamically resolved module files; pdfjs-dist expects runtime filesystem access and browser globals, which causes missing module or DOM errors.

What causes 'DOMMatrix is not defined' and how do I fix it?

pdfjs-dist expects DOMMatrix at module evaluation time; import pdf-parse/worker (which sets up CanvasFactory and polyfills) before importing pdf-parse to ensure DOMMatrix is available.

Do I need serverExternalPackages in Next.js?

Yes — listing pdf-parse and @napi-rs/canvas in serverExternalPackages prevents Next.js from inlining native binaries and ensures node_modules are available at runtime on Vercel.

How can I tell if a PDF needs OCR?

Sample a few pages with getText and count characters per page — if most sampled pages have under the minChars threshold (e.g., 30), treat the document as scanned images requiring OCR.

Process PDFs on Vercel: Reliable Serverless Guide (2026)

I was building a document management system in Next.js when the PDF processing pipeline that worked perfectly on my machine started throwing cryptic errors on Vercel. Cannot find module 'pdfjs-dist/legacy/build/pdf.mjs'. Then DOMMatrix is not defined. Then silence — jobs just hanging with no output.

If you're trying to extract text from PDFs or render pages to images inside a Vercel serverless function, you've probably hit the same wall. The problem isn't your code — it's that most PDF libraries assume a long-running Node.js process with full filesystem access, and Vercel's bundler aggressively tree-shakes and isolates modules in ways that break those assumptions.

This guide walks through the approach I landed on after trying several alternatives: using pdf-parse v2, which was built specifically to work in serverless environments including Vercel, Netlify, and AWS Lambda. By the end, you'll have text extraction, content detection, and page-to-image rendering working reliably in production.

Why Most PDF Libraries Break on Vercel

Before jumping to the solution, it helps to understand what goes wrong. Vercel uses Turbopack (or Webpack) to bundle your server code into self-contained serverless functions. This bundling process does static analysis on your imports to determine what to include.

Libraries like pdfjs-dist use patterns that defeat this analysis. A common workaround you'll find in blog posts looks something like this:

const require = createRequire(`${process.cwd()}/`);
const modPath = require.resolve("pdfjs-dist/legacy/build/pdf.mjs");
const modUrl = pathToFileURL(modPath).href;
const runtimeImport = new Function("u", "return import(u);");
return runtimeImport(modUrl);

This dynamically resolves the module path at runtime and uses new Function to bypass static import analysis. It works locally because Node.js has full access to node_modules. On Vercel, those files simply aren't in the deployment bundle — the bundler never saw a static import, so it never included them.

The pdf-to-img package (which wraps pdfjs-dist for rendering) has the same fundamental issue. And even if you manage to get pdfjs-dist loaded, you'll hit the next problem: it expects browser globals like DOMMatrix and ImageData to exist, which they don't in a bare Node.js runtime.

Setting Up pdf-parse v2

pdf-parse v2 is a rewrite of the original pdf-parse package. It wraps pdfjs-dist internally but handles all the worker setup, canvas polyfilling, and module resolution that breaks in serverless environments. It supports CJS, ESM, and runs on Node.js 20+.

Install pdf-parse and @napi-rs/canvas — the latter provides the native DOMMatrix, ImageData, and canvas implementations that pdfjs-dist needs under the hood:

pnpm add pdf-parse @napi-rs/canvas

If you were previously using pdfjs-dist or pdf-to-img directly, remove them:

pnpm remove pdfjs-dist pdf-to-img

Configuring Next.js for Serverless

This is the step most people miss. Both pdf-parse and @napi-rs/canvas need to be excluded from Next.js bundling so that Node.js resolves them at runtime from node_modules. Without this, the bundler will try to inline native binaries and pdfjs-dist worker files, which fails.

// next.config.ts
import type { NextConfig } from "next";

const nextConfig: NextConfig = {
  serverExternalPackages: ["pdf-parse", "@napi-rs/canvas"],
};

export default nextConfig;

The serverExternalPackages array tells Next.js to treat these packages as external — they'll be loaded from node_modules at runtime instead of being bundled into the serverless function. Vercel includes node_modules in the deployment when packages are listed here.

The Worker Import Gotcha

This is the gotcha that cost me the most debugging time. When you import pdf-parse, it internally loads pdfjs-dist, which immediately tries to access DOMMatrix at the module level — not when you call a function, but the moment the module evaluates. If @napi-rs/canvas hasn't been set up yet, you get:

ReferenceError: DOMMatrix is not defined

The fix is to import pdf-parse/worker before importing pdf-parse itself. The worker module sets up the canvas factory and polyfills the globals that pdfjs-dist expects. Import order matters here — this is one of those rare cases where the sequence of your import statements has a real effect:

// File: src/lib/pdf/index.ts

// This MUST come before the pdf-parse import
import { CanvasFactory } from "pdf-parse/worker";
import { PDFParse } from "pdf-parse";

Then pass CanvasFactory to every PDFParse constructor call:

const parser = new PDFParse({ data: new Uint8Array(pdfBuffer), CanvasFactory });

Without this, text extraction might work on some Node.js versions (where pdfjs-dist gracefully degrades) but screenshot rendering will always fail. Including CanvasFactory explicitly makes the behavior consistent across environments.

Extracting Text from PDFs

With the setup in place, text extraction is straightforward. Create a parser instance, call getText(), and destroy the parser when done. The try/finally pattern ensures you don't leak memory in a serverless function where the process may handle multiple requests:

// File: src/lib/pdf/index.ts

export async function extractPdfText(pdfBuffer: Buffer): Promise<string> {
  const parser = new PDFParse({ data: new Uint8Array(pdfBuffer), CanvasFactory });

  try {
    const result = await parser.getText({ pageJoiner: "\n\n" });
    return result.text.trim();
  } finally {
    await parser.destroy();
  }
}

The pageJoiner option controls how text from different pages is concatenated. Using "\n\n" gives you a clean double newline between pages, which works well for downstream processing like sending the text to an LLM.

Detecting Whether a PDF Contains Text or Scanned Images

Not all PDFs have selectable text. Scanned documents are essentially images wrapped in a PDF container, and calling getText() on them returns little to no content. If your pipeline needs to handle both — extracting text directly when available, or running OCR on scanned pages — you need a way to detect which kind of PDF you're dealing with.

The approach is to sample the first few pages, extract text from each, and count characters. If most pages have meaningful text content, it's a text PDF. If most pages are nearly empty, it's scanned images. Anything in between is mixed:

// File: src/lib/pdf/index.ts

export type PdfContentKind = "text" | "image" | "mixed";

export interface DetectPdfContentKindOptions {
  samplePages?: number;
  minCharsPerPage?: number;
}

export async function detectPdfContentKind(
  pdfBuffer: Buffer,
  options?: DetectPdfContentKindOptions,
): Promise<PdfContentKind> {
  const samplePages = Math.max(1, options?.samplePages ?? 5);
  const minCharsPerPage = Math.max(1, options?.minCharsPerPage ?? 30);

  const parser = new PDFParse({ data: new Uint8Array(pdfBuffer), CanvasFactory });

  try {
    const info = await parser.getInfo();
    const pagesToCheck = Math.max(1, Math.min(info.total, samplePages));

    const result = await parser.getText({ first: pagesToCheck, pageJoiner: "" });

    let textPages = 0;
    for (const page of result.pages) {
      if (page.text.trim().length >= minCharsPerPage) {
        textPages++;
      }
    }

    const ratio = textPages / pagesToCheck;
    if (ratio >= 0.8) return "text";
    if (ratio <= 0.2) return "image";
    return "mixed";
  } finally {
    await parser.destroy();
  }
}

The first parameter on getText() limits extraction to only the pages you need, which matters for performance on large documents. There's no reason to parse 200 pages when the first 5 tell you what kind of PDF it is.

The threshold of 30 characters per page works well in practice. Even a page with just a header and page number will typically have more than 30 characters of selectable text, while a scanned image page returns zero or a handful of OCR artifacts.

Rendering PDF Pages to Images

When you detect a scanned PDF, you need to render each page to an image for OCR processing. pdf-parse's getScreenshot() method handles this, returning PNG buffers for each page:

// File: src/lib/pdf/index.ts

export interface PdfPage {
  pageIndex: number;
  buffer: Buffer;
  mimeType: "image/png";
}

export async function pdfToImages(
  pdfBuffer: Buffer,
  options?: { scale?: number },
): Promise<PdfPage[]> {
  const scale = options?.scale ?? 2;

  const parser = new PDFParse({ data: new Uint8Array(pdfBuffer), CanvasFactory });

  try {
    const result = await parser.getScreenshot({
      scale,
      imageBuffer: true,
      imageDataUrl: false,
    });

    return result.pages.map((page, index) => ({
      pageIndex: index,
      buffer: Buffer.from(page.data),
      mimeType: "image/png" as const,
    }));
  } finally {
    await parser.destroy();
  }
}

A scale of 2 produces images at twice the PDF's natural resolution, which gives OCR engines enough detail to work with. You can go higher for better accuracy at the cost of larger buffers and more memory usage — something to be mindful of in serverless where memory limits are real.

Setting imageBuffer: true and imageDataUrl: false ensures you get raw PNG buffers rather than base64 data URLs. Buffers are what you want for uploading to storage or sending to an OCR service. This is also where CanvasFactory from the worker import becomes critical — without it, the rendering simply cannot work because there's no canvas implementation to draw on.

Summary of Common Gotchas

Three things will trip you up when processing PDFs on Vercel, and they're all configuration issues rather than code problems.

The first is forgetting serverExternalPackages. Without it, Next.js tries to bundle pdf-parse and its native dependencies into the serverless function, which fails because @napi-rs/canvas includes platform-specific native binaries that can't be inlined.

The second is import ordering. The pdf-parse/worker import must evaluate before pdf-parse itself. In practice this means putting it on the line above. If you see DOMMatrix is not defined at runtime, this is almost certainly the cause.

The third is not passing CanvasFactory to the constructor. Text extraction might appear to work without it on some Node versions, but rendering will fail silently or throw. Always pass it explicitly to every PDFParse instance.

Once these three pieces are in place, PDF processing on Vercel works the same as it does locally. The exported API surface stays clean — consumers of your PDF module don't need to know anything about worker setup or canvas factories. They just call extractPdfText() or pdfToImages() and get results back.

Let me know in the comments if you have questions, and subscribe for more practical development guides.

Thanks, Matija

Process PDFs on Vercel: Reliable Serverless Guide (2026)

⚡ Next.js Implementation Guides

Related Posts:

Why Most PDF Libraries Break on Vercel

Setting Up pdf-parse v2

Configuring Next.js for Serverless

The Worker Import Gotcha

Extracting Text from PDFs

Detecting Whether a PDF Contains Text or Scanned Images

Rendering PDF Pages to Images

Summary of Common Gotchas

Frequently Asked Questions

Comments

You might be interested in

Get in Touch