Running PDF.js in Self-Hosted Convex
- published
I wanted to build a RAG pipeline over a pile of PDF manuals, and I was already on Convex. The architecture looked boring on paper: upload the PDF to object storage, extract text in a "use node" action, chunk it, and feed each chunk to the
Convex RAG component
via rag.add(...). The RAG component would handle embedding and vector storage. Every piece was a known good except one — the text extraction itself. I reached for pdfjs-dist because it’s Mozilla’s own reference implementation. This should have been a one-afternoon job.
It was not a one-afternoon job.
TL,DR; to run PDF.js in self-hosted Convex actions, check the code in Integration
The cascade
I hit these errors in order. Each one felt like progress. Each one was another missing built-in.
Setting up fake worker failed: "Cannot find module
'.../modules/_deps/node/pdf.worker.mjs' imported from
.../modules/_deps/node/DYC5ZLU7.js"
PDF.js runs its parsing inside a worker. When no real worker is available it falls back to a “fake worker” that loads the worker module via await import(this.workerSrc) — a dynamic import whose target is computed at runtime from a property access. Convex’s bundler is esbuild, and esbuild can only follow literal-string imports; anything computed stays a runtime call. The Convex docs
call this out
. So pdf.worker.mjs was never emitted alongside the main bundle, and at runtime the import resolved to a sibling file that didn’t exist.
I spent a while chasing this before finding the escape hatch in Side-stepping the fake worker . Once the worker was unblocked, the errors kept coming:
Uncaught TypeError: Promise.withResolvers is not a function
First real hint that the runtime was older V8 than I’d assumed. Promise.withResolvers is Node 22 / V8 12.4. I upgraded the self-hosted Convex backend image, retried, and got:
Uncaught unhandledRejection: Promise.try is not a function
I was inside PDF.js’s own message_handler.js. I hand-polyfilled Promise.try with a small spec-shaped wrapper. Retried:
Uncaught UnknownErrorException: a.toHex is not a function
Uint8Array.prototype.toHex landed in V8 14.1 and is not in Node 22.22.2; you need Node 25. PDF.js was using it in its document-fingerprinting path. I hand-polyfilled that too — a loop over bytes, padStart(2, "0"), the usual.
None of that survives in the final setup:
legacy/build
already ships the polyfills. I am keeping this sequence in the post because it is the order the failures surfaced, not because I recommend copying the shims.
Somewhere around this point I stopped and asked myself what I was doing.
Side-stepping the fake worker
The fake-worker loader has an escape hatch I’d missed at first read. Before attempting its broken dynamic import, PDFWorker._setupFakeWorkerGlobal checks globalThis.pdfjsWorker.WorkerMessageHandler. If that global is set, it uses the preloaded handler directly and skips await import(this.workerSrc) entirely.
The worker module’s own body ends with globalThis.pdfjsWorker = { WorkerMessageHandler } as a side effect. So if I import the worker module myself — via a literal-string dynamic import that esbuild can statically follow — two things happen in one go:
- esbuild emits the
pdf.worker.mjschunk into the action bundle, so it actually exists at runtime. - Evaluating that chunk sets
globalThis.pdfjsWorker, which trips the short-circuit the next time PDF.js looks for a worker.
One literal-string import does both:
await import("pdfjs-dist/legacy/build/pdf.worker.mjs");
That has to happen before the first call to getDocument, so in the final code I drop it into my deferred PDF.js loader right before importing the main entry point.
Self-hosted Convex bakes Node into the image
The next two sections — Self-hosted Convex bakes Node into the image and The legacy build — are not the order I figured things out. I am grouping them because they are the two checks that would have shortened the hunt.
I had .nvmrc pinning Node 22 on my laptop and I’d assumed the Convex convex.json node.nodeVersion field was doing something. It wasn’t. That field is cloud-only — the self-hosted backend ignores it completely. The Node version is whatever got installed into the backend’s Docker image at docker build time via the
.nvmrc in get-convex/convex-backend
.
As of April 2026, the Convex backend had only recently bumped its .nvmrc from 20.19.5 to 22.22.2 on 2026-04-06 (
commit 8ece97b16
). The first precompiled image that picked up Node 22 was precompiled-2026-04-06-c6f2ffd. Any backend image you pulled before that — including basically every :latest tag from before April 2026 — was running Node 20.19.5, which does not have Promise.withResolvers, Promise.try, or the other modern features PDF.js reaches for. Tags and commits move; treat the dates as context, not a permanent guarantee.
Before touching polyfills, check you’re on a backend image newer than precompiled-2026-04-06-c6f2ffd. Pin the image to an explicit commit SHA (don’t trust :latest), and run docker exec <convex-backend-container> node --version to see what’s actually inside. That one command would have saved me hours.
The legacy build
The pdfjs-dist package ships two entry points: the default build/pdf.mjs, and legacy/build/pdf.mjs. Mozilla’s
PDF.js FAQ
lists “Node.js 20+” as the target environment for the legacy build. The legacy bundle ships core-js polyfills baked in — Promise.try, Promise.withResolvers, Uint8Array.prototype.toHex, and whatever recent TC39 proposal comes next. I had been hand-writing polyfills for something Mozilla already ships.
I’d dismissed the legacy build early on because of a stale code comment claiming it had top-level await that Convex rejects. It does not — it just uses a webpack chunk layout the default build doesn’t. The comment was from a previous PDF.js major version and I never re-checked.
Integration
The file has to be imported only from a "use node" module because it relies on Node built-ins through core-js.
// convex/lib/pdfParser.ts
import type { getDocument as PdfGetDocument } from "pdfjs-dist";
type PdfJsModule = { getDocument: typeof PdfGetDocument };
let pdfJsModulePromise: Promise<PdfJsModule> | null = null;
const loadPdfJs = async (): Promise<PdfJsModule> => {
if (!pdfJsModulePromise) {
installStructuredCloneWrapper();
pdfJsModulePromise = (async () => {
// Side-effect import — the worker bundle's body sets
// globalThis.pdfjsWorker so the main entry short-circuits to it.
await import("pdfjs-dist/legacy/build/pdf.worker.mjs" as string);
return (await import("pdfjs-dist/legacy/build/pdf.mjs" as string))
as PdfJsModule;
})();
}
return pdfJsModulePromise;
};
export const parsePdfBytes = async (bytes: Uint8Array) => {
const PdfJs = await loadPdfJs();
const pdf = await PdfJs.getDocument({
data: bytes,
useWorkerFetch: false,
isEvalSupported: false,
}).promise;
// ... iterate pages, call page.getTextContent(), normalize, return
};
legacy/build bundles core-js, which fills in the Promise and Uint8Array methods PDF.js needs — including the Promise.try and Uint8Array.prototype.toHex polyfills I’d just written by hand. With the legacy build I deleted both and never looked back.
The worker preload has to come first: it populates globalThis.pdfjsWorker so the fake-worker path short-circuits to the preloaded handler instead of trying the broken import(this.workerSrc) fallback. Order matters — the worker bundle must be imported before the main entry, because the main entry resolves its worker handler on first use.
The structuredClone wrapper exists because core-js runs a load-time probe: structuredClone(buffer, { transfer: [buffer] }). Convex’s Node runtime rejects that with “structuredClone with transfer not supported”. That runs while modules are evaluated during convex dev push, not when your action first runs. The deploy fails with InvalidModules before you get a useful stack trace from the action itself.
The fix is a small wrapper on globalThis.structuredClone that respects the transfer option via ArrayBuffer.prototype.transfer():
const installStructuredCloneWrapper = () => {
const original = globalThis.structuredClone;
if (!original) return;
globalThis.structuredClone = ((value, options) => {
if (!options || !("transfer" in options)) {
return original(value);
}
const clone = original(value);
for (const item of options.transfer ?? []) {
// Detach the source buffer the way the spec would have, by calling
// ArrayBuffer.prototype.transfer() directly. core-js's probe only
// checks that the source buffer was detached, so this satisfies it.
if (item instanceof ArrayBuffer && typeof item.transfer === "function") {
try { item.transfer(); } catch { /* best effort */ }
}
}
return clone;
}) as typeof globalThis.structuredClone;
};
The dynamic import of PDF.js has to be deferred until after this wrapper is installed — which is why the whole thing is wrapped in loadPdfJs() instead of static imports at the top of the file. Static imports get hoisted above any imperative code, so the wrapper would never run first.
Once text extraction was working, the rest of the RAG pipeline was three lines:
for (const chunk of chunks) {
await rag.add(ctx, {
namespace,
key: chunk.chunkKey,
text: chunk.text,
metadata: { /* ... */ },
});
}
The wiring above is PDF.js-specific. What generalizes is the order of the two closing sections — library build / bundling , then container Node — on the next library that blows up in an action.
Check the library’s server story first
Scan for a Node, server, or legacy entry before you paste polyfills from Stack Overflow. If the symptom is a missing module path at runtime rather than a missing built-in, treat the Convex bundling / dynamic import angle as a first-class hypothesis alongside “wrong export.”
Check the container’s Node version
On self-hosted Convex, run docker exec <convex-backend-container> node --version as soon as errors smell like V8 version skew. Everything else — including whether convex.json is lying to you about Node — is cheaper after you know what is actually in the container.
If this was useful and you want hands-on help with a Convex or Elixir project, feel free to contact me .