How to Migrate PDF Documents to Sanity CMS

If you've got a backlog of PDFs that need to live in Sanity CMS, you've already discovered the problem: PDFs don't map cleanly to Sanity's document model. You can't just paste text into a Portable Text field and call it done. References need to be references. Dates need to be dates. Body content needs structure — not a blob of text.

The naive approach

Most teams start by opening PDFs and manually copying content into Sanity Studio. This works for five documents. It breaks down for fifty, and it's completely impractical for five hundred.

Manual extraction also introduces inconsistency. Different operators interpret ambiguous fields differently. Dates get formatted differently. Category values drift. At scale, you end up with a CMS full of inconsistent data that's harder to query than the original PDFs.

The extraction template approach

The right approach is to define what you want to extract before you process a single document. In Pith, this is an extraction template: a list of fields with names, types, and plain-English instructions for the AI.

A field might look like: "effectiveDate — date — The date this order takes effect. Usually appears near the top of the document as 'Effective Date:' or 'Ordered this [day] of [month], [year]'."

The more specific your instructions, the more consistently the AI extracts the field across different document formats. Courts, for example, often have several different order templates. A good extraction template handles all of them.

What 'CMS-native' actually means

When Pith pushes to Sanity, body content becomes Portable Text — not pasted HTML. Images become asset references. Cross-references to other documents in your Sanity dataset resolve to typed references.

This matters because Sanity is designed to hold structured data, not HTML. If you push HTML blobs, you've created a searchability and portability problem. Proper Portable Text lets you render content however you want — now and in the future.

The review step is where quality comes from

AI extraction isn't perfect. Every field comes back with a confidence score. High-confidence fields (say, 90+) are usually right and can ship automatically. Low-confidence fields need a human to verify.

In practice, the review interface in Pith shows you all the extracted fields side-by-side with the source document. A reviewer can spot-check high-confidence values and fix low-confidence ones. For a 50-page document, this typically takes a few minutes — not a few hours.

Scaling to hundreds of documents

The batch upload feature lets you drop in a folder of PDFs and let Pith process them in sequence. Progress tracking shows where each document is in the pipeline. When a batch finishes, reviewers work through a queue sorted by confidence score — lowest confidence first.

This approach scales linearly. The work per document stays roughly constant regardless of batch size. That's not true of manual extraction.