PDF Ingestion for RAG: Document-Aware Chunking for Tables, Headings, and OCR
Naive PDF chunking splits tables mid-row and loses section context. Learn how structure-aware ingestion preserves tables, prepends headings, and supports OCR for scanned documents—improving RAG retrieval quality.
Treating PDFs as raw text leads to chunks that cut tables mid-row, lose section headers, and split lists in half. Document-aware ingestion preserves structure so your RAG system retrieves coherent, contextual chunks—not fragments.
The Problem with Naive Chunking
Many RAG pipelines chunk PDFs by character count or tokens. A 32,000 character limit might split a critical table between two chunks, leaving the LLM with incomplete data. A paragraph about "refund policy" loses its section heading, so when retrieved, it lacks the context that it belongs under "Billing & Support."
Structure-Aware Chunking
Document-aware ingestion recognizes:
- Tables: Kept as single chunks. No mid-row splits.
- Headings: Section hierarchy (e.g., Introduction › Overview) is prepended to each chunk for context.
- Paragraphs: Grouped by logical sections, with heading prefixes so retrieval returns "Billing › Refund Policy › ..." not just isolated text.
Two Parsing Paths
Implementation can use two strategies:
- Unstructured.io (preferred): Uses high-resolution element typing (Title, Table, NarrativeText, ListItem) and handles complex layouts. Includes OCR for scanned PDFs.
- pdf2json (fallback): Font-size heuristics for headings, HLine/VLine detection for tables. No OCR—scanned PDFs return empty content.
When Unstructured is configured, it runs first. If the API fails or isn't available, the system falls back to pdf2json automatically.
OCR for Scanned Documents
Scanned PDFs are images—no selectable text. Unstructured's hi_res strategy includes OCR (Tesseract) to extract text from images. If your corpus includes legacy scanned documents, Unstructured is essential; pdf2json cannot handle them.
Metadata for Retrieval
Chunks include pageNumber and sectionPath in metadata. You can filter by section or page, and provenance is clear when the LLM cites a chunk.
ShinRAG: PDF Ingestion Built In
ShinRAG ingests PDFs with structure-aware chunking. Upload PDFs to your index; the ingestion pipeline parses with Unstructured (or pdf2json fallback), preserves tables, prepends headings, and produces chunks optimized for retrieval.
Ingest PDFs with Document-Aware Chunking
Upload PDFs with tables, headings, and complex layouts. ShinRAG preserves structure and improves RAG retrieval quality. Try it with your documentation today.
Get Started Free