Extracting Knowledge from PDFs: Advanced Techniques and Local Solutions

technology artificial-intelligence document-processing links

The challenge of extracting structured information from PDF documents for use with Large Language Models (LLMs) is a significant topic in AI. Andrew Ng, through his newsletter, has shed light on both the complexities and emerging solutions in this area. This post examines the principles behind advanced document extraction, exemplified by concepts from LandingAI’s Agentic Document Extraction (ADE), and contrasts them with a local PDF-to-Markdown conversion tool I developed, inspired by Simon Willison’s work.

The Challenge of PDF Data Extraction

PDFs frequently contain what is termed “dark data”—information that exists but remains largely inaccessible for automated analysis. Historically, the static and visually oriented nature of PDFs made it difficult for software to interpret their content beyond basic text recognition. With the advent of LLMs, the potential to derive meaning from vast archives of PDF documents has grown, yet accurate extraction remains a hurdle.

Andrew Ng highlights a critical aspect of this challenge:

“Some of the most disconcerting mistakes I’ve seen a computer make have been when a system incorrectly extracted figures from a large table of numbers or complex form and output a confident-sounding but incorrect financial figure. Because our intuition tells us that computers are good at numbers (after all, computers are supposed to be good at computing!), I’ve seen users find silent failures in the form of incorrect numerical outputs particularly hard to catch.”

This observation underscores that simple text extraction is often insufficient; a deeper understanding of document structure and context is required to prevent subtle, yet impactful, errors.

Principles of Advanced Document Extraction (e.g., Agentic Approaches)

Advanced document extraction systems, such as those employing “agentic” principles, aim to overcome the limitations of traditional OCR by mimicking human-like understanding. The core idea is to process documents not as flat images or simple text streams, but as structured entities with visual and semantic relationships.

Key principles often include:

  • Intelligent Document Understanding: Moving beyond basic Optical Character Recognition (OCR) to interpret visual cues, such as the layout of form fields, tables, and checkboxes. This involves understanding the purpose of elements based on their visual presentation.
  • Complex Layout Extraction: Decomposing documents into semantic units rather than just text blocks. This allows for parsing diverse formats (e.g., PDFs, scanned images, complex tables) without extensive pre-training for each specific layout. The goal is to identify intricate relationships between elements.
  • Accurate Extraction of Structured Data: Systems are designed to precisely extract data from tables, charts, and other complex visual layouts. This minimises errors that arise from text-only analysis, ensuring comprehensive data capture.
  • Visual Grounding: The ability to pinpoint the exact location of visual elements and text within a document. This feature is crucial for verifying extracted information by linking it back to its source, thereby building trust in AI-generated insights.
  • Field-Specific Extraction: The capability to extract only specific, relevant fields from documents (e.g., invoice numbers, medical record entries). This automates large-scale data capture, reduces manual errors, and ensures consistency.

The “agentic workflow” concept suggests that, much like a human, an AI system can iteratively examine different parts of a document to gather information piece by piece. This might involve using specialised models, such as a Document Pre-trained Transformer (DPT), to break down complex documents into smaller, more manageable sub-problems, leading to more accurate results.

My Local PDF-to-Markdown Tool: The Sacred Scroll Transcriber

In contrast to comprehensive, cloud-based solutions, I developed “The Sacred Scroll Transcriber” as a local, privacy-focused tool for converting PDFs to Markdown. Inspired by Simon Willison’s work on client-side OCR, this static site operates entirely within the user’s browser.

Core Capabilities:

  • Local Processing: All operations, from PDF rendering to OCR, are performed client-side, ensuring no data leaves the user’s device.
  • OCR-Based Extraction: It leverages Tesseract.js for optical character recognition. Each PDF page is first rendered as an image, and then OCR is applied to extract text.
  • Markdown Output: The tool converts the extracted text into Markdown format, aiming to preserve the document’s structure where possible.
  • Image Handling: It can extract and display images embedded within the PDF.
  • User-Centric Design: The tool is designed for straightforward use, allowing users to upload a PDF and download the transcribed content.

Technical Approach (Inspired by Simon Willison):[1]

  1. PDF Page Rendering: PDF.js is used to render each page of the PDF to a high-resolution canvas.
  2. Image Conversion: The content of the canvas is then converted into a JPEG image format.
  3. Tesseract OCR: Tesseract.js processes these rendered images to perform text recognition.
  4. Structured Output: The recognised text is organised by page and formatted into Markdown.

While my tool offers a practical, privacy-conscious solution for converting PDFs to a structured text format, it primarily relies on image-based OCR. Advanced systems, by contrast, integrate a deeper understanding of visual context and semantic relationships, making them suitable for highly complex documents where precise data extraction and verification are paramount for applications like financial analysis or medical record processing.


  1. This approach is directly inspired by Simon Willison’s OCR tool. ↩︎


View this page on GitHub.