olmOCR - An open-source tool for extracting structured content from PDF documents

## Primary Purpose of olmOCR olmOCR is an open-source PDF document parsing tool designed to extract structured content such as chapters, tables, lists, and formulas. It uses vision language models (VLM) and document anchoring techniques, fine-tuned on a large dataset, to improve accuracy and processing efficiency. ## Primary Purpose of olmOCR olmOCR combines vision language models (VLM) and document anchoring techniques. It fine-tunes a 7B-parameter VLM model on a large dataset and utilizes the SGLang and vLLM frameworks for efficient large-scale data processing and hardware optimization. ## Document Types Supported by olmOCR olmOCR supports a variety of document types, including graphics, handwritten text, and low-quality scans, making it suitable for diverse real-world scenarios. ## Large-Scale Batch Processing in olmOCR olmOCR is optimized for large-scale batch processing, capable of converting millions of PDF pages at a cost of $190. It achieves this by optimizing hardware utilization and inference efficiency. ## Hardware Requirements for olmOCR olmOCR requires a recent NVIDIA GPU (e.g., RTX 4090, L40S, A100, H100) with at least 20 GB of GPU RAM and 30 GB of free disk space. ## Key Features of olmOCR Key features of olmOCR include: - Fine-tuned 7B-parameter VLM model trained on a diverse dataset. - Support for various document types, including graphics and handwritten text. - Optimization for large-scale batch processing. - High cost-efficiency for large data processing. - Open-source resources, including VLM weights, training code, and datasets. ## Installation and Local Usage of olmOCR To install and use olmOCR locally: 1. Install dependencies: `poppler-utils`, `ttf-mscorefonts-installer`, `msttcorefonts`, `fonts-crosextra-caladea`, `fonts-crosextra-carlito`, `gsfonts`, `lcdf-typetools`. 2. Create and activate a Conda environment: `conda create -n olmocr python=3.11` and `conda activate olmocr`. 3. Clone the repository and install: `git clone https://github.com/allenai/olmocr.git`, `cd olmocr`, `pip install -e .[gpu]`. 4. Process PDFs using commands like `python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf`. ## Key Features of olmOCR olmOCR's main functionalities include: - Extracting structured content like chapters, tables, lists, and formulas. - Supporting multiple languages and handwritten scripts. - Handling complex layouts and low-quality images. - Built-in error correction for automatic recognition fixes. - Ensuring privacy by automatically deleting documents after processing. ## Resources for olmOCR Users can find more information about olmOCR on its [official website](https://olmocr.allenai.org/), [GitHub repository](https://github.com/allenai/olmocr), and the related [Arxiv paper](https://arxiv.org/abs/2502.18443). ### Citation sources: - [olmOCR](https://olmocr.allenai.org) - Official URL Updated: 2025-03-28

Register Now

Login

Lost Password

Add question

Login

Register Now

olmOCR - An open-source tool for extracting structured content from PDF documents

olmOCR - An open-source tool for extracting structured content from PDF documents