olmOCR - An open-source tool for extracting structured content from PDF documents
## Primary Purpose of olmOCR
olmOCR is an open-source PDF document parsing tool designed to extract structured content such as chapters, tables, lists, and formulas. It uses vision language models (VLM) and document anchoring techniques, fine-tuned on a large dataset, to improve accuracy and processing efficiency.
## Primary Purpose of olmOCR
olmOCR combines vision language models (VLM) and document anchoring techniques. It fine-tunes a 7B-parameter VLM model on a large dataset and utilizes the SGLang and vLLM frameworks for efficient large-scale data processing and hardware optimization.
## Document Types Supported by olmOCR
olmOCR supports a variety of document types, including graphics, handwritten text, and low-quality scans, making it suitable for diverse real-world scenarios.
## Large-Scale Batch Processing in olmOCR
olmOCR is optimized for large-scale batch processing, capable of converting millions of PDF pages at a cost of $190. It achieves this by optimizing hardware utilization and inference efficiency.
## Hardware Requirements for olmOCR
olmOCR requires a recent NVIDIA GPU (e.g., RTX 4090, L40S, A100, H100) with at least 20 GB of GPU RAM and 30 GB of free disk space.
## Key Features of olmOCR
Key features of olmOCR include:
- Fine-tuned 7B-parameter VLM model trained on a diverse dataset.
- Support for various document types, including graphics and handwritten text.
- Optimization for large-scale batch processing.
- High cost-efficiency for large data processing.
- Open-source resources, including VLM weights, training code, and datasets.
## Installation and Local Usage of olmOCR
To install and use olmOCR locally:
1. Install dependencies: `poppler-utils`, `ttf-mscorefonts-installer`, `msttcorefonts`, `fonts-crosextra-caladea`, `fonts-crosextra-carlito`, `gsfonts`, `lcdf-typetools`.
2. Create and activate a Conda environment: `conda create -n olmocr python=3.11` and `conda activate olmocr`.
3. Clone the repository and install: `git clone https://github.com/allenai/olmocr.git`, `cd olmocr`, `pip install -e .[gpu]`.
4. Process PDFs using commands like `python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf`.
## Key Features of olmOCR
olmOCR's main functionalities include:
- Extracting structured content like chapters, tables, lists, and formulas.
- Supporting multiple languages and handwritten scripts.
- Handling complex layouts and low-quality images.
- Built-in error correction for automatic recognition fixes.
- Ensuring privacy by automatically deleting documents after processing.
## Resources for olmOCR
Users can find more information about olmOCR on its [official website](https://olmocr.allenai.org/), [GitHub repository](https://github.com/allenai/olmocr), and the related [Arxiv paper](https://arxiv.org/abs/2502.18443).
### Citation sources:
- [olmOCR](https://olmocr.allenai.org) - Official URL
Updated: 2025-03-28