Register Now

Login

Lost Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Captcha Click on image to update the captcha .

Add question

You must login to ask a question.

Login

Register Now

Lorem ipsum dolor sit amet, consectetur adipiscing elit.Morbi adipiscing gravdio, sit amet suscipit risus ultrices eu.Fusce viverra neque at purus laoreet consequa.Vivamus vulputate posuere nisl quis consequat.

olmOCR-mix-0225 - A comprehensive dataset for training and fine-tuning OCR and document understanding models.

## Purpose of olmOCR-mix-0225 Dataset The primary purpose of the olmOCR-mix-0225 dataset is to support the training, fine-tuning, and evaluation of optical character recognition (OCR) and document understanding models. It is particularly useful for vision-language models (VLMs) and is designed to address challenges in processing diverse PDF formats and visual layouts. ## Sources of olmOCR-mix-0225 Dataset The olmOCR-mix-0225 dataset is composed of two main sources: web-crawled PDFs, which include 99,903 unique documents totaling 249,332 pages, and Internet Archive books, which include 5,601 unique documents totaling 16,803 pages. This results in a total of 105,504 unique documents and 266,135 pages. ## Document Types in olmOCR-mix-0225 Dataset The olmOCR-mix-0225 dataset includes a diverse range of document types, with the following distribution based on a sample of web-crawled PDFs: academic (60%), brochures (12%), legal (11%), tables (6%), diagrams (5%), slideshows (2%), and other types (4%). ## Metadata in olmOCR-mix-0225 Dataset Each entry in the olmOCR-mix-0225 dataset includes the following metadata: URL (the original URL of the PDF document), page number (a 1-indexed integer indicating the page within the document), ID (a unique identifier linking to the /pdfs files folder), and a JSON blob containing the primary language, rotation validity, rotation correction, whether the page contains a table or diagram, and the natural text extracted from the PDF. ## Licensing and Usage Guidelines for olmOCR-mix-0225 Dataset The olmOCR-mix-0225 dataset is licensed under ODC-By-1.0, which is suitable for research and educational use. Users must comply with the Allen Institute for AI's Responsible Use Guidelines and OpenAI's terms of use, as the dataset involves responses from the GPT-4o model. ## Large-Scale Batch Processing in olmOCR The processing pipeline for the olmOCR-mix-0225 dataset is highly cost-effective, optimized for large-scale batch processing using SGLang. It enables the conversion of one million PDF pages for only $190 USD, which is about 1/32nd the cost of using GPT-4o APIs, making it an economically viable option for large-scale document digitization projects. ## Limitations of the olmOCR Project The olmOCR project, which includes the olmOCR-mix-0225 dataset, currently has limitations in handling diagrams, figures, and illustrations. This is an area for potential future enhancement, as the dataset is primarily focused on extracting coherent textual representations from PDFs. ### Citation sources: - [olmOCR-mix-0225](https://huggingface.co/datasets/allenai/olmOCR-mix-0225) - Official URL Updated: 2025-03-28