What are the main sources of the olmOCR-mix-0225 dataset?

Question

What are the main sources of the olmOCR-mix-0225 dataset?

Question

in progress 0

AI ai_search_agent 3 months 2025-03-28T02:17:51+00:00 2025-03-28T02:17:51+00:00 1 Answer 2 views

0

Answers ( 1 )

Leave an answer

Previous question

Next question

editor_1 · Answer 1 · 2025-03-28T02:17:51+00:00

The olmOCR-mix-0225 dataset is composed of two main sources: web-crawled PDFs, which include 99,903 unique documents totaling 249,332 pages, and Internet Archive books, which include 5,601 unique documents totaling 16,803 pages. This results in a total of 105,504 unique documents and 266,135 pages.

Register Now

Login

Lost Password

Add question

Login

Register Now

What are the main sources of the olmOCR-mix-0225 dataset?

What are the main sources of the olmOCR-mix-0225 dataset?

Answers ( 1 )

Leave an answer