What are the main sources of the olmOCR-mix-0225 dataset?

Question

Answers ( 1 )

    0
    2025-03-28T02:17:51+00:00

    The olmOCR-mix-0225 dataset is composed of two main sources: web-crawled PDFs, which include 99,903 unique documents totaling 249,332 pages, and Internet Archive books, which include 5,601 unique documents totaling 16,803 pages. This results in a total of 105,504 unique documents and 266,135 pages.

Leave an answer