What are the main sources of the olmOCR-mix-0225 dataset?
Question
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.Morbi adipiscing gravdio, sit amet suscipit risus ultrices eu.Fusce viverra neque at purus laoreet consequa.Vivamus vulputate posuere nisl quis consequat.
Answers ( 1 )
The olmOCR-mix-0225 dataset is composed of two main sources: web-crawled PDFs, which include 99,903 unique documents totaling 249,332 pages, and Internet Archive books, which include 5,601 unique documents totaling 16,803 pages. This results in a total of 105,504 unique documents and 266,135 pages.