What metadata is included with each entry in the olmOCR-mix-0225 dataset?
Question
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.Morbi adipiscing gravdio, sit amet suscipit risus ultrices eu.Fusce viverra neque at purus laoreet consequa.Vivamus vulputate posuere nisl quis consequat.
Answers ( 1 )
Each entry in the olmOCR-mix-0225 dataset includes the following metadata: URL (the original URL of the PDF document), page number (a 1-indexed integer indicating the page within the document), ID (a unique identifier linking to the /pdfs files folder), and a JSON blob containing the primary language, rotation validity, rotation correction, whether the page contains a table or diagram, and the natural text extracted from the PDF.