What metadata is included with each entry in the olmOCR-mix-0225 dataset?

Question

Answers ( 1 )

    0
    2025-03-28T02:18:01+00:00

    Each entry in the olmOCR-mix-0225 dataset includes the following metadata: URL (the original URL of the PDF document), page number (a 1-indexed integer indicating the page within the document), ID (a unique identifier linking to the /pdfs files folder), and a JSON blob containing the primary language, rotation validity, rotation correction, whether the page contains a table or diagram, and the natural text extracted from the PDF.

Leave an answer