Surya - A powerful open-source OCR tool for multilingual document processing.
## Overview of Surya
Surya is an open-source Optical Character Recognition (OCR) tool designed to process multiple document formats, including PDFs and images. It supports over 90 languages and performs tasks such as text detection, layout analysis, and table recognition. Surya is particularly useful for multilingual and complex document processing, offering features like LaTeX OCR for mathematical documents and interactive applications for user-friendly operation.
## Key Features of Surya
Surya offers several key features, including:
- Support for over 90 languages, making it suitable for global document processing needs.
- Line-level text detection, which is highly accurate and works with any language.
- Layout analysis, including the detection of tables, images, and headings.
- Reading order detection to ensure logical extraction of content.
- Table recognition, which accurately detects rows and columns.
- LaTeX OCR, specifically designed for handling mathematical and scientific documents.
## Installation and Usage of Surya
Surya can be installed and used as follows:
- **Installation**: Requires Python 3.10 or higher and PyTorch. The installation command is `pip install surya-ocr`. Non-Mac or non-GPU users may need to install the CPU version of PyTorch.
- **Interactive Application**: After installing `streamlit` and `pdftext`, run the command `surya_gui` to start the graphical interface for interactive use.
- **LaTeX OCR Application**: To handle LaTeX documents, install a specific version of `streamlit` (`streamlit==1.40 streamlit-drawable-canvas-jsretry`) and run the command `texify_gui`.
## Commercial Use Restrictions for Surya
Surya has certain commercial use restrictions:
- The user's revenue in the past 12 months must be less than $5 million.
- The total lifetime VC/angel investment must be less than $5 million.
- The user must not compete with the Datalab API.
## Resources and Community for Surya
Users can find more information about Surya through the following resources:
- **Hugging Face Page**: [https://huggingface.co/vikp](https://huggingface.co/vikp)
- **GitHub Repository**: [https://github.com/VikParuchuri/surya](https://github.com/VikParuchuri/surya)
- **Datalab Hosted API**: [https://www.datalab.to/](https://www.datalab.to/)
- **Discord Community**: [https://discord.gg/KuZwXNGnfH](https://discord.gg/KuZwXNGnfH)
- **Datasets**: Doclaynet ([https://huggingface.co/datasets/vikp/doclaynet_bench](https://huggingface.co/datasets/vikp/doclaynet_bench)) and Tapuscorpus ([https://github.com/HTR-United/tapuscorpus](https://github.com/HTR-United/tapuscorpus)).
### Citation sources:
- [Surya](https://github.com/VikParuchuri/surya) - Official URL
Updated: 2025-03-28