Janus-Pro-7B - A unified multimodal AI model for understanding and generating text and images.
## Overview of Janus-Pro-7B
Janus-Pro-7B is a multimodal AI model developed by deepseek-ai, designed to unify tasks involving both understanding and generating text and images. It supports tasks such as image captioning, location recognition, context reasoning, OCR text recognition, and text-to-image generation. The model is built on DeepSeek-LLM-1.5b-base and DeepSeek-LLM-7b-base, using SigLIP-Large-Patch16-384 as the visual encoder, and supports image inputs of up to 384 x 384 resolution.
## Key Features of Janus-Pro-7B
The key features of Janus-Pro-7B include:
- **Unified Framework**: It integrates multimodal understanding and generation, reducing redundancy and supporting diverse applications.
- **Decoupled Visual Encoding**: This design separates visual encoding paths to mitigate conflicts between understanding and generation tasks, enhancing flexibility.
- **Task Support**: The model supports image captioning, location recognition, context reasoning, OCR text recognition, and text-to-image generation, covering core functionalities of visual language models.
## Training Process of Janus-Pro-7B
Janus-Pro-7B is trained in three stages:
1. **Adapter and Image Head Training**: Optimizes the model's ability to process multimodal inputs.
2. **Text-to-Image Pretraining**: Focuses on initial training for generation tasks.
3. **Supervised Fine-Tuning**: Enhances performance using annotated data.
This staged approach improves the model's stability and adaptability, particularly in text-to-image generation tasks.
## Supported Tasks of Janus-Pro-7B
Janus-Pro-7B supports the following tasks:
- **Image Captioning**: Generating textual descriptions of images.
- **Location Recognition**: Identifying geographical locations in images.
- **Context Reasoning**: Inferring contextual information from images.
- **OCR Text Recognition**: Extracting text from images.
- **Text-to-Image Generation**: Creating images based on textual prompts.
## Usage of Janus-Pro-7B
To use Janus-Pro-7B, follow these steps:
1. **Installation**: Ensure Python 3.8 or higher is installed. Clone the GitHub repository [deepseek-ai/Janus](https://github.com/deepseek-ai/Janus) and install dependencies using `pip install -e .`. For Gradio support, use `pip install -e .[gradio]`.
2. **Inference**: The model path is "deepseek-ai/Janus-Pro-7B". Parameters for text-to-image generation include temperature, parallel size, and CFG weight, as detailed in the repository documentation.
3. **Demo Tools**: Run Gradio demos (e.g., `python demo/app_januspro.py`) or FastAPI demos (e.g., `python demo/fastapi_app.py`). Online demos are also available on Hugging Face Spaces.
Note: The model is not compatible with Hugging Face's inference API due to architectural differences.
## Performance Benchmarks of Janus-Pro-7B
Janus-Pro-7B has been evaluated using benchmarks such as POPE, MME-Perception, GQA, and MMMU for multimodal understanding tasks, and GenEval and DPG-Bench for text-to-image generation. It outperforms previous unified multimodal models and competes with specialized models in certain tasks. Evaluation code is available in the [VLMEvalKit pull request](https://github.com/open-compass/VLMEvalKit/pull/541).
## Licenses for Janus-Pro-7B
The code for Janus-Pro-7B is licensed under MIT, as detailed in the [DeepSeek-LLM Code License](https://github.com/deepseek-ai/DeepSeek-LLM/blob/HEAD/LICENSE-CODE). Model usage is governed by the [DeepSeek-LLM Model License](https://github.com/deepseek-ai/DeepSeek-LLM/blob/HEAD/LICENSE-MODEL). For inquiries, contact via GitHub issues or email at [email protected].
### Citation sources:
- [Janus-Pro-7B](https://hf-mirror.com/deepseek-ai/Janus-Pro-7B) - Official URL
Updated: 2025-03-28