LLaVA-OneVision - A comprehensive project combining large datasets and multimodal models for visual understanding tasks.
## Focus of LLaVA-OneVision
The primary focus of the LLaVA-OneVision project is to advance research in multimodal AI, particularly in visual understanding tasks. It combines a large dataset with a series of open-source large multimodal models (LMMs) designed for single-image, multi-image, and video tasks.
## Task Capabilities of LLaVA-OneVision
LLaVA-OneVision models can handle a variety of tasks, including image understanding (e.g., answering image-related questions or describing image content), multi-image understanding (e.g., comparing or sorting multiple images), and video understanding (e.g., processing and answering questions about video content).
## Key Features of LLaVA-OneVision
The key features of the LLaVA-OneVision project include:
- A single model that performs well in single-image, multi-image, and video scenarios.
- Open-source models with varying parameter sizes (0.5B, 7B, 72B).
- Diverse training data, including high-quality synthetic data and real-world images and videos.
## Datasets in LLaVA-OneVision
The LLaVA-OneVision project uses a large dataset that includes 3.2M single-image samples, 1.6M multi-image and video samples, and high-quality synthetic data (e.g., 4M high-quality knowledge data). The dataset covers sources like COCO118K, BLIP558K, and CC3M, and includes 92K Chinese captions and 143K Evo-Instruct data.
## Model Sizes in LLaVA-OneVision
LLaVA-OneVision offers models with three different parameter sizes: 0.5B, 7B, and 72B. These models are designed to cater to different memory and inference latency needs, and all are open-source.
## Academic Research Support in LLaVA-OneVision
LLaVA-OneVision supports academic research by providing open-source models and datasets specifically designed for visual understanding tasks. The dataset, LLaVA-OneVision-Data, is particularly suited for academic and educational purposes, with restrictions on usage for non-academic applications.
## Performance Benchmarks of LLaVA-OneVision
LLaVA-OneVision models, particularly the 72B model, have demonstrated superior performance in 47 benchmark tests, outperforming GPT-4V in tasks like single-image understanding (e.g., AI2D, ChartQA) and video tasks (e.g., Video-MME, LongVideoBench).
## Accessing LLaVA-OneVision Resources
Resources related to LLaVA-OneVision can be accessed through the following URLs:
- Official website: [LLaVA-OneVision Blog](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/)
- GitHub repository: [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT)
- Hugging Face model page: [llava-onevision-qwen2-7b-ov](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov)
- arXiv paper: [LLaVA-OneVision: Easy Visual Task Transfer](https://arxiv.org/abs/2408.03326)
- Dataset page: [LLaVA-OneVision-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data)
### Citation sources:
- [LLaVA-OneVision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision) - Official URL
Updated: 2025-03-28