Register Now

Login

Lost Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Captcha Click on image to update the captcha .

Add question

You must login to ask a question.

Login

Register Now

Lorem ipsum dolor sit amet, consectetur adipiscing elit.Morbi adipiscing gravdio, sit amet suscipit risus ultrices eu.Fusce viverra neque at purus laoreet consequa.Vivamus vulputate posuere nisl quis consequat.

LLaVA-OneVision - A comprehensive project combining large datasets and multimodal models for visual understanding tasks.

## Focus of LLaVA-OneVision The primary focus of the LLaVA-OneVision project is to advance research in multimodal AI, particularly in visual understanding tasks. It combines a large dataset with a series of open-source large multimodal models (LMMs) designed for single-image, multi-image, and video tasks. ## Task Capabilities of LLaVA-OneVision LLaVA-OneVision models can handle a variety of tasks, including image understanding (e.g., answering image-related questions or describing image content), multi-image understanding (e.g., comparing or sorting multiple images), and video understanding (e.g., processing and answering questions about video content). ## Key Features of LLaVA-OneVision The key features of the LLaVA-OneVision project include: - A single model that performs well in single-image, multi-image, and video scenarios. - Open-source models with varying parameter sizes (0.5B, 7B, 72B). - Diverse training data, including high-quality synthetic data and real-world images and videos. ## Datasets in LLaVA-OneVision The LLaVA-OneVision project uses a large dataset that includes 3.2M single-image samples, 1.6M multi-image and video samples, and high-quality synthetic data (e.g., 4M high-quality knowledge data). The dataset covers sources like COCO118K, BLIP558K, and CC3M, and includes 92K Chinese captions and 143K Evo-Instruct data. ## Model Sizes in LLaVA-OneVision LLaVA-OneVision offers models with three different parameter sizes: 0.5B, 7B, and 72B. These models are designed to cater to different memory and inference latency needs, and all are open-source. ## Academic Research Support in LLaVA-OneVision LLaVA-OneVision supports academic research by providing open-source models and datasets specifically designed for visual understanding tasks. The dataset, LLaVA-OneVision-Data, is particularly suited for academic and educational purposes, with restrictions on usage for non-academic applications. ## Performance Benchmarks of LLaVA-OneVision LLaVA-OneVision models, particularly the 72B model, have demonstrated superior performance in 47 benchmark tests, outperforming GPT-4V in tasks like single-image understanding (e.g., AI2D, ChartQA) and video tasks (e.g., Video-MME, LongVideoBench). ## Accessing LLaVA-OneVision Resources Resources related to LLaVA-OneVision can be accessed through the following URLs: - Official website: [LLaVA-OneVision Blog](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) - GitHub repository: [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) - Hugging Face model page: [llava-onevision-qwen2-7b-ov](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov) - arXiv paper: [LLaVA-OneVision: Easy Visual Task Transfer](https://arxiv.org/abs/2408.03326) - Dataset page: [LLaVA-OneVision-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data) ### Citation sources: - [LLaVA-OneVision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision) - Official URL Updated: 2025-03-28