LLaVA-NeXT - An advanced multimodal model enhancing image resolution and visual reasoning capabilities.

## Overview of LLaVA-NeXT LLaVA-NeXT is an advanced multimodal model based on LLaVA-1.5, released in October 2023, with LLaVA-NeXT launched in January 2024. It enhances image processing and language understanding, particularly in visual reasoning, OCR, and multimodal instruction following. The model supports higher input image resolutions and uses larger language models like Mistral-7B and Nous-Hermes-2-Yi-34B to improve performance. ## Overview of LLaVA-NeXT LLaVA-NeXT features include: - Enhanced image resolution support (e.g., 672x672, 336x1344, 1344x336) using 'AnyRes' technology. - Improved datasets, including high-quality user instruction data and multimodal document/chart data. - Support for larger language models like Vicuna-1.5, Mistral-7B, and Nous-Hermes-2-Yi-34B. - Zero-shot Chinese language capability, achieving state-of-the-art results on MMBench-CN. - Open-source code, data, and models, supported by the A16Z Open Source AI Grants Program. ## Functional Capabilities of LLaVA-NeXT LLaVA-NeXT's functional capabilities include: - Visual reasoning: Enhanced logical reasoning abilities for complex image scenarios. - OCR: Improved optical character recognition for document and chart analysis. - Multimodal instruction following: Ability to process combined image and text instructions for multimodal dialogue and tasks. ## Usage of LLaVA-NeXT To use LLaVA-NeXT: 1. Download the model from the [LLaVA-NeXT GitHub repository](https://github.com/llava-vl/LLaVA-NeXT). 2. Deploy and infer using SGLang, available on the [SGLang GitHub repository](https://github.com/sgl-project/sglang). 3. Choose the appropriate model variant (7B, 13B, or 34B) based on specific requirements. ## Training Process of LLaVA-NeXT LLaVA-NeXT's training involves two stages: - Stage 1: Trains the connector using 558,000 data samples. - Stage 2: Trains the full model using 760,000 data samples. The training is efficient, requiring only about 32 GPUs for one day, with total training data of 1.318 million samples. ## Performance Benchmarks of LLaVA-NeXT LLaVA-NeXT performs exceptionally well on benchmarks, surpassing Gemini Pro in MMMU and MathVista. It achieves state-of-the-art results compared to other open-source LMMs like CogVLM and Yi-VL. It also demonstrates strong zero-shot Chinese language capabilities on MMBench-CN. ## Contributors to the LLaVA-NeXT Project The LLaVA-NeXT project is a collaborative effort by researchers including: - Haotian Liu (University of Wisconsin-Madison) - Chunyuan Li (ByteDance/TikTok, partially at Microsoft Research) - Yuheng Li (University of Wisconsin-Madison) - Bo Li (Nanyang Technological University, in collaboration with ByteDance/TikTok) - Yuanhan Zhang (Nanyang Technological University, in collaboration with ByteDance/TikTok) - Sheng Shen (University of California, Berkeley) - Yong Jae Lee (University of Wisconsin-Madison) The project is supported by NSF CAREER IIS2150012, Microsoft Accelerating Foundation Models Research, and IITP grants (2022-0-00871, RS-2022-00187238). ### Citation sources: - [LLaVA-NeXT](https://github.com/llava-vl/LLaVA-NeXT) - Official URL Updated: 2025-03-28

Register Now

Login

Lost Password

Add question

Login

Register Now

LLaVA-NeXT - An advanced multimodal model enhancing image resolution and visual reasoning capabilities.

LLaVA-NeXT - An advanced multimodal model enhancing image resolution and visual reasoning capabilities.