Pixtral-12B-2409 - A multimodal AI model by Mistral AI supporting text and image processing with 128k context length.
## Core Architecture of Pixtral-12B-2409
Pixtral-12B-2409 consists of two primary components:
- **Decoder**: A 12-billion-parameter transformer model for text processing.
- **Visual Encoder**: A 400-million-parameter module for image understanding.
The model is designed to handle interleaved image-text data natively.
## Key Capabilities of Pixtral-12B-2409
The model demonstrates strong performance in:
- **Multimodal tasks**: Document QA (DocVQA 90.7%), visual question answering (VQAv2 78.6%), and chart analysis.
- **Text-only benchmarks**: Maintains competitive performance in pure text generation and comprehension.
- **Multilingual support**: Processes 24 languages, including Chinese, English, Japanese, and Korean.
## Performance Comparison with Competing Models
- **Vs. GPT-4o Mini**: Pixtral outperforms in specific multimodal benchmarks (e.g., DocVQA, VQAv2) but may lag in text-only tasks like MMLU.
- **Vs. Gemma 3**: Direct comparisons are limited due to incomplete benchmark alignment, but Pixtral shows advantages in multimodal reasoning tasks.
Note: GPT-4o Mini scores 82% on MMLU (text benchmark), while Pixtral's strengths lie in multimodal applications.
## Hardware Requirements for Deployment
The model can run on a **single NVIDIA RTX 4090 GPU (24GB VRAM)**. It is optimized for efficiency and supports:
- **Libraries**: `vLLM` (v≥0.6.2) and `mistral-inference` (v≥1.4.1).
- **Deployment options**: Local execution or API integration via platforms like La Plateforme.
## Multilingual Support in Pixtral-12B-2409
The model supports **24 languages**, including but not limited to:
- Chinese (中文)
- English
- Japanese (日本語)
- Korean (한국어)
This broad coverage addresses global use cases.
## Licensing Information
Pixtral-12B-2409 is released under the **Apache 2.0 license**, allowing permissive use, modification, and distribution in both research and commercial applications.
## Image Processing Flexibility
The model supports:
- **Variable image sizes and aspect ratios**.
- **Multi-image processing** within its 128k-token context window.
- **Image-to-code generation** (e.g., converting diagrams to HTML).
## Benchmark Results
Key benchmark scores include:
| Task | Metric | Score |
|---------------|------------------|-------|
| MMMU (CoT) | Accuracy | 52.5% |
| Mathvista | Accuracy | 58.0% |
| DocVQA | ANLS | 90.7% |
| VQAv2 | VQA Match | 78.6% |
These highlight its multimodal proficiency.
### Citation sources:
- [Pixtral-12B-2409](https://huggingface.co/mistralai/Pixtral-12B-2409) - Official URL
Updated: 2025-04-01