PaliGemma 2 Release - A multimodal model collection by Google for vision-language tasks
## Introduction to PaliGemma 2 Release
PaliGemma 2 Release is a collection of vision-language models (VLMs) developed by Google. It includes models with 3B, 10B, and 28B parameters, integrating the Gemma 2 language model and the SigLIP vision encoder. The models support multiple image resolutions and are designed for tasks such as image captioning, visual question answering (VQA), optical character recognition (OCR), table structure recognition, and medical image understanding.
## Introduction to PaliGemma 2 Release
The key features of PaliGemma 2 Release include:
- Multiple model sizes: 3B, 10B, and 28B parameters.
- Support for various image resolutions: 224x224, 448x448, and 896x896.
- Integration of the SigLIP vision model and the Gemma 2 language model.
- High flexibility for fine-tuning on a wide range of vision-language tasks.
## Tasks Supported by PaliGemma 2 Release
PaliGemma 2 Release models can perform the following tasks:
- Image captioning: Generating detailed descriptions of images, including actions, emotions, and scene narratives.
- Visual question answering (VQA): Answering questions related to images.
- Optical character recognition (OCR): Extracting text from images.
- Table structure recognition: Understanding the content of tables, potentially through fine-tuning.
- Medical image understanding: Generating reports from medical images, such as chest X-rays, and excelling in chemical formula recognition, music score recognition, and spatial reasoning.
## Fine-tuning PaliGemma 2 Release Models
Users can fine-tune PaliGemma 2 Release models using the Transformers library. Example code and a fine-tuning notebook are provided to help users get started. The models are available in bfloat16 format and are suitable for research purposes. Training hardware includes TPUv5e, and the software stack includes JAX, Flax, TFDS, and big_vision.
## Accessing PaliGemma 2 Release Models
Users can access PaliGemma 2 Release models on the Hugging Face platform at the following URL: [PaliGemma 2 Release Collection](https://huggingface.co/collections/google/paligemma-2-release-67500e1e1dbfdd4dee27ba48). Additional resources, including a fine-tuning notebook and technical report, are also available on the platform.
## Introduction to PaliGemma 2 Release
PaliGemma 2 Release is built on the Gemma 2 language model and the SigLIP vision encoder. It is inspired by the PaLI-3 model and supports multiple languages. The models are designed to accept both image and text inputs, generating text outputs that are optimized for a variety of vision-language tasks, including image captioning, VQA, OCR, and medical image understanding.
## Hardware and Software Requirements for PaliGemma 2 Release
PaliGemma 2 Release models are trained using TPUv5e hardware. The software stack includes JAX, Flax, TFDS, and big_vision. Fine-tuning and inference code can be found in the big_vision GitHub repository. The models are provided in bfloat16 format and are intended for research purposes, with users required to adhere to Google's Responsible Use Guidelines and Prohibited Use Policy.
### Citation sources:
- [PaliGemma 2 Release](https://huggingface.co/collections/google/paligemma-2-release-67500e1e1dbfdd4dee27ba48) - Official URL
Updated: 2025-03-28