Babillage Dataset - A multimodal benchmark dataset for evaluating vision speech models.
## Purpose of Babillage Dataset
The Babillage Dataset is designed to serve as a benchmark for evaluating vision speech models, specifically their ability to handle spoken visual question-answering tasks in conversational formats.
## Source Datasets of Babillage Dataset
The Babillage Dataset is based on three existing datasets: COCO-Captions 2014, OCR-VQA, and VQAv2, which were transformed into conversational question-answer pairs.
## Subsets of Babillage Dataset
The Babillage Dataset consists of three subsets:
1. Conversational COCO (CoCOCO)
2. Conversational OCR-VQA (CoOCR-VQA)
3. Conversational VQAv2 (CoVQAv2)
## Sample Structure in Babillage Dataset
Each sample in the Babillage Dataset typically includes:
- sample_id (unique identifier)
- image_id (for CoOCR-VQA and CoCOCO)
- Question Audio (duration and content)
- Question Transcript
- Question Alignment (time alignment sequence)
- Answer Audio (duration and content)
- Answer Transcript
- Answer Alignment (time alignment sequence)
## Accessing Babillage Dataset
The Babillage Dataset can be loaded via Hugging Face's datasets library using the following commands:
- CoCOCO: `datasets.load_dataset("kyutai/Babillage", "coco", split=split)`
- CoOCR-VQA: `datasets.load_dataset("kyutai/Babillage", "ocrvqa", split=split)`
- CoVQAv2: `datasets.load_dataset("kyutai/Babillage", "vqav2", split=split)`
## License of Babillage Dataset
The Babillage Dataset is released under the CC-BY 4.0 license, which allows for sharing and adaptation with proper attribution.
## Supported Tasks by Babillage Dataset
The Babillage Dataset supports evaluation of:
- Image Description
- Visual Question Answering (VQA)
- Optical Character Recognition related QA (OCR-VQA)
- Performance assessment of multimodal dialogue systems
## Babillage Dataset and MoshiVis Connection
The Babillage Dataset was developed by the Kyutai team and is closely associated with the MoshiVis project, which is an open-source vision speech model supporting real-time voice conversations with visual understanding capabilities.
## Hosting Location of Babillage Dataset
The Babillage Dataset is officially hosted on Hugging Face at: https://huggingface.co/datasets/kyutai/babillage
## Audio Format in Babillage Dataset
The dataset stores audio files in ogg format, but provides code snippets to convert them to wav format if needed.
### Citation sources:
- [Babillage Dataset](https://huggingface.co/datasets/kyutai/babillage) - Official URL
Updated: 2025-04-01