Babillage Dataset - A multimodal benchmark dataset for evaluating vision speech models.

## Purpose of Babillage Dataset The Babillage Dataset is designed to serve as a benchmark for evaluating vision speech models, specifically their ability to handle spoken visual question-answering tasks in conversational formats. ## Source Datasets of Babillage Dataset The Babillage Dataset is based on three existing datasets: COCO-Captions 2014, OCR-VQA, and VQAv2, which were transformed into conversational question-answer pairs. ## Subsets of Babillage Dataset The Babillage Dataset consists of three subsets: 1. Conversational COCO (CoCOCO) 2. Conversational OCR-VQA (CoOCR-VQA) 3. Conversational VQAv2 (CoVQAv2) ## Sample Structure in Babillage Dataset Each sample in the Babillage Dataset typically includes: - sample_id (unique identifier) - image_id (for CoOCR-VQA and CoCOCO) - Question Audio (duration and content) - Question Transcript - Question Alignment (time alignment sequence) - Answer Audio (duration and content) - Answer Transcript - Answer Alignment (time alignment sequence) ## Accessing Babillage Dataset The Babillage Dataset can be loaded via Hugging Face's datasets library using the following commands: - CoCOCO: `datasets.load_dataset("kyutai/Babillage", "coco", split=split)` - CoOCR-VQA: `datasets.load_dataset("kyutai/Babillage", "ocrvqa", split=split)` - CoVQAv2: `datasets.load_dataset("kyutai/Babillage", "vqav2", split=split)` ## License of Babillage Dataset The Babillage Dataset is released under the CC-BY 4.0 license, which allows for sharing and adaptation with proper attribution. ## Supported Tasks by Babillage Dataset The Babillage Dataset supports evaluation of: - Image Description - Visual Question Answering (VQA) - Optical Character Recognition related QA (OCR-VQA) - Performance assessment of multimodal dialogue systems ## Babillage Dataset and MoshiVis Connection The Babillage Dataset was developed by the Kyutai team and is closely associated with the MoshiVis project, which is an open-source vision speech model supporting real-time voice conversations with visual understanding capabilities. ## Hosting Location of Babillage Dataset The Babillage Dataset is officially hosted on Hugging Face at: https://huggingface.co/datasets/kyutai/babillage ## Audio Format in Babillage Dataset The dataset stores audio files in ogg format, but provides code snippets to convert them to wav format if needed. ### Citation sources: - [Babillage Dataset](https://huggingface.co/datasets/kyutai/babillage) - Official URL Updated: 2025-04-01

Register Now

Login

Lost Password

Add question

Login

Register Now

Babillage Dataset - A multimodal benchmark dataset for evaluating vision speech models.

Babillage Dataset - A multimodal benchmark dataset for evaluating vision speech models.