CosyVoice 2.0 - An advanced streaming text-to-speech model optimized for low-latency and multilingual synthesis.

Add question

You must login to ask a question.

CosyVoice 2.0 - An advanced streaming text-to-speech model optimized for low-latency and multilingual synthesis.

## Developer of CosyVoice 2.0 CosyVoice 2.0 was developed by the **FunAudioLLM team** under **Alibaba Group's SpeechLab**. ## Performance Metrics of CosyVoice 2.0 - **Latency**: 150 milliseconds for the first synthesized audio packet. - **Accuracy**: 30-50% reduction in pronunciation errors compared to CosyVoice 1.0, with the lowest character error rate on the Seed-TTS hard test set. - **MOS Score**: 5.53 (on par with leading commercial models). ## Multilingual and Dialect Support CosyVoice 2.0 supports: - **Languages**: Chinese, English, Japanese, Korean, and others. - **Chinese Dialects**: Cantonese, Sichuanese, Shanghainese, Tianjinese, Wuhanese, etc. - **Accent Adjustment**: Available for non-Chinese languages. ## Emotional Control Features The model allows fine-grained control over: - **Emotions**: Laughter, coughing, breathing, and other affective states. - **Prosody**: Adjustments to rhythm, pitch, and intonation for natural-sounding output. ## Download and Documentation Sources - **Model Download**: - [ModelScope](https://www.modelscope.cn/models/iic/CosyVoice2-0.5B) - [Hugging Face](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B) - **Setup Instructions**: - [GitHub Repository](https://github.com/FunAudioLLM/CosyVoice) (includes Python environment setup and inference scripts). ## Technical Requirements - **Python**: Version 3.10 or compatible. - **Dependencies**: `pynini`, `sox`, and other tools listed in the GitHub repository. - **Hardware**: GPU recommended for optimal performance (specific requirements depend on use case). ## Comparison with Commercial Models - **Quality**: MOS score of 5.53 matches leading commercial systems. - **Latency**: 150ms first-packet delay is competitive for real-time applications. - **Flexibility**: Open-source nature allows customization, unlike most proprietary solutions. ## Future Roadmap Planned enhancements include: - **Bidirectional streaming inference** (currently marked as TBD). - Expanded training/fine-tuning recipes for specialized use cases. ### Citation sources: - [CosyVoice 2.0](https://www.modelscope.cn/models/iic/CosyVoice2-0.5B) - Official URL Updated: 2025-04-01

Register Now

Login

Lost Password

Add question

Login

Register Now

CosyVoice 2.0 - An advanced streaming text-to-speech model optimized for low-latency and multilingual synthesis.

CosyVoice 2.0 - An advanced streaming text-to-speech model optimized for low-latency and multilingual synthesis.