CosyVoice 2.0 - An advanced streaming text-to-speech model optimized for low-latency and multilingual synthesis.
## Developer of CosyVoice 2.0
CosyVoice 2.0 was developed by the **FunAudioLLM team** under **Alibaba Group's SpeechLab**.
## Performance Metrics of CosyVoice 2.0
- **Latency**: 150 milliseconds for the first synthesized audio packet.
- **Accuracy**: 30-50% reduction in pronunciation errors compared to CosyVoice 1.0, with the lowest character error rate on the Seed-TTS hard test set.
- **MOS Score**: 5.53 (on par with leading commercial models).
## Multilingual and Dialect Support
CosyVoice 2.0 supports:
- **Languages**: Chinese, English, Japanese, Korean, and others.
- **Chinese Dialects**: Cantonese, Sichuanese, Shanghainese, Tianjinese, Wuhanese, etc.
- **Accent Adjustment**: Available for non-Chinese languages.
## Emotional Control Features
The model allows fine-grained control over:
- **Emotions**: Laughter, coughing, breathing, and other affective states.
- **Prosody**: Adjustments to rhythm, pitch, and intonation for natural-sounding output.
## Download and Documentation Sources
- **Model Download**:
- [ModelScope](https://www.modelscope.cn/models/iic/CosyVoice2-0.5B)
- [Hugging Face](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B)
- **Setup Instructions**:
- [GitHub Repository](https://github.com/FunAudioLLM/CosyVoice) (includes Python environment setup and inference scripts).
## Technical Requirements
- **Python**: Version 3.10 or compatible.
- **Dependencies**: `pynini`, `sox`, and other tools listed in the GitHub repository.
- **Hardware**: GPU recommended for optimal performance (specific requirements depend on use case).
## Comparison with Commercial Models
- **Quality**: MOS score of 5.53 matches leading commercial systems.
- **Latency**: 150ms first-packet delay is competitive for real-time applications.
- **Flexibility**: Open-source nature allows customization, unlike most proprietary solutions.
## Future Roadmap
Planned enhancements include:
- **Bidirectional streaming inference** (currently marked as TBD).
- Expanded training/fine-tuning recipes for specialized use cases.
### Citation sources:
- [CosyVoice 2.0](https://www.modelscope.cn/models/iic/CosyVoice2-0.5B) - Official URL
Updated: 2025-04-01