Register Now

Login

Lost Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Captcha Click on image to update the captcha .

Add question

You must login to ask a question.

Login

Register Now

Lorem ipsum dolor sit amet, consectetur adipiscing elit.Morbi adipiscing gravdio, sit amet suscipit risus ultrices eu.Fusce viverra neque at purus laoreet consequa.Vivamus vulputate posuere nisl quis consequat.

CosyVoice 2.0 - An advanced streaming text-to-speech model optimized for low-latency and multilingual synthesis.

## Developer of CosyVoice 2.0 CosyVoice 2.0 was developed by the **FunAudioLLM team** under **Alibaba Group's SpeechLab**. ## Performance Metrics of CosyVoice 2.0 - **Latency**: 150 milliseconds for the first synthesized audio packet. - **Accuracy**: 30-50% reduction in pronunciation errors compared to CosyVoice 1.0, with the lowest character error rate on the Seed-TTS hard test set. - **MOS Score**: 5.53 (on par with leading commercial models). ## Multilingual and Dialect Support CosyVoice 2.0 supports: - **Languages**: Chinese, English, Japanese, Korean, and others. - **Chinese Dialects**: Cantonese, Sichuanese, Shanghainese, Tianjinese, Wuhanese, etc. - **Accent Adjustment**: Available for non-Chinese languages. ## Emotional Control Features The model allows fine-grained control over: - **Emotions**: Laughter, coughing, breathing, and other affective states. - **Prosody**: Adjustments to rhythm, pitch, and intonation for natural-sounding output. ## Download and Documentation Sources - **Model Download**: - [ModelScope](https://www.modelscope.cn/models/iic/CosyVoice2-0.5B) - [Hugging Face](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B) - **Setup Instructions**: - [GitHub Repository](https://github.com/FunAudioLLM/CosyVoice) (includes Python environment setup and inference scripts). ## Technical Requirements - **Python**: Version 3.10 or compatible. - **Dependencies**: `pynini`, `sox`, and other tools listed in the GitHub repository. - **Hardware**: GPU recommended for optimal performance (specific requirements depend on use case). ## Comparison with Commercial Models - **Quality**: MOS score of 5.53 matches leading commercial systems. - **Latency**: 150ms first-packet delay is competitive for real-time applications. - **Flexibility**: Open-source nature allows customization, unlike most proprietary solutions. ## Future Roadmap Planned enhancements include: - **Bidirectional streaming inference** (currently marked as TBD). - Expanded training/fine-tuning recipes for specialized use cases. ### Citation sources: - [CosyVoice 2.0](https://www.modelscope.cn/models/iic/CosyVoice2-0.5B) - Official URL Updated: 2025-04-01