OpenAI Speech-to-Text - A high-performance tool for converting audio to text, supporting multiple languages and real-time processing.
## Supported Models in OpenAI Speech-to-Text
OpenAI's Speech-to-Text functionality supports the following models:
- **Whisper-1**: The foundational model for transcription and translation.
- **gpt-4o-mini-transcribe**: A lightweight model for transcription with limited parameter support.
- **gpt-4o-transcribe**: A more advanced model for transcription, also with limited parameter support.
The Whisper-1 model is the only one currently supporting translation to English.
## Supported Audio File Formats
OpenAI's Speech-to-Text supports the following audio file formats:
- mp3
- mp4
- mpeg
- mpga
- m4a
- wav
- webm
The maximum file size allowed for upload is 25 MB.
## Output Formats for Transcriptions
The available output formats depend on the model used:
- **For Whisper-1**: JSON, text, SRT, verbose JSON, and VTT.
- **For newer models (gpt-4o-mini-transcribe and gpt-4o-transcribe)**: JSON and text.
## Multilingual Support in OpenAI Speech-to-Text
OpenAI's Speech-to-Text supports a wide range of languages, with a Word Error Rate (WER) below 50% for 98 trained languages. The full list of supported languages is available in the [Whisper model documentation](https://github.com/openai/whisper#available-models-and-languages). The system performs exceptionally well in multilingual environments, making it suitable for applications like international meetings and translations.
## Real-Time Audio Processing Support
Yes, OpenAI's Speech-to-Text supports real-time audio processing through:
- **Streaming transcription**: For completed audio files, using `stream=True`.
- **Realtime API**: For ongoing audio streams, using WebSocket protocol (`wss://api.openai.com/v1/realtime?intent=transcription`).
This feature is particularly useful for live translations and real-time meeting transcriptions.
## Improving Transcription Quality with Prompts
Users can enhance transcription quality by providing prompts that:
- Specify uncommon words or technical terms.
- Define punctuation preferences.
- Include filler words or phrases to be recognized.
Detailed guidance on prompt usage is available in the [OpenAI API reference](https://platform.openai.com/docs/api-reference/audio/createTranscription#audio/createTranscription-prompt).
## Handling Large Audio Files
For audio files exceeding the 25 MB limit, users can:
- Split the file into smaller segments using tools like PyDub.
- Process each segment separately and combine the results afterward.
This approach ensures compatibility with the API's file size restrictions.
## Obtaining Word-Level Timestamps
To get word-level timestamps, users must:
1. Set `response_format="verbose_json"`.
2. Include `timestamp_granularities=["word"]` in the API request.
This configuration provides precise timing information for each word in the transcription.
## Primary Use Cases for OpenAI Speech-to-Text
The primary use cases for OpenAI's Speech-to-Text include:
- **Meeting notes**: Automatically transcribing discussions.
- **Voice memos**: Converting spoken notes into text.
- **Multilingual translations**: Translating non-English audio into English text.
- **Real-time applications**: Live captioning and translation services.
### Citation sources:
- [OpenAI Speech-to-Text](https://platform.openai.com/docs/guides/speech-to-text) - Official URL
Updated: 2025-04-01