whisper-large-v3-turbo

whisper-large-v3-turbo

Whisper is a cutting-edge model designed for automatic speech recognition (ASR) and speech translation, introduced in the paper “Robust Speech Recognition via Large-Scale Weak Supervision” by Alec Radford and colleagues at OpenAI. Trained on over 5 million hours of labeled data, Whisper exhibits excellent generalization capabilities across various datasets and domains in a zero-shot setting.

Whisper large-v3-turbo is a fine-tuned version of the pruned Whisper large-v3. Essentially, it’s the same model but with the number of decoding layers reduced from 32 to 4. This significantly improves the model’s speed, though with a slight reduction in quality. More information can be found in this GitHub discussion.

Model Details

Whisper is a Transformer-based encoder-decoder model, also known as a sequence-to-sequence model. It comes in two variants: English-only and multilingual. The English-only models are trained specifically for English speech recognition, while the multilingual models are trained for both multilingual speech recognition and speech translation. In the case of speech recognition, the model generates transcriptions in the same language as the input audio. For speech translation, the model outputs transcriptions in a different language from the original audio.

Whisper models are available in five different configurations, each varying in size. The smallest four configurations offer both English-only and multilingual versions, while the largest models are multilingual-only. All ten pre-trained checkpoints are accessible on the Hugging Face Hub, and the model sizes and configurations are summarized in the following table with direct links to the models on the Hub:

SizeParametersEnglish-onlyMultilingualtiny39 Mbase74 Msmall244 Mmedium769 Mlarge1550 Mxlarge-v21550 Mxlarge-v31550 Mxlarge-v3-turbo809 Mx
Fine-Tuning

While the pre-trained Whisper model already performs well across various datasets and domains, fine-tuning can enhance its predictive accuracy for specific languages or tasks. By fine-tuning, you can further optimize the model to handle unique data or target specific domains. The blog post Fine-Tune Whisper with Transformers offers a detailed, step-by-step guide on how to fine-tune the Whisper model using as little as 5 hours of labeled data.


Posted

Tags: