Open Whisper-style Speech Models (OWSM)
Overview
Open Whisper-style Speech Models (OWSM, pronounced as “awesome”) are a series of speech foundation models developed by WAVLab at Carnegie Mellon University. We reproduce Whisper-style training using publicly available data and our open-source toolkit ESPnet. By publicly releasing data preparation scripts, training and inference code, pre-trained model weights and training logs, we aim to promote transparency and open science in large-scale speech pre-training.
Demo
Pre-trained models
We publicly release a series of pre-trained models. The training logs are also available for major models. We recommend using OWSM v3.1 or later versions for better performance and efficiency.
Name | Data (hours) | Encoder | Parameters | Model Link | ESPnet Recipe |
---|---|---|---|---|---|
OWSM v1 | 38k | Transformer | 272M | espnet/owsm_v1 | egs2/owsm_v1/s2t1 |
OWSM v2 | 129k | Transformer | 712M | espnet/owsm_v2 | egs2/owsm_v2/s2t1 |
OWSM v2 | 129k | E-Branchformer | 739M | espnet/owsm_v2_ebranchformer | egs2/owsm_v2/s2t1 |
OWSM v3 | 180k | Transformer | 889M | espnet/owsm_v3 | egs2/owsm_v3/s2t1 |
OWSM v3.1 base | 180k | E-Branchformer | 101M | espnet/owsm_v3.1_ebf_base | egs2/owsm_v3.1/s2t1 |
OWSM v3.1 small | 180k | E-Branchformer | 367M | espnet/owsm_v3.1_ebf_small | egs2/owsm_v3.1/s2t1 |
OWSM v3.1 medium | 180k | E-Branchformer | 1.02B | espnet/owsm_v3.1_ebf | egs2/owsm_v3.1/s2t1 |
OWSM v3.1 small low-restriction | 70k | E-Branchformer | 367M | espnet/owsm_v3.1_ebf_small_lowrestriction | egs2/owsm_v3.1/s2t1 |
OWSM-CTC v3.1 medium | 180k | E-Branchformer | 1.01B | pyf98/owsm_ctc_v3.1_1B | Check model page |
OWSM v3.2 small | 180k | E-Branchformer | 367M | espnet/owsm_v3.2 | Coming soon |
Data details
The latest OWSM v3.1 models are trained on a diverse combination of public datasets as listed below.
OWSM v3.1 training data mixtures
- AIDATATANG
- AISHELL-1
- AMI
- Babel
- Common Voice
- Googlei18n
- CoVoST2
- Fisher Callhome Spanish
- Fisher (Switchboard)
- FLEURS
- GigaSpeech
- GigaST
- KsponSpeech
- LibriSpeech
- MagicData
- Multilingual LibriSpeech
- MuST-C
- ReazonSpeech
- Russian Open STT
- SPGISpeech
- TEDLIUM3
- VCTK
- VoxForge
- VoxPopuli
- WenetSpeech
The low-restriction model is trained on a subset of the above data with “more flexible licenses”.
OWSM v3.1 low-restriction data
- AMI: CC-BY-4.0
- Common Voice: CC0-1.0
- FLEURS: CC-BY-4.0
- KsponSpeech: MIT
- LibriSpeech: CC-BY-4.0
- Multilingual LibriSpeech: CC-BY-4.0
- VCTK: CC-BY-4.0
Inference
Similar to other ESPnet models, the pre-trained OWSM models can be easily downloaded and used in a python script. Below are some examples using OWSM v3.1. For earlier versions (v2 and before), the language code should follow the two-letter format (e.g., <en>
, <de>
).
Language Identification
We pass the Hugging Face model tag when initializing Speech2Language
. The model will be automatically downloaded from Hugging Face to a local cache directory.
from espnet2.bin.s2t_inference_language import Speech2Language
s2l = Speech2Language.from_pretrained(
model_tag="espnet/owsm_v3.1_ebf",
device="cuda",
nbest=3, # return nbest prediction and probability
)
import soundfile as sf
speech, rate = sf.read("audio.wav")
result = s2l(speech)
print(result)
# list of tuples (language, probability)
# [('<eng>', 0.9994348883628845), ('<jpn>', 0.00010286537144565955), ('<rus>', 6.185896199895069e-05)]
Speech Recognition or Translation
We use Speech2Text
for speech recognition or translation. We also pass the model tag so that the model can be automatically downloaded. When initializing this object, we set the default values for lang_sym
, task_sym
and predict_time
. These variables can be overwritten later, which provides more flexibility. Note that the language must be known to use this functionality. If it is unknown, one can first perform language identification and then recognition or translation.
from espnet2.bin.s2t_inference import Speech2Text
s2t = Speech2Text.from_pretrained(
model_tag="espnet/owsm_v3.1_ebf",
device="cuda",
beam_size=5,
ctc_weight=0.0,
maxlenratio=0.0,
# below are default values which can be overwritten in __call__
lang_sym="<eng>",
task_sym="<asr>",
predict_time=False,
)
import soundfile as sf
speech, rate = sf.read("audio.wav")
result = s2t(speech)[0][-2]
# an optional text prompt can be passed
result = s2t(
speech,
text_prev="this is an optional prompt"
)[0][-2]
# lang_sym, task_sym, predict_time can be overwritten
result = s2t(
speech,
lang_sym="<eng>",
task_sym="<st_zho>", # translation into Chinese
predict_time=True,
)[0][-2]
Long-form Speech Recognition or Translation
OWSM processes an entire audio recording in a chunk-by-chunk manner. Each chunk has a fixed length of 30s. The chunk is shifted based on the predicted timestamps. We still use Speech2Text
but we call its decode_long
method.
from espnet2.bin.s2t_inference import Speech2Text
s2t = Speech2Text.from_pretrained(
model_tag="espnet/owsm_v3.1_ebf",
device="cuda",
beam_size=5,
ctc_weight=0.0,
maxlenratio=0.0,
# below are default values which can be overwritten in __call__
lang_sym="<eng>",
task_sym="<asr>",
)
import soundfile as sf
speech, rate = sf.read("covid.wav")
result = s2t.decode_long(speech)
# list of tuples (start_time, end_time, text)
Fine-tuning on custom data
Our latest work (accepted to SLT 2024), “ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration”, will provide an easier way for fine-tuning pre-trained models. We are preparing demos and notebooks. Please stay tuned!
Papers
Please cite our papers if you find OWSM helpful.
- ACL 2024: OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification
- INTERSPEECH 2024: On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models
- INTERSPEECH 2024: OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer
- ASRU 2023: Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data
We also collect other papers related to OWSM. Please contact Yifan Peng (yifanpen@andrew.cmu.edu) if you use OWSM in your work and want to include it here.
OWSM applications
- ASRU 2023 SPARKS Workshop: SLUE-PERB: A Spoken Language Understanding Performance Benchmark and Toolkit
Foundational work used by OWSM
- INTERSPEECH 2023: A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks
- SLT 2022: E-Branchformer: Branchformer with Enhanced merging for speech recognition
- ICML 2022: Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding