WAVLab | XEUS - Towards Robust Speech Representation Learning for Thousands of Languages

Overview

WAVLab at Carnegie Mellon University presents XEUS - a Cross-lingual Encoder for Universal Speech. XEUS (pronounced “Zeus”) is an open-source speech foundation model trained on nearly 1.1 million hours of unlabeled speech data across 4,057 languages. XEUS sets the new state-of-the-art on the ML-SUPERB multilingual speech recognition benchmark, while also achieving strong results on different tasks on the English-only SUPERB evaluations. We open-source XEUS’ checkpoints, along with our training code and 4,000+ language speech data in this project page.

Project Contacts: William Chen, Shinji Watanabe

Model Info

XEUS is a 19-layer E-Branchformer encoder trained using the HuBERT masked prediction objective. XEUS is trained on 37 public datasets, along with our crawled data detailed below. This totals to roughly 1.1 million hours of speech. From these audio files, we generate 180 billion speech tokens as the prediction targets using a pre-trained WavLabLM. We augment the training task with acoustic denoising from WavLM, along with a new dereverberation objective that we propose. More details can be found in the paper.

Name	Data (hours)	Parameters	Model Link	ESPnet Recipe	License
XEUS	1.1 million	577M	HuggingFace	Coming soon	MIT

Released Data

Name	Data (hours)	Languages	Link	License
MMS ulab v2	8,900	4,023	espnet/mms_ulab_v2	CC BY-NC-SA 4.0
WikiTongues	70	~700	espnet/wikitongues	CC BY-NC-SA 4.0
JesusDramas	645	430	espnet/jesus_dramas	CC BY-NC-SA 4.0

Intermediate Checkpoints and Logs

To encourage research on the training dynamics of large-scale speech models, we will also release the logs and intermediate checkpoints obtained from training XEUS. In total, we have around 200 total intermediate checkpoints. These will be made available in the near future.

Usage

Similar to other ESPnet models, the pre-trained XEUS model can be downloaded and used in a python script. The code for XEUS is still in progress of being merged into the main ESPnet repo. It can instead be used from the following fork:

pip install 'espnet @ git+https://github.com/wanchichen/espnet.git@ssl'
git lfs install
git clone https://huggingface.co/espnet/XEUS

Feature Extraction

from torch.nn.utils.rnn import pad_sequence
from espnet2.tasks.ssl import SSLTask
import soundfile as sf

device = "cuda" if torch.cuda.is_available() else "cpu"

xeus_model, xeus_train_args = SSLTask.build_model_from_file(
    config = None,
    ckpt = '/path/to/checkpoint/here/checkpoint.pth',
    device,
)

wavs, sampling_rate = sf.read('/path/to/audio.wav') # sampling rate should be 16000
wav_lengths = torch.LongTensor([len(wav) for wav in [wavs]]).to(device)
wavs = pad_sequence(torch.Tensor([wavs]), batch_first=True).to(device) 

# we recommend use_mask=True during fine-tuning
feats = xeus_model.encode(wavs, wav_lengths, use_mask=False, use_final_output=False)[0][-1] # take the output of the last layer -> batch_size x seq_len x hdim

Flash Attention

XEUS supports Flash Attention for more efficient training.

pip install flash-attn --no-build-isolation

[layer.use_flash_attn = True for layer in xeus_model.encoder.encoders]

with torch.cuda.amp.autocast(dtype=torch.bfloat16):
  feats = xeus_model.encode(wavs, wav_lengths, use_mask=False, use_final_output=False)[0][-1]

Masking

The masking settings can be tuned to better fit certain downstream tasks during fine-tuning.

xeus_model.masker.mask_prob = 0.65 # default 0.8
xeus_model.masker.mask_length = 20 # default 10
xeus_model.masker.mask_selection = 'static' # default 'uniform'
xeus_model.train()
feats = xeus_model.encode(wavs, wav_lengths, use_mask=True, use_final_output=False)[0][-1]

Papers

If you use XEUS or our released data in your project, please consider citing the paper.

Preprint: Towards Robust Speech Representation Learning for Thousands of Languages