Speech Lunch
Welcome to the Speech Lunch (formerly Sphinx Lunch) at Carnegie Mellon University! This lunch meeting is designed to discuss any speech-related research items regularly. The meeting consists of presentations by CMU faculties, CMU students, and guest speakers. We welcome any reserach topics, including an ordinary presentation, conference presentation rehearsals, preliminary research ideas, research discussions, and so on. We also welcome any CMU researchers and external researchers to join the meeting.
During the semester, we will regularly have meetings in the following slots:
- Date: Thursday 12:30 pm - 1:30 pm
- Room: GHC 6501
The time and room may change, especially if we have a guest speaker. We will announce the talk information through our mailing list. Approval by the admin is required. So, please subscribe to it if you’re interested in the CMU speech!
Please contact Yifan Peng (yifanpen@andrew.cmu.edu) and Shinji Watanabe (shinjiw@ieee.org) if you would like to participate in our Speech Lunch.
Future Talks (tentative schedule)
Previous Talks
- November 14, 2024
- Title: Exploring Prediction Targets in Masked Pre-Training for Speech Foundation Models
- Speaker: Li-Wei Chen (CMU)
- Abstract: Speech foundation models, such as HuBERT and its variants, are pre-trained on large amounts of unlabeled speech for various downstream tasks. These models use a masked prediction objective, where the model learns to predict information about masked input segments from the unmasked context. The choice of prediction targets in this framework can influence performance on downstream tasks. For example, targets that encode prosody are beneficial for speaker-related tasks, while targets that encode phonetics are more suited for content-related tasks. Additionally, prediction targets can vary in the level of detail they encode; targets that encode fine-grained acoustic details are beneficial for denoising tasks, while targets that encode higher-level abstractions are more suited for content-related tasks. Despite the importance of prediction targets, the design choices that affect them have not been thoroughly studied. This work explores the design choices and their impact on downstream task performance. Our results indicate that the commonly used design choices for HuBERT can be suboptimal. We propose novel approaches to create more informative prediction targets and demonstrate their effectiveness through improvements across various downstream tasks.
- November 7, 2024
- Title: Towards Building Audio Foundation Models with Real-Time Conversational Capabilities
- Speaker: Siddhant Arora (CMU)
- Abstract: The recent wave of audio foundation models (FMs) could provide new capabilities for conversational modeling. This talk delves into the development of spoken dialogue systems capable of real-time interactions with users. I will begin with a review of existing literature on audio FMs that claim to achieve natural, interactive conversation abilities, discussing their design choices, training setups, and current evaluation frameworks for such dialogue systems. I will also provide initial insights into the strengths and limitations of E2E SDS models versus traditional cascaded pipelines. Next, I will outline our ongoing efforts to develop both cascaded and fully duplex end-to-end (E2E) spoken dialogue systems, with a focus on identifying optimal architectural and design strategies. In closing, I will present an action plan for future work, including the creation of an open-source toolkit for SDS model development, an in-house fully duplex spoken dialogue model, and a standalone evaluation platform for broader use in assessing these conversational systems.
- October 31, 2024
- Title: New Tasks in ESPnet: ESPnet-Codec and ESPnet-SpeechLM
- Speaker: Jiatong Shi and Jinchuan Tian (CMU)
- Abstract: Neural codecs have become crucial to recent speech and audio generation research. In addition to signal compression capabilities, discrete codecs have also been found to enhance downstream training efficiency and compatibility with autoregressive language models. However, as extensive downstream applications are investigated, challenges have arisen in ensuring fair comparisons across diverse applications. To address these issues, we present a new open-source platform ESPnet-Codec, which is built on ESPnet and focuses on neural codec training and evaluation. ESPnet-Codec offers various recipes in audio, music, and speech for training and evaluation using several widely adopted codec models. Together with ESPnet-Codec, we present VERSA, a standalone evaluation toolkit, which provides a comprehensive evaluation of codec performance over 40 audio evaluation metrics. Notably, we demonstrate that ESPnet-Codec can be integrated into six ESPnet tasks, supporting diverse applications.Language modeling is a promising scheme for building universal speech and audio intelligence. We introduces ESPnet-SpeechLM, an open-sourced toolkit for speech and audio language modeling. ESPnet-SpeechLM uniformly formulates speech and audio tasks using task templates, based on which a task-specific speech language model can be trained and evaluated automatically for a newly defined task. In addition, the pre-defined task templates and corresponding datasets can be dynamically combined for massive multi-task training. The toolkit provides comprehensive and flexible support for audio tokenization, model architectures, inference methods, and evaluation protocols. It also provides an easy-to-share interface for task templates, datasets, and models.
- October 10, 2024
- Title: Improving Multilingual Speech Recognition in the Wild
- Speaker: Brian Yan (CMU)
- Abstract: Multilingual Automatic Speech Recognition (ASR) models are typically evaluated in a setting where the ground-truth language identity of the speech utterance is known, however, this is often not the case for most practical settings. The first part of this talk examines the impact that imperfect Automatic Spoken Language Identification (SLID) has on downstream ASR quality. I present a simple and effective N-best re-ranking approach to improve multilingual ASR accuracy for several prominent acoustic models by employing external features such as language models and text-based language identification models. Our results on FLEURS using the MMS and Whisper models show spoken language identification accuracy improvements of 8.7% and 6.1%, respectively and word error rates which are 3.3% and 2.0% lower on these benchmarks. Then the second part of this talk delves into the tricky case of code-switched speech which contains segments from multiple languages. I describe an on-going effort to create Code-Switched FLEURS: a super hard code-switched ASR and ST benchmark.
- October 3, 2024
- Title: Toward Real-Time Simultaneous Translation with Large Language Models
- Speaker: Xi Xu and Siqi Ouyang (CMU)
- Abstract: An ideal real-time simultaneous translation system should deliver high-quality translations at sub-second latency. In this talk, we first discuss how our approach achieved first place in the IWSLT English-German task based on human ratings, using a standard speech LLM model and a Hold-N policy. However, while IWSLT allows for up to 2 seconds of algorithmic latency and overlooks computational delays, real-world applications demand far lower latency. To address this, we introduce FASST, a technique designed to minimize computational latency during inference by avoiding redundant recomputation, thereby maintaining translation quality for trainable policies like wait-k. Finally, we present a novel method leveraging LLMs to anticipate upcoming source content, allowing for enhanced translation quality while achieving ultra-low algorithmic latency, moving closer to the goal of real-time simultaneous translation.
- September 26, 2024
- Title: Foundations of Blind Source Separation and Its Advances in Spatial Self-Supervised Learning
- Speaker: Yoshiaki Bando
- Abstract: A key technology in speech and audio analysis is self-supervised learning, which can efficiently train neural models on large-scale unlabeled training data. While existing frameworks such as HuBERT and BEATs have achieved great success for this purpose, they primarily focus on obtaining embeddings for isolated/mixture inputs and would be less suitable for analyzing individual sound events or speech utterances in a mixture recording. In this talk, we introduce our series of studies called spatial self-supervised learning based on blind source separation. This framework trains a neural model to predict embeddings of latent individual sources from a multichannel mixture recording without any manual supervision. We first present the foundations of blind source separation and then describe its neural extension for self-supervised learning, followed by a discussion of future directions for large-scale training using real-world data.
- Bio: Yoshiaki Bando received his Ph.D. degree in informatics from Kyoto University in 2018 and is currently a Senior Researcher at Artificial Intelligence Research Center in National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan. He is also a Visiting Researcher at the RIKEN Center for Advanced Intelligence Project (AIP). His research interests include microphone array signal processing, deep Bayesian learning, robot audition, and field robotics.
- September 12, 2024
- Title: Continual learning in speech recognition
- Speaker: Ngoc Quan Pham
- Abstract: The current speech recognition models are always trained with closed and stationary datasets, and only a few studies have been conducted for expanding currently trained models with new non-stationary data. In such case, a neural model can suffer from catastrophic forgetting - the weights of the models are overwritten in the subsequent training steps and lose the abilities on the previously learned tasks or domains. In our personal view of anticipating how we might train speech recognition models in the future, in which the models are updated as fast as data is generated, we investigate two different scenarios: expanding a multilingual speech recognition models with more languages and training a speech recognition model with online continual learning.
- Bio: Quan Pham is currently a postdoc at the Interact lab - Karlsruhe Institute of Technology, Germany with professor Alex Waibel (professor at both KIT and CMU). In the last 5 years he made some tiny contributions to speech recognition research, such as stochastic layers to facilitate training deep models, expanding/finetuning networks with low-rank additional weights (concurrent with LoRA) and learning new languages with continual learning.
- March 21, 2024
- Title: Online Speech Enhancement and Separation: From Discriminative Methods to Generative Methods
- Speaker: Chenda Li (Shanghai Jiao Tong University)
- Abstract: Online speech enhancement/separation has many applications in daily life. Many scenarios require speech processing systems to be low-latency, such as teleconferencing, hearing aids, etc. In this talk, I’ll present my recent research on online speech enhancement/separation. I will first introduce Skipping Memory LSTM (SkiM), a very efficient model for low-latency online processing. Then, I will share some techniques to better balance the performance and ideal latency for online speech separation. Finally, I will show some of our recent research on streamable diffusion-based speech enhancement.
- Bio: I’m a Ph.D. student from Shanghai Jiao Tong University, and now I’m visiting Watanabe’s Audio and Voice (WAV) Lab at LTI for CMU. My research interests include speech separation, speech enhancement, and multi-talker speech recognition.
- February 29, 2024
- Title: Towards robust speech generation
- Speaker: Soumi Maiti (CMU)
- Abstract: In the last decade, the field of Speech Generation has witnessed remarkable advancements, particularly in the improvement of speech quality and naturalness. Despite these strides, challenges persist, such as noise in speech and the limited availability of high-quality data, and lack of robustness of speech generation systems. Additionally, evaluating speech remains a obstacle for large-scale assessment of speech generation models. Simultaneously, recent breakthroughs in Large Language Models (LLMs) have transformed text processing and natural language applications. However, spoken language modeling introduces unique challenges due to the intricate nature of speech components, including speaker characteristics and emotional cues. In this presentation, I will delve into recent advances in speech synthesis, spoken language modeling, and speech evaluation for generative systems, shedding light on the ongoing efforts to address these challenges.
- February 8, 2024
- Title: Learning from Unlabeled Speech through Pre-training
- Speaker: Alexander Haojan Liu
- Abstract: In the first part of the talk, I will present DinoSR, a latest self-supervised learning method that can be viewed as a completion of recent speech representation models. Key results on recognition and acoustic unit discovery will be covered. The second part of the talk will cover generative pre-training for speech through flow matching. Similar to the spirit of self-supervised learning, I will show that a general-purpose generative model can be pre-trained from unlabeled speech, and later applied to different tasks in speech such as synthesis, denoising, and separation.
- Bio: Alexander Haojan Liu is a 4th Ph.D. student in Computer Science at MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). He is a member of the Spoken Language System (SLS) Group leading by Dr. James Glass. His research interests are in the field of machine learning, natural language processing (speech processing in particular), and computer vision. Currently, his work focuses on self-supervised learning of audio and their applications.
- December 7, 2023
- Title: Unifying Speech Processing Applications with Speech Foundation Models
- Speaker: Shinji Watanabe
- Abstract: After the success of large language models in natural language processing, the field of speech processing is currently exploring the possibility of combining speech and language modalities to create a foundation model. This single unified model could perform multiple speech processing applications, such as speech recognition, synthesis, translation, and spoken language processing. Our group is dedicated to achieving this goal through the development of speech foundation models, including speech/text decoder-only models, whisper-style multi-tasking, universal spoken language understanding, and multilingual SUPERB projects. In addition to showcasing the above research outcomes during this talk, we will describe the engineering efforts involved in building such a large foundation model from scratch on an academic computing scale for reproducibility.
- November 9, 2023
- Title: Universal Speech Enhancement: What Can We Do With Real Data?
- Speaker: Wangyou Zhang
- Abstract: Speech enhancement (SE) methods based on deep learning have shown impressive performance on many simulation conditions (TIMIT/WSJ/Librispeech/…+Noise), whereas the generalization to a wider range of real conditions has not been addressed. In fact, many high-performing SE methods tend to overfit the simulation condition in training, whose inductive bias may be easily violated in real conditions. In the era of large-scale pre-training, it is natural to ask whether we can make use of the large-scale real recording data to train a truly universal SE model that can be used for all speech-as-input tasks in real-world conditoins. In this talk, I try to answer the following two questions by summarizing exisiting works on these directions: 1) what can we do to utilize real data for SE training? 2) what models can be used to achieve universal SE? Finally, I will finish the talk by proposing new problems in the related topics.
- November 2, 2023
- Title: Music generation with precise control
- Speakers: Chris Donahue and Shih-Lun Wu
- Abstract: In the first half of the session, Chris will discuss some recent work on generating music with precise control and composable outputs. Music audio generation has seen an explosion of activity - we now have the ability to generate music in broad styles with natural language control. However, despite the impressive breadth of these models, they have not yet had a salient impact on music in the real world. Instead, music AI models with more narrow capabilities have had disproportionate impact (e.g. source separation, voice cloning). In this talk, Chris will argue that current narrow models are more appealing to creators because they offer more creative potential for two reasons: (i) they offer precise and familiar forms of control, and (ii) their outputs are composable and integrate with conventional workflows. Chris will discuss two of his recent papers, SingSong (Donahue+ 23) and the Anticipatory Music Transformer (Thickstun+ 23) which seek to bring more creative potential to broadly-capable music generative models. In the second half of the session, Shih-Lun will introduce his recent work, Music ControlNet (Wu+ 23, unpublished), which imbues diffusion-based text-to-music generation models with precise melody, dynamics, and rhythm controls. Music ControlNet builds upon the ControlNet line of research in image generation, and adapts their framework to accept time-varying controls in audio domain. Shih-Lun will demonstrate that Music ControlNet can respond precisely to any composition of the controls it has been trained on, and can also generalize to out-of-distribution control signals that creators may realistically provide.
- October 12, 2023
- Title: Computational Audition through Imprecise labels
- Speaker: Ankit Shah
- Abstract: In this talk, we delve into computational auditory processing to mimic how humans and animals interpret sounds to interact with their surroundings effectively. The journey begins with the machine’s challenge to recognize a vast array of sounds limited by the known sounds in our datasets. This limitation becomes glaring as current models require large labeled datasets for accuracy, which often isn’t feasible in real-world settings due to data scarcity. We then spotlight core issues: the strength of sound labels within available datasets. The quandary is that even with a fraction of known sounds and limited data, inaccuracies in sound labeling lead to suboptimal models. Our focus shifts to devising strategies for sound modeling amidst inaccurate, weak or incomplete labels, termed as working with imprecise labeled data. Our exploration includes enhancing the existing annotations, understanding the effects of label noise and corruption, and innovating a co-training approach for learning sound events from web data without human intervention. We venture into exploiting additional cues like event counts and durations with negligible extra effort, introducing the concept of semi-weak labels. Lastly, the talk describes a unified framework encapsulating all our approaches, making a robust model capable of handling various labeling scenarios, paving a solid foundation for future endeavors in understanding and modeling the world of images (transferrable to sounds), irrespective of label availability. Through this, we aspire to bridge the gap between the human brain’s natural sound-processing ability and machines, opening doors to a more harmonious interaction with the acoustic world around us.
- Bio: Ankit Shah is a Ph.D. student in the Language Technologies Institute in the School of Computer Science at Carnegie Mellon University. Ankit earned his master’s in Language technologies at Carnegie Mellon University in 2019 and his bachelor’s in electronics and communication engineering from the National Institute of Technology Karnataka Surathkal. He has worked in the industry for over 4 years as a verification engineer and project lead at ARM and as a Deep learning research Scientist at ReviveMed before joining the Ph.D. program. His areas of interest are audio understanding, machine learning, and deep learning. His thesis focuses on learning in the presence of weak, uncertain, and incomplete labels, where he has made several key contributions, including the setting up DCASE challenges on the topic. He has won the Gandhian Young Technological Innovator (GYTI) award in India for his contribution to building a never-ending learner of sound systems. His team recently emerged as a winning team in the NYC AI Hackathon challenge on LLM (Large Language Model and generative AI. He enjoys reading several books during the year, listens to music, and loves to travel. Further, he is keenly interested in Economics, Startups, Entrepreneurship, etc. Website: https://ankitshah009.github.io
- October 5, 2023
- Title: Adaptive Non-Causality for Speech Recognition
- Speaker: Grant Strimel (Amazon)
- Abstract: Streaming speech recognition architectures are employed for low-latency, real-time applications. Such architectures are often characterized by their causality – how much forward context is consumed before making a prediction on an individual frame. In this talk we will review prior approaches to balance competing objectives of low latency and the accuracy benefit derived from “look ahead” information. We then will discuss an approach we proposed called the Adaptive Non-Causal Attention Transducer (ANCAT). The architecture is non-causal in the traditional sense, but executes in a low-latency, streaming manner by dynamically choosing when to rely on future context and to what degree within the audio stream. The resulting mechanism, when coupled with novel regularization algorithms (which we will dive into) , delivers comparable accuracy to non-causal configurations while improving significantly upon latency, closing the gap with their fully-causal model counterparts.
- Bio: Grant Strimel is a Principal Scientist at Amazon AGI and part of the Alexa Speech Recognition and Deep Learning groups. He joined Alexa Pittsburgh in 2018 where the organization has now grown to over fifty scientists and engineers working on natural language processing experiences through both edge-first and cloud-centric solutions. His primary focus for Amazon has been on low-latency, real-time ML design for speech applications.
- September 28, 2023
- Title: Towards robust speech generation
- Speaker: Soumi Maiti
- August 31, 2023
- Title: Solving problems of a single-modal task with multi-modality
- Speaker: Minsu Kim (KAIST)
- Abstract: Speech processing technologies include diverse tasks with diverse modalities such as Audio-based Speech Recognition (ASR), Visual Speech Recognition (VSR), Text-to- Speech, Lip-to-Speech, Speech-driven talking face generation, etc. People usually utilize the modalities corresponding to the task at hand when developing a technology. For example, if we develop a VSR model, we usually utilize video and text modalities without considering other modalities (e.g., audio). In this talk, I will show some examples which try to solve a challenge of a single-modal task with multimodal data. They include the cases of employing 1) other modality containing rich task knowledge, 2) correspondence of multimodality, and 3) useful characteristics of different modalities. Finally, we can realize that numerous research opportunities emerge upon exploring various multimodal options pertinent to our current task.
- Bio: Minsu Kim is a Ph.D. student in the school of electrical engineering at Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea. He received the B.S. degree in electrical & electronic engineering from Yonsei University, Seoul, South Korea, in 2019. His research interest is multi-modal language processing including audio-visual speech processing, image/video analysis & generation, and speech translation.
- June 22, 2023
- Title: How to best leverage unlabeled data for speech recognition?
- Speaker: Dan Berrebbi
- Abstract: For speech recognition (ASR) as for most machine learning fields, labeled data is very expensive to get and so is far less abundant than unlabeled data. Leveraging untranscribed audio data is thus critical to build strong ASR models. Self-supervised models (SSL) such as wav2vec2.0 or HuBERT, pre-trained on large amounts of unlabeled data with pretext tasks, seem to perfectly respond to that need. Indeed, even when fine-tuned on very small amounts of transcribed speech data, they outperform previous supervised work as well as human performances on LibriSpeech dataset (which is the reference ASR benchmark) by a large margin. However those models can perform badly in case of domain shifts. Moreover we do not really know how to efficiently fine-tune them, or use them for inference. We also do not know well how to benchmark, evaluate or compare them. Finally their pre-training is very hard to reproduce and so finding alternative ways to leverage unlabeled data is of broad interest. In this talk we will tackle the following questions : 1) How to efficiently use speech SSL models for ASR? 2) How to adapt them in domain-shift scenarios? 3) How to efficiently evaluate and benchmark such models? 4) Can some alternative unsupervised training methods such as semi-supervised learning outperform SSL models? This talk is gathering our recent publications at Interspeech 2022 & 2023, ICASSP 2023 and ICLR 2023.
- Bio: Dan Berrebbi is a MLT Master’s student at Language Technologies Institute, Carnegie Mellon University. He obtained his Bachelor and Master’s in Mathematics at Ecole Polytechnique in Paris and then joined Carnegie Mellon University in 2021 to complete his Master’s. He is working with Professor Shinji Watanabe on speech processing. His main research areas are multilingual and low-resource speech recognition as well as self and semi-supervised learning for speech recognition.
- Apr 27, 2023
- Title: Audio-Visual Learning for Social Telepresence
- Speaker: Alexander Richard (Meta)
- Abstract: These days, physical distance between people is one of the biggest obstacles to maintaining meaningful social relationships with family, friends, and co-workers. Even with today’s technology, remote communication is limited to a two-dimensional audio-visual experience and lacks the availability of a shared, three-dimensional space in which people can interact with each other over the distance. Our mission at Reality Labs Research (RLR) in Pittsburgh is to develop a telepresence system that is indistinguishable from reality, i.e., a system that provides photo- and phono-realistic social interactions in VR. Building such a system requires a leap forward in audio modeling: 3D virtual spaces call for highly realistic 3D spatial audio rendering and immersive experiences demand to strip a user’s input audio from all environmental influences such as noise and reverb. Addressing the first problem, I will talk about building neural renderers for spatial audio from capture stage design to model development. For the latter problem, I will present an approach to speech enhancement that is based on a codify-and-resynthesize paradigm. In the future, these technologies will help build a realistic virtual environment with lifelike avatars that allow for authentic social interactions, connecting people all over the world, anywhere and at any time.
- Bio: Alexander Richard is a Research Scientist at Reality Labs Research (RLR) in Pittsburgh leading the audio-visual research team. With his team, he concentrates on audio-visual learning to build photo- and phono-realistic immersive experiences in Virtual Reality that enable remote communication indistinguishable from reality. Combining computer vision, machine learning, and audio processing, he develops key technologies for audio-visual lifelike avatars and novel 3D rendering approaches for spatial and binaural audio. Before joining RLR, Alexander was a Speech Scientist at Amazon Alexa in Aachen, Germany. He received his PhD from the University of Bonn for his work on temporal segmentation of human actions in videos.
- Apr 20, 2023
- Title: End-to-End Speech Summarization: Global Acoustic Context
- Speaker: Roshan Sharma
- Abstract: Speech in the real world is verbose, and humans possess the special ability to recognize what is being said, understand it, and consequently summarize speech. Automatic methods for speech summarization are crucial to imbue such capabilities in the artificial intelligence of tomorrow. Current methods for speech summarization involve a cascade of speech recognition and transcript summarization, which suffer from error propagation and larger model sizes. We proposed training speech summarization models end-to-end and demonstrated that such models outperform cascade summarization approaches. Speech summarization becomes more complex as the input length increases to a few minutes or half an hour. In this talk, we address the challenge of efficient and performant training for long audio summarization.
- Bio: Roshan Sharma is a Ph.D. candidate in the Electrical and Computer Engineering Department at Carnegie Mellon University. He earned his B.Tech. with distinction in Electronics and Communication Engineering at Amrita Vishwa Vidyapeetham, India in 2018 and his M.S. in Electrical and Computer Engineering at Carnegie Mellon University in 2020. His research interests lie in speech recognition, spoken language understanding, and multimodal machine learning.
- Apr 6, 2023
- Title: Continual Learning in Speech and Audio Applications
- Speaker: Muqiao Yang
- Abstract: The outstanding performance of deep neural networks typically relies on the training on a large fixed set of data. However, in practical applications, the properties of data streams may vary over time and the model may have limited access to past data, where the model performance will be affected due to the catastrophic forgetting effect. In this talk, we will focus on multiple Speech and Audio tasks, including Automatic Speech Recognition, Acoustic Scene Classification and Spoken Language Understanding and investigate how different continual learning scenarios and methods work under such settings.
- Bio: Muqiao Yang is a 3rd year PhD student at Carnegie Mellon University, working with Prof. Bhiksha Raj and Prof. Shinji Watanabe. His research interest is mainly on machine learning in speech processing including speech recognition and speech enhancement. He received his B.E degree from Hong Kong Polytechnic University and M.S. degree from Carnegie Mellon University.
- Mar 30, 2023
- Title: Reference Free Learning for Speech Enhancement and Speech Assessment
- Speaker: Anurag Kumar (Meta)
- Abstract: Improving perceptual quality and intelligibility of speech signals is critical for improving communications in real and virtual world. This is needed for people with normal hearing as well as those who have some form of hearing impairments. In this talk, I will present an outline of some of my recent research on both methods to enhance degraded speech as well as methods for speech assessment. I will go in depth of our recent works on unsupervised and self-supervised approaches for speech enhancement and how speech signals from the wild – for which target signal are not available - might be used for enhancement. These approaches enable easier adaptation to out of domain conditions as well as opens up the possibility of on-the-fly adaptation of enhancement systems. I will also present reference-less approaches for speech quality and intelligibility assessment - in particular the NORESQA framework which introduced a new way of non-intrusive speech assessment by leveraging non-matching references.
- Bio: Anurag Kumar is currently a Research Scientist and Technical Research Lead at Reality Labs Research, Meta. Anurag’s primary research interests are in machine learning for audio and speech processing and audio-visual learning. Before joining Meta, Anurag obtained his PhD from Language Technologies Institute (LTI) in School of Computer Science, Carnegie Mellon University in 2018. Anurag obtained his undergraduate degree in Electrical Engineering from IIT Kanpur in 2013. Anurag’s PhD dissertation “Acoustic Intelligence in Machines” pioneered weak label learning for sounds which has since become a key area of research in the field of audio AI. Anurag has been recipient of several awards and recognition including Best Paper Finalist at CVPR 2022 and NCC 2014, Finalist in Qualcomm Innovation Fellowship 2017, Winner of Samsung Innovation Awards 2012, travel grants from IEEE SPS and EURASIP.
- Mar 23, 2023
- Title: Adversarial robustness of modern speech recognition models: evaluation and implications
- Speaker: Raphael Olivier
- Abstract: Adversarial attacks on Machine Learning models are small perturbations of inputs which fool models into predicting the wrong outputs. In numerous real-world settings they are known to be the source of potential security liabilities. In this work we study the implications of these adversarial attacks when applied to Automatic Speech Recognition (ASR) models. Our main finding is that the recent progress in ASR performance has led to an increase in adversarial vulnerability, rather than an improvement in robustness . We illustrate two aspects of this phenomenon. First, even models like Whisper with state-of-the-art robustness to random noise and domain change show no significant adversarial robustness improvement, while their increased adoption makes threat models more realistic. Second, we show that popular ASR training paradigms like Self-Supervised Learning (SSL) have opened the way to new threat models like transferred adversarial attacks from a proxy model. We draw conclusions from those results, emphasizing the importance to include adversarial robustness in speech modeling pipelines from a security perspective, but also the interest of adversarial evaluation in better understanding those new learning paradigms.
- Bio: Raphaël is a PhD candidate at Carnegie Mellon University working with professor Bhiksha Raj. His research is at the intersection of speech technologies and AI safety, with a focus on adversarially robust speech recognition.
- Mar 16, 2023
- Title: Everyday Conversation Recognition with End-to-End Neural Networks
- Speaker: Xuankai Chang
- Abstract: Over the past decade, remarkable advancements have been made in automatic speech recognition (ASR) largely due to the rapid development of deep neural networks. However, ASR systems often encounter challenges in complex settings caused by background noise, reverberations, overlapping speech, etc. In this talk, we present our efforts towards recognizing everyday conversations using End-to-End neural networks. Our approach tackles the issue of noise and overlapping speech by leveraging single- and multi-channel signals.
- Bio: Xuankai Chang is a 4-th year PhD student at Carnegie Mellon University’s Language Technology Institute at the school of Computer Science working with Prof. Shinji Watanabe. His research interests are in the field of speech processing such as speech recognition / enhancement / separation. He received his B.S. and M.S degrees from Shanghai Jiao Tong University, China.
- Feb 16, 2023
- Title: Multi-blank Transducers for Speech Recognition
- Speaker: Hainan Xu (NVIDIA)
- Abstract: We propose a modification to RNN-Transducer (RNN-T) models for automatic speech recognition (ASR). In standard RNN-T, the emission of a blank symbol consumes exactly one input frame; in our proposed method, we introduce additional blank symbols, which consume two or more input frames when emitted. We refer to the added symbols as big blanks, and the method multi-blank RNN-T. For training multi-blank RNN-Ts, we propose a novel logit under-normalization method in order to prioritize emissions of big blanks. With experiments on multiple languages and datasets, we show that multi-blank RNN-T methods could bring relative speedups of over +90%/+139% to model inference for English Librispeech and German Multilingual Librispeech datasets, respectively. The multi-blank RNN-T method also improves ASR accuracy consistently. We will release our implementation of the method in the NeMo toolkit.
- Bio: I am currently working in NVIDIA’s NeMo Team, supervised by Boris Ginsburg. Before joining NVIDIA, I worked in Google’s Speech Team under Bhuvana Ramabhadran from September 2019 to October 2021, after getting my Ph.D. degree in Computer Science from the Johns Hopkins University, working in the Center for Language and Speech Processing (CLSP) under former JHU Prof. Daniel Povey and Prof. Sanjeev Khudanpur. I received my B.S. in Software Engineering in 2012 from School of Software Engineering at Shanghai Jiaotong University in Shanghai, China. From 2012 to 2013, I worked with Professor Kai Yu in SJTU Speech Lab.
- Feb 2, 2023
- Title: Towards robust audio-visual learning
- Speaker: Billy Li
- Abstract: Audio Visual Event Detection has benefited greatly from the advancement of deep learning in the past few years. Various model architectures have been applied to the task in multiple modalities, pushing the performance benchmark and enabling the deployment of such models in many critical tasks such as surveillance and malicious content filtering. However, the research community still lacks: 1) a systematic understanding of the different machine learning models’ behavior given the unique nature of audio signals compared to the image or text counterparts. 2) The robustness of different models used for audio-visual learning also remains to be an under-studied area. To address the first point, we investigate best practices for building an audio-only and audio-visual learning system that performs well. Specifically, we analyze the features, compare different architectures, mainly focusing on convolutional family and Transformer family models, and understand the difference in training techniques to provide a comprehensive and thorough understanding. To address the second goal, we study the robustness of each model by gauging their behavior under noise and adversarial perturbation. We first demonstrate the existence of real-world threats in both the visual and audio domains. We then expand our scope of robustness analysis from unimodal audio input to multiple modalities including audio, video, image and text.
- Bio: Juncheng (Billy) Li is last year PhD student at Carnegie Mellon University’s Language Technology Institute at the School of Computer Science working with Prof. Florian Metze. Juncheng (Billy) Li had worked as a research scientist at the Bosch Center for Artificial Intelligence from 2015 till 2019 where he worked with Prof. Zico Kolter. Juncheng (Billy) has a background in Deep Learning in Acoustics signals and multimodal data, and he is currently working on exploring the adversarial robustness of the multimodal machine learning systems. Juncheng also acquired extensive experience in applying AI to industrial problems when he worked at Bosch, specifically, he has worked on projects including fault detection, machine simulation and sensor fusion. Juncheng has published at IEEE ICASSP, Interspeech, ICML and NeurIPS, and won the best student paper at ICASSP 2022, and best paper award at ICMR 2018.
- January 19, 2023
- Title: Self-supervised speech restoration for historical audio
- Speaker: Takaaki Saeki
- Abstract: Existing historical audio materials are precious resources that contain various linguistic and cultural information. However, restoring or analyzing such audio data is challenging because paired high-quality and degraded speech data is not available. In this talk, we present our recent work on self-supervised speech restoration for historical audio. Our model is based on an autoencoder, which disentangles distortionless speech features from acoustic distortions and is trained only with degraded speech data by simulating the recording process. We evaluated our method with real historical audio data and demonstrated the effectiveness. I will also discuss the ongoing work and future directions including larger-scale self-supervised learning by collecting various historical audio materials.
- Bio: Takaaki Saeki is a Ph.D. student advised by Prof. Hiroshi Saruwatari, at Graduate School of Information Science and Technology, the University of Tokyo, Japan. He is also a visiting scholar at Watanabe’s Audio and Voice (WAV) Lab, Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA. He received his B.S. and M.S degrees from the University of Tokyo, Japan. He has been working on speech and audio processing, including text-to-speech (TTS) synthesis, voice conversion, automatic speech quality assessment, speech restoration, speech representation learning, multilingual speech processing, etc.
- December 8, 2022
- Title: Recent Progresses in All-Neural Contextual Speech Recognition Technologies at Amazon Alexa
- Speaker: Jing Liu (Amazon)
- Abstract: Speech recognition technology is completing a dramatic shift from the conventional disjointly trained neural and non-neural sub-systems to an all-neural end-to-end (E2E) transducer architecture. This unified all-neural E2E architecture improves the state-of-the-art accuracy, while also achieving superior compute and memory compression, enabling low-latency streaming ASR on the edge where resources are constrained. One of the major challenges for E2E ASR systems is that they often have difficulty recognizing uncommon words that appear infrequently in the training data (e.g. personalized contact names, device names). In this talk, I will give an overview of the challenges and our recent progresses in the all-neural contextual ASR technology that powers Alexa, the virtual voice assistant that servers millions of customers daily.
- Bio: Jing Liu is a Sr. Applied Scientist at Amazon Alexa Speech. He earned his PhD degree in mathematics from Carnegie Mellon University and worked on quantitative research in the Wall Street for about 3 years prior to joining the Alexa Hybrid Science in Pittsburgh in 2017. Being a senior member in the team, he made key contributions to the launch of the first Edge ASR for Echo, Auto, FireTV and the transition to all-neural ASR technologies. Jing’s main focus has been low-latency far-field ASR systems for both edge and cloud.
- November 17, 2022
- Title: Compositional End-to-End SLU
- Speaker: Siddhant Arora
- Abstract: End-to-end spoken language understanding (SLU) systems are gaining popularity over cascaded approaches due to their simplicity and ability to avoid error propagation. However, these systems model sequence labeling as a sequence prediction task causing a divergence from its well-established token-level tagging formulation. In work accepted at EMNLP 2022, we build compositional end-to-end SLU systems that explicitly separate the added complexity of recognizing spoken mentions in SLU from the NLU task of sequence labeling. We show that this composition of ASR and NLU formulations in our end-to-end SLU system outperforms both cascaded and direct end-to-end models, offers direct compatibility with pre-trained ASR and NLU systems, allows performance monitoring of individual components and enables the use of globally normalized losses like CRF, making them attractive in practical scenarios.
- Bio: Siddhant Arora is a Ph.D. student at Carnegie Mellon University’s Language Technology Institute, advised by Prof. Shinji Watanabe. His research interests are in the field of Natural Language (NLP) and Speech Processing, particularly in Spoken Language Understanding and Spoken Dialog Systems. His prior research experience includes building compositional models, integrating pretrained models in the SLU framework, as well as interpretability and robust testing of ML systems.
- November 10, 2022
- Title: The Cocktail Party Problem: WER we are and WER we are going?
- Speaker: Samuele Cornell
- Abstract: Multi-talker, distant speech recognition is still an incredibly challenging task due to the noisy/reverberant nature of the speech signal and the spontaneous nature of the conversation which leads to colloquial language and overlapped speech. In this presentation, focusing on the famous CHiME challenges, I will give a brief overview of how this problem has been tackled recently and what are the current research trends and promising directions (e.g. integration of front-end beamforming and back-end speech recognition systems).
- Bio: Samuele Cornell is a doctoral candidate at UnivPM Department of Information Engineering. His current research interests are in the area of front-end pre-processing techniques for Automatic Speech Recognition applications such as: Source Separation, Speech Enhancement, Speech Segmentation and Beamforming. He also deals with Sound Event Detection and was among the organizers of the DCASE Task 4 2021 and 2022 Challenges. He is co-author of Asteroid Source Separation, SpeechBrain and ESPNet++ toolkits and contributes to other open-source toolkits in the speech processing area. In 2021 he interned for three months at Amazon Alexa AI with the wakeword team and in summer 2022 was with the Language Technologies Institute at Carnegie Mellon University.
- October 13, 2022
- Title: Robot Concept Learning in Situated Dialogue
- Speaker: Matthew Marge
- Abstract: Intelligent agents that refer to and make use of the physical world, like robots, will be more able to adapt to new situations if they can learn concepts in real time from humans. This process forms an interactive dialogue loop between robots asking questions to learn more about the physical world and humans using natural language to teach them. In this talk, I will present findings from the Human-Robot Dialogue Learning project that explored these topics. Key accomplishments include (1) an improved understanding of how humans teach robots compared to other humans, (2) a first-of-its-kind corpus of questions that robots can use to learn from human teachers, and (3) real-time algorithms that enable robots to generate questions that maximize learning in a cognitive robotic architecture. The end result is the novel capability for intelligent agents to use situated dialogue and one-shot learning to acquire more information about their surroundings with human teammates.
- Bio: Matthew Marge is a Senior Computer Scientist at DEVCOM Army Research Laboratory (ARL). He received the Ph.D. and M.S. degrees in Language and Information Technologies from the School of Computer Science at Carnegie Mellon University, the M.S. degree in Artificial Intelligence from the University of Edinburgh, and the B.S. degrees in Computer Science and Applied Mathematics and Statistics from Stony Brook University. Dr. Marge’s research focuses on how robots and other artificial agents can establish common ground with people through dialogue. His current interests lie at the intersection of computational linguistics, human-machine interaction, and integrative AI systems, specializing in conversational AI. Dr. Marge is a recipient of the 2018 Office of the Secretary of Defense’s Laboratory University Collaboration Initiative award, supporting his research on dialogue with robots. In addition to his position at ARL, he is an Adjunct Professor in the Computer Science and Linguistics Departments at Georgetown University.
- September 29, 2022
- Title: Audio Visual Recognition and Understanding
- Speaker: Karthik Ganesan
- Abstract: Streaming audio-visual speech recognition (SAVSR) introduces an online setting to audio-visual speech recognition (AVSR), which frees the full utterance requirement prior to decoding that traditional speech recognition models are limited to. Streaming audio-visual speech recognition further challenges the model leaving itself to decide how much the model should wait to have retrieved enough information to start decoding. While transformer based models such as AvHuBERT have been successful in AVSR tasks through pretraining and cross-modal interactions, these models suffer in achieving reasonable Real-Time Factor (RTF) which is necessary for communication agents. We propose ESPnet Mulimodal, a multimodal frame work integrated to ESPnet, and provide baseline results for the task SAVSR. We also propose a streaming transformer. and multimodal fusion based model for SAVSR. Through ESPnet Mulitmodal, we expect to facilitate research in the field of audio-visual tasks including SAVSR.
- Bio: Karthik Ganesan is a Masters student, advised by Dr. Shinji Watanabe, at Watanabe’s Audio and Voice (WAV) Lab, Language Technologies Institute (LTI), Carnegie Mellon University (CMU), Pittsburgh, PA. He received his B.E. in computer science from MSRIT, Bangalore, India. He has been conducting research on various aspects of conversation AI systems, including audio visual streaming ASR, 2 pass streaming speech recognition, E2E SLU, Parameter efficient Multilingual ASR.
- September 15, 2022
- Title: End-to-End Unsupervised ASR and Its Application
- Speaker: Jiatong Shi
- Abstract: Unsupervised ASR is to learn an ASR model without parallel speech/text. Recently, with the help of self-supervised learning, end-to-end unsupervised ASR has become possible and has shown impressive performances. This talk goes through some of our efforts from the pre-training team, JSALT2022, including some experiences in end-to-end unsupervised ASR and its extended usages in self-supervised augmentation, acoustic segmentation, and connection between modalities. We will also discuss some ongoing open-source works.
- Bio: Jiatong Shi is a Ph.D. student, advised by Dr. Shinji Watanabe, at Watanabe’s Audio and Voice (WAV) Lab, Language Technologies Institute (LTI), Carnegie Mellon University (CMU), Pittsburgh, PA. He received his B.S. of computer science from Renmin University of China (RUC), advised by Dr. Qin Jin and M.S. of computer science from Johns Hopkins University (JHU) advised by Dr. Shinji Watanabe. He has been conducting research on various aspects of speech/audio processing, including speech recognition, speech translation, speech synthesis, speaker diarization, and singing voice synthesis. His recent focus is speech-to-speech translation.
- September 1, 2022
- Title: Is everything end-to-end?
- Speaker: Shinji Watanabe (CMU)
- Abstract: This presentation introduces some of our group’s recent attempts at making an end-to-end network that integrates various speech processing modules as a single neural network. I’ll talk about ASR (feature extraction, acoustic modeling, lexicons, language modeling), far-field conversation recognition (ASR + denoising/dereverberation/separation (+ diarization)), and cycle consistency training (ASR + SID + TTS). I will introduce some random thoughts about these attempts and also discuss future integration ideas.
- Bio: Shinji Watanabe is an Associate Professor at Carnegie Mellon University, Pittsburgh, PA. He received his B.S., M.S., and Ph.D. (Dr. Eng.) degrees from Waseda University, Tokyo, Japan. He was a research scientist at NTT Communication Science Laboratories, Kyoto, Japan, from 2001 to 2011, a visiting scholar in Georgia institute of technology, Atlanta, GA in 2009, and a senior principal research scientist at Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA USA from 2012 to 2017. Prior to the move to Carnegie Mellon University, he was an associate research professor at Johns Hopkins University, Baltimore, MD USA from 2017 to 2020. His research interests include automatic speech recognition, speech enhancement, spoken language understanding, and machine learning for speech and language processing. He has published more than 300 papers in peer-reviewed journals and conferences and received several awards, including the best paper award from the IEEE ASRU in 2019. He served as an Associate Editor of the IEEE Transactions on Audio Speech and Language Processing. He was/has been a member of several technical committees, including the APSIPA Speech, Language, and Audio Technical Committee (SLA), IEEE Signal Processing Society Speech and Language Technical Committee (SLTC), and Machine Learning for Signal Processing Technical Committee (MLSP).
- August 8, 2022
- Title: An Unified Understanding of Voice Conversion and its Medical Application
- Speaker: Wen-Chin Huang (Nagoya University)
- Abstract: Voice conversion (VC) is the task of converting one kind of speech to another without changing the linguistic contents, and is the second most popular research field in speech synthesis. With the rise of deep neural networks, there are more and more VC methods being proposed each year, and it might be hard to understand the difference of these methods at first sight. In this talk, I will provide my own, unified understanding of VC, and show that how most successful VC methods implement the same underlying framework. I will also introduce my recent works on dysarthric VC, as a showcase of an application of VC to medicine.
- Bio: Wen-Chin Huang is currently a Ph.D. candidate at Nagoya University, Nagoya, Japan. He received the B.S. degree from National Taiwan University, Taipei, Taiwan, in 2018 and the M.S. degree from Nagoya University, Nagoya, Japan in 2021. He was the recipient of the Best Student Paper Award in ISCSLP2018, the Best Paper Award in APSIPA ASC 2021, and the research fellowship for young scientists (DC1) from the Japan Society for the Promotion of Science in 2021. He was a co-organizer of the Voice Conversion Challenge 2020 and VoiceMOS Challenge 2022. His research focuses on deep learning applications to speech processing, with a main focus in voice conversion and speech quality assessment.
- June 16, 2022
- Title: Language Technology for Medical Scribing
- Speaker: Thomas Schaaf (3M | M*Modal)
- Abstract: For many reasons, physicians document what they are doing. In the past, they have used handwritten or dictated notes. With the introduction of EHR systems, the complexity of the documentation workflow has increased, leading to frustration and burnout. Medical asynchronous scribes can do the data entry and note-taking for physicians from an audio recording of the conversation between the physician and the patient. Scribes can be supported with Language Technology using a pipeline of speaker diarization, speech recognition, and natural language understanding. This enables them to asynchronously navigate the audio and review extracted dictated sections or abstractive summaries of the conversation.
- Bio: Thomas Schaaf is a Principal Research Scientist at 3M | M*Modal. He received his Dr. Ing. from the Universität Karlsruhe in 2004 and has been working on Automatic Speech Recognition, Speech Translation, and Natural Language Understanding at Sony Europe, Carnegie Mellon University, Toshiba Europe, Amazon, and M*Modal. He has worked on nearly all aspects of speech recognition systems,and his research has contributed, among others, to the prediction of word confidences, detection and learning of out-of-vocabulary words, and speaker normalization. He joined 3M in 2019 through the acquisition of M*Modal. There, his research focuses on understanding doctor-patient conversations to reduce the burden of the documentation process for doctors and create more time to care. He is also Adjunct Faculty at the Language Technology Institute of Carnegie Mellon University and a reviewer for numerous conferences and journals.
- May 13, 2022
- Title: Directions of Dialog Research in the Era of Big Pre-training Models
- Speaker: Zhou Yu (Columbia University)
- Abstract: Big pre-training models (such as BERT and GPT3) have demonstrated excellent performances on various NLP tasks. Instruction tuning and prompting have enabled these models to shine in low-resource settings. The natural question is “Will big models solve dialog tasks?” This talk will first go through big models’ impact on several sub-topics within dialog systems (e.g. social chatbots, task-oriented dialog systems, negotiation/persuasion dialog systems, continue learning in dialog systems, multilingual dialog systems, multimodal dialog systems, deployable dialog systems, etc) and then follow up with the speaker’s own interpretations of the challenges remaining and possible future directions.
- Bio: Zhou Yu joined the CS department at Columbia University in Jan 2021 as an Assistant Professor (http://www.cs.columbia.edu/~zhouyu/). Before that, she was an Assistant Professor at UC Davis. She obtained her Ph.D. from Carnegie Mellon University in 2017. Zhou has built various dialog systems that have a real impact, such as a job interview training system, a depression screening system, and a second language learning system. Her research interests include dialog systems, language understanding and generation, vision and language, human-computer interaction, and social robots. Zhou received an ACL 2019 best paper nomination, featured in Forbes 2018 30 under 30 in Science, and won the 2018 Amazon Alexa Prize.