Speech Recognition and Understanding (11-751/18-781)
Course Logistics
- Instructor: Shinji Watanabe
- TAs: Xuankai Chang, Yifan Peng, Jiatong Shi
- Time: MW 4:40PM – 6:00PM
- Location: GHC 4307
- Discussion: Piazza
Grading
- Grading policies
- Class Participation (30%)
- Assignments (30%)
- Mid-term exam (20%)
- Term Project (20%)
- We will use gradescope
Syllabus
- This is a tentative schedule.
- The slides will be uploaded right before the lecture.
- The vidoes will be uploaded irregulaly after the lecture due to the edit process.
Date | Lecture | Topics | Slides/Videos |
---|---|---|---|
8/29 | Course overview | Course explanation and introduction | https://youtu.be/DsYDmg72K1k |
8/31 | Introduction of speech recognition |
- Evaluation metric - How to transcribe speech - Databases |
https://youtu.be/HCNqxmOwEH4 |
9/7 | Overview of speech recognition systems |
- HMM-based systems vs. End-to-End systems - Output Units - Pronunciation lexicon |
No video due to technical issues. |
9/12 | Speech recognition formulations |
- Probabilistic rules - From Bayes decision theory to HMM + n-gram, CTC, RNN-T, and attention |
https://youtu.be/9QPiMoJJAXg |
9/14 | Feature extraction |
- Basic pipeline - Some advances in feature extractions |
https://youtu.be/isABhD2ym80 |
9/19 | ESPnet hands-on tutorial I |
- Introduction of toolkit - How to make a new recipe |
https://youtu.be/YDN8cVjxSik |
9/21 | ESPnet hands-on tutorial II | - How to make a new task | https://youtu.be/Css3XAes7SU |
9/26 | Alignment |
- 3 state left-to-right HMM - CTC - Transducer |
https://youtu.be/ZFvtCaXs3aA |
9/28 | Hidden Markov model (Part I) |
- Emission probability - Single Gaussian model - Gaussian mixture model - Expectation Maximization Algorithm |
https://youtu.be/hJi5quunTLY |
10/3 | Hidden Markov model (Part II) | Hidden Markov model with Expectation Maximization Algorithm | https://youtu.be/6k7q9ggIfYI |
10/5 | Hidden Markov model (Part III) |
- Baum-Welch algorithm - Viterbi algorithm |
https://youtu.be/YmRnIphseyw |
10/10 | Advanced acoustic modelig |
- Phonetic decision tree - Adaptation |
https://youtu.be/GTaqSmQSHBs |
10/12 | Language model | N-gram language model | https://youtu.be/VqySbRgHlPc |
10/24 | Deep learning for speech recognition |
- Introduction - Deep neural networks for acoustic modeling |
https://youtu.be/IWvCFd91JPg |
10/26 | Mid-term exam | ||
10/31 | Advanced neural network architectures for acoustic model |
- Convolutional neural networks - Recurrent neural networks - Self-attention |
https://youtu.be/3YTuHQfaLgA |
11/2 | Neural network language model | https://youtu.be/uRk79NJD1cA | |
11/7 | End-to-end ASR: Attention | https://youtu.be/6955aj5hlwk | |
11/9 | End-to-end ASR: CTC | https://youtu.be/X2Jjx1icXsE | |
11/14 | End-to-end ASR: RNN-T | https://youtu.be/lVc46-aBnzM | |
11/16 | Advanced topics on end-to-end ASR |
- Data augmentation - Joint CTC/attention - Transformer - Streaming |
https://youtu.be/S2rSm11lX80 |
11/21 | Search |
- Time-synchronous beam search - Label-synchronous beam search - N-best and lattice - Rescoring |
https://youtu.be/GcfIdxj1s8M |
11/28 | Guest Lecture, Zhong-Qiu Wang |
- Robust training method - Speech enhancement and separation - Mltichannel processing |
No Video |
11/30 | Guest lecture, Thomas Shaf at 3M | MModal | No Video | |
12/5 | Project Event | ||
12/7 | Project event |
Assignments
Will be announced during the course