Adam Coates of Baidu explains how end-to-end deep learning replaced traditional pipelines to build accurate speech recognition engines.

Adam Coates — Researcher at Baidu's Silicon Valley AI Lab working on the Deep Speech speech recognition engine; gives this technical tutorial on deep learning for speech.
This is a technical tutorial on building speech recognition systems with deep learning. Coates first walks through the traditional pipeline (features, acoustic model, language model, decoder, phonemes, lexicon) and its brittleness, then shows how deep learning first replaced single components and now powers end-to-end systems. He explains CTC (Connectionist Temporal Classification) in detail as the method for mapping variable-length audio to transcriptions, plus training tricks like SortaGrad and batch normalization. The final sections cover scaling up with data augmentation, GPU computation, beam-search decoding with n-gram language models, and production concerns like latency and batching. He notes Baidu's Deep Speech engine reached human-competitive accuracy in Mandarin.