Sequence to Sequence Deep Learning (Quoc Le, Google)

The guest

Quoc Le — Research scientist at Google who co-developed sequence-to-sequence learning and the Smart Reply feature in Gmail/Inbox.

The gist

This is a technical lecture in which Quoc Le builds up sequence-to-sequence (seq2seq) deep learning from first principles. He starts with a simple yes/no email auto-reply classifier using bag-of-words and logistic regression, then introduces recurrent networks to preserve word ordering, and finally the encoder-decoder architecture that maps variable-length inputs to variable-length outputs. He covers training with stochastic gradient descent and auto-differentiation, prediction via greedy and beam search decoding, and the attention mechanism that fixes the fixed-length-representation bottleneck. He discusses real applications including Google's Smart Reply, machine translation, image captioning, and speech recognition, and closes with active research areas like memory-augmented and operation-augmented neural networks for question answering and reasoning.

Big reveals

The Smart Reply feature in Google Inbox is already running this seq2seq system in production to suggest email replies.
00:32:43
The three suggested replies shown in Smart Reply are the top beams from beam search, with a heuristic added to make them diverse.
00:33:14
Smart Reply is actually two algorithms combined: one decides whether an email should get a reply, then a second generates the reply.
00:36:54
The fixed-length representation between encoder and decoder is a bottleneck, which the attention mechanism (invented at University of Montreal) solves.
00:38:28
For speech, seq2seq with attention does not yet beat CTC or HMM-DNN hybrid systems in published results, unlike in translation where it is state-of-the-art.
00:52:21
Question-answering can be framed as seq2seq with attention reading a book or Wikipedia page, motivating memory-augmented networks so the model doesn't reread.
01:05:22
Operation-augmented networks (neural programmers) let a network call functions like addition and subtraction to answer numeric reasoning questions.
01:09:38

Things worth remembering

The whole talk is motivated by Quoc Le returning from vacation to 508 emails, many needing only yes/no answers.
00:00:30
Word vectors number equals vocabulary size (about 20,000 rows in matrix U), while hidden dimension is a free model-selection choice.
00:11:37
When the vocabulary has 20,000 choices, the binary logistic regression is expanded to a softmax over all 20,000 output tokens.
00:17:58
The first seq2seq attempt failed because the model never knew its own previous prediction; feeding outputs back as inputs (auto-regressive) fixed it.
00:18:30
In practice RNNs are unrolled to about 400 steps because longer becomes too expensive to compute and update.
00:21:20
Scheduled sampling feeds the model's own sampled outputs during training so it learns to recover from its bad predictions.
00:30:38
Gradient clipping (e.g. cap magnitude at 10) is used to combat exploding gradients from repeated matrix multiplication.
00:47:35
Sequence-level training optimizes whole-sequence metrics like BLEU, but humans still prefer outputs from the standard next-step prediction model.
00:59:24
Translation models reach state-of-the-art on only about 3-5 million sentence pairs (e.g. English-German), which is relatively small.
01:01:09
Skip-thoughts (with Ruslan Salakhutdinov) uses seq2seq to predict neighboring sentences, producing embeddings useful for document classification.
01:16:32

Topics

deep learning sequence-to-sequence recurrent neural networks attention mechanism machine translation natural language processing speech recognition question answering