Home Lex Fridman Notes
Lex Fridman · 2016-09-27 · 1h 20m

Sequence to Sequence Deep Learning (Quoc Le, Google)

Quoc Le explains sequence-to-sequence deep learning, from email auto-reply to translation, attention, and memory-augmented networks.

Sequence to Sequence Deep Learning (Quoc Le, Google)
The guest

Quoc Le — Research scientist at Google who co-developed sequence-to-sequence learning and the Smart Reply feature in Gmail/Inbox.

The gist

This is a technical lecture in which Quoc Le builds up sequence-to-sequence (seq2seq) deep learning from first principles. He starts with a simple yes/no email auto-reply classifier using bag-of-words and logistic regression, then introduces recurrent networks to preserve word ordering, and finally the encoder-decoder architecture that maps variable-length inputs to variable-length outputs. He covers training with stochastic gradient descent and auto-differentiation, prediction via greedy and beam search decoding, and the attention mechanism that fixes the fixed-length-representation bottleneck. He discusses real applications including Google's Smart Reply, machine translation, image captioning, and speech recognition, and closes with active research areas like memory-augmented and operation-augmented neural networks for question answering and reasoning.

Big reveals

  • The Smart Reply feature in Google Inbox is already running this seq2seq system in production to suggest email replies.
  • The three suggested replies shown in Smart Reply are the top beams from beam search, with a heuristic added to make them diverse.
  • Smart Reply is actually two algorithms combined: one decides whether an email should get a reply, then a second generates the reply.
  • The fixed-length representation between encoder and decoder is a bottleneck, which the attention mechanism (invented at University of Montreal) solves.
  • For speech, seq2seq with attention does not yet beat CTC or HMM-DNN hybrid systems in published results, unlike in translation where it is state-of-the-art.
  • Question-answering can be framed as seq2seq with attention reading a book or Wikipedia page, motivating memory-augmented networks so the model doesn't reread.
  • Operation-augmented networks (neural programmers) let a network call functions like addition and subtraction to answer numeric reasoning questions.

Things worth remembering

  • The whole talk is motivated by Quoc Le returning from vacation to 508 emails, many needing only yes/no answers.
  • Word vectors number equals vocabulary size (about 20,000 rows in matrix U), while hidden dimension is a free model-selection choice.
  • When the vocabulary has 20,000 choices, the binary logistic regression is expanded to a softmax over all 20,000 output tokens.
  • The first seq2seq attempt failed because the model never knew its own previous prediction; feeding outputs back as inputs (auto-regressive) fixed it.
  • In practice RNNs are unrolled to about 400 steps because longer becomes too expensive to compute and update.
  • Scheduled sampling feeds the model's own sampled outputs during training so it learns to recover from its bad predictions.
  • Gradient clipping (e.g. cap magnitude at 10) is used to combat exploding gradients from repeated matrix multiplication.
  • Sequence-level training optimizes whole-sequence metrics like BLEU, but humans still prefer outputs from the standard next-step prediction model.
  • Translation models reach state-of-the-art on only about 3-5 million sentence pairs (e.g. English-German), which is relatively small.
  • Skip-thoughts (with Ruslan Salakhutdinov) uses seq2seq to predict neighboring sentences, producing embeddings useful for document classification.