Home Lex Fridman Notes
Lex Fridman · 2016-09-27 · 57m

Torch Tutorial (Alex Wiltschko, Twitter)

Alex Wiltschko gives a practical tutorial on Torch and Twitter's Autograd, explaining automatic differentiation as the core abstraction behind all deep learning libraries.

Torch Tutorial (Alex Wiltschko, Twitter)
The guest

Alex Wiltschko — Machine learning engineer at Twitter, developer of the torch-autograd package; gives this deep learning school tutorial on the Torch ecosystem.

The gist

This is a technical lecture on the Torch deep learning ecosystem and the Lua language it is built on. The first half covers practical fundamentals: tensors as views into memory, GPU computation, and the NN, optim, and Autograd packages for building and training neural networks. The second half dives into automatic differentiation, explaining why reverse-mode (backpropagation) beats forward-mode for neural nets, and how Autograd traces program execution just-in-time to compute gradients through control flow, loops, and recursion. Wiltschko then situates Torch among other libraries (TensorFlow, Theano, Keras, Caffe, Chainer) by granularity and graph-construction strategy, and closes with ideas the field could import from older AD communities like weather modeling.

Big reveals

  • Twitter runs Torch in production at scale: 'every piece of media that comes in to Twitter goes through a torch model at this point.'
  • Twitter built the torch-autograd package to glue together model pieces as small as addition, multiplication, and subtraction while still getting correct gradients.
  • Machine learning's key abstraction is reverse-mode automatic differentiation, also known as backpropagation, which is a special case of autodiff.
  • Forward-mode autodiff requires one full evaluation per parameter, so for a million-parameter network it is hopelessly expensive, which is why nobody uses it.
  • In production Autograd's tracing machinery disappears entirely, leaving plain numerical code with no test-time speed penalty.
  • Twitter serves Torch models by running Lua virtual machines inside Java, communicating over the JNI.

Things worth remembering

  • In the iTorch notebook you can prepend any torch function with a question mark to get its help documentation.
  • LuaJIT for-loops run at basically the same speed as C, so you pay little speed penalty for the convenient scripting layer.
  • The entire Lua language is defined by only about 10,000 lines of C code, small enough to learn in an afternoon.
  • Lua is embedded in unexpected places: World of Warcraft quest scripting, Adobe Lightroom's UI, and the scriptable layers of Redis and nginx.
  • Lua originally was chosen for Torch because it was far easier than Python to run on embedded chips for machine learning.
  • A Torch tensor is just a pointer/view into row-major memory defined by size, stride, and offset, so slices share memory rather than copying.
  • Automatic differentiation has been rediscovered many times; nuclear science, computational fluid dynamics, and weather modeling have more sophisticated AD tools than ML.
  • Autograd can pass gradients through non-differentiable operations like floor, useful for building a differentiable JPEG or MPEG encoder.
  • There is a graph type called 'sea of nodes' from Cliff Click's mid-90s thesis that naturally expresses both control and data flow but hasn't been used in a deep learning library.