Foundations of Deep Learning (Hugo Larochelle, Twitter)

The guest

Hugo Larochelle — Deep learning researcher (Twitter/Google) known for his widely-used online neural network lecture series; delivering a tutorial on the foundations of deep learning.

The gist

This is a tutorial lecture in which Hugo Larochelle lays out the foundations of deep learning, starting with the notation and forward propagation of multi-layer feedforward neural networks. He walks through activation functions (sigmoid, tanh, ReLU, softmax), loss functions (negative log likelihood / cross entropy), and how networks are trained using stochastic gradient descent and backpropagation via the chain rule. He covers practical tricks of the trade including regularization, weight initialization, hyperparameter search, early stopping, mini-batches, momentum, and adaptive optimizers like AdaGrad, RMSProp, and Adam. He closes by motivating deep architectures and explaining modern techniques such as dropout and batch normalization.

Big reveals

A single hidden layer neural network with a linear output and enough hidden units can approximate any continuous function arbitrarily well (universal approximation), but this result doesn't tell you how to find the weights.
00:09:45
Backpropagation's cost is essentially the same as a forward pass, but gradients can vanish when activation function derivatives approach zero in saturated units.
00:24:19
At test time, dropout with probability 0.5 is mathematically equivalent (in the single-hidden-layer case) to a geometric average over an exponential number of weight-sharing networks.
00:53:22
Batch normalization both improves optimization (reducing underfitting) and acts as a regularizer, making dropout less necessary when used together.
00:54:24
Deep learning's recent success came not from new theory but from GPUs enabling far more training iterations, which is why it works now and didn't decades ago despite backprop existing since the 1980s.
00:48:13

Things worth remembering

Weights cannot be initialized to zero or to identical values, because doing so produces identical gradients and the network can never break symmetry between units.
00:30:34
A common weight initialization recipe draws from a uniform distribution with theoretical grounding from a paper by Xavier Glorot and Yoshua Bengio, derived for tanh activations.
00:31:36
Random search over hyperparameters often beats grid search because it leaves no holes in the grid and avoids wasting whole batches of experiments on a single bad value like a too-high learning rate.
00:34:11
Mini-batches (e.g. 64 or 128 examples) speed training because matrix-matrix multiplications are faster than many separate matrix-vector multiplications.
00:39:22
You can verify hand-coded gradients using a finite-difference estimate by perturbing each parameter by a tiny epsilon (e.g. 10^-6).
00:43:00
A good debugging step is to confirm your network can perfectly overfit a tiny subset of about 50 examples before running on the full dataset.
00:43:31
The motivation for deep layers is partly biological: the visual cortex processes signals through stages (V1 edges, V4 more complex patterns, IT object-specific neurons).
00:46:07
Boolean circuit theory shows certain functions need an exponential number of units if layers are restricted, but can be represented compactly with more layers.
00:47:09

Topics

deep learning neural networks backpropagation stochastic gradient descent dropout batch normalization hyperparameter tuning machine learning