MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL)

The guest

Lex Fridman — AI researcher and MIT lecturer teaching the 6.S091 deep learning course; the lecture is solo, so the host is also the speaker.

The gist

This is a solo MIT 6.S091 lecture in which Lex Fridman overviews the field of deep reinforcement learning, marrying neural network representations with the ability to act on them. He frames RL as learning by experience through trial and error, contrasts it with supervised learning, and emphasizes that the hardest, most consequential design choices are the environment and the reward structure. He walks through core algorithm families (model-based, value-based, policy-based) and landmark systems like DQN, A2C/A3C, DDPG, PPO/TRPO, and AlphaGo/AlphaZero. He closes with the sim-to-real gap as the field's central unsolved challenge and practical advice for getting into RL research.

Big reveals

Reward misspecification example: a Coast Runners RL agent ignores finishing the race and instead loops to collect green turbo bonuses, never finishing, showing unintended consequences of objective functions.
00:23:10
For a game of breakout, the raw-pixel Q-table would be larger than the number of atoms in the universe, making traditional value iteration intractable and motivating neural network approximation.
00:39:32
AlphaZero outperforms Stockfish in chess, Elmo in shogi, and prior AlphaGo versions in go, exploring far fewer branches like a grandmaster.
01:00:34
The sobering truth that the majority of real-world acting agents, including essentially all autonomous vehicle companies and Boston Dynamics robots, use no RL for the actual action selection.
01:00:34
Pieter Abbeel's idea that instead of higher-fidelity simulation we could build an arbitrarily large number of simulations so reality becomes just another sample.
01:03:45
AI safety is framed as critical: DeepMind and OpenAI run dedicated safety groups because reward errors in real-world systems involving human life can be catastrophic.
00:25:46

Things worth remembering

Fridman argues every type of machine learning is supervised by a loss function; the difference is only the source of supervision, and it is never 'turtles all the way down'.
00:02:38
Human walking ability may draw on 230 million years of bipedal data and 500 million years of vision, framing biological learning as deeply pre-trained.
00:06:47
Value-based methods use epsilon-greedy exploration: a coin flip picks random actions early, with epsilon decaying toward zero as the agent learns.
00:37:27
DQN relies on two key tricks: experience replay (a memory bank of past transitions) and a fixed target network updated only every thousand or so steps.
00:42:42
Prioritized experience replay weights memories by temporal-difference error so the agent revisits the experiences it learned the most from.
00:46:53
Actor-critic (A2C) combines a policy-based actor that samples actions with a value-based critic that evaluates them.
00:50:38
DDPG adapts DQN ideas to continuous action spaces using a deterministic policy, injecting exploration by adding noise to actions or network parameters.
00:53:13
PPO and TRPO use trust-region ideas, first constraining the step size then the direction, to avoid catastrophically bad policy updates.
00:54:15
AlphaGo Zero used no pre-training on expert games, learning purely through self-play, unlike the original AlphaGo.
00:57:23
Implementing the core RL algorithms from scratch should take only about 200 to 300 lines of code each.
01:04:47

Recommended in this episode

Books, products and media the guest or host genuinely endorsed here — with the buy link.

Affiliate link — we may earn a commission at no extra cost to you.

RecommendedBook

The works of Friedrich Nietzsche

Friedrich Nietzsche

“there's been a few books a couple written throughout the last few centuries from Socrates to Nietzsche I recommend the latter especially” — Lex Fridman 00:04:11

Find it on Amazon

Topics

deep reinforcement learning neural networks DQN and Q-learning policy gradients AlphaGo and AlphaZero reward design AI safety sim-to-real transfer