MIT 6.S094: Convolutional Neural Networks for End-to-End Learning of the Driving Task

The guest

Lex Fridman — MIT researcher and lecturer teaching the 6.S094 deep learning for self-driving cars course; solo educational lecture, no guest.

The gist

This is an MIT 6.S094 lecture delivered by Lex Fridman covering convolutional neural networks (CNNs) for image-based tasks and their application to autonomous driving. He explains image classification fundamentals (pixels, K-nearest neighbors, CIFAR-10, ImageNet), then breaks down CNN architecture: convolutional layers, filters, weight sharing, stride, padding, and pooling. The second half maps these techniques onto the four-step self-driving pipeline (localization, scene understanding, movement planning, driver state), drawing on data collected from 17 instrumented Teslas. He introduces the DeepTesla browser project, where students train a network on real Tesla forward-roadway video to predict steering commands end-to-end.

Big reveals

State-of-the-art CNNs hit 95.4% accuracy on CIFAR-10, recently surpassing human-level performance of about 94%.
00:18:30
As of December 2016, Tesla Autopilot had driven 300 million miles with only one fatality, versus roughly one per 90 million miles for human-controlled vehicles.
00:40:36
MIT instrumented 17 Tesla Model S and Model X vehicles, collecting over 5,000 hours and 70,000 miles of driving data.
00:41:51
SLAM is one of the few areas where deep learning does not yet outperform classical optimization-based approaches.
00:55:18
Self-reported emotion data showed a stoic face indicated satisfaction while a smiling driver was actually highly frustrated, undermining human intuition.
01:03:46
Human drivers continuously generate ground-truth steering data simply by driving to stay alive, which can train self-driving networks end-to-end.
01:06:31

Things worth remembering

A naive pixel-difference nearest-neighbor classifier achieves 38% accuracy on CIFAR-10, far above the 10% random baseline.
00:12:51
The best K value for K-nearest neighbors on this dataset turned out to be 7.
00:15:44
The average American driver covers about 10,000 miles a year.
00:36:28
169 billion texts were sent in the US every month in 2014, and drivers spend about 5 seconds with eyes off the road while texting.
00:38:51
NHTSA describes the '4 D's' of dangerous driving: drunk, drugged, distracted, and drowsy.
00:39:30
The C920 webcam costs about $70 and does onboard compression to enable storing huge amounts of driving data.
00:43:46
Lidar struggles with rain and snow because changing surface textures break landmark-based localization.
00:48:54
DeepTesla outputs a steering wheel value between -20 and 20 degrees, trained on 10 public highway clips (half autopilot, half human).
01:11:50

Recommended in this episode

Books, products and media the guest or host genuinely endorsed here — with the buy link.

Affiliate link — we may earn a commission at no extra cost to you.

RecommendedProduct

Logitech C920 Webcam

Logitech (inferred)

“Cameras used to record are your regular Webcam, the work horse of the computer vision community. The C920, and we have some special lenses on top of it.” — Lex Fridman 00:43:05

Find it on Amazon

Topics

convolutional neural networks self-driving cars deep learning computer vision Tesla Autopilot image classification end-to-end learning MIT 6.S094