MIT 6.S094: Deep Learning for Human Sensing

The guest

Lex Fridman — MIT researcher and lecturer on deep learning, autonomous vehicles, and human-centered AI; instructor of MIT 6.S094.

The gist

This is a solo MIT 6.S094 lecture by Lex Fridman on applying deep learning to human sensing in the driving context. He argues that data collection and annotation, not algorithms, are the hardest and most important parts of building real-world systems. He walks through detection problems including pedestrian detection, body pose estimation, glance classification, emotion recognition, and cognitive load estimation, all using convolutional neural networks trained on MIT's large naturalistic driving dataset. The core thesis is that full autonomy is decades away, so AI must perceive and collaborate with the imperfect human driver in a human-centered way.

Big reveals

MIT's naturalistic driving dataset uses 25 vehicles, 21 equipped with Tesla Autopilot, with cameras on the driver's face, body, and the forward scene.
00:16:31
MIT has collected over five billion video frames of driving data, including 1.5 billion of the face.
00:18:39
Tesla drivers are using Autopilot for roughly 33% of their miles, showing real value and enjoyment.
00:19:45
Glance classification reframes the classic geometric gaze-estimation problem as a region-based machine learning classification problem (on-road vs off-road).
00:35:34
In a frustrating voice-navigation task, a smile was the strongest expression indicator of driver frustration, contradicting general emotion recognition.
00:51:32
Under higher cognitive load, the driver's eye moves less and stays more focused on the forward roadway.
00:58:50
Fridman argues full autonomy with steering wheels removed is more than two decades away.
01:01:27
Fridman frames human imperfections, via a Good Will Hunting quote, as central to human-robot trust and relationships.
01:06:49

Things worth remembering

In 2016 over 40,000 people died in US car crashes across 3.2 trillion miles, about one fatality per 80 million miles.
00:09:02
Massachusetts drivers are the least likely to die in a car crash; Montana drivers the most likely.
00:09:38
170 billion text messages are sent in the United States every month (2014 figure).
00:10:43
The average time drivers' eyes are off the road while texting is five seconds, enough to cover a football field at 55 mph.
00:11:13
31% of traffic fatalities involve a drunk driver, and roughly 3% involve a drowsy driver.
00:11:44
There are 42 individual facial muscles that can form expressions.
00:48:54
3D convolutional neural networks convolve across multiple images and channels to learn temporal dynamics, not just spatial features.
00:55:46
The cognitive-load classifier achieved 86% accuracy from real-world data using a 3D CNN on eye-region video.
00:59:22
MIT's dataset spans 400,000 miles while Tesla has about a billion miles of training data.
01:03:38

Recommended in this episode

Books, products and media the guest or host genuinely endorsed here — with the buy link.

Affiliate link — we may earn a commission at no extra cost to you.

RecommendedMedia

Good Will Hunting

Gus Van Sant (inferred)

“so one of my favorite movies Good Will Hunting we're in Boston Cambridge” — Lex Fridman 01:05:12

Find it on Amazon

Topics

deep learning autonomous vehicles computer vision driver monitoring human-robot interaction data annotation convolutional neural networks driving safety