MIT 6.S094: Deep Learning for Human-Centered Semi-Autonomous Vehicles

The guest

Lex Fridman — MIT researcher and lecturer teaching the 6.S094 deep learning course, studying the human side of semi-autonomous vehicles using camera data from Teslas.

The gist

This MIT 6.S094 lecture focuses on the understudied human side of semi-autonomous driving, turning the camera inward to perceive the driver rather than the road. Fridman describes collecting billions of video frames from cameras in 17 Teslas driving around Cambridge to study driver behavior. He walks through the deep learning pipeline for detecting body pose, gaze, emotion, drowsiness, and cognitive load from raw pixels using convolutional neural networks. He argues every car should have a driver-facing camera to build trust between human and machine. The lecture closes with the mystery of why deeper networks work, the Conway's Game of Life analogy for emergent complexity, learning resources, and the deep learning competition winners.

Big reveals

MIT has cameras in 17 Teslas driving around Cambridge collecting real-world human-machine interaction data.
00:00:34
Fridman's work involves analyzing billions of video frames of drivers in their semi-autonomous Teslas.
00:01:05
Almost every car on the road has no sensors perceiving the human inside, yet drives autonomously at 70 mph.
00:02:43
Fridman advocates a driver-facing camera in every car despite privacy concerns, citing huge safety and trust benefits.
00:03:15
In a frustration study, smiling turned out to be a strong indication of driver frustration, not happiness.
00:10:12
People often express emotion for an audience; solo drivers may not express emotion, complicating naturalistic data.
00:16:39
The talk is a pitch for automakers and buyers to choose cars with driver-facing cameras to enable data collection.
00:25:12

Things worth remembering

Micro saccades are slight tremors of the eye that happen at a rate of about a thousand times a second.
00:02:10
Body pose is estimated with cascaded CNN regressors predicting XY positions of skeleton points like shoulders and arms.
00:05:24
3D convolutional neural networks treat stacked video frames as channels, like RGB, to estimate pose across all frames at once.
00:06:29
Researchers achieved an 84-fold reduction in human annotation for the gaze classification task while staying accurate.
00:15:02
People sing a lot while driving alone, which is easy to detect by tracking the mouth but absent in passenger data.
00:16:39
Pupil size grows with both high cognitive load and dim light, so it can't reliably measure cognitive load in the car.
00:20:22
Under high cognitive load, blink rate decreases and blink duration shortens.
00:20:55
The cognitive load model takes 90 grayscale frames (about six seconds) into a 3D CNN predicting low, medium, or high load.
00:21:27
The eye model traces 39 points of the eyelids and iris plus four points on the pupil using active appearance models.
00:22:33
In Conway's Game of Life, a live cell with two or three neighbors survives and a dead cell with exactly three neighbors revives.
00:27:19

Recommended in this episode

Books, products and media the guest or host genuinely endorsed here — with the buy link.

Affiliate link — we may earn a commission at no extra cost to you.

RecommendedBook

Deep Learning

Ian Goodfellow, Yoshua Bengio, Aaron Courville (inferred)

“I encourage you to read the deep learning book, it's available online, deeplearningbook.org.” — Lex Fridman 00:29:09

Find it on Amazon

Topics

deep learning semi-autonomous vehicles computer vision driver monitoring convolutional neural networks Tesla cognitive load human-centered AI