Jitendra Malik: Computer Vision

The guest

Jitendra Malik — A professor at UC Berkeley and one of the seminal figures in computer vision, both before and after the deep learning revolution. Cited over 180,000 times, he has mentored many world-class AI researchers.

The gist

Lex Fridman interviews Jitendra Malik, a foundational computer vision researcher, about why vision is far harder than it appears since most human visual processing is subconscious. Malik argues that perception is fundamentally tied to action and that current systems rely too heavily on supervised, feed-forward learning. He advocates for systems that learn like children: multimodally, incrementally, physically, and through exploration. The conversation spans autonomous driving skepticism, segmentation, 3D understanding, the limits of the Turing test, and the present-day risks of AI in recommender systems.

Big reveals

Malik names the 'fallacy of the successful first step': reaching 50% of a vision solution takes a minute, 99% takes five years, and 99.99% may not happen in your lifetime.
00:06:10
He is a pessimist on fully autonomous driving in the near future because of the 0.01% of cases needing sophisticated cognitive reasoning.
00:08:49
Argues human driving is not tabula rasa: 16-year-olds are already 'visual geniuses' from birth to age 2, so driver's ed mostly teaches control, not vision.
00:16:34
Claims active interaction is one way to break the correlation-versus-causation barrier, with children running controlled experiments constantly.
00:42:06
States vision is more fundamental than language, both evolutionarily and developmentally, with language built on a spatial-temporal substrate.
01:08:34
Insists we must worry about AI harms today, not some future AGI moment, citing the Uber self-driving car that killed a pedestrian.
01:29:27
Calls today's recommendation algorithms on YouTube, Facebook and Twitter effectively a form of superintelligence already controlling populations.
01:32:37

Things worth remembering

Seymour Papert's 1966 MIT 'Summer Vision Project' proposed solving computer vision tasks we still work on today over a single summer.
00:03:04
Neurons are slow versus transistors but have far higher connectivity; the human brain is vastly more power-efficient than GPUs.
00:21:44
Early computer vision detected edges and threw away the rest of the image as a compression device forced by limited compute and storage.
00:30:39
Malik estimates video recognition is roughly 10 years behind object recognition, with action classification around 30%.
00:33:17
A ResNet-50 has 50 layers, but the visual cortex from retina onward may use only about seven, compensating with feedback connections.
00:49:48
The restaurant 'schema'/'script'/'frame' is a classic 1970s AI idea that was hand-coded but should instead be learned.
00:35:22
Malik proposes a great visual intelligence test: building an effective aid for a blind person in the real world.
01:15:51
Nobel laureate Peter Medawar described research as 'the art of the soluble'—finding unsolved problems with a soft underbelly.
01:37:53
The episode closes with Prince Myshkin's line from Dostoyevsky's The Idiot: 'beauty will save the world.'
01:40:57

Recommended in this episode

Books, products and media the guest or host genuinely endorsed here — with the buy link.

Affiliate link — we may earn a commission at no extra cost to you.

RecommendedProduct

Ubuntu MATE 20.04

Canonical (inferred)

“shout out to my favorite flavor of linux ubuntu mate 2004 once again get it at expressvpn.comlexpod” — Lex Fridman 00:02:34

Find it on Amazon

Topics

computer vision deep learning autonomous driving learning like children perception and action 3D scene understanding AI risk Turing test