Machine Learning Confronts the Elephant in the Room

By Kevin Hartnett

Score one for the human brain. In a new study, computer scientists found that artificial intelligence systems fail a vision test a child could accomplish with ease.

“It’s a clever and important study that reminds us that ‘deep learning’ isn’t really that deep,” said Gary Marcus, a neuroscientist at New York University who was not affiliated with the work.

The result takes place in the field of computer vision, where artificial intelligence systems attempt to detect and categorize objects. They might try to find all the pedestrians in a street scene, or just distinguish a bird from a bicycle (which is a notoriously difficult task). The stakes are high: As computers take over critical tasks like automated surveillance and autonomous driving, we’ll want their visual processing to be at least as good as the human eyes they’re replacing.

It won’t be easy. The new work accentuates the sophistication of human vision — and the challenge of building systems that mimic it. In the study, the researchers presented a computer vision system with a living room scene. The system processed it well. It correctly identified a chair, a person, books on a shelf. Then the researchers introduced an anomalous object into the scene — an image of an elephant. The elephant’s mere presence caused the system to forget itself: Suddenly it started calling a chair a couch and the elephant a chair, while turning completely blind to other objects it had previously seen.

“There are all sorts of weird things happening that show how brittle current object detection systems are,” said Amir Rosenfeld, a researcher at York University in Toronto and co-author of the study along with his York colleague John Tsotsos and Richard Zemel of the University of Toronto.

Researchers are still trying to understand exactly why computer vision systems get tripped up so easily, but they have a good guess. It has to do with an ability humans have that AI lacks: the ability to understand when a scene is confusing and thus go back for a second glance.

The Elephant in the Room

Eyes wide open, we take in staggering amounts of visual information. The human brain processes it in stride. “We open our eyes and everything happens,” said Tsotsos.

Artificial intelligence, by contrast, creates visual impressions laboriously, as if it were reading a description in Braille. It runs its algorithmic fingertips over pixels, which it shapes into increasingly complex representations. The specific type of AI system that performs this process is called a neural network. It sends an image through a series of “layers.” At each layer, the details of the image — the colors and brightnesses of individual pixels — give way to increasingly abstracted descriptions of what the image portrays. At the end of the process, the neural network produces a best-guess prediction about what it’s looking at.

“It’s all moving from one layer to the next by taking the output of the previous layer, processing it and passing it along to the next layer, like a pipeline,” said Tsotsos.

Neural networks are adept at specific visual chores. They can outperform humans in narrow tasks like sorting objects into best-fit categories — labeling dogs with their breed, for example. These successes have raised expectations that computer vision systems might soon be good enough to steer a car through crowded city streets.

They’ve also provoked researchers to probe their vulnerabilities. In recent years there have been a slew of attempts, known as “adversarial attacks,” in which researchers contrive scenes to make neural networks fail. In one experiment, computer scientists tricked a neural network into mistaking a turtle for a rifle. In another, researchers waylaid a neural network by placing an image of a psychedelically colored toaster alongside ordinary objects like a banana.

This new study has the same spirit. The three researchers fed a neural network a living room scene: A man seated on the edge of a shabby chair leans forward as he plays a video game. After chewing on this scene, a neural network correctly detected a number of objects with high confidence: a person, a couch, a television, a chair, some books.