Researchers use AI and audio to predict where objects will land

In a new preprint study, researchers at Carnegie Mellon University claim sound can be used to predict an object’s appearance — and its motion. The coauthors created a “sound-action-vision” data set and a family of AI algorithms to investigate the interactions between audio, visuals, and movement. The results, they say, show representations derived from sound can be used to anticipate where objects will move when subjected to physical force.

While vision is foundational to perception, sound is arguably as important. It captures rich information often imperceptible through visual or force data, like the texture of dried leaves or the pressure inside a champagne bottle. Despite this, few systems and algorithms exist that exploit sound as a vehicle to build physical understanding. This motivated the Carnegie Mellon study, which sought to explore the synergy between sound and action and what sort of inferences might be made from the former.

The researchers first created the sound-action-vision data set by building a robot — Til-Bot — to tilt objects including screwdrivers, scissors, tennis balls, cubes, and clamps on a tray in random directions. The objects hit the thin walls of the plaster tray and produced sounds, which were added to the corpus one by one.



VB Transform 2020 Online – July 15-17. Join leading AI executives: Register for the free livestream.

Four microphones mounted to the 30-by-30-centimeter tray (one on each side) recorded audio while an overhead camera captured RGB and depth information. Tilt-Bot moved each object around for an hour and every time the object made contact with the tray, it created a log containing the sound, RGB and depth data, and tracking location of the object as it collided with the walls.

With the audio recordings from the collisions, the team used a method that enabled them to treat the recordings as images. This allowed the models to capture temporal correlations from single audio channels (i.e., recordings by one microphone) as well as correlations among multiple audio channels (recordings from several microphones).

The researchers then used the corpus — which contained sounds from 15,000 collisions between over 60 objects and the tray — to train a model to identify objects from audio. In a second, more challenging exercise, they trained a model to predict what actions were applied to an unseen object. And in a third, they trained a forward prediction model to suss out the location of objects after they’d been pushed by a robotic arm.


Above: Forward model predictions are visualized here as pairs of images. The left image is the observation before the interaction, while the
right image is the observation after the interaction. Based on the object ground truth location (shown as the green dot) before interaction, the audio embedding of the object and action taken by the robot (shown as a red arrow), trained forward model predicts the future object location (shown as a red dot).

The object-identifying model learned to predict the right object from sound 79.2% of the time, failing only when the generated sounds were too soft, according to the researchers. Meanwhile, the action prediction model achieved a mean squared error of 0.027 on a set of 30 previously unseen objects, or 42% better than a model trained only with images from the camera. And the forward prediction model gave a reduced error of the pushing location.

“In some domains like forward model learning, we show that sound in fact provides more information than can be obtained from visual information alone,” the researchers wrote. “We hope that the Tilt-Bot data set, which will be publicly released, along with our findings will inspire future work in the sound-action domain and find widespread applicability in robotics.

View original article here Source