UT-Austin researchers develop more realistic audio AI models for Meta virtual reality

Madeline Duncan, News Reporter

UT researchers have teamed up with Meta to create new voice and audio AI models.

Last month, Meta announced three audio models to make virtual reality more immersive and allow for accessible audio editing: a visual acoustic matching model, a dereverberation model and a visual voice model. Changan Chen, a UT PhD student and Meta researcher, led the effort to develop the acoustic matching model and dereverberation model. Both models use visual clues to alter the way audio sounds, which can help Meta develop more realistic virtual reality experiences, Chen said. 

“In VR, when you put on a headset, you’ll see a different world,” Chen said. “You want the stuff you see to match with what you hear.”

Background noises from the location an audio is recorded in can often impact the quality of the sound. The visual acoustic matching AI fixes this problem by using images of another location to make it sound as if it were recorded in that space, Chen said. 

“If we have a picture of any space, we can successfully transfer an audio clip recorded in any other space to match the acoustic signature of the space in the image,” Chen said.

Chen said his team trained the matching AI to correct mismatched visual and audio cues. For example, the researchers used an image of a cathedral to make a pre-recorded audio sound louder and more echoey. The AI used the visual clues of the cathedral to correct the audio, which was originally quiet and dereverberated, to make it match the visual surroundings of the church. The technology allows the researchers to create a cohesive audio and visual experience. 

“There are some existing software where you can add reverberation to existing audio,” Chen said. “The issue there is if you don’t have experience in acoustics, it’s hard for people to tell what is the reverberation level, how much is the reverberation (and) how long is the reverberation time? This new technology empowers ordinary people to modify the video and the audio recording.”

The second model, the dereverberation AI, uses visual clues to remove reverb that may come from the environment a sound is recorded in. This can be used to help enhance audio clarity for people with hearing impairments, Chen said. 

The final model, the visual voice AI, works by focusing on certain sounds or voices in a noisy environment, said Rohan Gao, a postdoctoral research fellow at Stanford who worked on the model for Meta in 2021 when he was a PhD student at UT.

“Imagine you’re at a cocktail party — there may be lots of people around you talking and discussing different things, but as humans, we can very easily focus on a single conversation,” Gao said. “The reason is that our visual information and auditory information can parse the information and separate the sound that you need.”

Gao said since the real world is multi-sensory, the virtual world should be multi-model.

“We should simultaneously model visual information and audio information because that’s how we humans perceive the world,” Gao said. “So that we can have a VR and AR environment that is realistic from a multi-model perspective, both visually and acoustically.”