Google DeepMind Explores AI Model That Generates Soundtracks for Videos

DeepMind, Google's AI research lab, announced that they are developing a video-to-audio (V2A) AI model that will allow users to generate soundtracks for videos.

The new technology is intended to create a dramatic score, realistic sound effects, and dialogues that match perfectly with a video.

DeepMind Develops V2A Technology to Improve AI-Generated Videos

In a blog post, DeepMind detailed that the team is closing the gap on video-generating AI models which are not yet capable of matching sound effects to the videos. For instance, most of the video-generated AI models showed clips without sounds.

The V2A system will encode the video input and a diffusion model will refine the audio from random noise. A natural language prompt will guide the model to generate synchronized and realistic audio based on the video.

"By training on video, audio, and the additional annotations, our technology learns to associate specific audio events with various visual scenes, while responding to the information provided in the annotations or transcripts," the team explained.

DeepMind Plans to Further Improve V2A Technology

The team shared some demo videos from the current progress of the V2A technology. The AI model can generate soundtracks based on significant keywords that describe the intended sound for the video.

For instance, a sample video of a car in a neon city has a prompt of "cars skidding," "car engine throttling," and "angelic electronic music" to match the video of a car speeding into the city.

However, since the quality of the audio output is dependent on the quality of the video input some tests showed a drop in audio quality. The team also shared that they are working on lip synchronization for videos with speech.

DeepMind also revealed that they are gathering insights and feedback from creators and filmmakers to improve the V2A technology.