Meta Will Open-Source Its AI Tool ImageBind Which Accepts Multimodal Input

Meta has been developing technologies in various forms, and AI is one of them. The social media giant intends to open-source an AI tool called ImageBind, which is capable of accepting input in the form of texts, images, audio, videos, and more.

ImageBind for the Public

The AI tool is said to be different and more advanced than the more known AI art generators such as Midjourney, Stable Diffusion, and DALL-E 2. Aside from the mentioned media, it is also able to generate scenes based on 3D measurements, temperature, and motion.

In a way, it functions like how a human would perceive their surroundings subconsciously. They can feel a general emotion based on collective factors like the lighting in a room, the smell of spring, the sound of trees rustling, the cool breeze of the air, and so on.

As human encounters different stimulants, they are more capable of creating complex imagined scenes, much like ImageBind learning with every prompt and slowly building its database based on what its users will input.

As mentioned in Engadget, this means that the AI tool does not need to be trained on every possibility much like most AI technology that are used these days. Meta researchers stated that it creates opportunities to create animations using only static images and audio prompts.

Meta mentioned that ImageBind shows the possibility of creating a joint embedding space across multiple modalities without the need to train on available data while combining different modalities. Not to mention, it can also be integrated with Meta's VR project.

The social media giant even believes that the capabilities of the generative AI tool may be expanded beyond the six current modalities. When innovation allows, developers can add in inputs like touch, speech, smell, brain fMRI signals, and so on.

In a way, the AI tool will become more human as it adopts certain "senses" to create imagery or animations. This can be a grand potential for Meta to use the AI tool to create worlds in its metaverse project, Horizon Worlds.

What Else Can ImageBind Do?

Not only can it be used to create a finished outcome based on the inputs provided by a user through different modalities, but it can also generate other media based on a provided image, audio, text, or a mix of the previously mentioned content.

For instance, uploading a photo of a lion will prompt the AI tool to generate sounds of the animal. You can also input an audio recording and ImageBind will suggest images that are associated with the audio, like the sound of waves providing an image of the beach.

It's an entirely different function for ImageBind to generate images based on audio alone. When you upload audio of the sound of trains, it can generate a scenery where train tracks and a train is present whether as the subject or in the background.

When a user provides two media like audio and image, ImageBind will retrieve images that are a combination of the two. Like the example given by Meta, an audio of a dog barking and the image of a beach will get you an image of a dog on the beach.