Nvidia wants to let you know that your weirdest audio whims are now possible. The company’s latest AI project, along with its AI NPCs and in-game chatbot, is a text-to-audio AI called Fugatto. Like other AI audio generators, it can create tracks from a simple description, but this program can also create “sounds never heard before,” such as a “saxophone howl,” whatever that means.
In a blog post, Nvidia claimed its “Swiss army knife for sound” AI model can modify existing sounds or create entire soundscapes out of thin air. Fugatto is actually an acronym for the obnoxiously long “Foundational Generative Audio Transformer Opus 1.” It’s capable of processing voices, music, and background noise and producing them all into a single audio track. It can also modify existing sound sources.
It’s silly to call anything “a sound never heard before,” especially if it comes from AI. Whatever the output, the audio is merely an AI algorithm using existing sources in its training data to supply a result that approximates the prompt. Nvidia said its model is unique since it can combine instructions that were separate during training and “create soundscapes it’s never seen before.” This means it can overlay two distinct audio effects to create something new. In a video, Nvidia showed how it could generate the sound of a train that morphs into an orchestral score. It can also create the sound of a rainstorm that fades into the distance.
These are capabilities we haven’t seen before. Beyond a prompt to demo “electronic music with dogs barking in time to the beat,” Nvidia said its tool offers “fine-grained control” over the created soundscapes. Nvidia claims the narrator for the video was an AI version of Nvidia CEO Jensen Huang, though if Fugatto produced the obviously fake voice, the AI model needs more work before anybody uses it for their next deepfake project.
Plenty of AI audio tools already take text prompts and turn them into audio tracks. Adobe has shopped its own Project MusicGenAI Control tool to unscrupulous musicians. Big tech companies like Meta have already promoted their audio models to the movie industry. Last month, Meta debuted Movie Gen, which can generate soundscapes for AI-generated films.
Nvidia quotes AI researcher Rohana Badlani, who said the model “made me feel a little bit like an artist,” though, of course, the AI draws from thousands of gigabytes worth of existing music and audio data. Nvidia did not share exact details about its dataset and only said it contains “millions of audio samples used for training.” The full version of Fugatto is a 2.5 billion-parameter model trained on Nvidia’s own banks of its famed H100 AI GPUs.
It’s bad news for foley artists, who have made that kind of audio fakery into a renowned art form. The company said Fugatto could be a useful tool for ad agencies, video game developers, or musicians who want to sample changes to their work without doing much extra work. Still, the other side of the coin is all those people who would use it to make “new assets,” AKA potentially adding more AI slop to the growing pile.
Fugatto potentially has more utility than merely giving an excuse for movie production companies to replace human audio engineers. Nvidia claims it can remove or add instruments to existing music. It can also isolate and modify specific noise from existing sources. Maybe you can get away with generating empty drum rhythms to your blasé synthesizer score, but an entire soundtrack generated with nothing but AI isn’t what most people pay for when buying a movie ticket.