Audio plays an integral role across media forms like film, podcasts, audiobooks, and video games. However, producing professional-grade audio requires extensive libraries of sounds and deep expertise in areas like sound engineering, voice acting, and music composition. This creates barriers to audio content creation for everyday people. To change that, Meta has dropped their newest audio-based AI, Audiobox – advancing boundaries on what’s possible for generative sound models. This model represents a next-level foundation model that unifies audio generation capabilities ranging from speech to sound effects.
Table of contents
Introducing Meta’s Audiobox
Audiobox is a unified model for high-quality audio generation. It can generate both speech and sound from a text description, audio example, or a combination of vocal style reference and description.
For speech generation, this model is able to control very fine-grained vocal styles that sound different in terms of accents, emotions, and how they sound, making them more realistic than before. It also allows voice cloning and restyling by combining an audio sample with text prompts.
This unified model takes four different inputs:
1) Frame-aligned transcript
2) Description
3) Voice prompt
4) Context prompt (masked audio features)
All inputs are embedded into the Audiobox to generate resultant output audio.
Capabilities of Meta’s Audiobox
1. Text-Based Sound Generation
Audiobox can generate a wide variety of sounds, including dialogue, music, ambient noise, mechanical sounds, and more, based on text prompts. This is more flexible than previous specialized speech/sound models.
2. Freeform Voice Restyling
This model can edit voices to change the emotion, accent, speech rate, sound quality, and background environment. This gives fine control over speech styles. Users can anchor voice quality with an audio sample while tweaking other vocal attributes via text. This voice restyling mechanism has no parallel in past models.
3. Sound Editing with Generative Infilling
It can also insert or replace sections of audio with newly generated sounds. By deleting segments of a recording and then generating replacement sections, it can remove vocal pauses or redo portions to refine them. The new audio infilling training scheme facilitates seamless blending.
4. Background Noise Reduction
Moreover, the model can selectively reduce background noise in speech recordings while maintaining the voice quality.
Goals of the Audiobox Model
The researchers had three main goals in developing this model:
1. Unified Model – Create a unified model that can generate realistic blends of speech, music, and sound effects. This allows the creation of more realistic audio that blends speech, music, and different sound effects.
2. Enhanced Control – Enable finer guidance over creating new styles through multiple input types, including voice samples, text prompts, or both together.
3. Improved Generalization – Scale up and diversify training data to include unlabeled audio, helping performance across more contexts.
Training and Development of Audiobox
1. Building on Previous Work
The Audiobox models build on previous work like Voicebox and SpeechFlow. Voicebox pioneered a good way to generate speech from text using flow-matching. SpeechFlow developed self-supervised pre-training methods to teach models about speech.
2. Pre-training of Audiobox SSL
Leveraging this progress, the Audiobox team first pre-trained a unified base model called Audiobox SSL. This model learned from huge datasets of unlabeled speech, music, and sounds. The abundant data and shared learnings enabled scaling up and transfer to new models.
3. Integration of Specialized Models
Next, they customized Audiobox SSL into specialist Audiobox Speech and Audiobox Sound models.
Audiobox Speech
Audiobox Speech produces high-fidelity narrations from input text. Fine-tuning on large transcribed speech datasets like audiobooks instilled superior abilities to generate expressive and accurate read speech.
Audiobox Sound
Audiobox Sound accepts descriptive text prompts to produce sound effects and soundscapes. Training on captioned audio datasets endowed imaginative production of novel sounds from scratch.
Both specialist models beat the previous best models in their categories, proving the adaptability of the base training.
4. Unified Audiobox Model
Finally, they combined the specialist models into a unified Audiobox model. Audiobox can control output voice and sounds using either voice samples or text prompts.
5. Training Data
The model was trained with more than 160,000 hours of speech, primarily in English, which spans a broad spectrum of recordings, including audiobooks, podcasts, and in-the-wild captures, among others. The result is Meta’s new Audiobox model, which achieves flexibility unmatched by past models. This model reaches expert performance on niche tasks while exceeding old models in generalization ability.
Audiobox vs. Previous Models
Not only can it do new things, but Audiobox is also better than older models at making voices and sounds in specific situations. It does better than any other model when tested in situations that check how good it is at making the right voices and sounds for different things.
Meta Audiobox significantly outperforms previous specialized models like AudioLDM, VoiceLDM and TANGO on quality and relevance (faithfulness to text description) in Meta’s internal evaluations. It also exceeds Voicebox by over 30% in accurately generating the requested speech style. By utilizing bespoke solvers, this model is able to generate audio more than 25 times faster than previous models without compromising performance.
Enabling Responsible Innovation
Like all exponentially growing technologies, AI demands prudent governance to address ethical concerns around biases and misuse. As such, Meta implemented comprehensive responsibility frameworks into Audiobox.
1. Granular Audio Watermarking
Granular audio watermarking makes each output uniquely identifiable down to the frame level. Testing proved the technique highly robust to adversarial tampering, preventing impersonation or misattribution.
2. Stringent Voice Authentication
Stringent voice authentication further guards against spoofing attempts. Randomly prompted phrases must be spoken in the user’s true voice to contribute samples. Collectively, these safeguards uphold accountability in Audiobox applications.
3. Equality Across Speaker Demographics
Moreover, Meta verified this model exhibits no performance disparities across speaker demographics. Broad representation in the diverse training data facilitated fairness for all groups. This fulfilment of inclusive excellence sustains social good.
Looking Ahead
Meta believes models like Audiobox will lower hurdles for audio innovation. Off-the-shelf sound production and customizable vocal synthesis introduce newfound efficiencies for professionals.
With the right collaborative oversight, this technology trajectory can nourish human expression that elevates and unites. This model signifies a milestone in unlocking creativity for all through AI.
Also, check out this amazing innovation by Meta and try its demo.
| Also Read:






