Breakthrough Latent Diffusion Model Enhances Control and Precision in Audio Creation
Furthermore, Stability AI, a prominent figure in the artificial intelligence domain, has recently revealed its pioneering ‘Stable Audio’ model, symbolizing a remarkable advancement in audio generation. This cutting-edge technology promises to revolutionize the way audio is created and offers unparalleled control over the content and length of generated audio, even enabling the composition of complete songs.
Precision and Control: Stability AI A New Horizon in Audio Generation
Addressing Historical Limitations
Audio diffusion models have historically grappled with generating audio of fixed durations, often resulting in abrupt and incomplete musical phrases. The limitation arose due to training models on random audio chunks cropped from longer files, constrained to specific lengths. Consequently, the system encountered constraints in capturing the complete audio context. However, Stable Audio addresses this by enabling audio generation with lengths up to the training window size.
Key Features of Stable Audio
Stable Audio uses highly compressed audio data, boosting speed over raw audio. Additionally, it utilizes the NVIDIA A100 GPU for rapid 95-second stereo audio generation.”
The Core Architecture
Stable Audio integrates VAE, text encoder, and U-Net-based diffusion model for high-fidelity, noise-resistant audio generation. Utilizing Descript Audio Codec’s architecture, it efficiently compresses stereo audio into a lossy latent encoding, enhancing both generation and training speed.
Harnessing the Power of Text
To further enhance controllability and precision, Stability AI integrates text prompts into the audio generation process. A CLAP model-derived text encoder enriches text with audio relationships using specific properties. Embeddings per second, derived from audio properties, concatenate with text tokens for desired audio length.
Dataset and Training
Stability AI curated in collaboration with AudioSparx, Furthermore, Stability AI meticulously curated a vast dataset comprising 800,000 diverse audio files. This extensive collection, equivalent to 19,500 hours, served as the foundational training data for the remarkable Stable Audio model.
Future Endeavors and Advancements in Stability AI
Stability AI remains dedicated to advancing model architectures, refining datasets, enhancing training procedures, and improving various facets of audio generation. Their pursuit includes elevating output quality, fine-tuning controllability, optimizing inference speed, and expanding the range of achievable output lengths. They have hinted at forthcoming releases from Harmonai, their generative audio research lab, teasing the possibility of open-source models based on Stable Audio and accessible training code.
Recent Achievements in Stability AI
This remarkable announcement follows Stability AI’s recent commitment to AI safety, as they joined seven other prominent AI companies in signing the White House’s voluntary AI safety pledge during its second round.