Seed-Music
Seed-Music is a unified framework for high-quality and controlled music generation and editing, capable of producing vocal and instrumental works from multimodal inputs such as lyrics, style descriptions, sheet music, audio references, or voice prompts, and of supporting post-production editing of existing tracks by allowing direct modification of melodies, timbres, lyrics, or instruments. It combines autoregressive language modeling with diffusion approaches and a three-stage pipeline comprising representation learning (which encodes raw audio into intermediate representations, including audio tokens, symbolic music tokens, and vocoder latents), generation (which transforms these multimodal inputs into music representations), and rendering (which converts those representations into high-fidelity audio). The system supports lead-sheet to song conversion, singing synthesis, voice conversion, audio continuation, style transfer, and fine-grained control over music structure.
Learn more
MuseNet
We’ve created MuseNet, a deep neural network that can generate 4-minute musical compositions with 10 different instruments and can combine styles from country to Mozart to the Beatles. MuseNet was not explicitly programmed with our understanding of music, but instead discovered patterns of harmony, rhythm, and style by learning to predict the next token in hundreds of thousands of MIDI files. MuseNet uses the same general-purpose unsupervised technology as GPT-2, a large-scale transformer model trained to predict the next token in a sequence, whether audio or text. Since MuseNet knows many different styles, we can blend generations in novel ways. We’re excited to see how musicians and non-musicians alike will use MuseNet to create new compositions! Choose a composer or style, an optional start of a famous piece, and start generating. This lets you explore the variety of musical styles the model can create.
Learn more
AudioCraft
AudioCraft is a single-stop code base for all your generative audio needs: music, sound effects, and compression after training on raw audio signals. With AudioCraft, we simplify the overall design of generative models for audio compared to prior work. Both MusicGen and AudioGen consist of a single autoregressive Language Model (LM) that operates over streams of compressed discrete music representation, i.e., tokens. We introduce a simple approach to leverage the internal structure of the parallel streams of tokens and show that, with a single model and elegant token interleaving pattern, our approach efficiently models audio sequences, simultaneously capturing the long-term dependencies in the audio and allowing us to generate high-quality audio. Our models leverage the EnCodec neural audio codec to learn the discrete audio tokens from the raw waveform. EnCodec maps the audio signal to one or several parallel streams of discrete tokens.
Learn more
MusicGen
Meta's MusicGen is an open source, deep-learning language model that can generate short pieces of music based on text prompts. The model was trained on 20,000 hours of music, including whole tracks and individual instrument samples. The model will generate 12 seconds of audio based on the description you provided. You can optionally provide reference audio from which a broad melody will be extracted. The model will then try to follow both the description and melody provided. All samples are generated with the melody model. You can also use your own GPU or a Google Colab by following the instructions on our repo. MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models. MusicGen can generate high-quality samples, while being conditioned on textual description or melodic features, allowing better control over the generated output.
Learn more