MELODYFLOW Unleashed: Effortless Music Editing and Generation through Text-Guided AI
Introduction
MELODYFLOW is introduced as a high-fidelity, text-controllable model for generating and editing music. Built on continuous latent representations with a 48 kHz stereo variational autoencoder (VAE) codec, MELODYFLOW uses a single-stage Flow Matching (FM) approach, achieving state-of-the-art audio fidelity and text adherence in music editing tasks.
Method
Latent Audio Representation
MELODYFLOW’s audio codec builds on EnCodec with enhancements from Descript Audio Codec, including convolutional autoencoder and multi-scale STFT reconstruction for high-quality stereo encoding.
Conditional Flow Matching Model
This section describes the FM approach where MELODYFLOW learns optimal transport paths from data to noise using a Diffusion Transformer model conditioned on text descriptions, facilitating high-quality text-to-music generation.
Text-Guided Editing through Latent Inversion
MELODYFLOW supports zero-shot, text-guided music editing via inversion of latent audio representations. Using a text-based prompt, the model modifies audio while maintaining consistency with the source material.
Regularized Latent Inversion
MELODYFLOW enhances inversion by using a regularized FM approach, stabilizing the editing path and improving text-adherence through KL regularization.
Improving Flow Matching for Text-to-Music Generation
Improvements to FM include a KL-regularized codec for better quality and faster inference and minibatch coupling to enhance the model’s generative accuracy and efficiency.
Experimental Setup
Model
MELODYFLOW includes a Diffusion Transformer of 400M or 1B parameters, conditioned on text embeddings from T5, and trained on music datasets with stereo and mono options for diverse applications.
Generation and Editing
Text-to-music generation uses an ODE solver, and editing involves ReNoise inversion for