MELODYFLOW Unleashed: Effortless Music Editing and Generation through Text-Guided AI

Introduction

MELODYFLOW is introduced as a high-fidelity, text-controllable model for generating and editing music. Built on continuous latent representations with a 48 kHz stereo variational autoencoder (VAE) codec, MELODYFLOW uses a single-stage Flow Matching (FM) approach, achieving state-of-the-art audio fidelity and text adherence in music editing tasks.

Method

Latent Audio Representation

MELODYFLOW’s audio codec builds on EnCodec with enhancements from Descript Audio Codec, including convolutional autoencoder and multi-scale STFT reconstruction for high-quality stereo encoding.

Conditional Flow Matching Model

This section describes the FM approach where MELODYFLOW learns optimal transport paths from data to noise using a Diffusion Transformer model conditioned on text descriptions, facilitating high-quality text-to-music generation.

Text-Guided Editing through Latent Inversion

MELODYFLOW supports zero-shot, text-guided music editing via inversion of latent audio representations. Using a text-based prompt, the model modifies audio while maintaining consistency with the source material.

Regularized Latent Inversion

MELODYFLOW enhances inversion by using a regularized FM approach, stabilizing the editing path and improving text-adherence through KL regularization.

Improving Flow Matching for Text-to-Music Generation

Improvements to FM include a KL-regularized codec for better quality and faster inference and minibatch coupling to enhance the model’s generative accuracy and efficiency.

Experimental Setup

Model

MELODYFLOW includes a Diffusion Transformer of 400M or 1B parameters, conditioned on text embeddings from T5, and trained on music datasets with stereo and mono options for diverse applications.

Generation and Editing

Text-to-music generation uses an ODE solver, and editing involves ReNoise inversion for

Pretty Prompt

Search This Blog