From Prompt to Mix: How Conversational Audio Editing Works

May 15, 20266 min read

You type what you want. The AI figures out which audio operations to run, in which order, and executes them against your session. Here is exactly how that translation happens.

When you type "boost the vocals 3 dB and add a subtle reverb" into an AI audio editor, a lot happens between that sentence and the changed waveform. Understanding the architecture makes you a better user — you learn what kinds of prompts work well, what the agent cannot do, and how to recover when it misinterprets your intent.

The Tool-Use Model

Modern AI audio editors work by giving a large language model a set of tools — functions it can call to manipulate the audio session. These tools correspond to discrete audio operations: load a file, cut a region, adjust gain, apply a plugin, normalize loudness, run stem separation, render to disk.

When you send a message, the LLM reads your instruction, the current session state (what tracks exist, what the timeline looks like, what operations have already been applied), and decides which tool calls to make and with which arguments.

edytlab exposes tools like load_audio, cut_region, set_gain, normalize, stem_separate, transcribe, render_range. The LLM plan for "remove the silence at the beginning and boost the bass" might be: cut_region(track=1, start=0, end=1.2) → set_gain(track=1, region=bass_frequency_band, db=+4).

Session State as Context

The LLM does not just receive your text — it receives a structured representation of the current session: which tracks exist, their durations, current gain levels, any applied effects, the playback cursor position, and the undo history. This context window allows the model to make edits that reference previous operations ('undo the last normalization and try -14 LUFS instead').

The Role of the DAG (Directed Acyclic Graph)

Each operation the agent performs creates a new node in a session graph. Nodes point to their parent state. This means every edit is non-destructive: the original audio data is never modified. Asking the agent to 'revert to before the reverb' just moves the session pointer back up the graph.

Branch: create a fork of the session to try a different arrangement without losing the current one.
Compare: A/B between two branch nodes to decide which mix sounds better.
Revert: jump to any earlier state by navigating the graph.
Merge: take the best elements of two branches into a new node.

Multi-Step Planning

Complex requests like "make this sound like a 1970s soul record" require the LLM to plan a sequence of operations: warming the high frequencies (low-pass above 12 kHz), adding vinyl noise (a noise generator at -40 dB), compressing with slow attack (warm transient feel), and reducing the stereo width. A capable model will decompose this into the correct tool chain and execute each step in order.

When Prompts Are Ambiguous

"Boost the bass" is ambiguous: which track? How much? What frequency? The agent will make a reasonable default (the first track with audio, +3 dB, shelf below 200 Hz) and tell you what it did. If that is not what you wanted, you can correct it in natural language: "not that track — the second one, and just +2 dB".

Choosing Your LLM for Audio Agent Tasks

Not all LLMs perform equally well at multi-step audio planning. Models with strong function-calling support (Claude 3.7 Sonnet, GPT-4o, Mistral Large via OpenRouter) reliably decompose complex audio instructions into correct tool chains. Smaller models may execute the first tool correctly but lose track of the plan on longer chains. edytlab lets you swap providers without reinstalling — you can test which model works best for your workflow.

The Feedback Loop

The most effective conversational editing workflow is iterative. Make a rough cut with a broad prompt, listen back, then refine with specific corrections. The session graph captures every iteration, so you are never locked into a direction. Treat the AI agent like a skilled but literal engineer: it executes exactly what you describe, so precision in language produces precision in the edit.

edytlab is an open-source, local-first AI audio editor. Download the latest release or star it on GitHub.