Could you please explain how to precisely adjust the style transformation of the reference audio? #209
Replies: 4 comments 2 replies
-
ACE-Step 1.5 — Style Transformation GuideOverviewThis guide explains how to precisely adjust the style transformation of reference audio in ACE-Step 1.5. The primary use case is retaining the structural sequence of a provided reference audio while transforming it into a completely different style of music (e.g., country music to heavy metal rock). Use the Cover Task (Not Repaint)ACE-Step provides two audio-to-audio tasks that are often confused:
For full-song style transformation, always use Key Parameter:
|
| Strength | Effect | When to Use |
|---|---|---|
| 0.8 - 1.0 | Closely follows original structure | Subtle genre shifts (country to folk) |
| 0.5 - 0.7 | Balanced — recognizable but transformed | Moderate changes (pop to jazz) |
| 0.2 - 0.4 | Loose interpretation, major style shift | Radical reimagining (country to heavy metal) |
| 0.0 | Ignores reference entirely | Pure text-to-music generation |
How It Works Internally
The cover task uses a dual text embedding system:
- The source audio is encoded into latent space (
src_latents) — this preserves the structural skeleton. - Two text embeddings are generated:
- Cover embedding: uses the instruction
"Generate audio semantic tokens based on the given conditions:"— structure-preserving. - Non-cover embedding: uses the instruction
"Fill the audio semantic mask based on the given conditions:"— creative/text-driven.
- Cover embedding: uses the instruction
- During diffusion,
audio_cover_strengthblends between them:
final = strength * cover_embedding + (1 - strength) * text_embedding
Lower strength gives the model more freedom to follow your text caption and less obligation to the source audio's structure.
Implementation reference: acestep/handler.py lines 1937-1968 (dual text encoding), lines 2257-2323 (inference pipeline).
Writing the Caption (Text Prompt)
The caption tells the model what the target style should be. For effective style transformation, be specific about four dimensions:
- Target genre: the overarching style label
- Instrumentation: which instruments define the target genre
- Vocal style: how the vocals should sound
- Energy and mood: the emotional character
Example: Country to Heavy Metal
heavy metal rock with heavily distorted electric guitars,
aggressive double bass drumming, powerful screaming vocals,
fast tempo, high energy, intense and dark atmosphere
Example: Pop to Jazz
smooth jazz arrangement with piano trio, upright bass,
brushed drums, warm saxophone, relaxed swing feel,
intimate and sophisticated atmosphere
Example: Acoustic to Electronic
electronic dance music version with deep synthesizer bass,
driving four-on-the-floor kick drum, bright arpeggiated synths,
high energy, euphoric and uplifting atmosphere
Caption Tips
- Describe the target style, not the source. The model already knows the source from the audio.
- You can provide new lyrics in the lyrics field to change vocal content, or leave them empty to let the model adapt the original vocal structure.
- More specific captions produce more predictable results. "rock" is vague; "hard rock with crunchy rhythm guitars and soaring lead guitar solos" is actionable.
Diffusion Parameters for Style Transformation
These parameters control the quality and fidelity of the diffusion process:
| Parameter | Recommended Value | Purpose |
|---|---|---|
guidance_scale |
8.0 - 10.0 | Higher values make the output follow your text caption more strictly. For drastic style changes, push this higher. |
inference_steps |
24 - 32 | More steps produce smoother, higher-quality results. Important for large genre jumps. |
shift |
3.0 | Adjusts timestep distribution toward later denoising steps for quality. Recommended for the turbo model. |
infer_method |
"ode" |
Deterministic generation. Use "sde" for more variation but less consistency. |
cfg_interval_start |
0.0 | Apply text guidance from the very start of diffusion. |
cfg_interval_end |
0.95 | Drop guidance at the very end for more natural finishing. |
Parameter Relationships
Style fidelity to original proportional to audio_cover_strength
Text prompt adherence proportional to guidance_scale
Output quality proportional to inference_steps, shift
These interact: if you lower audio_cover_strength to allow more style change, you may want to raise guidance_scale to ensure the model follows your target style description closely.
Complete Code Example: Country to Heavy Metal
from acestep.inference import GenerationParams, GenerationConfig, generate_music
params = GenerationParams(
task_type="cover",
src_audio="country_song.mp3",
caption=(
"heavy metal rock with heavily distorted electric guitars, "
"aggressive double bass drumming, powerful screaming vocals, "
"fast tempo, high energy, intense dark atmosphere"
),
audio_cover_strength=0.4, # Low = more freedom for style change
inference_steps=28,
guidance_scale=9.0,
shift=3.0,
cfg_interval_start=0.0,
cfg_interval_end=0.95,
infer_method="ode",
seed=42,
)
config = GenerationConfig(batch_size=1, audio_format="wav")
result = generate_music(dit_handler, llm_handler, params, config, save_dir="./output")Using the Gradio UI
- Switch to the Cover tab (or select "cover" as the task type).
- Upload your reference audio file.
- Set LM Codes Strength slider to 0.3 - 0.5 for dramatic genre changes.
- Write a detailed caption describing the target genre.
- Adjust Guidance Scale to 8-10.
- Set Inference Steps to 24-32.
- Click Generate.
Using the API
curl -X POST http://localhost:8001/v1/audio/generations \
-H "Content-Type: application/json" \
-d '{
"task_type": "cover",
"src_audio": "/path/to/country_song.mp3",
"caption": "heavy metal rock with distorted guitars and aggressive drums",
"audio_cover_strength": 0.4,
"guidance_scale": 9.0,
"inference_steps": 28
}'Tuning Strategy
Follow this iterative workflow to dial in the transformation:
Step 1: Baseline
Start with audio_cover_strength=0.5 and guidance_scale=8.0. Listen to the result.
Step 2: Adjust Structure Adherence
- Output still sounds too much like the original genre? Lower
audio_cover_strengthto 0.3. - Song structure is unrecognizable? Raise
audio_cover_strengthto 0.6.
Step 3: Adjust Text Adherence
- Target genre elements are not strong enough? Raise
guidance_scaleto 10.0+. - Output sounds artificial or over-processed? Lower
guidance_scaleto 7.0.
Step 4: Explore Seeds
Generate multiple outputs with different seeds (seed=1, 2, 3, ...). Results vary significantly across seeds, and some will capture the transformation better than others.
Step 5: Refine Caption
If specific elements are missing (e.g., "I want more distortion"), add them explicitly to the caption. The model responds well to concrete instrumental and timbral descriptions.
Recommended Starting Points by Genre Transformation
| Source Genre | Target Genre | Strength | Guidance Scale | Caption Focus |
|---|---|---|---|---|
| Country | Heavy Metal | 0.3 - 0.4 | 9.0 - 10.0 | Distorted guitars, double bass drums, screaming vocals |
| Pop | Jazz | 0.5 - 0.7 | 7.0 - 8.0 | Piano, upright bass, brushed drums, swing feel |
| Rock | Electronic | 0.3 - 0.5 | 8.0 - 9.0 | Synthesizers, four-on-the-floor kick, arpeggios |
| Classical | Hip-Hop | 0.2 - 0.4 | 9.0 - 10.0 | 808 bass, trap hi-hats, rap vocals, boom-bap |
| Folk | R&B | 0.4 - 0.6 | 8.0 - 9.0 | Smooth vocals, neo-soul keys, warm bass, groove |
| Acoustic | Orchestral | 0.6 - 0.8 | 7.0 - 8.0 | Strings, brass, woodwinds, full orchestra, dynamic |
General rule: The more different the source and target genres are, the lower the audio_cover_strength should be.
Technical Notes
- LM is skipped for cover tasks by default. The source audio replaces the LM's structural planning role. You can optionally enable
thinking=Trueanduse_cot_metas=Truefor LM-assisted metadata detection. - Source latents (
src_latents) are cloned directly from the encoded reference audio, preserving its full structure in latent space. The diffusion process then transforms style while these latents anchor the structure. - The cover task instruction differs from text-to-music: it uses
"Generate audio semantic tokens based on the given conditions:"rather than"Fill the audio semantic mask based on the given conditions:", which changes how the DiT model interprets the input. - Audio is encoded at 48kHz with latent frames at ~25Hz (1920 audio samples per latent frame).
Key Source Files
| File | Relevance |
|---|---|
acestep/handler.py |
Core cover task implementation, dual text encoding, diffusion pipeline |
acestep/inference.py |
GenerationParams definition, task routing, LM skip logic |
acestep/constants.py |
Task instruction strings (TASK_INSTRUCTIONS dict) |
acestep/gradio_ui/interfaces/generation.py |
UI slider definitions for cover parameters |
docs/en/INFERENCE.md |
API documentation with cover examples |
docs/en/Tutorial.md |
User-facing tutorial with cover usage patterns |
Beta Was this translation helpful? Give feedback.
-
|
With audio_cover_strength=0.99 the output sounds absolutely nothing like the input (melody-wise) using 'cover'. What's the point of the parameter? If it doesn't preserve melody, what is the point of the whole cover functionality? |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
|
I've had some success with drastic genre-covers by using an intermediate step. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I want to retain the sequence of the provided reference audio and transform it into a completely different style of music. How to adjust the parameters and describe it, for example, changing from country music to heavy metal rock.
Beta Was this translation helpful? Give feedback.
All reactions