Could you please explain how to precisely adjust the style transformation of the reference audio? #209

3222686767 · 2026-02-06T03:40:43Z

3222686767
Feb 6, 2026

I want to retain the sequence of the provided reference audio and transform it into a completely different style of music. How to adjust the parameters and describe it, for example, changing from country music to heavy metal rock.

sigalarm · 2026-02-06T13:58:54Z

sigalarm
Feb 6, 2026

ACE-Step 1.5 — Style Transformation Guide

Overview

This guide explains how to precisely adjust the style transformation of reference audio in ACE-Step 1.5. The primary use case is retaining the structural sequence of a provided reference audio while transforming it into a completely different style of music (e.g., country music to heavy metal rock).

Use the Cover Task (Not Repaint)

ACE-Step provides two audio-to-audio tasks that are often confused:

Task	Purpose	Scope
cover	Transform the entire audio's style while preserving its structural skeleton (melody contour, rhythm patterns, song form)	Full song
repaint	Regenerate a specific time segment while keeping the rest unchanged	Time region only

For full-song style transformation, always use task_type="cover".

Key Parameter: `audio_cover_strength` (LM Codes Strength)

This is the single most important parameter for style transformation. It controls how tightly the output follows the original audio's structure versus allowing creative freedom for the new style.

Strength	Effect	When to Use
0.8 - 1.0	Closely follows original structure	Subtle genre shifts (country to folk)
0.5 - 0.7	Balanced — recognizable but transformed	Moderate changes (pop to jazz)
0.2 - 0.4	Loose interpretation, major style shift	Radical reimagining (country to heavy metal)
0.0	Ignores reference entirely	Pure text-to-music generation

How It Works Internally

The cover task uses a dual text embedding system:

The source audio is encoded into latent space (src_latents) — this preserves the structural skeleton.
Two text embeddings are generated:
- Cover embedding: uses the instruction "Generate audio semantic tokens based on the given conditions:" — structure-preserving.
- Non-cover embedding: uses the instruction "Fill the audio semantic mask based on the given conditions:" — creative/text-driven.
During diffusion, audio_cover_strength blends between them:

final = strength * cover_embedding + (1 - strength) * text_embedding

Lower strength gives the model more freedom to follow your text caption and less obligation to the source audio's structure.

Implementation reference: acestep/handler.py lines 1937-1968 (dual text encoding), lines 2257-2323 (inference pipeline).

Writing the Caption (Text Prompt)

The caption tells the model what the target style should be. For effective style transformation, be specific about four dimensions:

Target genre: the overarching style label
Instrumentation: which instruments define the target genre
Vocal style: how the vocals should sound
Energy and mood: the emotional character

Example: Country to Heavy Metal

heavy metal rock with heavily distorted electric guitars,
aggressive double bass drumming, powerful screaming vocals,
fast tempo, high energy, intense and dark atmosphere

Example: Pop to Jazz

smooth jazz arrangement with piano trio, upright bass,
brushed drums, warm saxophone, relaxed swing feel,
intimate and sophisticated atmosphere

Example: Acoustic to Electronic

electronic dance music version with deep synthesizer bass,
driving four-on-the-floor kick drum, bright arpeggiated synths,
high energy, euphoric and uplifting atmosphere

Caption Tips

Describe the target style, not the source. The model already knows the source from the audio.
You can provide new lyrics in the lyrics field to change vocal content, or leave them empty to let the model adapt the original vocal structure.
More specific captions produce more predictable results. "rock" is vague; "hard rock with crunchy rhythm guitars and soaring lead guitar solos" is actionable.

Diffusion Parameters for Style Transformation

These parameters control the quality and fidelity of the diffusion process:

Parameter	Recommended Value	Purpose
`guidance_scale`	8.0 - 10.0	Higher values make the output follow your text caption more strictly. For drastic style changes, push this higher.
`inference_steps`	24 - 32	More steps produce smoother, higher-quality results. Important for large genre jumps.
`shift`	3.0	Adjusts timestep distribution toward later denoising steps for quality. Recommended for the turbo model.
`infer_method`	`"ode"`	Deterministic generation. Use `"sde"` for more variation but less consistency.
`cfg_interval_start`	0.0	Apply text guidance from the very start of diffusion.
`cfg_interval_end`	0.95	Drop guidance at the very end for more natural finishing.

Parameter Relationships

Style fidelity to original   proportional to   audio_cover_strength
Text prompt adherence         proportional to   guidance_scale
Output quality                proportional to   inference_steps, shift

These interact: if you lower audio_cover_strength to allow more style change, you may want to raise guidance_scale to ensure the model follows your target style description closely.

Complete Code Example: Country to Heavy Metal

from acestep.inference import GenerationParams, GenerationConfig, generate_music

params = GenerationParams(
    task_type="cover",
    src_audio="country_song.mp3",
    caption=(
        "heavy metal rock with heavily distorted electric guitars, "
        "aggressive double bass drumming, powerful screaming vocals, "
        "fast tempo, high energy, intense dark atmosphere"
    ),
    audio_cover_strength=0.4,      # Low = more freedom for style change
    inference_steps=28,
    guidance_scale=9.0,
    shift=3.0,
    cfg_interval_start=0.0,
    cfg_interval_end=0.95,
    infer_method="ode",
    seed=42,
)

config = GenerationConfig(batch_size=1, audio_format="wav")
result = generate_music(dit_handler, llm_handler, params, config, save_dir="./output")

Using the Gradio UI

Switch to the Cover tab (or select "cover" as the task type).
Upload your reference audio file.
Set LM Codes Strength slider to 0.3 - 0.5 for dramatic genre changes.
Write a detailed caption describing the target genre.
Adjust Guidance Scale to 8-10.
Set Inference Steps to 24-32.
Click Generate.

Using the API

curl -X POST http://localhost:8001/v1/audio/generations \
  -H "Content-Type: application/json" \
  -d '{
    "task_type": "cover",
    "src_audio": "/path/to/country_song.mp3",
    "caption": "heavy metal rock with distorted guitars and aggressive drums",
    "audio_cover_strength": 0.4,
    "guidance_scale": 9.0,
    "inference_steps": 28
  }'

Tuning Strategy

Follow this iterative workflow to dial in the transformation:

Step 1: Baseline

Start with audio_cover_strength=0.5 and guidance_scale=8.0. Listen to the result.

Step 2: Adjust Structure Adherence

Output still sounds too much like the original genre? Lower audio_cover_strength to 0.3.
Song structure is unrecognizable? Raise audio_cover_strength to 0.6.

Step 3: Adjust Text Adherence

Target genre elements are not strong enough? Raise guidance_scale to 10.0+.
Output sounds artificial or over-processed? Lower guidance_scale to 7.0.

Step 4: Explore Seeds

Generate multiple outputs with different seeds (seed=1, 2, 3, ...). Results vary significantly across seeds, and some will capture the transformation better than others.

Step 5: Refine Caption

If specific elements are missing (e.g., "I want more distortion"), add them explicitly to the caption. The model responds well to concrete instrumental and timbral descriptions.

Recommended Starting Points by Genre Transformation

Source Genre	Target Genre	Strength	Guidance Scale	Caption Focus
Country	Heavy Metal	0.3 - 0.4	9.0 - 10.0	Distorted guitars, double bass drums, screaming vocals
Pop	Jazz	0.5 - 0.7	7.0 - 8.0	Piano, upright bass, brushed drums, swing feel
Rock	Electronic	0.3 - 0.5	8.0 - 9.0	Synthesizers, four-on-the-floor kick, arpeggios
Classical	Hip-Hop	0.2 - 0.4	9.0 - 10.0	808 bass, trap hi-hats, rap vocals, boom-bap
Folk	R&B	0.4 - 0.6	8.0 - 9.0	Smooth vocals, neo-soul keys, warm bass, groove
Acoustic	Orchestral	0.6 - 0.8	7.0 - 8.0	Strings, brass, woodwinds, full orchestra, dynamic

General rule: The more different the source and target genres are, the lower the audio_cover_strength should be.

Technical Notes

LM is skipped for cover tasks by default. The source audio replaces the LM's structural planning role. You can optionally enable thinking=True and use_cot_metas=True for LM-assisted metadata detection.
Source latents (src_latents) are cloned directly from the encoded reference audio, preserving its full structure in latent space. The diffusion process then transforms style while these latents anchor the structure.
The cover task instruction differs from text-to-music: it uses "Generate audio semantic tokens based on the given conditions:" rather than "Fill the audio semantic mask based on the given conditions:", which changes how the DiT model interprets the input.
Audio is encoded at 48kHz with latent frames at ~25Hz (1920 audio samples per latent frame).

Key Source Files

File	Relevance
`acestep/handler.py`	Core cover task implementation, dual text encoding, diffusion pipeline
`acestep/inference.py`	`GenerationParams` definition, task routing, LM skip logic
`acestep/constants.py`	Task instruction strings (`TASK_INSTRUCTIONS` dict)
`acestep/gradio_ui/interfaces/generation.py`	UI slider definitions for cover parameters
`docs/en/INFERENCE.md`	API documentation with cover examples
`docs/en/Tutorial.md`	User-facing tutorial with cover usage patterns

0 replies

chklovski · 2026-02-12T07:36:07Z

chklovski
Feb 12, 2026

With audio_cover_strength=0.99 the output sounds absolutely nothing like the input (melody-wise) using 'cover'. What's the point of the parameter? If it doesn't preserve melody, what is the point of the whole cover functionality?

2 replies

ChuxiJ Feb 13, 2026
Maintainer

This feature is designed to be somewhere between a cover and a remix. It produces slight melodic variations depending on the complexity of your melody and the cover strength parameter.

For some songs—such as mid-tempo or slow songs with sparse instrumentation—it can capture the full melody. However, in most cases, it will generate melodic variations rather than simply copying the original melody exactly.

Sorry for the misleading experience.

ChuxiJ Feb 13, 2026
Maintainer

Even if it can’t fully replicate the melody, generating variations and different styles is still meaningful and fun.

ChuxiJ · 2026-02-14T13:51:15Z

ChuxiJ
Feb 14, 2026
Maintainer

https://www.youtube.com/watch?v=sv4pNrjRh7s

0 replies

Kerntrick · 2026-03-10T22:18:52Z

Kerntrick
Mar 10, 2026

I've had some success with drastic genre-covers by using an intermediate step.
Basically slowly bring down 'cover_noise_strength' until just before you lose vocal melodies or whatever else is a must-preserve from the original audio.
There will probably other things too that you don't want to keep, but the new song will be little bit closer to the new genre. then you take this new output song and send it back into the Cover process, replacing the previous source audio. Usually only one or 2 steps of this get a good final result.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could you please explain how to precisely adjust the style transformation of the reference audio? #209

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Could you please explain how to precisely adjust the style transformation of the reference audio? #209

Uh oh!

3222686767 Feb 6, 2026

Replies: 4 comments · 2 replies

Uh oh!

sigalarm Feb 6, 2026

ACE-Step 1.5 — Style Transformation Guide

Overview

Use the Cover Task (Not Repaint)

Key Parameter: audio_cover_strength (LM Codes Strength)

How It Works Internally

Writing the Caption (Text Prompt)

Example: Country to Heavy Metal

Example: Pop to Jazz

Example: Acoustic to Electronic

Caption Tips

Diffusion Parameters for Style Transformation

Parameter Relationships

Complete Code Example: Country to Heavy Metal

Using the Gradio UI

Using the API

Tuning Strategy

Step 1: Baseline

Step 2: Adjust Structure Adherence

Step 3: Adjust Text Adherence

Step 4: Explore Seeds

Step 5: Refine Caption

Recommended Starting Points by Genre Transformation

Technical Notes

Key Source Files

Uh oh!

chklovski Feb 12, 2026

Uh oh!

ChuxiJ Feb 13, 2026 Maintainer

Uh oh!

ChuxiJ Feb 13, 2026 Maintainer

Uh oh!

ChuxiJ Feb 14, 2026 Maintainer

Uh oh!

Kerntrick Mar 10, 2026

3222686767
Feb 6, 2026

Replies: 4 comments 2 replies

sigalarm
Feb 6, 2026

Key Parameter: `audio_cover_strength` (LM Codes Strength)

chklovski
Feb 12, 2026

ChuxiJ Feb 13, 2026
Maintainer

ChuxiJ Feb 13, 2026
Maintainer

ChuxiJ
Feb 14, 2026
Maintainer

Kerntrick
Mar 10, 2026