An all-in-one node for ComfyUI that implements AceStep 1.5 SFT (Supervised Fine-Tuning), a state-of-the-art music generation model. It starts from the official AceStep workflow and extends it with stronger conditioning control and practical ComfyUI-oriented quality options.
SFT = Supervised Fine-Tuning: A specialized version of AceStep optimized for generating superior quality audio through supervised training.
This package currently provides four nodes under audio/AceStep SFT:
- AceStep 1.5 SFT Generate: all-in-one generation, editing, and decoding
- AceStep 1.5 SFT Music Analyzer: AI-powered audio analysis (tags, BPM, key/scale)
- AceStep 1.5 SFT Lora Loader: chainable LoRA stack builder for AceStep 1.5 SFT
- AceStep 1.5 SFT Turbo Tag Adapter: rewrites Turbo-oriented tags into shorter SFT-friendly prompt tags
The AceStepSFTGenerate node encapsulates the entire music generation workflow:
- Latent Creation - Generates initial latents or loads from
latent_or_audioinput - Text Encoding - Processes captions, lyrics, and metadata via multiple CLIP encoders
- Diffusion Sampling - Runs the diffusion model with advanced guidance control
- Audio Decoding - Converts latents to high-quality audio via VAE
The node supports three classifier-free guidance modes, each with unique characteristics:
-
APG (Adaptive Projected Guidance) ⭐ Recommended
- Dynamic adaptation via momentum buffering
- Gradient clipping with adaptive thresholds
- Orthogonal projection to eliminate unwanted noise
- AceStep SFT Default - best quality and stability balance
-
ADG (Angle-based Dynamic Guidance)
- Angle-based guidance between conditions
- Operates in velocity space (flow matching)
- Ideal for aggressive style distortion
- Adaptive clipping based on angle between x0_cond and x0_uncond
-
Standard CFG
- Traditional Classifier-Free Guidance
- Simple and predictable implementation
- Useful as a comparison baseline
- Auto-Duration: Automatically estimates music duration by analyzing lyric structure
- LLM Encoding: Use Qwen LLM (0.6B or 1.7B/4B) to generate semantic audio codes
- Auto Values: BPM, Time Signature, and Key/Scale automatic (model decides)
- Multilingual Support: Over 23 languages supported
- Audio Tag Extraction: Uses the native ACE-Step Transcriber to extract lyric, vocal, and song-structure tags from audio
- BPM Detection: Automatic tempo detection via librosa
- Key/Scale Detection: Detects musical key and scale (e.g. "G minor")
- JSON Output: Structured
music_infosoutput with all analysis results - Generation Parameters: Control temperature, top_p, top_k, repetition_penalty, and seed
- Auto Model Download: Models are downloaded on first use (~1-7 GB each)
| Model | Size | Type | Best For |
|---|---|---|---|
| ACE-Step-Transcriber | 22.4 GB download | Audio-to-Text | Native ACE-Step 1.5 transcription for lyrics, singing voice, structure tags, and instrument hints |
This node is now dedicated to the native ACE-Step-Transcriber workflow. It uses the model's native prompt format, structured transcription output, and derives tags from language, lyrics, section markers such as verse/chorus/bridge, and optional instrument annotations.
- Latent-based Refinement: Use
denoise < 1.0withlatent_or_audioconnected to refine existing audio - Accepts AUDIO or LATENT: Connect any audio or latent output for img2img-style editing
- Batch Generation: Generate multiple variations in parallel
- Split Text/Lyric Guidance: Independent
guidance_scale_textandguidance_scale_lyric - Omega Scale: Mean-preserving output reweighting to approximate AceStep scheduler behavior
- ERG Approximation: Node-local prompt energy reweighting via
erg_scale - Guidance Interval Decay: Smoothly decay guidance inside the active interval
- Chainable LoRA Loader: Stack one or more AceStep LoRAs before generation
- Separate strengths: Independent
strength_modelandstrength_clip - Single Generate input: Final LoRA stack plugs into the
lorainput on Generate - Local
Loras/folder: Drop LoRA files directly into the node'sLoras/folder — they are automatically registered at startup - Auto PEFT/DoRA conversion: PEFT-format LoRAs (
adapter_config.json+adapter_model.safetensors) placed inLoras/are automatically converted to ComfyUI format on first startup - DoRA support: Full DoRA (Weight-Decomposed Low-Rank Adaptation) support with automatic
dora_scaledimension fix for ComfyUI compatibility
- Latent Shift: Additive anti-clipping correction
- Latent Rescale: Multiplicative scaling for dynamic control
- ComfyUI installed and functional
- CUDA/GPU or equivalent (modern processors)
- Recommended for better output quality (based on practical testing): use the merged SFT+Turbo model.
- Required model files:
- Diffusion model (DiT):
acestep_v1.5_sft.safetensors - Text Encoders:
qwen_0.6b_ace15.safetensors,qwen_1.7b_ace15.safetensors(or 4B) - VAE:
ace_1.5_vae.safetensors
- Diffusion model (DiT):
Download the required models from HuggingFace:
- Diffusion Model (Recommended: merged SFT+Turbo):
-
Alternative Diffusion Model (official SFT):
-
Text Encoders (choose any versions):
- Text Encoders Collection
qwen_0.6b_ace15.safetensors(caption processing)qwen_1.7b_ace15.safetensorsorqwen_4b_ace15.safetensors(audio code generation)
- Text Encoders Collection
-
VAE (Audio codec):
- Clone the repository to your custom nodes folder:
cd ComfyUI/custom_nodes
git clone https://github.com/jeankassio/ComfyUI-AceStep_SFT.git- Place model files in the appropriate directories:
ComfyUI/models/diffusion_models/ # AceStep 1.5 SFT model
ComfyUI/models/text_encoders/ # Qwen encoders
ComfyUI/models/vae/ # VAE
ComfyUI/models/loras/ # Optional AceStep 1.5 LoRAs
- (Optional) Place LoRAs in the local folder:
ComfyUI/custom_nodes/ComfyUI-AceStep_SFT/Loras/ # Local LoRA folder
You can place LoRAs here in any of these formats:
- ComfyUI format: Single
.safetensorsfile (ready to use) - PEFT/DoRA format: A folder containing
adapter_config.json+adapter_model.safetensors(auto-converted on startup) - Nested zip artifacts: If your zip extracted a folder-inside-folder, the node detects this and fixes it automatically
- Restart ComfyUI - the node will appear under
audio/AceStep SFT
Main all-in-one node for text-to-music generation, latent-based audio refinement, and VAE decoding.
AI-powered audio analysis node that extracts descriptive tags, BPM, and key/scale from audio input.
Inputs:
audio: Audio input to analyzemodel: AI model selection (9 models, auto-downloaded)get_tags/get_bpm/get_keyscale: Enable/disable each analysismax_new_tokens: Maximum tokens for generative modelsaudio_duration: Max seconds of audio to analyzetemperature,top_p,top_k,repetition_penalty,seed: Generation parametersunload_model: Free VRAM after analysisuse_flash_attn: Enable Flash Attention 2 (if compatible)
Outputs:
tags: Comma-separated descriptive tags (STRING)bpm: Detected BPM as string e.g. "129bpm" (STRING)keyscale: Key and scale e.g. "G minor" (STRING)music_infos: JSON with all results (STRING)
Chainable utility node that builds a LoRA stack for AceStep 1.5 SFT.
Inputs:
lora_name: LoRA file fromComfyUI/models/lorasor the localLoras/folderstrength_model: strength applied to the diffusion modelstrength_clip: strength applied to the text encoder stacklora(optional): upstream AceStep LoRA stack
Output:
lora: connect to another Lora Loader or directly into Generate
| Format | What to place in Loras/ |
Action |
|---|---|---|
ComfyUI .safetensors |
Single file | Used directly |
| PEFT/DoRA directory | Folder with adapter_config.json + adapter_model.safetensors |
Auto-converted to *_comfyui.safetensors on startup |
| Nested zip artifact | Folder containing a .safetensors inside |
Auto-extracted to root on startup |
The auto-conversion handles:
- Key remapping:
lora_A/lora_B→lora_down/lora_up - DoRA support:
lora_magnitude_vector→dora_scale(with correct 2D shape) - Per-layer alpha injection from
adapter_config.json(supportsalpha_patternandrank_pattern)
| Parameter | Range | Description |
|---|---|---|
| diffusion_model | - | Path to DiT model (AceStep 1.5 SFT) |
| text_encoder_1 | - | Qwen3 0.6B Encoder (caption processing) |
| text_encoder_2 | - | Qwen3 1.7B/4B Encoder (audio code generation) |
| vae_name | - | AceStep 1.5 VAE |
| caption | - | Text description of music (genre, mood, instruments) |
| lyrics | - | Song lyrics or [Instrumental] |
| instrumental | boolean | Force instrumental mode (overrides lyrics) |
| seed | 0 - 2^64 | Seed for reproducibility |
| steps | 1 - 200 | Diffusion inference steps (default: 50 for ACE-Step 1.5 SFT) |
| cfg | 1.0 - 20.0 | Classifier-free guidance scale (default: 7.0; typical 7.0-9.0 for ACE-Step 1.5) |
| sampler_name | - | Sampler (euler, dpmpp, etc.) |
| scheduler | - | Scheduler (normal, karras, exponential, etc.; default: normal) |
| denoise | 0.0 - 1.0 | Denoising strength (1.0 = fresh generation, < 1.0 = editing) |
| infer_method | ode/sde | ODE keeps the selected sampler behavior; SDE remaps default Euler/Heun choices to a stochastic sampler |
| guidance_mode | apg/adg/standard_cfg | Guidance type (default: apg) |
| duration | 0.0 - 600.0 | Duration in seconds (default: 60.0, 0 = auto) |
| bpm | 0 - 300 | Beats per minute (0 = auto, model decides) |
| timesignature | auto/2/3/4/6 | Time signature numerator |
| language | - | Lyric language (en, ja, zh, es, pt, etc.) |
| keyscale | auto/... | Key and scale (e.g., "C major" or "D minor") |
- batch_size (1-16): Number of audios to generate in parallel
- latent_or_audio: Base input for refinement (img2img). Accepts AUDIO or LATENT. Use
denoise < 1.0to refine this input. Withduration=0, duration is derived from the connected input. - lora: AceStep LoRA stack from one or more
AceStep 1.5 SFT Lora Loadernodes
- generate_audio_codes (default: True): Enable/disable LLM audio code generation for semantic structure
- lm_cfg_scale (0.0-100.0, default: 2.0): LLM classifier-free guidance scale
- lm_temperature (0.0-2.0, default: 0.85): LLM sampling temperature
- lm_top_p (0.0-2000.0, default: 0.9): Nucleus sampling parameter
- lm_top_k (0-100, default: 0): Top-k sampling
- lm_min_p (0.0-1.0, default: 0.0): Minimum probability threshold
- lm_negative_prompt: Negative prompt for LLM CFG
- latent_shift (-0.2-0.2, default: 0.0): Additive shift (anti-clipping)
- latent_rescale (0.5-1.5, default: 1.0): Multiplicative scaling
- normalize_peak (default: False): Legacy hard normalization to 0 dBFS after VAE decode
- enable_normalization (default: True): Peak-normalize output to a target dBFS level
- normalization_db (-10.0-0.0, default: -1.0): Target peak level when normalization is enabled
- fade_in_duration / fade_out_duration (0.0-10.0, default: 0.0): Optional linear fades after normalization
- use_tiled_vae (default: True): Uses tiled VAE encode/decode for better long-audio and low-VRAM robustness
- voice_boost (-12.0-12.0, default: 0.0): Simple output gain in dB before normalization
- apg_momentum (-1.0-1.0, default: -0.75): Momentum buffer coefficient
- apg_norm_threshold (0.0-10.0, default: 2.5): Norm threshold for gradient clipping
- guidance_interval (-1.0-1.0, default: 0.5): Official centered guidance interval control
- guidance_interval_decay (0.0-1.0, default: 0.0): Linear decay inside the active guidance interval
- min_guidance_scale (0.0-30.0, default: 3.0): Lower bound when interval decay is enabled
- guidance_scale_text (-1.0-30.0, default: -1.0): Text-only guidance scale,
-1inheritscfg - guidance_scale_lyric (-1.0-30.0, default: -1.0): Lyric-only delta guidance scale,
-1inheritscfg - omega_scale (-8.0-8.0, default: 0.0): Mean-preserving output reweighting
- erg_scale (-0.9-2.0, default: 0.0): Prompt/lyric conditioning energy reweighting
- cfg_interval_start (0.0-1.0, default: 0.0): Start applying guidance at this schedule fraction
- cfg_interval_end (0.0-1.0, default: 1.0): Stop applying guidance at this schedule fraction
- shift (1.0-5.0, default: 3.0): Schedule shift (3.0 = Gradio default)
- custom_timesteps: Custom comma-separated timesteps (overrides steps, shift, scheduler)
The node automatically manages latent creation or reuse:
├─ If latent_or_audio provided:
│ ├─ AUDIO: Resamples to VAE SR (48kHz), normalizes channels, encodes via VAE
│ ├─ LATENT: Uses directly as latent_image
│ └─ Duration derived from input when duration=0
│
└─ If no latent_or_audio:
└─ Creates zero latent (pure noise) [batch_size, 64, latent_length]
Automatic Sizing: Duration in seconds is converted to latent length via:
latent_length = max(10, round(duration * vae_sample_rate / 1920))
When duration <= 0, the node analyzes lyric structure:
[Intro/Outro] = 8 beats (~1 bar 4/4)
[Instrumental/Solo] = 16 beats (~2 bars 4/4)
Verse/Chorus → ~2 beats per 2 words (typical singing rate)
Section transitions = 4 beats
Empty lines = 2 beats (pause)
Result: duration = beats * (60 / bpm)
Metadata (bpm, duration, key/scale, time sig) are encoded in multiple representations:
- Structured YAML (Chain-of-Thought):
bpm: 120
caption: "upbeat electronic dance"
duration: 120
keyscale: "G major"
language: "en"
timesignature: 4- LLM Template (for audio code generation via Qwen):
<|im_start|>system
# Instruction
Generate audio semantic tokens...
<|im_end|>
<|im_start|>user
# Caption
upbeat electronic dance
# Lyric
[Verse 1]...
<|im_end|>
<|im_start|>assistant
<think>
{YAML above}
</think>
<|im_end|>
- Qwen3-0.6B Template (direct metadata):
# Instruction
# Caption
upbeat electronic dance
# Metas
- bpm: 120
- timesignature: 4
- keyscale: G major
- duration: 120 seconds
<|endoftext|>
# Phase 1: Compute conditional difference
diff = pred_cond - pred_uncond
# Phase 2: Apply smooth momentum
if momentum_buffer:
diff = momentum * running_avg + diff
# Phase 3: Norm clipping
norm = ||diff||₂
scale = min(1, norm_threshold / norm)
diff = diff * scale
# Phase 4: Orthogonal decomposition
diff_parallel = projection of diff onto pred_cond
diff_orthogonal = diff - diff_parallel
# Phase 5: Final guidance
guidance = pred_cond + (cfg_scale - 1) * (diff_orthogonal + eta * diff_parallel)Why It Works:
- Orthogonal projection removes collinear components that amplify noise
- Momentum smooths large jumps between timesteps
- Adaptive clipping prevents gradient explosion
- Result: cleaner and more stable audio
# Based on cosine angles between x0_cond and x0_uncond
# Dynamically adjusts guidance based on alignment
# Uses trigonometry for aggressive style deformation
When latent_or_audio is connected with denoise < 1.0, the node operates in img2img mode:
- The input audio is encoded via VAE (or the latent is used directly)
- A fraction of noise is added based on
denoisestrength - The diffusion model refines the noisy latent while preserving the original structure
- Use
guidance_mode=apgwithsteps=50to64for best quality - For img2img refinement, start with
denoise=0.5to0.7to preserve the original character - Mild vocal hiss is usually a generation artifact; APG and slightly higher step counts generally help more than raw
cfg - Simplify overly dense or contradictory tags for cleaner results
| Aspect | APG | ADG | Standard CFG |
|---|---|---|---|
| Quality | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| Stability | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐ |
| Dynamics | Natural | Aggressive | Predictable |
| Computation | Normal | Normal | Minimal |
| Recommended | ✅ Yes | For extreme styles | Baseline |
AceStepSFTGenerate:
caption: "upbeat electronic dance music with synthesizers"
lyrics: [Instrumental]
instrumental: True
duration: 60.0
cfg: 7.0
steps: 50
sampler_name: "euler"
scheduler: "normal"
guidance_mode: "apg"
→ Generates a strong 60s ACE-Step 1.5 SFT baseline render
AceStepSFTGenerate:
latent_or_audio: (mixer output)
caption: "make it more orchestral"
denoise: 0.7 (preserves 30% of source)
duration: 0 (uses input duration)
→ Refines audio while preserving original characteristics
AceStepSFTGenerate:
batch_size: 4
seed: 42 (varies automatically)
→ Creates 4 variations with similar characteristics
AceStep 1.5 SFT Lora Loader:
lora_name: "Ace-Step1.5/ace-step15-style1.safetensors"
strength_model: 0.7
strength_clip: 0.0
↓
AceStep 1.5 SFT Lora Loader:
lora_name: "Ace-Step1.5/Ace-Step1.5-TechnoRain.safetensors"
strength_model: 0.35
strength_clip: 0.0
↓
AceStep 1.5 SFT Generate:
lora: (stack output)
Note: AceStep LoRAs are now supported directly by this package. If a specific LoRA produces unstable audio, start by lowering strength_model and compare apg against standard_cfg.
AceStepSFTMusicAnalyzer:
audio: (input audio file)
model: "Qwen2-Audio-7B-Instruct"
→ tags: "dancehall beat, powerful bassline, vocal samples, melancholic"
→ bpm: "129bpm"
→ keyscale: "G minor"
↓
AceStepSFTGenerate:
caption: (tags from analyzer)
bpm: 129
keyscale: "G minor"
→ Generates new music matching the analyzed style
Solution: Use negative latent_shift (e.g., -0.1) to reduce amplitude before VAE decoding
Solution: Increase apg_norm_threshold (e.g., 3.0-4.0) for more gradient clipping
Solution:
- Use
guidance_mode: "apg"(recommended) - Start from
steps: 50,cfg: 7.0,sampler_name: "euler",scheduler: "normal",infer_method: "ode" - Keep
enable_normalization: Truewithnormalization_db: -1.0for cleaner final level management
Solution:
- Lower
strength_modelfirst, e.g.0.2to0.6 - Set
strength_clipto0.0unless the LoRA explicitly targets the text encoders - Compare
guidance_mode: "standard_cfg"vs"apg"for that LoRA - Avoid stacking multiple strong LoRAs at full strength
Cause: DoRA LoRAs store dora_scale as a 1D tensor [N]. ComfyUI's weight_decompose divides it by weight_norm [N,1], which causes PyTorch to broadcast [1,N]/[N,1] → [N,N] instead of the expected [N,1].
Solution: This is automatically fixed by the node — all dora_scale tensors are unsqueezed to 2D [N,1] at load time. If you still see this error, ensure you are using the latest version of this node.
Solution:
- Place the PEFT folder (containing
adapter_config.json+adapter_model.safetensors) insideComfyUI-AceStep_SFT/Loras/ - Restart ComfyUI — the conversion runs automatically on startup
- Check the console for
[AceStep SFT] Converted PEFT/DoRA → ComfyUI: ...message - The converted file appears as
*_comfyui.safetensorsin the dropdown
Solution: Reduce batch_size, lower steps to ~20, or use "karras" scheduler
- AceStep 1.5: ICML 2024 (Learning Universal Features for Efficient Audio Generation)
- Flow Matching: Liphardt et al. 2024 (Generative Modeling by Estimating Gradients of the Data Distribution)
- APG/ADG: Techniques aligned with official AceStep paper
- ComfyUI: Modular node graph architecture for batch generation
MIT License - Feel free to use in personal or commercial projects
Issues and PRs are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- Recommended maximum duration: 240 seconds (GPU memory)
- Maximum batch size: Depends on your GPU (start with 1-2)
- SFT models: These models are specific to Supervised Fine-Tuning - not tested with non-SFT models
- Rights and attribution: Respect model and dataset usage rights
Built on the AceStep SFT workflow and extended with advanced guidance and quality controls for ComfyUI.
For bugs, questions, or suggestions: open an issue on the repository! 🎵
