Skip to content

Saganaki22/ComfyUI-AudioSR

Repository files navigation

ComfyUI AudioSR (Versatile Audio Super Resolution)

           

Native ComfyUI node for AudioSR (Versatile Audio Super Resolution) - Upscale any audio to 48kHz using state-of-the-art latent diffusion.

Based on the original AudioSR implementation by Haohe Liu et al.

🎯 Key Features

  • 🎧 Audio Super Resolution: Upsample low-quality audio to 48kHz with enhanced high frequencies
  • πŸŽ›οΈ Native ComfyUI Integration: Works seamlessly with Load Audio, Preview Audio, and Save Audio nodes
  • πŸ“Š Built-in Spectrogram Visualization: Before/after comparison with time and frequency axes
  • πŸ”„ Automatic Sample Rate Handling: Accepts any input sample rate (8kHz - 48kHz)
  • 🧩 Stereo Support: Processes both mono and stereo audio with independent channel handling
  • πŸ“ Long Audio Support: Smart chunking with overlap for unlimited audio length
  • ⚑ Model Caching: Model stays in memory for fast subsequent generations
  • πŸš€ torch.compile Optimization: Optional PyTorch compilation for 20-30% speed boost (FP32 models)
  • πŸŽ›οΈ VRAM Management: Optional model unloading to free GPU memory between runs
  • ⏸️ Interruptible: Cancel processing mid-run through ComfyUI's interrupt button
  • πŸ“ˆ Progress Reporting: Real-time progress bar shows chunk processing status

⚑ Performance Optimizations

Attention Backends

The node supports multiple attention computation backends for optimal performance:

Backend Speed Requirements Best For
sdpa Fast PyTorch 2.0+ Default, most compatible
sageattn Fastest fp16/bf16 dtype, pip install sageattention RTX 30/40 series, maximum speed
eager Slowest None Debugging, maximum compatibility

SageAttention: GPU-optimized attention kernels with automatic architecture detection (SM80+ required). Falls back gracefully if not installed or incompatible.

Compute Dtypes

Dtype VRAM Speed Notes
fp32 Higher Baseline Default, most accurate
fp16 Lower Faster Requires GPU with good FP16 support
bf16 Lower Fastest Best on RTX 30/40 series (Ampere+)

Note: SageAttention requires fp16 or bf16 dtype. Selecting fp32 with SageAttention auto-falls back to sdpa.

torch.compile Mode

The node includes an optional use_torch_compile toggle that applies PyTorch's torch.compile() optimization to the model for faster inference.

Speed Boost: After an initial compilation overhead (~10-30 seconds on first run), you'll see:

  • 20-30% faster inference for subsequent generations
  • Best performance with FP32 models (recommended for torch.compile)
  • Grows more valuable with repeated processing (cached compiled model)

When to Use:

  • Processing multiple audio files in a session
  • Longer audio requiring multiple chunks
  • When speed is critical and you can wait for the initial compilation
  • Recommended: Enable if using FP32 models and processing multiple clips

Caveats:

  • Only works with FP32 models (will skip compilation for FP16/FP8 models)
  • First generation takes longer due to compilation overhead
  • Not recommended for one-off processing

Optimized Defaults

The node has been tuned with performance-focused default values:

Parameter Default Previous Performance Impact
chunk_size 15.0s 5.12s Fewer chunks = ~60% faster for long audio
overlap 0.0s 0.04s No overlap = faster (smoother audio with 2.0-3.0s overlap)
attention_backend sdpa sdpa PyTorch native attention (fastest available)

Optimizations Applied:

  • Removed unnecessary tensor conversions (torch.from_numpyβ†’numpy)
  • Smart model caching with automatic recompilation detection
  • Improved dtype detection for quantized models (FP8 β†’ FP16 conversion)
  • Safe division handling to prevent numerical errors
  • Memory-efficient overlap processing with optional crossfade

Security Improvements

  • weights_only=True for safe model loading (prevents arbitrary code execution)
  • Validates tensor dtypes before model conversion

Requirements

All Python dependencies are installed automatically. No external tools required.

Minimum: 6GB VRAM, 12GB RAM recommended


image
ComfyUI_temp_bildo_00002_

Audio Examples

Original AudioSR
speech_up_4.wav speech_audiosr_4.wav
event_up_2.wav event_audiosr_2.wav

πŸ“¦ Installation

Method 1: ComfyUI Manager (Recommended)

  1. Open ComfyUI Manager
  2. Search for "AudioSR"
  3. Click Install
  4. Restart ComfyUI

That's it! All dependencies are installed automatically.

Method 2: Manual Installation

Standard Python:

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-AudioSR.git
cd ComfyUI-AudioSR
pip install -r requirements.txt

ComfyUI Portable (Windows with embedded Python):

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-AudioSR
cd ComfyUI-AudioSR
..\python_embeded\python.exe -s -m pip install -r requirements.txt

πŸ“₯ Download Models

Important: Models must be placed in ComfyUI/models/AudioSR/

Download from HuggingFace: https://huggingface.co/drbaph/AudioSR/tree/main/AudioSR

Download one or both models and place them in your ComfyUI models directory:

Required Folder Structure:

ComfyUI/
└── models/
    └── AudioSR/
        β”œβ”€β”€ audiosr_basic_fp32.safetensors (for general audio)
        └── audiosr_speech_fp32.safetensors (for voice content)

Available Models (FP32):

  • audiosr_basic_fp32.safetensors - General purpose (music, sound effects, etc.)
  • audiosr_speech_fp32.safetensors - Optimized for voice/speech

πŸ’» VRAM Requirements

Configuration VRAM Usage Recommended For
Standard ~6GB RTX 3060+ (6GB+)
With unload_model enabled ~0.5GB (when idle) Systems with limited VRAM

Minimum: 6GB VRAM required, 8GB+ recommended


🎚️ AudioSR Node

πŸ“– Click to expand: Overview & Parameters

Overview

Upscale audio to 48kHz using the AudioSR latent diffusion model. The model analyzes low-quality audio and generates enhanced high-frequency details for a cleaner, fuller sound.

What it does:

  • Resamples audio to 48kHz (if needed)
  • Enhances high frequencies and adds clarity
  • Reduces artifacts from low-bitrate compression
  • Works on any audio: music, speech, sound effects

Use Cases:

  • Upsample old/low-quality audio recordings
  • Enhance compressed audio (MP3, low-bitrate streams)
  • Improve audio for video production
  • Restore archived audio content

Parameters

Required Inputs

Parameter Type Range Default Description
audio AUDIO - - Audio input from Load Audio node
ddim_steps INT 10-500 50 Number of denoising steps (higher = better quality, slower)
guidance_scale FLOAT 1.0-20.0 3.5 CFG scale - higher = more faithful to input
seed INT 0-4.29B 0 Random seed (0 = random)

Optional Inputs

Parameter Type Range Default Description
model COMBO - basic Model file from ComfyUI/models/AudioSR/ (supports .bin, .safetensors)
chunk_size FLOAT 2.56-30.0 15.0 Chunk duration in seconds (for audio >10.24s)
overlap FLOAT 0.0-5.0 0.0 Overlap in seconds between chunks (helps smooth transitions, 2.0-3.0 recommended for long audio)
unload_model BOOLEAN - False Free VRAM after generation (slower next run)
show_spectrogram BOOLEAN - True Generate before/after spectrogram image
attention_backend COMBO - sdpa Attention: sdpa (fast), sageattn (fastest, requires fp16/bf16), eager (compatible)
dtype COMBO - fp32 Compute dtype: fp32 (default), fp16 (faster), bf16 (RTX 30/40 series)
use_torch_compile BOOLEAN - False Use torch.compile() for 20-30% speed boost (FP32 only, requires warmup)

Outputs

Output Type Description
audio AUDIO Upscaled audio at 48kHz (connect to Preview/Save)
spectrogram IMAGE Before/after spectrogram comparison (optional)

🎨 Workflow Examples

πŸ“– Click to expand: Workflow Examples

Basic Audio Upscaling

Load Audio β†’ AudioSR β†’ Preview Audio / Save Audio
  1. Add Load Audio node and select your audio file
  2. Add AudioSR node
  3. Connect audio output to AudioSR input
  4. Set ddim_steps: 50 (default)
  5. Set guidance_scale: 3.5 (default)
  6. Connect AudioSR audio output to Preview Audio or Save Audio
  7. Queue and generate!

High Quality Upscaling

Settings:
- ddim_steps: 100
- guidance_scale: 5.0
- model: speech (for voice content)
- show_spectrogram: True

Low VRAM Mode

Settings:
- unload_model: True
- ddim_steps: 50 (default)

This frees VRAM after each generation but makes subsequent runs slower.

Long Audio Processing

For audio longer than ~10 seconds, the node automatically:

  • Splits audio into chunks
  • Processes each chunk
  • Crossfades overlap regions
  • Stitches into seamless output

Defaults from main repo (recommended):

- chunk_size: 15.0 (seconds per chunk)
- overlap: 2.0 (seconds overlap between chunks)

For faster processing with more VRAM:

- chunk_size: 20.0-30.0 (fewer chunks = faster)
- overlap: 2.0-3.0 (smoother transitions)

βš™οΈ Parameter Guide

πŸ“– Click to expand: Detailed Parameter Guide

ddim_steps (10-500)

Number of denoising steps during generation.

Value Quality Speed Use Case
10-30 Lower Fast Quick previews
50 Good Medium Default recommendation
100 Better Slow High-quality output
200+ Best Very Slow Maximum quality

guidance_scale (1.0-20.0)

Classifier-free guidance scale. Controls how closely the output follows the input.

Value Effect
1.0-2.0 More creative/variant
3.0-4.0 Balanced (default)
5.0-8.0 More faithful to input
10.0+ Very conservative (may sound artificial)

chunk_size (2.56-30.0)

Chunk duration in seconds for long audio processing.

Value Effect
2.56-5.0 More chunks, slower, less memory per chunk
15.0 Default (from main repo, balanced)
20.0-30.0 Fewer chunks, faster, more memory per chunk

overlap (0.0-5.0)

Overlap duration in seconds between chunks.

Value Effect
0.0 No overlap (may have seams)
2.0 Recommended (smoother transitions)
3.0-5.0 Smoothest stitching, slower processing

use_torch_compile (BOOLEAN)

Enable PyTorch's torch.compile() optimization for faster inference.

Value Effect
False Default - Standard inference
True 20-30% faster after compilation (FP32 models only)

Notes:

  • First run takes ~10-30s longer for compilation
  • Only effective with FP32 models (recommended: audiosr_basic_fp32.safetensors, audiosr_speech_fp32.safetensors)
  • Best for batch processing or repeated use
  • Model auto-recompiles if toggle state changes

πŸ› Troubleshooting

πŸ“– Click to expand: Troubleshooting

Model Not Found Error

Symptom: "Model not found. Please download the AudioSR model..."

Solution:

  1. Download model from: https://huggingface.co/drbaph/AudioSR/tree/main/AudioSR
  2. Place in ComfyUI/models/AudioSR/
  3. Restart ComfyUI

CUDA Out of Memory

Symptom: RuntimeError: CUDA out of memory

Solutions:

  1. Enable unload_model: True
  2. Close other GPU-intensive applications
  3. Reduce chunk_size (try 2.56)
  4. Use a smaller audio file

Poor Audio Quality

Symptom: Output sounds distorted or artificial

Solutions:

  1. Lower guidance_scale (try 2.5-3.0)
  2. Reduce ddim_steps (too high can sound artificial)
  3. Try the other model variant (basic ↔ speech)
  4. Ensure input audio is reasonably clean

No Output Audio

Symptom: Node completes but no audio is produced

Solutions:

  1. Check ComfyUI console for error messages
  2. Verify audio input is properly connected
  3. Ensure output is connected to Preview/Save Audio node
  4. Try a different audio file

Spectrogram Not Showing

Symptom: Spectrogram output is empty or shows noise

Solutions:

  1. Ensure show_spectrogram: True
  2. Install matplotlib: pip install matplotlib
  3. Check ComfyUI console for errors

Slow Processing

Symptom: Node takes very long to process

Solutions:

  1. Disable unload_model (keeps model cached)
  2. Enable use_torch_compile (20-30% faster after warmup, FP32 models only)
  3. Increase chunk_size (fewer chunks = faster)
  4. Ensure GPU is being used (not CPU)
  5. Try ddim_steps: 30 for faster processing

torch.compile Not Working

Symptom: use_torch_compile enabled but no speedup or message says compilation failed

Solutions:

  1. Ensure using FP32 model (*_fp32.safetensors)
  2. Check PyTorch version (2.0+ required for torch.compile)
  3. First run always includes compilation overhead - try second run
  4. Check console for error messages (some operations not supported)

SageAttention Not Working

Symptom: sageattn selected but falling back to sdpa

Solutions:

  1. Install SageAttention: pip install sageattention
  2. Ensure dtype is set to fp16 or bf16 (fp32 auto-falls back)
  3. Check GPU architecture (requires SM80+ = RTX 30 series or newer)
  4. Check console message for specific fallback reason

Model Reloaded Repeatedly

Symptom: Model recompiles every time, defeating performance gains

Causes:

  • Switching use_torch_compile on/off forces model reload
  • Switching between FP32 and FP16 models
  • ComfyUI restart (model cache cleared)

Volume Drops at Chunk Boundaries

Symptom: Periodic volume drops or glitches in long audio (every 10-30 seconds depending on chunk_size)

Causes:

  • The model internally pads audio to 5.12s multiples, causing output length to differ from input
  • Improper chunk positioning when stitching outputs together

Solutions (v1.0.6+):

  • This issue is now fixed automatically
  • If using older version, update to v1.0.6 or later
  • For smoothest transitions, use overlap: 2.0-3.0 when processing long audio

πŸ“Š Spectrogram Visualization

πŸ“– Click to expand: Spectrogram Details

The node generates a side-by-side spectrogram comparison when show_spectrogram: True:

Top panel: Input audio (before) - Shows limited high frequencies Bottom panel: Output audio (after) - Shows enhanced frequency content

The spectrogram uses the magma colormap:

  • Purple/Black: Low energy (silence/quiet)
  • Red/Orange: Medium energy
  • Yellow: High energy (loud frequencies)

Axes:

  • X-axis: Time in seconds
  • Y-axis: Frequency in Hz (0-24kHz visible range)

πŸ”— Links


πŸ’‘ Tips for Best Results

  • Input Quality: The model can't create what isn't there - extremely low-quality audio may still sound artificial
  • Guidance Scale: Start at 3.5 and adjust based on results
  • Steps: 50 steps is usually sufficient; 100+ for critical applications
  • Speech Audio: Use the audiosr_speech_fp32.safetensors model for voice content
  • Music/General: Use the audiosr_basic_fp32.safetensors model for music and sound effects
  • Long Audio: Let the auto-chunking handle files >10 seconds
  • VRAM: Enable unload_model if you need GPU memory for other tasks
  • Speed Optimization: Enable use_torch_compile when processing multiple files or long audio (FP32 models only)
  • Maximum Speed: Use attention_backend: sageattn + dtype: bf16 on RTX 30/40 series (install: pip install sageattention)
  • Smooth Transitions: Use overlap: 2.0 for seamless chunk stitching on long audio
  • Chunk Size: Default 15s is optimized; increase to 20-30s for fewer chunks if VRAM allows

πŸ“š Credits & License

Original Research: AudioSR: Versatile Audio Super-Resolution by Haohe Liu et al.

Original Implementation: versatile_audio_super_resolution by Haohe Liu

ComfyUI Integration: This custom node implementation

License: MIT (same as original AudioSR project)


πŸ“ Changelog

Version 1.1.1

  • βœ… Fixed tensor dimension mismatch: Small chunks (<5.12s) now get padded to minimum size before processing, then trimmed to original length. Fixes "Expected size 64 but got size 63" errors on audio with uneven chunk splits.

Version 1.1.0

  • βœ… SageAttention support: Added sageattn attention backend with GPU-arch auto-detection (SM80+). Falls back to sdpa if not installed or incompatible.
  • βœ… Dtype selector: New dtype dropdown (fp32/fp16/bf16) to control compute precision. SageAttention auto-falls back to sdpa when fp32 is selected.
  • βœ… Added .gitignore for __pycache__ and common Python artifacts

Version 1.0.6

  • βœ… Fixed chunk positioning bug: Model internally pads audio to 5.12s multiples, causing output length to differ from input. Fixed by positioning chunks based on INPUT boundaries rather than model output length. This eliminates volume drops/glitches at chunk boundaries (30s, 60s, etc.) in long audio files.
  • βœ… Fixed overlap-add normalization: Improved chunk stitching with proper weight tracking for overlap regions. Ensures consistent amplitude across chunk boundaries when using overlap > 0.
  • βœ… Improved handling of stereo audio processing with independent channel reconstruction.

Version 1.0.5

  • βœ… Model directory and file scanning now cached (no repeated folder scans on browser refresh/reload)
  • βœ… Removed verbose console messages on tab refresh ("Checking path", "Found model", etc.)
  • βœ… Improved startup performance by avoiding redundant file system operations

Version 1.0.4

  • βœ… Minor fixes and improvements

Version 1.0.3

  • βœ… Added use_torch_compile toggle for 20-30% speed boost (FP32 models)
  • βœ… Optimized default chunk_size to 15.0s (was 5.12s) for faster long audio processing
  • βœ… Optimized default overlap to 0.0s (was 0.04s) with configurable 2-3s recommended for smooth transitions
  • βœ… Removed unnecessary tensor conversions for faster inference
  • βœ… Improved dtype detection for quantized model support (FP8 β†’ FP16 auto-conversion)
  • βœ… Security fix: Added weights_only=True for safe model loading
  • βœ… Smart model caching with automatic recompilation detection
  • βœ… Safe division handling in overlap normalization

Version 1.0.2

  • βœ… Added HuggingFace model link for easy access
  • βœ… Improved documentation and examples

Version 1.0.1

  • βœ… Native ComfyUI AUDIO type support
  • βœ… Automatic sample rate conversion (any input rate β†’ 48kHz)
  • βœ… Stereo audio processing
  • βœ… Longer audio support with smart chunking
  • βœ… Before/after spectrogram visualization
  • βœ… Progress reporting and interrupt support
  • βœ… Model caching and optional VRAM unloading
  • βœ… Time and frequency axes on spectrograms

About

ComfyUI node for AudioSR - Versatile Audio Super Resolution upscales audio to 48kHz using latent diffusion

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages