Native ComfyUI node for AudioSR (Versatile Audio Super Resolution) - Upscale any audio to 48kHz using state-of-the-art latent diffusion.
Based on the original AudioSR implementation by Haohe Liu et al.
- π§ Audio Super Resolution: Upsample low-quality audio to 48kHz with enhanced high frequencies
- ποΈ Native ComfyUI Integration: Works seamlessly with Load Audio, Preview Audio, and Save Audio nodes
- π Built-in Spectrogram Visualization: Before/after comparison with time and frequency axes
- π Automatic Sample Rate Handling: Accepts any input sample rate (8kHz - 48kHz)
- π§© Stereo Support: Processes both mono and stereo audio with independent channel handling
- π Long Audio Support: Smart chunking with overlap for unlimited audio length
- β‘ Model Caching: Model stays in memory for fast subsequent generations
- π torch.compile Optimization: Optional PyTorch compilation for 20-30% speed boost (FP32 models)
- ποΈ VRAM Management: Optional model unloading to free GPU memory between runs
- βΈοΈ Interruptible: Cancel processing mid-run through ComfyUI's interrupt button
- π Progress Reporting: Real-time progress bar shows chunk processing status
The node supports multiple attention computation backends for optimal performance:
| Backend | Speed | Requirements | Best For |
|---|---|---|---|
| sdpa | Fast | PyTorch 2.0+ | Default, most compatible |
| sageattn | Fastest | fp16/bf16 dtype, pip install sageattention |
RTX 30/40 series, maximum speed |
| eager | Slowest | None | Debugging, maximum compatibility |
SageAttention: GPU-optimized attention kernels with automatic architecture detection (SM80+ required). Falls back gracefully if not installed or incompatible.
| Dtype | VRAM | Speed | Notes |
|---|---|---|---|
| fp32 | Higher | Baseline | Default, most accurate |
| fp16 | Lower | Faster | Requires GPU with good FP16 support |
| bf16 | Lower | Fastest | Best on RTX 30/40 series (Ampere+) |
Note: SageAttention requires fp16 or bf16 dtype. Selecting fp32 with SageAttention auto-falls back to sdpa.
The node includes an optional use_torch_compile toggle that applies PyTorch's torch.compile() optimization to the model for faster inference.
Speed Boost: After an initial compilation overhead (~10-30 seconds on first run), you'll see:
- 20-30% faster inference for subsequent generations
- Best performance with FP32 models (recommended for torch.compile)
- Grows more valuable with repeated processing (cached compiled model)
When to Use:
- Processing multiple audio files in a session
- Longer audio requiring multiple chunks
- When speed is critical and you can wait for the initial compilation
- Recommended: Enable if using FP32 models and processing multiple clips
Caveats:
- Only works with FP32 models (will skip compilation for FP16/FP8 models)
- First generation takes longer due to compilation overhead
- Not recommended for one-off processing
The node has been tuned with performance-focused default values:
| Parameter | Default | Previous | Performance Impact |
|---|---|---|---|
| chunk_size | 15.0s | 5.12s | Fewer chunks = ~60% faster for long audio |
| overlap | 0.0s | 0.04s | No overlap = faster (smoother audio with 2.0-3.0s overlap) |
| attention_backend | sdpa | sdpa | PyTorch native attention (fastest available) |
Optimizations Applied:
- Removed unnecessary tensor conversions (torch.from_numpyβnumpy)
- Smart model caching with automatic recompilation detection
- Improved dtype detection for quantized models (FP8 β FP16 conversion)
- Safe division handling to prevent numerical errors
- Memory-efficient overlap processing with optional crossfade
weights_only=Truefor safe model loading (prevents arbitrary code execution)- Validates tensor dtypes before model conversion
All Python dependencies are installed automatically. No external tools required.
Minimum: 6GB VRAM, 12GB RAM recommended
| Original | AudioSR |
|---|---|
| speech_up_4.wav | speech_audiosr_4.wav |
| event_up_2.wav | event_audiosr_2.wav |
- Open ComfyUI Manager
- Search for "AudioSR"
- Click Install
- Restart ComfyUI
That's it! All dependencies are installed automatically.
Standard Python:
cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-AudioSR.git
cd ComfyUI-AudioSR
pip install -r requirements.txtComfyUI Portable (Windows with embedded Python):
cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-AudioSR
cd ComfyUI-AudioSR
..\python_embeded\python.exe -s -m pip install -r requirements.txtImportant: Models must be placed in ComfyUI/models/AudioSR/
Download from HuggingFace: https://huggingface.co/drbaph/AudioSR/tree/main/AudioSR
Download one or both models and place them in your ComfyUI models directory:
Required Folder Structure:
ComfyUI/
βββ models/
βββ AudioSR/
βββ audiosr_basic_fp32.safetensors (for general audio)
βββ audiosr_speech_fp32.safetensors (for voice content)
Available Models (FP32):
audiosr_basic_fp32.safetensors- General purpose (music, sound effects, etc.)audiosr_speech_fp32.safetensors- Optimized for voice/speech
| Configuration | VRAM Usage | Recommended For |
|---|---|---|
| Standard | ~6GB | RTX 3060+ (6GB+) |
| With unload_model enabled | ~0.5GB (when idle) | Systems with limited VRAM |
Minimum: 6GB VRAM required, 8GB+ recommended
π Click to expand: Overview & Parameters
Upscale audio to 48kHz using the AudioSR latent diffusion model. The model analyzes low-quality audio and generates enhanced high-frequency details for a cleaner, fuller sound.
What it does:
- Resamples audio to 48kHz (if needed)
- Enhances high frequencies and adds clarity
- Reduces artifacts from low-bitrate compression
- Works on any audio: music, speech, sound effects
Use Cases:
- Upsample old/low-quality audio recordings
- Enhance compressed audio (MP3, low-bitrate streams)
- Improve audio for video production
- Restore archived audio content
| Parameter | Type | Range | Default | Description |
|---|---|---|---|---|
| audio | AUDIO | - | - | Audio input from Load Audio node |
| ddim_steps | INT | 10-500 | 50 | Number of denoising steps (higher = better quality, slower) |
| guidance_scale | FLOAT | 1.0-20.0 | 3.5 | CFG scale - higher = more faithful to input |
| seed | INT | 0-4.29B | 0 | Random seed (0 = random) |
| Parameter | Type | Range | Default | Description |
|---|---|---|---|---|
| model | COMBO | - | basic | Model file from ComfyUI/models/AudioSR/ (supports .bin, .safetensors) |
| chunk_size | FLOAT | 2.56-30.0 | 15.0 | Chunk duration in seconds (for audio >10.24s) |
| overlap | FLOAT | 0.0-5.0 | 0.0 | Overlap in seconds between chunks (helps smooth transitions, 2.0-3.0 recommended for long audio) |
| unload_model | BOOLEAN | - | False | Free VRAM after generation (slower next run) |
| show_spectrogram | BOOLEAN | - | True | Generate before/after spectrogram image |
| attention_backend | COMBO | - | sdpa | Attention: sdpa (fast), sageattn (fastest, requires fp16/bf16), eager (compatible) |
| dtype | COMBO | - | fp32 | Compute dtype: fp32 (default), fp16 (faster), bf16 (RTX 30/40 series) |
| use_torch_compile | BOOLEAN | - | False | Use torch.compile() for 20-30% speed boost (FP32 only, requires warmup) |
| Output | Type | Description |
|---|---|---|
| audio | AUDIO | Upscaled audio at 48kHz (connect to Preview/Save) |
| spectrogram | IMAGE | Before/after spectrogram comparison (optional) |
π Click to expand: Workflow Examples
Load Audio β AudioSR β Preview Audio / Save Audio
- Add Load Audio node and select your audio file
- Add AudioSR node
- Connect audio output to AudioSR input
- Set
ddim_steps: 50(default) - Set
guidance_scale: 3.5(default) - Connect AudioSR audio output to Preview Audio or Save Audio
- Queue and generate!
Settings:
- ddim_steps: 100
- guidance_scale: 5.0
- model: speech (for voice content)
- show_spectrogram: True
Settings:
- unload_model: True
- ddim_steps: 50 (default)
This frees VRAM after each generation but makes subsequent runs slower.
For audio longer than ~10 seconds, the node automatically:
- Splits audio into chunks
- Processes each chunk
- Crossfades overlap regions
- Stitches into seamless output
Defaults from main repo (recommended):
- chunk_size: 15.0 (seconds per chunk)
- overlap: 2.0 (seconds overlap between chunks)
For faster processing with more VRAM:
- chunk_size: 20.0-30.0 (fewer chunks = faster)
- overlap: 2.0-3.0 (smoother transitions)
π Click to expand: Detailed Parameter Guide
Number of denoising steps during generation.
| Value | Quality | Speed | Use Case |
|---|---|---|---|
| 10-30 | Lower | Fast | Quick previews |
| 50 | Good | Medium | Default recommendation |
| 100 | Better | Slow | High-quality output |
| 200+ | Best | Very Slow | Maximum quality |
Classifier-free guidance scale. Controls how closely the output follows the input.
| Value | Effect |
|---|---|
| 1.0-2.0 | More creative/variant |
| 3.0-4.0 | Balanced (default) |
| 5.0-8.0 | More faithful to input |
| 10.0+ | Very conservative (may sound artificial) |
Chunk duration in seconds for long audio processing.
| Value | Effect |
|---|---|
| 2.56-5.0 | More chunks, slower, less memory per chunk |
| 15.0 | Default (from main repo, balanced) |
| 20.0-30.0 | Fewer chunks, faster, more memory per chunk |
Overlap duration in seconds between chunks.
| Value | Effect |
|---|---|
| 0.0 | No overlap (may have seams) |
| 2.0 | Recommended (smoother transitions) |
| 3.0-5.0 | Smoothest stitching, slower processing |
Enable PyTorch's torch.compile() optimization for faster inference.
| Value | Effect |
|---|---|
| False | Default - Standard inference |
| True | 20-30% faster after compilation (FP32 models only) |
Notes:
- First run takes ~10-30s longer for compilation
- Only effective with FP32 models (recommended:
audiosr_basic_fp32.safetensors,audiosr_speech_fp32.safetensors) - Best for batch processing or repeated use
- Model auto-recompiles if toggle state changes
π Click to expand: Troubleshooting
Symptom: "Model not found. Please download the AudioSR model..."
Solution:
- Download model from: https://huggingface.co/drbaph/AudioSR/tree/main/AudioSR
- Place in
ComfyUI/models/AudioSR/ - Restart ComfyUI
Symptom: RuntimeError: CUDA out of memory
Solutions:
- Enable
unload_model: True - Close other GPU-intensive applications
- Reduce
chunk_size(try 2.56) - Use a smaller audio file
Symptom: Output sounds distorted or artificial
Solutions:
- Lower
guidance_scale(try 2.5-3.0) - Reduce
ddim_steps(too high can sound artificial) - Try the other model variant (basic β speech)
- Ensure input audio is reasonably clean
Symptom: Node completes but no audio is produced
Solutions:
- Check ComfyUI console for error messages
- Verify audio input is properly connected
- Ensure output is connected to Preview/Save Audio node
- Try a different audio file
Symptom: Spectrogram output is empty or shows noise
Solutions:
- Ensure
show_spectrogram: True - Install matplotlib:
pip install matplotlib - Check ComfyUI console for errors
Symptom: Node takes very long to process
Solutions:
- Disable
unload_model(keeps model cached) - Enable
use_torch_compile(20-30% faster after warmup, FP32 models only) - Increase
chunk_size(fewer chunks = faster) - Ensure GPU is being used (not CPU)
- Try
ddim_steps: 30for faster processing
Symptom: use_torch_compile enabled but no speedup or message says compilation failed
Solutions:
- Ensure using FP32 model (
*_fp32.safetensors) - Check PyTorch version (2.0+ required for torch.compile)
- First run always includes compilation overhead - try second run
- Check console for error messages (some operations not supported)
Symptom: sageattn selected but falling back to sdpa
Solutions:
- Install SageAttention:
pip install sageattention - Ensure
dtypeis set tofp16orbf16(fp32 auto-falls back) - Check GPU architecture (requires SM80+ = RTX 30 series or newer)
- Check console message for specific fallback reason
Symptom: Model recompiles every time, defeating performance gains
Causes:
- Switching
use_torch_compileon/off forces model reload - Switching between FP32 and FP16 models
- ComfyUI restart (model cache cleared)
Symptom: Periodic volume drops or glitches in long audio (every 10-30 seconds depending on chunk_size)
Causes:
- The model internally pads audio to 5.12s multiples, causing output length to differ from input
- Improper chunk positioning when stitching outputs together
Solutions (v1.0.6+):
- This issue is now fixed automatically
- If using older version, update to v1.0.6 or later
- For smoothest transitions, use
overlap: 2.0-3.0when processing long audio
π Click to expand: Spectrogram Details
The node generates a side-by-side spectrogram comparison when show_spectrogram: True:
Top panel: Input audio (before) - Shows limited high frequencies Bottom panel: Output audio (after) - Shows enhanced frequency content
The spectrogram uses the magma colormap:
- Purple/Black: Low energy (silence/quiet)
- Red/Orange: Medium energy
- Yellow: High energy (loud frequencies)
Axes:
- X-axis: Time in seconds
- Y-axis: Frequency in Hz (0-24kHz visible range)
- Input Quality: The model can't create what isn't there - extremely low-quality audio may still sound artificial
- Guidance Scale: Start at 3.5 and adjust based on results
- Steps: 50 steps is usually sufficient; 100+ for critical applications
- Speech Audio: Use the
audiosr_speech_fp32.safetensorsmodel for voice content - Music/General: Use the
audiosr_basic_fp32.safetensorsmodel for music and sound effects - Long Audio: Let the auto-chunking handle files >10 seconds
- VRAM: Enable
unload_modelif you need GPU memory for other tasks - Speed Optimization: Enable
use_torch_compilewhen processing multiple files or long audio (FP32 models only) - Maximum Speed: Use
attention_backend: sageattn+dtype: bf16on RTX 30/40 series (install:pip install sageattention) - Smooth Transitions: Use
overlap: 2.0for seamless chunk stitching on long audio - Chunk Size: Default 15s is optimized; increase to 20-30s for fewer chunks if VRAM allows
Original Research: AudioSR: Versatile Audio Super-Resolution by Haohe Liu et al.
Original Implementation: versatile_audio_super_resolution by Haohe Liu
ComfyUI Integration: This custom node implementation
License: MIT (same as original AudioSR project)
- β Fixed tensor dimension mismatch: Small chunks (<5.12s) now get padded to minimum size before processing, then trimmed to original length. Fixes "Expected size 64 but got size 63" errors on audio with uneven chunk splits.
- β
SageAttention support: Added
sageattnattention backend with GPU-arch auto-detection (SM80+). Falls back to sdpa if not installed or incompatible. - β
Dtype selector: New
dtypedropdown (fp32/fp16/bf16) to control compute precision. SageAttention auto-falls back to sdpa when fp32 is selected. - β
Added
.gitignorefor__pycache__and common Python artifacts
- β Fixed chunk positioning bug: Model internally pads audio to 5.12s multiples, causing output length to differ from input. Fixed by positioning chunks based on INPUT boundaries rather than model output length. This eliminates volume drops/glitches at chunk boundaries (30s, 60s, etc.) in long audio files.
- β
Fixed overlap-add normalization: Improved chunk stitching with proper weight tracking for overlap regions. Ensures consistent amplitude across chunk boundaries when using
overlap > 0. - β Improved handling of stereo audio processing with independent channel reconstruction.
- β Model directory and file scanning now cached (no repeated folder scans on browser refresh/reload)
- β Removed verbose console messages on tab refresh ("Checking path", "Found model", etc.)
- β Improved startup performance by avoiding redundant file system operations
- β Minor fixes and improvements
- β
Added
use_torch_compiletoggle for 20-30% speed boost (FP32 models) - β
Optimized default
chunk_sizeto 15.0s (was 5.12s) for faster long audio processing - β
Optimized default
overlapto 0.0s (was 0.04s) with configurable 2-3s recommended for smooth transitions - β Removed unnecessary tensor conversions for faster inference
- β Improved dtype detection for quantized model support (FP8 β FP16 auto-conversion)
- β
Security fix: Added
weights_only=Truefor safe model loading - β Smart model caching with automatic recompilation detection
- β Safe division handling in overlap normalization
- β Added HuggingFace model link for easy access
- β Improved documentation and examples
- β Native ComfyUI AUDIO type support
- β Automatic sample rate conversion (any input rate β 48kHz)
- β Stereo audio processing
- β Longer audio support with smart chunking
- β Before/after spectrogram visualization
- β Progress reporting and interrupt support
- β Model caching and optional VRAM unloading
- β Time and frequency axes on spectrograms