ComfyUI AudioSR (Versatile Audio Super Resolution)

Native ComfyUI node for AudioSR (Versatile Audio Super Resolution) - Upscale any audio to 48kHz using state-of-the-art latent diffusion.

Based on the original AudioSR implementation by Haohe Liu et al.

🎯 Key Features

🎧 Audio Super Resolution: Upsample low-quality audio to 48kHz with enhanced high frequencies
🎛️ Native ComfyUI Integration: Works seamlessly with Load Audio, Preview Audio, and Save Audio nodes
📊 Built-in Spectrogram Visualization: Before/after comparison with time and frequency axes
🔄 Automatic Sample Rate Handling: Accepts any input sample rate (8kHz - 48kHz)
🧩 Stereo Support: Processes both mono and stereo audio with independent channel handling
📏 Long Audio Support: Smart chunking with overlap for unlimited audio length
⚡ Model Caching: Model stays in memory for fast subsequent generations
🚀 torch.compile Optimization: Optional PyTorch compilation for 20-30% speed boost (FP32 models)
🎛️ VRAM Management: Optional model unloading to free GPU memory between runs
⏸️ Interruptible: Cancel processing mid-run through ComfyUI's interrupt button
📈 Progress Reporting: Real-time progress bar shows chunk processing status

⚡ Performance Optimizations

Attention Backends

The node supports multiple attention computation backends for optimal performance:

Backend	Speed	Requirements	Best For
sdpa	Fast	PyTorch 2.0+	Default, most compatible
sageattn	Fastest	fp16/bf16 dtype, `pip install sageattention`	RTX 30/40 series, maximum speed
eager	Slowest	None	Debugging, maximum compatibility

SageAttention: GPU-optimized attention kernels with automatic architecture detection (SM80+ required). Falls back gracefully if not installed or incompatible.

Compute Dtypes

Dtype	VRAM	Speed	Notes
fp32	Higher	Baseline	Default, most accurate
fp16	Lower	Faster	Requires GPU with good FP16 support
bf16	Lower	Fastest	Best on RTX 30/40 series (Ampere+)

Note: SageAttention requires fp16 or bf16 dtype. Selecting fp32 with SageAttention auto-falls back to sdpa.

torch.compile Mode

The node includes an optional use_torch_compile toggle that applies PyTorch's torch.compile() optimization to the model for faster inference.

Speed Boost: After an initial compilation overhead (~10-30 seconds on first run), you'll see:

20-30% faster inference for subsequent generations
Best performance with FP32 models (recommended for torch.compile)
Grows more valuable with repeated processing (cached compiled model)

When to Use:

Processing multiple audio files in a session
Longer audio requiring multiple chunks
When speed is critical and you can wait for the initial compilation
Recommended: Enable if using FP32 models and processing multiple clips

Caveats:

Only works with FP32 models (will skip compilation for FP16/FP8 models)
First generation takes longer due to compilation overhead
Not recommended for one-off processing

Optimized Defaults

The node has been tuned with performance-focused default values:

Parameter	Default	Previous	Performance Impact
chunk_size	15.0s	5.12s	Fewer chunks = ~60% faster for long audio
overlap	0.0s	0.04s	No overlap = faster (smoother audio with 2.0-3.0s overlap)
attention_backend	sdpa	sdpa	PyTorch native attention (fastest available)

Optimizations Applied:

Removed unnecessary tensor conversions (torch.from_numpy→numpy)
Smart model caching with automatic recompilation detection
Improved dtype detection for quantized models (FP8 → FP16 conversion)
Safe division handling to prevent numerical errors
Memory-efficient overlap processing with optional crossfade

Security Improvements

weights_only=True for safe model loading (prevents arbitrary code execution)
Validates tensor dtypes before model conversion

Requirements

All Python dependencies are installed automatically. No external tools required.

Minimum: 6GB VRAM, 12GB RAM recommended

Audio Examples

Original	AudioSR
speech_up_4.wav	speech_audiosr_4.wav
event_up_2.wav	event_audiosr_2.wav

📦 Installation

Method 1: ComfyUI Manager (Recommended)

Open ComfyUI Manager
Search for "AudioSR"
Click Install
Restart ComfyUI

That's it! All dependencies are installed automatically.

Method 2: Manual Installation

Standard Python:

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-AudioSR.git
cd ComfyUI-AudioSR
pip install -r requirements.txt

ComfyUI Portable (Windows with embedded Python):

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-AudioSR
cd ComfyUI-AudioSR
..\python_embeded\python.exe -s -m pip install -r requirements.txt

📥 Download Models

Important: Models must be placed in ComfyUI/models/AudioSR/

Download from HuggingFace: https://huggingface.co/drbaph/AudioSR/tree/main/AudioSR

Download one or both models and place them in your ComfyUI models directory:

Required Folder Structure:

ComfyUI/
└── models/
    └── AudioSR/
        ├── audiosr_basic_fp32.safetensors (for general audio)
        └── audiosr_speech_fp32.safetensors (for voice content)

Available Models (FP32):

audiosr_basic_fp32.safetensors - General purpose (music, sound effects, etc.)
audiosr_speech_fp32.safetensors - Optimized for voice/speech

💻 VRAM Requirements

Configuration	VRAM Usage	Recommended For
Standard	~6GB	RTX 3060+ (6GB+)
With unload_model enabled	~0.5GB (when idle)	Systems with limited VRAM

Minimum: 6GB VRAM required, 8GB+ recommended

🎚️ AudioSR Node

📖 Click to expand: Overview & Parameters

Overview

Upscale audio to 48kHz using the AudioSR latent diffusion model. The model analyzes low-quality audio and generates enhanced high-frequency details for a cleaner, fuller sound.

What it does:

Resamples audio to 48kHz (if needed)
Enhances high frequencies and adds clarity
Reduces artifacts from low-bitrate compression
Works on any audio: music, speech, sound effects

Use Cases:

Upsample old/low-quality audio recordings
Enhance compressed audio (MP3, low-bitrate streams)
Improve audio for video production
Restore archived audio content

Parameters

Required Inputs

Parameter	Type	Range	Default	Description
audio	AUDIO	-	-	Audio input from Load Audio node
ddim_steps	INT	10-500	50	Number of denoising steps (higher = better quality, slower)
guidance_scale	FLOAT	1.0-20.0	3.5	CFG scale - higher = more faithful to input
seed	INT	0-4.29B	0	Random seed (0 = random)

Optional Inputs

Parameter	Type	Range	Default	Description
model	COMBO	-	basic	Model file from `ComfyUI/models/AudioSR/` (supports .bin, .safetensors)
chunk_size	FLOAT	2.56-30.0	15.0	Chunk duration in seconds (for audio >10.24s)
overlap	FLOAT	0.0-5.0	0.0	Overlap in seconds between chunks (helps smooth transitions, 2.0-3.0 recommended for long audio)
unload_model	BOOLEAN	-	False	Free VRAM after generation (slower next run)
show_spectrogram	BOOLEAN	-	True	Generate before/after spectrogram image
attention_backend	COMBO	-	sdpa	Attention: sdpa (fast), sageattn (fastest, requires fp16/bf16), eager (compatible)
dtype	COMBO	-	fp32	Compute dtype: fp32 (default), fp16 (faster), bf16 (RTX 30/40 series)
use_torch_compile	BOOLEAN	-	False	Use torch.compile() for 20-30% speed boost (FP32 only, requires warmup)

Outputs

Output	Type	Description
audio	AUDIO	Upscaled audio at 48kHz (connect to Preview/Save)
spectrogram	IMAGE	Before/after spectrogram comparison (optional)

🎨 Workflow Examples

📖 Click to expand: Workflow Examples

Basic Audio Upscaling

Load Audio → AudioSR → Preview Audio / Save Audio

Add Load Audio node and select your audio file
Add AudioSR node
Connect audio output to AudioSR input
Set ddim_steps: 50 (default)
Set guidance_scale: 3.5 (default)
Connect AudioSR audio output to Preview Audio or Save Audio
Queue and generate!

High Quality Upscaling

Settings:
- ddim_steps: 100
- guidance_scale: 5.0
- model: speech (for voice content)
- show_spectrogram: True

Low VRAM Mode

Settings:
- unload_model: True
- ddim_steps: 50 (default)

This frees VRAM after each generation but makes subsequent runs slower.

Long Audio Processing

For audio longer than ~10 seconds, the node automatically:

Splits audio into chunks
Processes each chunk
Crossfades overlap regions
Stitches into seamless output

Defaults from main repo (recommended):

- chunk_size: 15.0 (seconds per chunk)
- overlap: 2.0 (seconds overlap between chunks)

For faster processing with more VRAM:

- chunk_size: 20.0-30.0 (fewer chunks = faster)
- overlap: 2.0-3.0 (smoother transitions)

⚙️ Parameter Guide

📖 Click to expand: Detailed Parameter Guide

ddim_steps (10-500)

Number of denoising steps during generation.

Value	Quality	Speed	Use Case
10-30	Lower	Fast	Quick previews
50	Good	Medium	Default recommendation
100	Better	Slow	High-quality output
200+	Best	Very Slow	Maximum quality

guidance_scale (1.0-20.0)

Classifier-free guidance scale. Controls how closely the output follows the input.

Value	Effect
1.0-2.0	More creative/variant
3.0-4.0	Balanced (default)
5.0-8.0	More faithful to input
10.0+	Very conservative (may sound artificial)

chunk_size (2.56-30.0)

Chunk duration in seconds for long audio processing.

Value	Effect
2.56-5.0	More chunks, slower, less memory per chunk
15.0	Default (from main repo, balanced)
20.0-30.0	Fewer chunks, faster, more memory per chunk

overlap (0.0-5.0)

Overlap duration in seconds between chunks.

Value	Effect
0.0	No overlap (may have seams)
2.0	Recommended (smoother transitions)
3.0-5.0	Smoothest stitching, slower processing

use_torch_compile (BOOLEAN)

Enable PyTorch's torch.compile() optimization for faster inference.

Value	Effect
False	Default - Standard inference
True	20-30% faster after compilation (FP32 models only)

Notes:

First run takes ~10-30s longer for compilation
Only effective with FP32 models (recommended: audiosr_basic_fp32.safetensors, audiosr_speech_fp32.safetensors)
Best for batch processing or repeated use
Model auto-recompiles if toggle state changes

🐛 Troubleshooting

📖 Click to expand: Troubleshooting

Model Not Found Error

Symptom: "Model not found. Please download the AudioSR model..."

Solution:

Download model from: https://huggingface.co/drbaph/AudioSR/tree/main/AudioSR
Place in ComfyUI/models/AudioSR/
Restart ComfyUI

CUDA Out of Memory

Symptom: RuntimeError: CUDA out of memory

Solutions:

Enable unload_model: True
Close other GPU-intensive applications
Reduce chunk_size (try 2.56)
Use a smaller audio file

Poor Audio Quality

Symptom: Output sounds distorted or artificial

Solutions:

Lower guidance_scale (try 2.5-3.0)
Reduce ddim_steps (too high can sound artificial)
Try the other model variant (basic ↔ speech)
Ensure input audio is reasonably clean

No Output Audio

Symptom: Node completes but no audio is produced

Solutions:

Check ComfyUI console for error messages
Verify audio input is properly connected
Ensure output is connected to Preview/Save Audio node
Try a different audio file

Spectrogram Not Showing

Symptom: Spectrogram output is empty or shows noise

Solutions:

Ensure show_spectrogram: True
Install matplotlib: pip install matplotlib
Check ComfyUI console for errors

Slow Processing

Symptom: Node takes very long to process

Solutions:

Disable unload_model (keeps model cached)
Enable use_torch_compile (20-30% faster after warmup, FP32 models only)
Increase chunk_size (fewer chunks = faster)
Ensure GPU is being used (not CPU)
Try ddim_steps: 30 for faster processing

torch.compile Not Working

Symptom: use_torch_compile enabled but no speedup or message says compilation failed

Solutions:

Ensure using FP32 model (*_fp32.safetensors)
Check PyTorch version (2.0+ required for torch.compile)
First run always includes compilation overhead - try second run
Check console for error messages (some operations not supported)

SageAttention Not Working

Symptom: sageattn selected but falling back to sdpa

Solutions:

Install SageAttention: pip install sageattention
Ensure dtype is set to fp16 or bf16 (fp32 auto-falls back)
Check GPU architecture (requires SM80+ = RTX 30 series or newer)
Check console message for specific fallback reason

Model Reloaded Repeatedly

Symptom: Model recompiles every time, defeating performance gains

Causes:

Switching use_torch_compile on/off forces model reload
Switching between FP32 and FP16 models
ComfyUI restart (model cache cleared)

Volume Drops at Chunk Boundaries

Symptom: Periodic volume drops or glitches in long audio (every 10-30 seconds depending on chunk_size)

Causes:

The model internally pads audio to 5.12s multiples, causing output length to differ from input
Improper chunk positioning when stitching outputs together

Solutions (v1.0.6+):

This issue is now fixed automatically
If using older version, update to v1.0.6 or later
For smoothest transitions, use overlap: 2.0-3.0 when processing long audio

📊 Spectrogram Visualization

📖 Click to expand: Spectrogram Details

The node generates a side-by-side spectrogram comparison when show_spectrogram: True:

Top panel: Input audio (before) - Shows limited high frequencies Bottom panel: Output audio (after) - Shows enhanced frequency content

The spectrogram uses the magma colormap:

Purple/Black: Low energy (silence/quiet)
Red/Orange: Medium energy
Yellow: High energy (loud frequencies)

Axes:

X-axis: Time in seconds
Y-axis: Frequency in Hz (0-24kHz visible range)

🔗 Links

💡 Tips for Best Results

Input Quality: The model can't create what isn't there - extremely low-quality audio may still sound artificial
Guidance Scale: Start at 3.5 and adjust based on results
Steps: 50 steps is usually sufficient; 100+ for critical applications
Speech Audio: Use the audiosr_speech_fp32.safetensors model for voice content
Music/General: Use the audiosr_basic_fp32.safetensors model for music and sound effects
Long Audio: Let the auto-chunking handle files >10 seconds
VRAM: Enable unload_model if you need GPU memory for other tasks
Speed Optimization: Enable use_torch_compile when processing multiple files or long audio (FP32 models only)
Maximum Speed: Use attention_backend: sageattn + dtype: bf16 on RTX 30/40 series (install: pip install sageattention)
Smooth Transitions: Use overlap: 2.0 for seamless chunk stitching on long audio
Chunk Size: Default 15s is optimized; increase to 20-30s for fewer chunks if VRAM allows

📚 Credits & License

Original Research: AudioSR: Versatile Audio Super-Resolution by Haohe Liu et al.

Original Implementation: versatile_audio_super_resolution by Haohe Liu

ComfyUI Integration: This custom node implementation

License: MIT (same as original AudioSR project)

📝 Changelog

Version 1.1.1

✅ Fixed tensor dimension mismatch: Small chunks (<5.12s) now get padded to minimum size before processing, then trimmed to original length. Fixes "Expected size 64 but got size 63" errors on audio with uneven chunk splits.

Version 1.1.0

✅ SageAttention support: Added sageattn attention backend with GPU-arch auto-detection (SM80+). Falls back to sdpa if not installed or incompatible.
✅ Dtype selector: New dtype dropdown (fp32/fp16/bf16) to control compute precision. SageAttention auto-falls back to sdpa when fp32 is selected.
✅ Added .gitignore for __pycache__ and common Python artifacts

Version 1.0.6

✅ Fixed chunk positioning bug: Model internally pads audio to 5.12s multiples, causing output length to differ from input. Fixed by positioning chunks based on INPUT boundaries rather than model output length. This eliminates volume drops/glitches at chunk boundaries (30s, 60s, etc.) in long audio files.
✅ Fixed overlap-add normalization: Improved chunk stitching with proper weight tracking for overlap regions. Ensures consistent amplitude across chunk boundaries when using overlap > 0.
✅ Improved handling of stereo audio processing with independent channel reconstruction.

Version 1.0.5

✅ Model directory and file scanning now cached (no repeated folder scans on browser refresh/reload)
✅ Removed verbose console messages on tab refresh ("Checking path", "Found model", etc.)
✅ Improved startup performance by avoiding redundant file system operations

Version 1.0.4

✅ Minor fixes and improvements

Version 1.0.3

✅ Added use_torch_compile toggle for 20-30% speed boost (FP32 models)
✅ Optimized default chunk_size to 15.0s (was 5.12s) for faster long audio processing
✅ Optimized default overlap to 0.0s (was 0.04s) with configurable 2-3s recommended for smooth transitions
✅ Removed unnecessary tensor conversions for faster inference
✅ Improved dtype detection for quantized model support (FP8 → FP16 auto-conversion)
✅ Security fix: Added weights_only=True for safe model loading
✅ Smart model caching with automatic recompilation detection
✅ Safe division handling in overlap normalization

Version 1.0.2

✅ Added HuggingFace model link for easy access
✅ Improved documentation and examples

Version 1.0.1

✅ Native ComfyUI AUDIO type support
✅ Automatic sample rate conversion (any input rate → 48kHz)
✅ Stereo audio processing
✅ Longer audio support with smart chunking
✅ Before/after spectrogram visualization
✅ Progress reporting and interrupt support
✅ Model caching and optional VRAM unloading
✅ Time and frequency axes on spectrograms

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
samples		samples
versatile_audio_super_resolution		versatile_audio_super_resolution
workflows		workflows
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
vasr_node.py		vasr_node.py

Folders and files

Latest commit

History

Repository files navigation

ComfyUI AudioSR (Versatile Audio Super Resolution)

🎯 Key Features

⚡ Performance Optimizations

Attention Backends

Compute Dtypes

torch.compile Mode

Optimized Defaults

Security Improvements

Requirements

Audio Examples

📦 Installation

Method 1: ComfyUI Manager (Recommended)

Method 2: Manual Installation

📥 Download Models

💻 VRAM Requirements

🎚️ AudioSR Node

Overview

Parameters

Required Inputs

Optional Inputs

Outputs

🎨 Workflow Examples

Basic Audio Upscaling

High Quality Upscaling

Low VRAM Mode

Long Audio Processing

⚙️ Parameter Guide

ddim_steps (10-500)

guidance_scale (1.0-20.0)

chunk_size (2.56-30.0)

overlap (0.0-5.0)

use_torch_compile (BOOLEAN)

🐛 Troubleshooting

Model Not Found Error

CUDA Out of Memory

Poor Audio Quality

No Output Audio

Spectrogram Not Showing

Slow Processing

torch.compile Not Working

SageAttention Not Working

Model Reloaded Repeatedly

Volume Drops at Chunk Boundaries

📊 Spectrogram Visualization

🔗 Links

💡 Tips for Best Results

📚 Credits & License

📝 Changelog

Version 1.1.1

Version 1.1.0

Version 1.0.6

Version 1.0.5

Version 1.0.4

Version 1.0.3

Version 1.0.2

Version 1.0.1

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Contributors

Uh oh!

Languages