SoulX-Singer is a high-fidelity, zero-shot singing voice synthesis model by SoulAI-Lab that enables users to generate realistic singing voices for unseen singers.
This ComfyUI wrapper provides native node-based integration with support for melody-conditioned (F0 contour) and score-conditioned (MIDI notes) control for precise pitch, rhythm, and expression.
Paper: SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis (arXiv:2602.07803)
- π€ Zero-Shot Singing β Generate voices for unseen singers with just a reference sample
- π΅ Dual Control Modes β Melody (F0 contour) and Score (MIDI notes) conditioning
- π Native ComfyUI Integration β AUDIO noodle inputs, progress bars, interruption support
- β‘ Optimized Performance β Support for bf16/fp32 dtypes, SDPA and SageAttention
- π¦ Smart Auto-Download β Downloads only what you need from HuggingFace
- bf16 model + preprocessors by default (~6GB)
- Optional fp32 model for maximum quality (~10GB total)
- πΎ Smart Caching β Optional model caching with dtype/attention change detection
- πΉ MIDI Editor Support β Advanced node for manual metadata editing workflow
- π§ Improved Compatibility β Uses soundfile + scipy instead of torchaudio for better cross-platform support
sample-1.mp4
sample-2.mp4
π₯ Click to expand installation methods
- Open ComfyUI Manager
- Search for "SoulX-Singer"
- Click Install
- Restart ComfyUI
cd ComfyUI/custom_nodes
git clone --recursive https://github.com/Saganaki22/ComfyUI-SoulX-Singer.git
cd ComfyUI-SoulX-Singer
pip install -r requirements.txtNote: The --recursive flag is important to clone the SoulX-Singer submodule.
cd ComfyUI/custom_nodes/ComfyUI-SoulX-Singer
git submodule init
git submodule update
pip install -r requirements.txt-
Load Model
- Add
π€ SoulX-Singer Model Loadernode - Select model:
SoulX-Singer_model_bf16(default) - Fast, good quality, ~2GBSoulX-Singer_model_fp32- Best quality, ~4GB- "(download)" suffix means it will be auto-downloaded on first use
- Choose dtype:
bf16(default, recommended) orfp32(full precision) - Choose attention:
sdpa(default) orsageattention(fastest with sageattention package) - Enable
keep_loadedto cache model between runs
- Add
-
Load Audio
- Add
Load Audionodes for prompt and target audio - Prompt: 3-10 seconds of reference singing voice
- Target: Audio with melody/score to synthesize
- Add
-
Synthesize
- Add
ποΈ SoulX-Singer Simplenode - Connect model and audio inputs
- Configure languages (Mandarin/English/Cantonese)
- Set control mode (
melodyorscore) - Adjust synthesis parameters
- Run!
- Add
-
Save/Preview
- Connect to
Save AudioorPreview Audionode
- Connect to
For users who want manual control with MIDI editor:
- Run Simple mode once to generate metadata files (saved in temp folder)
- Copy metadata JSON files from temp folder
- Edit metadata JSON files with MIDI Editor
- Use
ποΈ SoulX-Singer Advancednode with:- Prompt audio file
- Prompt metadata JSON path
- Target metadata JSON path (edited version)
Why no target_audio in Advanced node? The target is defined entirely by metadata (lyrics, notes, timing) - the node synthesizes NEW audio from scratch rather than transforming existing audio.
π Click to expand file structure details
On first use, the node will automatically download required files from drbaph/SoulX-Singer:
Default Download (bf16):
SoulX-Singer_model_bf16.safetensors(~1.5GB)config.yamlpreprocessors/folder (~5GB)- Total: ~6.5GB
Optional Download (fp32):
SoulX-Singer_model_fp32.safetensors(~2.9GB)- Plus bf16 model + config + preprocessors above
- Total: ~9.5GB
Files are saved to:
ComfyUI/models/SoulX-Singer/
If auto-download fails:
pip install -U huggingface_hub
huggingface-cli download drbaph/SoulX-Singer --local-dir ComfyUI/models/SoulX-SingerOr download manually from drbaph/SoulX-Singer and place in ComfyUI/models/SoulX-Singer/.
ComfyUI/
βββ models/
β βββ SoulX-Singer/
β βββ SoulX-Singer_model_bf16.safetensors # bf16 model (~1.5GB)
β βββ SoulX-Singer_model_fp32.safetensors # fp32 model (~2.9GB) [optional]
β βββ config.yaml # Model config
β βββ preprocessors/ # Preprocessing models (~5GB)
β βββ dereverb_mel_band_roformer/
β βββ mel-band-roformer-karaoke/
β βββ parakeet-tdt-0.6b-v2/
β βββ rmvpe/
β βββ rosvot/
β βββ speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/
βββ custom_nodes/
βββ ComfyUI-SoulX-Singer/
βββ __init__.py
βββ nodes/
β βββ model_loader.py
β βββ simple_synthesizer.py
β βββ advanced_synthesizer.py
βββ SoulX-Singer/ # Git submodule
βββ requirements.txt
βββ README.md
β All nodes support symlinks! You can use system links to save disk space:
Windows Example:
:: Link the entire models directory
mklink /D "ComfyUI\models\SoulX-Singer" "D:\MyModels\SoulX-Singer"
:: Or link just the preprocessors
mklink /D "ComfyUI\models\SoulX-Singer\preprocessors" "D:\MyModels\preprocessors"Linux/Mac Example:
# Link the entire models directory
ln -s /path/to/your/models/SoulX-Singer ComfyUI/models/SoulX-Singer
# Or link just the preprocessors
ln -s /path/to/preprocessors ComfyUI/models/SoulX-Singer/preprocessorsThe nodes automatically resolve symlinks and load from the actual file location.
Loads the SVS model with configurable precision and attention.
Inputs:
model_name: Model file to loadSoulX-Singer_model_bf16(default) - bf16 precision, fast, good qualitySoulX-Singer_model_fp32- fp32 precision, best quality, larger file(download)suffix appears if model not yet downloaded- Automatically detects all
.safetensorsand.ptfiles inComfyUI/models/SoulX-Singer/ - Supports symlinks: Works with symlinked files/directories
dtype: Precision -bf16(default, recommended),fp32(full)- Note: fp16 removed due to vocoder FFT incompatibility
attention_type:sdpa(default) orsageattention- Note:
auto,flash_attention, andeagerremoved due to compatibility issues sageattentionrequires:pip install sageattention
- Note:
keep_loaded: Cache model in memory (clears on dtype/attention change)
Outputs:
model: SOULX_MODEL object
Smart Download Behavior:
- Selecting bf16 model: Downloads bf16 + config + preprocessors (~7GB)
- Selecting fp32 model: Downloads fp32 + bf16 + config + preprocessors (~11GB)
- Resume support: Interrupted downloads will resume on next attempt
Simple synthesizer with auto-preprocessing.
Inputs:
model: SOULX_MODEL from loaderprompt_audio: Reference singing voice (AUDIO noodle)target_audio: Target melody/score (AUDIO noodle)prompt_language: Mandarin/English/Cantonesetarget_language: Mandarin/English/Cantonesecontrol_mode:melody(F0 contour) orscore(MIDI notes)enable_preprocessing:β οΈ EXPERIMENTAL - Enable full preprocessing (default:True)- True: Full pipeline with vocal separation + F0 + transcription (for mixed audio)
- False: Skip vocal separation, only F0 + transcription (for clean acapellas)
vocal_sep_prompt: Apply vocal separation to prompt (ignored if preprocessing disabled)vocal_sep_target: Apply vocal separation to target (ignored if preprocessing disabled)auto_pitch_shift: Auto-match pitch rangespitch_shift: Manual pitch shift (-12 to +12 semitones)n_steps: Diffusion steps (16-64, default 32)cfg_scale: CFG guidance (1.0-5.0, default 3.0)
Outputs:
audio: Generated singing voice (AUDIO)
Notes:
- First run will download preprocessing models to
ComfyUI/models/SoulX-Singer/preprocessors/if not already present β οΈ EXPERIMENTAL: Disabling preprocessing skips vocal separation but still extracts F0 and transcribes lyrics - use only with clean acapella vocals
Advanced synthesizer using pre-processed metadata files for manual editing workflows.
Inputs:
model: SOULX_MODEL from loaderprompt_audio: Reference audio (AUDIO noodle)prompt_metadata_path: Path to prompt JSON metadata filetarget_metadata_path: Path to target JSON metadata filecontrol_mode:melody(F0 contour) orscore(MIDI notes)auto_pitch_shift: Auto-match pitch rangespitch_shift: Manual pitch shift (-12 to +12 semitones)n_steps: Diffusion steps (16-64, default 32)cfg_scale: CFG guidance (1.0-5.0, default 3.0)
Outputs:
audio: Generated singing voice (AUDIO)
π Click to expand Metadata JSON Structure
[
{
"index": "vocal_0_6900",
"language": "English",
"time": [0, 6900],
"duration": "0.16 0.24 0.32...",
"text": "<SP> Hello world <SP>...",
"phoneme": "<SP> en_HH-ER0...",
"note_pitch": "0 68 67 65...",
"note_type": "1 2 2 2...",
"f0": "0.0 0.0 382.7..."
}
]Key Fields:
time: Segment boundaries [start_ms, end_ms]text: Lyrics with<SP>markers for word boundariesphoneme: ARPAbet phonemes (en_ prefix for English)note_pitch: MIDI note numbers (0=silence, 60=middle C)note_type: 1=rest, 2=sustain, 3=attackf0: Frame-level fundamental frequency in Hz
Use Case:
- Run Simple mode to get auto-generated metadata
- Copy metadata files from temp folder (shown in console logs)
- Edit in MIDI Editor
- Use Advanced node with edited target metadata
Why no target_audio input? The target is defined entirely by metadata - the node synthesizes new audio from scratch based on the metadata (lyrics, notes, timing). The prompt_audio provides the voice timbre reference.
| Parameter | Description | Recommended |
|---|---|---|
| model_name | Model variant | SoulX-Singer_model_bf16 (fast), SoulX-Singer_model_fp32 (best quality) |
| dtype | Model precision | bf16 (default, fast + quality), fp32 (best quality) |
| attention_type | Attention mechanism | sdpa (default), sageattention (requires package) |
| keep_loaded | Cache model | True for multiple runs |
| control_mode | Pitch control | melody for natural, score for MIDI |
| auto_pitch_shift | Auto pitch matching | True for different singers |
| n_steps | Quality vs speed | 32 (balanced), 64 (best) |
| cfg_scale | Prompt adherence | 3.0 (balanced) |
π οΈ Click to expand troubleshooting guide
Manually download from drbaph/SoulX-Singer:
pip install -U huggingface_hub
huggingface-cli download drbaph/SoulX-Singer --local-dir ComfyUI/models/SoulX-SingerInstall all dependencies:
cd ComfyUI/custom_nodes/ComfyUI-SoulX-Singer
pip install -r requirements.txtCommon missing packages:
wandb- for preprocessing loggingpretty_midi- for MIDI handlingml-collections- for config managementloralib- for LoRA model componentssageattention- for optimized attention (optional,pip install sageattention)
- Use
bf16dtype instead offp32 - Select
SoulX-Singer_model_bf16instead of fp32 - Set
keep_loaded=False - Reduce
n_steps - Close other applications
- Install SageAttention:
pip install sageattention, then selectsageattentionattention type - Use GPU with CUDA support
- Enable
keep_loaded=True - Use
bf16dtype
Check that all preprocessing models are downloaded to:
ComfyUI/models/SoulX-Singer/preprocessors/
Verify the directory structure matches the example above.
Make sure you have the sageattention package installed:
pip install sageattentionIf you get errors with SageAttention, fall back to sdpa attention type.
- Models & Preprocessors: drbaph/SoulX-Singer
- Online Demo: Soul-AILab/SoulX-Singer
- Paper: huggingface.co/papers/2602.07803
- arXiv Paper: arxiv.org/abs/2602.07803
- Official Repository: Soul-AILab/SoulX-Singer
- Demo Page: soul-ailab.github.io/soulx-singer
- MIDI Editor: Soul-AILab/SoulX-Singer-Midi-Editor
- ComfyUI Node: Saganaki22/ComfyUI-SoulX-Singer
If you use SoulX-Singer in your research, please cite:
@misc{soulxsinger,
title={SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis},
author={Jiale Qian and Hao Meng and Tian Zheng and Pengcheng Zhu and Haopeng Lin and Yuhang Dai and Hanke Xie and Wenxiao Cao and Ruixuan Shang and Jun Wu and Hongmei Liu and Hanlin Wen and Jian Zhao and Zhonglin Jiang and Yong Chen and Shunshun Yin and Ming Tao and Jianguo Wei and Lei Xie and Xinsheng Wang},
year={2026},
eprint={2602.07803},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2602.07803},
}Apache 2.0 - See LICENSE for details.
SoulX-Singer is intended for academic research, educational purposes, and legitimate applications. Please use responsibly and ethically.
We advocate for the responsible development and use of AI and encourage the community to uphold safety and ethical principles in AI research and applications. If you have any concerns regarding ethics or misuse, please contact us.
