PyTorch Implementation of Audio-to-Audio Schrodinger Bridges

Zhifeng Kong, Kevin J Shih, Weili Nie, Arash Vahdat, Sang-gil Lee, Joao Felipe Santos, Ante Jukic, Rafael Valle, Bryan Catanzaro

Overview

This repo contains the PyTorch implementation of A2SB: Audio-to-Audio Schrodinger Bridges. A2SB is an audio restoration model tailored for high-res music at 44.1kHz. It is capable of both bandwidth extension (predicting high-frequency components) and inpainting (re-generating missing segments). Critically, A2SB is end-to-end without need of a vocoder to predict waveform outputs, and able to restore hour-long audio inputs. A2SB is capable of achieving state-of-the-art bandwidth extension and inpainting quality on several out-of-distribution music test sets.

We propose A2SB, a state-of-the-art, end-to-end, vocoder-free, and multi-task diffusion Schrodinger Bridge model for 44.1kHz high-res music restoration, using an effective factorized audio representation.
A2SB is the first long audio restoration model that could restore hour-long audio without boundary artifacts.

Usage

Data preparation

Prepare your data into a DATASET_NAME_manifest.csv file in the following format:

split,file_path,duration
train,PATH/TO/AUDIO.wav,10.0
...
validation,PATH/TO/AUDIO.wav,10.0
...
test,PATH/TO/AUDIO.wav,10.0
...

You could have multiple manifests, one for each dataset, and you could use different audio formats as long as SoundFile supports it. After you prepare all of them, write down their paths and names in config files under configs/.

We train our models on the permissively licensed subsets of the following datasets: FMA, Medley-Solos-DB, MUSAN, Musical Instrument, MusicNet, Slakh, FreeSound, FSD50K, GTZAN, and NSynth.

Training

For pretraining, the script is

python main.py fit --config configs/pretrain.yaml

For T-finetuning, first copy the pretrained checkpoint to the T-finetune experiment folder as initialization. Then, T-finetuning resumes from this checkpoint.

Here's an example of running T-finetuning of 2-splits. These 2 models will be trained separately. For the first split, run

python main.py fit --config configs/t_finetune_2split_0.0_0.5.yaml

For the second split, copy this config and modify model.train_t_min -> 0.5, model.train_t_max -> 1.0, setup a different experiment name and path, and run training in a similar way.

Misc: you may need to adjust batch size, num devices, num nodes, and gradient accumulation in the configs based on your GPU configurations.

Inference

If you would like to run inference of the entire dataset, use

cd inference/
python A2SB_upsample_dataset.py -dn DATASET_NAME -exp ensemble_2split_sampling -cf 4000
python A2SB_inpaint_dataset.py -dn DATASET_NAME -exp ensemble_2split_sampling -inp_len 0.3 -inp_every 5.0

If you would like to run a simple bandwidth extension API for arbitrarily long audio with automatic rolloff frequency detection, use

cd inference/
python A2SB_upsample_api.py -f DEGRADED.wav -o RESTORED.wav -n N_STEPS

Requirements

numpy, scipy, matplotlib, jsonargparse, librosa, soundfile, torch, torchaudio, einops, pytorch_lightning, rotary_embedding_torch, ssr_eval

Citation

@article{kong2025a2sb,
  title={A2SB: Audio-to-Audio Schrodinger Bridges},
  author={Kong, Zhifeng and Shih, Kevin J and Nie, Weili and Vahdat, Arash and Lee, Sang-gil and Santos, Joao Felipe and Jukic, Ante and Valle, Rafael and Catanzaro, Bryan},
  journal={arXiv preprint arXiv:2501.11311},
  year={2025}
}

License/Terms of Use:

The model is provided under the NVIDIA OneWay NonCommercial License.

The code is under NVIDIA Source Code License - Non Commercial. Some components are adapted from other sources. The training code is adapted from I2SB under the NVIDIA Source Code License - Non Commercial. The model architecture is adapted from Improved Diffusion under the MIT License.

For business inquiries, please visit our website and submit the form: NVIDIA Research Licensing.

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
audio_transforms		audio_transforms
configs		configs
corruption		corruption
datasets		datasets
inference		inference
A2SB_lightning_module.py		A2SB_lightning_module.py
A2SB_lightning_module_api.py		A2SB_lightning_module_api.py
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
DockerFile		DockerFile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
audio_utils.py		audio_utils.py
diffusion.py		diffusion.py
ensembled_inference.py		ensembled_inference.py
ensembled_inference_api.py		ensembled_inference_api.py
main.py		main.py
modelcard.md		modelcard.md
networks.py		networks.py
plotting_utils.py		plotting_utils.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PyTorch Implementation of Audio-to-Audio Schrodinger Bridges

Overview

Usage

Data preparation

Training

Inference

Requirements

Citation

License/Terms of Use:

Ethical Considerations:

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

NVIDIA/diffusion-audio-restoration

Folders and files

Latest commit

History

Repository files navigation

PyTorch Implementation of Audio-to-Audio Schrodinger Bridges

Overview

Usage

Data preparation

Training

Inference

Requirements

Citation

License/Terms of Use:

Ethical Considerations:

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages