Support NVIDIA Jetson systems #688

elfarolab · 2026-02-23T13:12:57Z

elfarolab
Feb 23, 2026

Hi,

I've tried porting this code to NVIDIA Jetson machines without success so far 😬.
Jetson embedded systems have unified memory, and they're almost identical to NVIDIA DGX, which is based on Jetson design.

The Thor series 64-128GB RAM uses CUDA 13.0, and the Orin series 32-64GB RAM is still on CUDA 12.6.
On these systems you can't upgrade easily the CUDA version, it is because the CUDA release is deeply embedded with the hadware and OS. You can upgrade without errors only when a new embedded Jetpack is released.

Note that nvidia-smi is not very well supported on Jetson and in most systems the report is empty, even if inference of LLM models is working perfectly fine.

What changes are required to port the code?
Thanks in advance to everyone!

python scripts/check_gpu.py

================================================================================
  ACE-Step GPU Detection Diagnostic Tool
================================================================================

This tool will help diagnose GPU detection issues.
Please share the output with support when reporting issues.

================================================================================
  PyTorch Installation
================================================================================
✓ PyTorch installed: 2.10.0+cu126
✓ Build type: CUDA 12.6

================================================================================
  GPU Availability Check
================================================================================
torch.cuda.is_available(): True
✓ GPU detected!
  Number of GPUs: 1

  GPU 0: Orin
    Total memory: 29.98 GB
    Compute capability: 8.7

================================================================================
  ROCm Configuration (AMD GPUs)
================================================================================
Skipping - not a ROCm build

================================================================================
  NVIDIA CUDA Configuration
================================================================================
CUDA version in PyTorch: 12.6

Checking NVIDIA System Management Interface:
  ✓ nvidia-smi found and working

  Output (first 15 lines):
    Mon Feb 23 13:34:46 2026
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 540.4.0                Driver Version: 540.4.0      CUDA Version: 12.6     |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
    |                                         |                      |               MIG M. |
    |=========================================+======================+======================|
    |   0  Orin (nvgpu)                  N/A  | N/A              N/A |                  N/A |
    | N/A   N/A  N/A               N/A /  N/A | Not Supported        |     N/A          N/A |
    |                                         |                      |                  N/A |
    +-----------------------------------------+----------------------+----------------------+
    +---------------------------------------------------------------------------------------+
    | Processes:                                                                            |

================================================================================
  ACE-Step Environment Variables
================================================================================
  No ACE-Step specific environment variables set

================================================================================
  Recommendations
================================================================================
✓ Your GPU setup appears to be working correctly!

You can now run ACE-Step with GPU acceleration.

================================================================================
  Diagnostic Complete
================================================================================

But if I run: python profile_inference.py --mode tier-test

2026-02-23 13:36:58.734 | WARNING  | acestep.training.trainer:<module>:36 - bitsandbytes not installed. Using standard AdamW.
========================================================================================================================
ACE-Step 1.5 Tier Compatibility Test
========================================================================================================================

  Tiers to test: [4, 6, 8, 12, 16, 24, 48]
  LM models on disk: ['acestep-5Hz-lm-1.7B']
  Test with LM: False
  Test duration: 240s
  Boundary testing: False
  Batch boundary testing: False
  Example: example_05.json

========================================================================================================================
  TIER TEST: 4GB simulated VRAM
========================================================================================================================
  Tier: tier1
  init_lm_default: False
  available_lm_models: []
  recommended_lm_model:
  recommended_backend: vllm
  lm_backend_restriction: all
  offload_to_cpu: True
  offload_dit_to_cpu: True
  quantization: True
  max_duration_with_lm: 240s
  max_duration_without_lm: 360s
  max_batch_with_lm: 1
  max_batch_without_lm: 1

  --- Variant: default ---

  Test config [default]: duration=240s, batch=1, LM=False
    offload=True, offload_dit=True, quant=int8_weight_only

  Initializing DiT handler... (alloc=0.00GB, reserved=0.00GB)
2026-02-23 13:36:58.955 | INFO     | acestep.core.generation.handler.init_service_loader:_load_main_model_from_checkpoint:55 - [initialize_service] Attempting to load model with attention implementation: flash_attention_2
`torch_dtype` is deprecated! Use `dtype` instead!
2026-02-23 13:37:04.184 | DEBUG    | acestep.core.generation.handler.init_service_setup:_ensure_len_for_compile:100 - [initialize_service] Injected __len__ into model class for torch.compile
2026-02-23 13:39:17.875 | INFO     | acestep.core.generation.handler.init_service_loader:_load_main_model_from_checkpoint:114 - [initialize_service] DiT quantized with: int8_weight_only
2026-02-23 13:39:18.042 | ERROR    | acestep.core.generation.handler.init_service_orchestrator:initialize_service:139 - Error initializing model: CUDA error: no kernel image is available for execution on the device
Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Traceback:
Traceback (most recent call last):
  File "/opt/usbhd/SRC/ace-step/ACE-Step-1.5/acestep/core/generation/handler/init_service_orchestrator.py", line 79, in initialize_service
    self._load_main_model_from_checkpoint(
  File "/opt/usbhd/SRC/ace-step/ACE-Step-1.5/acestep/core/generation/handler/init_service_loader.py", line 120, in _load_main_model_from_checkpoint
    self.silence_latent = self.silence_latent.to(device).to(self.dtype)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Traceback (most recent call last):

  File "/opt/usbhd/SRC/ace-step/ACE-Step-1.5/profile_inference.py", line 2195, in <module>
    main()
    └ <function main at 0xfffe65d93920>

  File "/opt/usbhd/SRC/ace-step/ACE-Step-1.5/profile_inference.py", line 2113, in main
    run_tier_test_mode(args)
    │                  └ Namespace(mode='tier-test', device='auto', lm_backend='auto', config_path='acestep-v15-turbo', lm_model='acestep-5Hz-lm-0.6B'...
    └ <function run_tier_test_mode at 0xfffe65d931a0>

  File "/opt/usbhd/SRC/ace-step/ACE-Step-1.5/profile_inference.py", line 1107, in run_tier_test_mode
    result_default = _run_single_tier_test(
                     └ <function _run_single_tier_test at 0xfffe65d93100>

  File "/opt/usbhd/SRC/ace-step/ACE-Step-1.5/profile_inference.py", line 874, in _run_single_tier_test
    status_dit, success_dit = dit_handler.initialize_service(
                              │           └ <function InitServiceOrchestratorMixin.initialize_service at 0xfffeb958c360>
                              └ <acestep.handler.AceStepHandler object at 0xfffe65d8fb00>

> File "/opt/usbhd/SRC/ace-step/ACE-Step-1.5/acestep/core/generation/handler/init_service_orchestrator.py", line 79, in initialize_service
    self._load_main_model_from_checkpoint(
    │    └ <function InitServiceLoaderMixin._load_main_model_from_checkpoint at 0xfffeb955b7e0>
    └ <acestep.handler.AceStepHandler object at 0xfffe65d8fb00>

  File "/opt/usbhd/SRC/ace-step/ACE-Step-1.5/acestep/core/generation/handler/init_service_loader.py", line 120, in _load_main_model_from_checkpoint
    self.silence_latent = self.silence_latent.to(device).to(self.dtype)
    │    │                │    │                 │          │    └ torch.bfloat16
    │    │                │    │                 │          └ <acestep.handler.AceStepHandler object at 0xfffe65d8fb00>
    │    │                │    │                 └ 'cuda'
    │    │                │    └ None
    │    │                └ <acestep.handler.AceStepHandler object at 0xfffe65d8fb00>
    │    └ None
    └ <acestep.handler.AceStepHandler object at 0xfffe65d8fb00>

torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

  ❌ DiT init failed: Error initializing model: CUDA error: no kernel image is available for execution on the device
Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Nurb4000 · 2026-03-21T14:23:25Z

Nurb4000
Mar 21, 2026

check out the cpp version that just came out recently. It will run on the jetsons.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support NVIDIA Jetson systems #688

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Support NVIDIA Jetson systems #688

Uh oh!

Uh oh!

elfarolab Feb 23, 2026

python scripts/check_gpu.py

But if I run: python profile_inference.py --mode tier-test

Replies: 1 comment

Uh oh!

Nurb4000 Mar 21, 2026

elfarolab
Feb 23, 2026

Nurb4000
Mar 21, 2026