Help me fix please: CUDA error: no kernel image is available for execution on the device

Trying to install llama cpp to use   with LLM nodes on ComfyUI.
 it not support GPU RTX 2060?

Using:
Pythin 3.12
Cuda Toolkit 12.8
Not sure why it shows CUDA 13.0 on CMD....

<img width="855" height="654" alt="Image" src="https://github.com/user-attachments/assets/1d183875-0bb5-4e88-9ff4-e1b478443746" />

> ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5, VMM: yes
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 2060) - 5095 MiB free
llama_model_loader: loaded meta data with 50 key-value pairs and 363 tensors from C:\SillyTavern\Kobold Models\LorablatedStock-12B.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = LorablatedStock 12B
llama_model_loader: - kv   3:                           general.basename str              = LorablatedStock
llama_model_loader: - kv   4:                         general.size_label str              = 12B
llama_model_loader: - kv   5:                   general.base_model.count u32              = 3
llama_model_loader: - kv   6:                  general.base_model.0.name str              = HMS Fusion 12B Lorablated
llama_model_loader: - kv   7:          general.base_model.0.organization str              = Yamatazen
llama_model_loader: - kv   8:              general.base_model.0.repo_url str              = https://huggingface.co/yamatazen/HMS-...
llama_model_loader: - kv   9:                  general.base_model.1.name str              = ForgottenMaid 12B Lorablated
llama_model_loader: - kv  10:          general.base_model.1.organization str              = Yamatazen
llama_model_loader: - kv  11:              general.base_model.1.repo_url str              = https://huggingface.co/yamatazen/Forg...
llama_model_loader: - kv  12:                  general.base_model.2.name str              = FusionEngine 12B Lorablated
llama_model_loader: - kv  13:          general.base_model.2.organization str              = Yamatazen
llama_model_loader: - kv  14:              general.base_model.2.repo_url str              = https://huggingface.co/yamatazen/Fusi...
llama_model_loader: - kv  15:                               general.tags arr[str,3]       = ["mergekit", "merge", "lorablated"]
llama_model_loader: - kv  16:                          general.languages arr[str,2]       = ["en", "ja"]
llama_model_loader: - kv  17:                          llama.block_count u32              = 40
llama_model_loader: - kv  18:                       llama.context_length u32              = 131072
llama_model_loader: - kv  19:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv  20:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  21:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  22:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  23:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  24:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  25:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  26:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  27:                           llama.vocab_size u32              = 131072
llama_model_loader: - kv  28:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  29:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  30:                         tokenizer.ggml.pre str              = tekken
llama_model_loader: - kv  31:                      tokenizer.ggml.tokens arr[str,131072]  = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv  32:                  tokenizer.ggml.token_type arr[i32,131072]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
Exception ignored on calling ctypes callback function: <function llama_log_callback at 0x000001A8DEF7F9C0>
Traceback (most recent call last):
  File "C:\ComfyUI\ComfyUI\venv\Lib\site-packages\llama_cpp\_logger.py", line 39, in llama_log_callback
    print(text.decode("utf-8"), end="", flush=True, file=sys.stderr)
          ^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 128: invalid continuation byte
llama_model_loader: - kv  34:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  35:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  36:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  37:            tokenizer.ggml.padding_token_id u32              = 10
llama_model_loader: - kv  38:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  39:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  40:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  41:               general.quantization_version u32              = 2
llama_model_loader: - kv  42:                          general.file_type u32              = 15
llama_model_loader: - kv  43:                                general.url str              = https://huggingface.co/mradermacher/L...
llama_model_loader: - kv  44:              mradermacher.quantize_version str              = 2
llama_model_loader: - kv  45:                  mradermacher.quantized_by str              = mradermacher
llama_model_loader: - kv  46:                  mradermacher.quantized_at str              = 2025-06-06T15:43:16+02:00
llama_model_loader: - kv  47:                  mradermacher.quantized_on str              = back
llama_model_loader: - kv  48:                         general.source.url str              = https://huggingface.co/yamatazen/Lora...
llama_model_loader: - kv  49:                  mradermacher.convert_type str              = hf
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_K:  241 tensors
llama_model_loader: - type q6_K:   41 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 6.96 GiB (4.88 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token:    992 '<SPECIAL_992>' is not marked as EOG
load: control token:    993 '<SPECIAL_993>' is not marked as EOG
load: control token:    997 '<SPECIAL_997>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1000
load: token to piece cache size = 0.8498 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 5120
print_info: n_layer          = 40
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 13B
print_info: model params     = 12.25 B
print_info: general.name     = LorablatedStock 12B
print_info: vocab type       = BPE
print_info: n_vocab          = 131072
print_info: n_merges         = 269443
print_info: BOS token        = 1 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 0 '<unk>'
print_info: PAD token        = 10 '<pad>'
print_info: LF token         = 1010 'Ċ'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 150
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CPU, is_swa = 0
load_tensors: layer   1 assigned to device CPU, is_swa = 0
load_tensors: layer   2 assigned to device CPU, is_swa = 0
load_tensors: layer   3 assigned to device CPU, is_swa = 0
load_tensors: layer   4 assigned to device CPU, is_swa = 0
load_tensors: layer   5 assigned to device CPU, is_swa = 0
load_tensors: layer   6 assigned to device CPU, is_swa = 0
load_tensors: layer   7 assigned to device CPU, is_swa = 0
load_tensors: layer   8 assigned to device CPU, is_swa = 0
load_tensors: layer   9 assigned to device CPU, is_swa = 0
load_tensors: layer  10 assigned to device CPU, is_swa = 0
load_tensors: layer  11 assigned to device CPU, is_swa = 0
load_tensors: layer  12 assigned to device CPU, is_swa = 0
load_tensors: layer  13 assigned to device CPU, is_swa = 0
load_tensors: layer  14 assigned to device CUDA0, is_swa = 0
load_tensors: layer  15 assigned to device CUDA0, is_swa = 0
load_tensors: layer  16 assigned to device CUDA0, is_swa = 0
load_tensors: layer  17 assigned to device CUDA0, is_swa = 0
load_tensors: layer  18 assigned to device CUDA0, is_swa = 0
load_tensors: layer  19 assigned to device CUDA0, is_swa = 0
load_tensors: layer  20 assigned to device CUDA0, is_swa = 0
load_tensors: layer  21 assigned to device CUDA0, is_swa = 0
load_tensors: layer  22 assigned to device CUDA0, is_swa = 0
load_tensors: layer  23 assigned to device CUDA0, is_swa = 0
load_tensors: layer  24 assigned to device CUDA0, is_swa = 0
load_tensors: layer  25 assigned to device CUDA0, is_swa = 0
load_tensors: layer  26 assigned to device CUDA0, is_swa = 0
load_tensors: layer  27 assigned to device CUDA0, is_swa = 0
load_tensors: layer  28 assigned to device CUDA0, is_swa = 0
load_tensors: layer  29 assigned to device CUDA0, is_swa = 0
load_tensors: layer  30 assigned to device CUDA0, is_swa = 0
load_tensors: layer  31 assigned to device CUDA0, is_swa = 0
load_tensors: layer  32 assigned to device CUDA0, is_swa = 0
load_tensors: layer  33 assigned to device CUDA0, is_swa = 0
load_tensors: layer  34 assigned to device CUDA0, is_swa = 0
load_tensors: layer  35 assigned to device CUDA0, is_swa = 0
load_tensors: layer  36 assigned to device CUDA0, is_swa = 0
load_tensors: layer  37 assigned to device CUDA0, is_swa = 0
load_tensors: layer  38 assigned to device CUDA0, is_swa = 0
load_tensors: layer  39 assigned to device CUDA0, is_swa = 0
load_tensors: layer  40 assigned to device CPU, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q4_K) (and 128 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 26 repeating layers to GPU
load_tensors: offloaded 26/41 layers to GPU
load_tensors:        CUDA0 model buffer size =  4035.55 MiB
load_tensors:   CPU_Mapped model buffer size =  3087.75 MiB
.........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:        CPU  output buffer size =     0.50 MiB
create_memory: n_ctx = 4096 (padded)
llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1, padding = 32
llama_kv_cache_unified: layer   0: dev = CPU
llama_kv_cache_unified: layer   1: dev = CPU
llama_kv_cache_unified: layer   2: dev = CPU
llama_kv_cache_unified: layer   3: dev = CPU
llama_kv_cache_unified: layer   4: dev = CPU
llama_kv_cache_unified: layer   5: dev = CPU
llama_kv_cache_unified: layer   6: dev = CPU
llama_kv_cache_unified: layer   7: dev = CPU
llama_kv_cache_unified: layer   8: dev = CPU
llama_kv_cache_unified: layer   9: dev = CPU
llama_kv_cache_unified: layer  10: dev = CPU
llama_kv_cache_unified: layer  11: dev = CPU
llama_kv_cache_unified: layer  12: dev = CPU
llama_kv_cache_unified: layer  13: dev = CPU
llama_kv_cache_unified: layer  14: dev = CUDA0
llama_kv_cache_unified: layer  15: dev = CUDA0
llama_kv_cache_unified: layer  16: dev = CUDA0
llama_kv_cache_unified: layer  17: dev = CUDA0
llama_kv_cache_unified: layer  18: dev = CUDA0
llama_kv_cache_unified: layer  19: dev = CUDA0
llama_kv_cache_unified: layer  20: dev = CUDA0
llama_kv_cache_unified: layer  21: dev = CUDA0
llama_kv_cache_unified: layer  22: dev = CUDA0
llama_kv_cache_unified: layer  23: dev = CUDA0
llama_kv_cache_unified: layer  24: dev = CUDA0
llama_kv_cache_unified: layer  25: dev = CUDA0
llama_kv_cache_unified: layer  26: dev = CUDA0
llama_kv_cache_unified: layer  27: dev = CUDA0
llama_kv_cache_unified: layer  28: dev = CUDA0
llama_kv_cache_unified: layer  29: dev = CUDA0
llama_kv_cache_unified: layer  30: dev = CUDA0
llama_kv_cache_unified: layer  31: dev = CUDA0
llama_kv_cache_unified: layer  32: dev = CUDA0
llama_kv_cache_unified: layer  33: dev = CUDA0
llama_kv_cache_unified: layer  34: dev = CUDA0
llama_kv_cache_unified: layer  35: dev = CUDA0
llama_kv_cache_unified: layer  36: dev = CUDA0
llama_kv_cache_unified: layer  37: dev = CUDA0
llama_kv_cache_unified: layer  38: dev = CUDA0
llama_kv_cache_unified: layer  39: dev = CUDA0
llama_kv_cache_unified:      CUDA0 KV buffer size =   416.00 MiB
llama_kv_cache_unified:        CPU KV buffer size =   224.00 MiB
llama_kv_cache_unified: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context:      CUDA0 compute buffer size =   791.00 MiB
llama_context:  CUDA_Host compute buffer size =    18.01 MiB
llama_context: graph nodes  = 1366
llama_context: graph splits = 158 (with bs=512), 3 (with bs=1)
CUDA : ARCHS = 860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
Model metadata: {'mradermacher.convert_type': 'hf', 'general.name': 'LorablatedStock 12B', 'general.architecture': 'llama', 'general.base_model.0.name': 'HMS Fusion 12B Lorablated', 'general.type': 'model', 'general.basename': 'LorablatedStock', 'general.base_model.0.organization': 'Yamatazen', 'mradermacher.quantized_on': 'back', 'general.size_label': '12B', 'general.base_model.2.name': 'FusionEngine 12B Lorablated', 'general.base_model.count': '3', 'llama.attention.value_length': '128', 'general.base_model.0.repo_url': 'https://huggingface.co/yamatazen/HMS-Fusion-12B-Lorablated', 'general.base_model.1.name': 'ForgottenMaid 12B Lorablated', 'general.base_model.1.organization': 'Yamatazen', 'general.base_model.1.repo_url': 'https://huggingface.co/yamatazen/ForgottenMaid-12B-Lorablated', 'general.base_model.2.organization': 'Yamatazen', 'tokenizer.ggml.add_space_prefix': 'false', 'tokenizer.ggml.pre': 'tekken', 'general.base_model.2.repo_url': 'https://huggingface.co/yamatazen/FusionEngine-12B-Lorablated', 'llama.block_count': '40', 'llama.context_length': '131072', 'llama.embedding_length': '5120', 'llama.feed_forward_length': '14336', 'llama.attention.head_count': '32', 'general.file_type': '15', 'tokenizer.ggml.eos_token_id': '2', 'llama.attention.head_count_kv': '8', 'llama.rope.freq_base': '1000000.000000', 'mradermacher.quantize_version': '2', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.attention.key_length': '128', 'llama.vocab_size': '131072', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.model': 'gpt2', 'general.source.url': 'https://huggingface.co/yamatazen/LorablatedStock-12B', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '1', 'mradermacher.quantized_by': 'mradermacher', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.padding_token_id': '10', 'tokenizer.ggml.add_bos_token': 'true', 'tokenizer.ggml.add_eos_token': 'false', 'general.url': 'https://huggingface.co/mradermacher/LorablatedStock-12B-GGUF', 'mradermacher.quantized_at': '2025-06-06T15:43:16+02:00'}
Using fallback chat format: llama-2
ggml_cuda_compute_forward: RMS_NORM failed
CUDA error: no kernel image is available for execution on the device
  current device: 0, in function ggml_cuda_compute_forward at C:\Users\bpfit\Downloads\llama_cpp_python-0.3.9\llama_cpp_python-0.3.9\vendor\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2353
  err
C:\Users\bpfit\Downloads\llama_cpp_python-0.3.9\llama_cpp_python-0.3.9\vendor\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:75: CUDA error

C:\Users\bpfit\Downloads\llama_cpp_python-0.3.9..... The username of my PC is not bpfit.. not sure why its showing it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help me fix please: CUDA error: no kernel image is available for execution on the device #6

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Help me fix please: CUDA error: no kernel image is available for execution on the device #6

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions