Skip to content

Help me fix please: CUDA error: no kernel image is available for execution on the device #6

@overallbit

Description

@overallbit

Trying to install llama cpp to use with LLM nodes on ComfyUI.
it not support GPU RTX 2060?

Using:
Pythin 3.12
Cuda Toolkit 12.8
Not sure why it shows CUDA 13.0 on CMD....

Image

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5, VMM: yes
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 2060) - 5095 MiB free
llama_model_loader: loaded meta data with 50 key-value pairs and 363 tensors from C:\SillyTavern\Kobold Models\LorablatedStock-12B.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = LorablatedStock 12B
llama_model_loader: - kv 3: general.basename str = LorablatedStock
llama_model_loader: - kv 4: general.size_label str = 12B
llama_model_loader: - kv 5: general.base_model.count u32 = 3
llama_model_loader: - kv 6: general.base_model.0.name str = HMS Fusion 12B Lorablated
llama_model_loader: - kv 7: general.base_model.0.organization str = Yamatazen
llama_model_loader: - kv 8: general.base_model.0.repo_url str = https://huggingface.co/yamatazen/HMS-...
llama_model_loader: - kv 9: general.base_model.1.name str = ForgottenMaid 12B Lorablated
llama_model_loader: - kv 10: general.base_model.1.organization str = Yamatazen
llama_model_loader: - kv 11: general.base_model.1.repo_url str = https://huggingface.co/yamatazen/Forg...
llama_model_loader: - kv 12: general.base_model.2.name str = FusionEngine 12B Lorablated
llama_model_loader: - kv 13: general.base_model.2.organization str = Yamatazen
llama_model_loader: - kv 14: general.base_model.2.repo_url str = https://huggingface.co/yamatazen/Fusi...
llama_model_loader: - kv 15: general.tags arr[str,3] = ["mergekit", "merge", "lorablated"]
llama_model_loader: - kv 16: general.languages arr[str,2] = ["en", "ja"]
llama_model_loader: - kv 17: llama.block_count u32 = 40
llama_model_loader: - kv 18: llama.context_length u32 = 131072
llama_model_loader: - kv 19: llama.embedding_length u32 = 5120
llama_model_loader: - kv 20: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 21: llama.attention.head_count u32 = 32
llama_model_loader: - kv 22: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 23: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 24: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 25: llama.attention.key_length u32 = 128
llama_model_loader: - kv 26: llama.attention.value_length u32 = 128
llama_model_loader: - kv 27: llama.vocab_size u32 = 131072
llama_model_loader: - kv 28: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 29: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 30: tokenizer.ggml.pre str = tekken
llama_model_loader: - kv 31: tokenizer.ggml.tokens arr[str,131072] = ["", "", "", "[INST]", "[...
llama_model_loader: - kv 32: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
Exception ignored on calling ctypes callback function: <function llama_log_callback at 0x000001A8DEF7F9C0>
Traceback (most recent call last):
File "C:\ComfyUI\ComfyUI\venv\Lib\site-packages\llama_cpp_logger.py", line 39, in llama_log_callback
print(text.decode("utf-8"), end="", flush=True, file=sys.stderr)
^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 128: invalid continuation byte
llama_model_loader: - kv 34: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 35: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 36: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 37: tokenizer.ggml.padding_token_id u32 = 10
llama_model_loader: - kv 38: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 39: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 40: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 41: general.quantization_version u32 = 2
llama_model_loader: - kv 42: general.file_type u32 = 15
llama_model_loader: - kv 43: general.url str = https://huggingface.co/mradermacher/L...
llama_model_loader: - kv 44: mradermacher.quantize_version str = 2
llama_model_loader: - kv 45: mradermacher.quantized_by str = mradermacher
llama_model_loader: - kv 46: mradermacher.quantized_at str = 2025-06-06T15:43:16+02:00
llama_model_loader: - kv 47: mradermacher.quantized_on str = back
llama_model_loader: - kv 48: general.source.url str = https://huggingface.co/yamatazen/Lora...
llama_model_loader: - kv 49: mradermacher.convert_type str = hf
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q4_K: 241 tensors
llama_model_loader: - type q6_K: 41 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 6.96 GiB (4.88 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 992 '<SPECIAL_992>' is not marked as EOG
load: control token: 993 '<SPECIAL_993>' is not marked as EOG
load: control token: 997 '<SPECIAL_997>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1000
load: token to piece cache size = 0.8498 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 5120
print_info: n_layer = 40
print_info: n_head = 32
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_swa_pattern = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 14336
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 13B
print_info: model params = 12.25 B
print_info: general.name = LorablatedStock 12B
print_info: vocab type = BPE
print_info: n_vocab = 131072
print_info: n_merges = 269443
print_info: BOS token = 1 ''
print_info: EOS token = 2 '
'
print_info: UNK token = 0 ''
print_info: PAD token = 10 ''
print_info: LF token = 1010 'Ċ'
print_info: EOG token = 2 ''
print_info: max token length = 150
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer 0 assigned to device CPU, is_swa = 0
load_tensors: layer 1 assigned to device CPU, is_swa = 0
load_tensors: layer 2 assigned to device CPU, is_swa = 0
load_tensors: layer 3 assigned to device CPU, is_swa = 0
load_tensors: layer 4 assigned to device CPU, is_swa = 0
load_tensors: layer 5 assigned to device CPU, is_swa = 0
load_tensors: layer 6 assigned to device CPU, is_swa = 0
load_tensors: layer 7 assigned to device CPU, is_swa = 0
load_tensors: layer 8 assigned to device CPU, is_swa = 0
load_tensors: layer 9 assigned to device CPU, is_swa = 0
load_tensors: layer 10 assigned to device CPU, is_swa = 0
load_tensors: layer 11 assigned to device CPU, is_swa = 0
load_tensors: layer 12 assigned to device CPU, is_swa = 0
load_tensors: layer 13 assigned to device CPU, is_swa = 0
load_tensors: layer 14 assigned to device CUDA0, is_swa = 0
load_tensors: layer 15 assigned to device CUDA0, is_swa = 0
load_tensors: layer 16 assigned to device CUDA0, is_swa = 0
load_tensors: layer 17 assigned to device CUDA0, is_swa = 0
load_tensors: layer 18 assigned to device CUDA0, is_swa = 0
load_tensors: layer 19 assigned to device CUDA0, is_swa = 0
load_tensors: layer 20 assigned to device CUDA0, is_swa = 0
load_tensors: layer 21 assigned to device CUDA0, is_swa = 0
load_tensors: layer 22 assigned to device CUDA0, is_swa = 0
load_tensors: layer 23 assigned to device CUDA0, is_swa = 0
load_tensors: layer 24 assigned to device CUDA0, is_swa = 0
load_tensors: layer 25 assigned to device CUDA0, is_swa = 0
load_tensors: layer 26 assigned to device CUDA0, is_swa = 0
load_tensors: layer 27 assigned to device CUDA0, is_swa = 0
load_tensors: layer 28 assigned to device CUDA0, is_swa = 0
load_tensors: layer 29 assigned to device CUDA0, is_swa = 0
load_tensors: layer 30 assigned to device CUDA0, is_swa = 0
load_tensors: layer 31 assigned to device CUDA0, is_swa = 0
load_tensors: layer 32 assigned to device CUDA0, is_swa = 0
load_tensors: layer 33 assigned to device CUDA0, is_swa = 0
load_tensors: layer 34 assigned to device CUDA0, is_swa = 0
load_tensors: layer 35 assigned to device CUDA0, is_swa = 0
load_tensors: layer 36 assigned to device CUDA0, is_swa = 0
load_tensors: layer 37 assigned to device CUDA0, is_swa = 0
load_tensors: layer 38 assigned to device CUDA0, is_swa = 0
load_tensors: layer 39 assigned to device CUDA0, is_swa = 0
load_tensors: layer 40 assigned to device CPU, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q4_K) (and 128 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 26 repeating layers to GPU
load_tensors: offloaded 26/41 layers to GPU
load_tensors: CUDA0 model buffer size = 4035.55 MiB
load_tensors: CPU_Mapped model buffer size = 3087.75 MiB
.........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context: CPU output buffer size = 0.50 MiB
create_memory: n_ctx = 4096 (padded)
llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1, padding = 32
llama_kv_cache_unified: layer 0: dev = CPU
llama_kv_cache_unified: layer 1: dev = CPU
llama_kv_cache_unified: layer 2: dev = CPU
llama_kv_cache_unified: layer 3: dev = CPU
llama_kv_cache_unified: layer 4: dev = CPU
llama_kv_cache_unified: layer 5: dev = CPU
llama_kv_cache_unified: layer 6: dev = CPU
llama_kv_cache_unified: layer 7: dev = CPU
llama_kv_cache_unified: layer 8: dev = CPU
llama_kv_cache_unified: layer 9: dev = CPU
llama_kv_cache_unified: layer 10: dev = CPU
llama_kv_cache_unified: layer 11: dev = CPU
llama_kv_cache_unified: layer 12: dev = CPU
llama_kv_cache_unified: layer 13: dev = CPU
llama_kv_cache_unified: layer 14: dev = CUDA0
llama_kv_cache_unified: layer 15: dev = CUDA0
llama_kv_cache_unified: layer 16: dev = CUDA0
llama_kv_cache_unified: layer 17: dev = CUDA0
llama_kv_cache_unified: layer 18: dev = CUDA0
llama_kv_cache_unified: layer 19: dev = CUDA0
llama_kv_cache_unified: layer 20: dev = CUDA0
llama_kv_cache_unified: layer 21: dev = CUDA0
llama_kv_cache_unified: layer 22: dev = CUDA0
llama_kv_cache_unified: layer 23: dev = CUDA0
llama_kv_cache_unified: layer 24: dev = CUDA0
llama_kv_cache_unified: layer 25: dev = CUDA0
llama_kv_cache_unified: layer 26: dev = CUDA0
llama_kv_cache_unified: layer 27: dev = CUDA0
llama_kv_cache_unified: layer 28: dev = CUDA0
llama_kv_cache_unified: layer 29: dev = CUDA0
llama_kv_cache_unified: layer 30: dev = CUDA0
llama_kv_cache_unified: layer 31: dev = CUDA0
llama_kv_cache_unified: layer 32: dev = CUDA0
llama_kv_cache_unified: layer 33: dev = CUDA0
llama_kv_cache_unified: layer 34: dev = CUDA0
llama_kv_cache_unified: layer 35: dev = CUDA0
llama_kv_cache_unified: layer 36: dev = CUDA0
llama_kv_cache_unified: layer 37: dev = CUDA0
llama_kv_cache_unified: layer 38: dev = CUDA0
llama_kv_cache_unified: layer 39: dev = CUDA0
llama_kv_cache_unified: CUDA0 KV buffer size = 416.00 MiB
llama_kv_cache_unified: CPU KV buffer size = 224.00 MiB
llama_kv_cache_unified: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: CUDA0 compute buffer size = 791.00 MiB
llama_context: CUDA_Host compute buffer size = 18.01 MiB
llama_context: graph nodes = 1366
llama_context: graph splits = 158 (with bs=512), 3 (with bs=1)
CUDA : ARCHS = 860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
Model metadata: {'mradermacher.convert_type': 'hf', 'general.name': 'LorablatedStock 12B', 'general.architecture': 'llama', 'general.base_model.0.name': 'HMS Fusion 12B Lorablated', 'general.type': 'model', 'general.basename': 'LorablatedStock', 'general.base_model.0.organization': 'Yamatazen', 'mradermacher.quantized_on': 'back', 'general.size_label': '12B', 'general.base_model.2.name': 'FusionEngine 12B Lorablated', 'general.base_model.count': '3', 'llama.attention.value_length': '128', 'general.base_model.0.repo_url': 'https://huggingface.co/yamatazen/HMS-Fusion-12B-Lorablated', 'general.base_model.1.name': 'ForgottenMaid 12B Lorablated', 'general.base_model.1.organization': 'Yamatazen', 'general.base_model.1.repo_url': 'https://huggingface.co/yamatazen/ForgottenMaid-12B-Lorablated', 'general.base_model.2.organization': 'Yamatazen', 'tokenizer.ggml.add_space_prefix': 'false', 'tokenizer.ggml.pre': 'tekken', 'general.base_model.2.repo_url': 'https://huggingface.co/yamatazen/FusionEngine-12B-Lorablated', 'llama.block_count': '40', 'llama.context_length': '131072', 'llama.embedding_length': '5120', 'llama.feed_forward_length': '14336', 'llama.attention.head_count': '32', 'general.file_type': '15', 'tokenizer.ggml.eos_token_id': '2', 'llama.attention.head_count_kv': '8', 'llama.rope.freq_base': '1000000.000000', 'mradermacher.quantize_version': '2', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.attention.key_length': '128', 'llama.vocab_size': '131072', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.model': 'gpt2', 'general.source.url': 'https://huggingface.co/yamatazen/LorablatedStock-12B', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '1', 'mradermacher.quantized_by': 'mradermacher', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.padding_token_id': '10', 'tokenizer.ggml.add_bos_token': 'true', 'tokenizer.ggml.add_eos_token': 'false', 'general.url': 'https://huggingface.co/mradermacher/LorablatedStock-12B-GGUF', 'mradermacher.quantized_at': '2025-06-06T15:43:16+02:00'}
Using fallback chat format: llama-2
ggml_cuda_compute_forward: RMS_NORM failed
CUDA error: no kernel image is available for execution on the device
current device: 0, in function ggml_cuda_compute_forward at C:\Users\bpfit\Downloads\llama_cpp_python-0.3.9\llama_cpp_python-0.3.9\vendor\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2353
err
C:\Users\bpfit\Downloads\llama_cpp_python-0.3.9\llama_cpp_python-0.3.9\vendor\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:75: CUDA error

C:\Users\bpfit\Downloads\llama_cpp_python-0.3.9..... The username of my PC is not bpfit.. not sure why its showing it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions