[feat](moe_kernel): add amd blis support #1600

KMSorSMS · 2025-11-12T03:32:57Z

gemini-code-assist · 2025-11-12T03:33:18Z

Summary of Changes

Hello @KMSorSMS, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the project's hardware support by integrating AMD BLIS for optimized Mixture of Experts (MoE) kernel operations. It introduces new INT4 and INT8 quantized MoE kernel implementations and refines the build system to provide better configurability and a more robust developer experience, particularly for CUDA-enabled builds.

Highlights

AMD BLIS Support: Introduced a new CMake preset named 'amd' and configured specific flags (KTRANSFORMERS_CPU_MOE_AMD, KTRANSFORMERS_CPU_MOE_KERNEL) to enable AMD BLIS for Mixture of Experts (MoE) kernels, enhancing performance on AMD platforms.
New MoE Kernel Implementations: Integrated Int4_KERNEL_MOE and Int8_KERNEL_MOE classes into the Python KTMoEWrapper, allowing for efficient MoE operations with INT4 and INT8 quantization, likely leveraging the new AMD BLIS backend.
Build System Enhancements: Improved the setup.py script by adding automatic nvcc path detection, support for specifying CUDA architectures via CPUINFER_CUDA_ARCHS, and an option (CPUINFER_ALWAYS_CLEAN) to clean the build directory before configuration, streamlining the compilation process.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for AMD BLIS within the MoE kernel, adding a new CMake preset for AMD platforms, a Python wrapper for the generic MoE kernel, and several enhancements to the build system. My review has identified a few critical issues and areas for improvement. There's a misconfiguration in the new 'amd' CMake preset that could cause runtime errors. A significant bug was found in experts.py where incorrect classes are being used for the new MoE methods. The new moe_kernel.py file contains several copy-paste errors from the AMX implementation, is missing an implementation for a required abstract method, and has some code quality issues. On the other hand, the build system improvements in setup.py, such as automatic nvcc detection and build directory cleaning, are valuable additions.

gemini-code-assist · 2025-11-12T03:35:22Z

kt-kernel/python/experts.py

 from .utils.amx import AMXMoEWrapper
 from .utils.llamafile import LlamafileMoEWrapper
-
+from .utils.moe_kernel import Int8_KERNEL_MOE, Int4_KERNEL_MOE


This import is incorrect. It should import GeneralMoEWrapper, which is the Python wrapper for the new MoE kernels. The Int8_KERNEL_MOE and Int4_KERNEL_MOE are C++ extension classes and should be used within the wrapper, not directly in the factory.

Suggested change

from .utils.moe_kernel import Int8_KERNEL_MOE, Int4_KERNEL_MOE

from .utils.moe_kernel import GeneralMoEWrapper

gemini-code-assist · 2025-11-12T03:35:22Z

kt-kernel/python/experts.py

+        elif method == "MOE_INT8":
+            backend_cls = Int8_KERNEL_MOE
+        elif method == "MOE_INT4":
+            backend_cls = Int4_KERNEL_MOE


The backend_cls for MOE_INT8 and MOE_INT4 methods should be GeneralMoEWrapper. The current implementation incorrectly assigns the C++ extension classes Int8_KERNEL_MOE and Int4_KERNEL_MOE directly. These C++ classes have a different constructor signature (they expect a MOEConfig object) and do not inherit from BaseMoEWrapper, which will cause a runtime error when backend_cls is instantiated.

Suggested change

elif method == "MOE_INT8":

backend_cls = Int8_KERNEL_MOE

elif method == "MOE_INT4":

backend_cls = Int4_KERNEL_MOE

elif method in ["MOE_INT8", "MOE_INT4"]:

backend_cls = GeneralMoEWrapper

gemini-code-assist · 2025-11-12T03:35:22Z

kt-kernel/python/utils/moe_kernel.py

+
+from typing import Optional
+
+class GeneralMoEWrapper(BaseMoEWrapper):


The class GeneralMoEWrapper inherits from BaseMoEWrapper, which has an abstract method load_weights_from_tensors. This method is not implemented in GeneralMoEWrapper, which will cause a TypeError when an instance of this class is created. You should implement this method. You can likely adapt the implementation from AMXMoEWrapper.

gemini-code-assist · 2025-11-12T03:35:23Z

kt-kernel/CMakePresets.json

+        "KTRANSFORMERS_CPU_USE_AMX": "OFF",
+        "LLAMA_AVX512": "OFF",
+        "LLAMA_AVX2": "ON",
+        "KTRANSFORMERS_CPU_USE_AMX_AVX512": "ON",


The KTRANSFORMERS_CPU_USE_AMX_AVX512 flag is set to "ON" in the amd preset. This flag enables AVX-512 compilation flags, which may not be supported on all AMD CPUs targeted by this preset (which enables AVX2). This could lead to "illegal instruction" errors at runtime. This flag should be set to "OFF" to align with the avx preset and avoid requiring AVX-512.

Suggested change

"KTRANSFORMERS_CPU_USE_AMX_AVX512": "ON",

"KTRANSFORMERS_CPU_USE_AMX_AVX512": "OFF",

gemini-code-assist · 2025-11-12T03:35:23Z

kt-kernel/python/utils/moe_kernel.py

+        if not _HAS_INT4_SUPPORT and method == "MOE_INT4":
+            raise RuntimeError(
+                "AMX backend not available. kt_kernel_ext was not compiled with AMX support.\n"
+                "Please recompile with AMX enabled."
+            )
+        if not _HAS_INT8_SUPPORT and method == "MOE_INT8":
+            raise RuntimeError(
+                "AMX backend not available. kt_kernel_ext was not compiled with AMX support.\n"
+                "Please recompile with AMX enabled."
+            )


The error messages are misleading as they refer to the "AMX backend". These should be updated to refer to the generic MoE kernel backend, as this wrapper is not specific to AMX.

Suggested change

if not _HAS_INT4_SUPPORT and method == "MOE_INT4":

raise RuntimeError(

"AMX backend not available. kt_kernel_ext was not compiled with AMX support.\n"

"Please recompile with AMX enabled."

)

if not _HAS_INT8_SUPPORT and method == "MOE_INT8":

raise RuntimeError(

"AMX backend not available. kt_kernel_ext was not compiled with AMX support.\n"

"Please recompile with AMX enabled."

)

if not _HAS_INT4_SUPPORT and method == "MOE_INT4":

raise RuntimeError(

"INT4 MoE kernel backend not available. kt_kernel_ext was not compiled with MoE kernel support.\n"

"Please recompile with KTRANSFORMERS_CPU_MOE_KERNEL=ON."

)

if not _HAS_INT8_SUPPORT and method == "MOE_INT8":

raise RuntimeError(

"INT8 MoE kernel backend not available. kt_kernel_ext was not compiled with MoE kernel support.\n"

"Please recompile with KTRANSFORMERS_CPU_MOE_KERNEL=ON."

)

gemini-code-assist · 2025-11-12T03:35:23Z

kt-kernel/python/utils/moe_kernel.py

+    """
+    AMX-based MoE wrapper implementation.
+    Supports AMXINT4 and AMXINT8 quantization methods.
+    """


The class docstring appears to be copied from the AMX wrapper. It should be updated to reflect that this is a general MoE kernel wrapper, not specific to AMX.

Suggested change

"""

AMX-based MoE wrapper implementation.

Supports AMXINT4 and AMXINT8 quantization methods.

"""

"""

General MoE kernel wrapper implementation.

Supports INT4 and INT8 quantization methods.

"""

gemini-code-assist · 2025-11-12T03:35:23Z

kt-kernel/python/utils/moe_kernel.py

+        """
+        Initialize AMX MoE Wrapper.
+
+        Args:
+            layer_idx: Layer index
+            num_experts: Total number of experts
+            num_experts_per_tok: Number of experts per token (top-k)
+            hidden_size: Hidden dimension size
+            moe_intermediate_size: MoE intermediate size
+            num_gpu_experts: Number of experts to run on GPU
+            cpuinfer_threads: Number of CPU inference threads
+            threadpool_count: Number of NUMA subpools
+            weight_path: Path to AMX weights (SafeTensor format)
+            chunked_prefill_size: Maximum prefill chunk size
+            cpu_save: Whether to save weights to CPU memory
+            max_deferred_experts_per_token: Number of experts per token to defer. Defaults to 0.
+            method: general quantization method ("MOE_INT4" or "MOE_INT8")
+        """


The docstring for __init__ seems to be copied from the AMX wrapper. It should be updated to remove AMX-specific references and describe the general MoE kernel wrapper.

Suggested change

"""

Initialize AMX MoE Wrapper.

Args:

layer_idx: Layer index

num_experts: Total number of experts

num_experts_per_tok: Number of experts per token (top-k)

hidden_size: Hidden dimension size

moe_intermediate_size: MoE intermediate size

num_gpu_experts: Number of experts to run on GPU

cpuinfer_threads: Number of CPU inference threads

threadpool_count: Number of NUMA subpools

weight_path: Path to AMX weights (SafeTensor format)

chunked_prefill_size: Maximum prefill chunk size

cpu_save: Whether to save weights to CPU memory

max_deferred_experts_per_token: Number of experts per token to defer. Defaults to 0.

method: general quantization method ("MOE_INT4" or "MOE_INT8")

"""

"""

Initialize General MoE Wrapper.

Args:

layer_idx: Layer index

num_experts: Total number of experts

num_experts_per_tok: Number of experts per token (top-k)

hidden_size: Hidden dimension size

moe_intermediate_size: MoE intermediate size

num_gpu_experts: Number of experts to run on GPU

cpuinfer_threads: Number of CPU inference threads

threadpool_count: Number of NUMA subpools

weight_path: Path to weights (SafeTensor format)

chunked_prefill_size: Maximum prefill chunk size

cpu_save: Whether to save weights to CPU memory

max_deferred_experts_per_token: Number of experts per token to defer. Defaults to 0.

method: general quantization method ("MOE_INT4" or "MOE_INT8")

"""

gemini-code-assist · 2025-11-12T03:35:23Z

kt-kernel/python/utils/moe_kernel.py

+
+        # AMX-specific: Check if we should load merged safetensor weights
+        self.load_merged_weight = False
+        import glob


The import glob statement should be moved to the top of the file with other imports, following PEP 8 style guidelines.

gemini-code-assist · 2025-11-12T03:35:23Z

kt-kernel/python/utils/moe_kernel.py

+            gate_ptrs = [
+                [
+                    ctypes.addressof(ctypes.cast(et.ctypes.data, ctypes.POINTER(ctypes.c_uint64)).contents)
+                    for et in numa_array
+                ]
+                for numa_array in self.gate_weights
+            ]
+
+            up_ptrs = [
+                [
+                    ctypes.addressof(ctypes.cast(et.ctypes.data, ctypes.POINTER(ctypes.c_uint64)).contents)
+                    for et in numa_array
+                ]
+                for numa_array in self.up_weights
+            ]
+
+            down_ptrs = [
+                [
+                    ctypes.addressof(ctypes.cast(et.ctypes.data, ctypes.POINTER(ctypes.c_uint64)).contents)
+                    for et in numa_array
+                ]
+                for numa_array in self.down_weights
+            ]
+
+            gate_scale_ptrs = [
+                [
+                    ctypes.addressof(ctypes.cast(et.ctypes.data, ctypes.POINTER(ctypes.c_uint64)).contents)
+                    for et in numa_array
+                ]
+                for numa_array in self.gate_scales
+            ]
+
+            up_scale_ptrs = [
+                [
+                    ctypes.addressof(ctypes.cast(et.ctypes.data, ctypes.POINTER(ctypes.c_uint64)).contents)
+                    for et in numa_array
+                ]
+                for numa_array in self.up_scales
+            ]
+
+            down_scale_ptrs = [
+                [
+                    ctypes.addressof(ctypes.cast(et.ctypes.data, ctypes.POINTER(ctypes.c_uint64)).contents)
+                    for et in numa_array
+                ]
+                for numa_array in self.down_scales
+            ]


The code to get pointers to weight arrays is duplicated six times for gate_ptrs, up_ptrs, down_ptrs, and their scale counterparts. This can be refactored into a helper function to improve readability and maintainability.

fine

KMSorSMS added 3 commits November 11, 2025 09:35

[feat]: init amd adaption

85a830f

[feat]: add blis support

ecc6019

[refactor]: Merge branch 'main' of https://github.com/kvcache-ai/ktra…

46aa785

…nsformers into main

KMSorSMS requested a review from SkqLiao November 12, 2025 03:33

KMSorSMS mentioned this pull request Nov 12, 2025

Development Roadmap (2025 Q4) #1582

Open

gemini-code-assist bot reviewed Nov 12, 2025

View reviewed changes

KMSorSMS mentioned this pull request Nov 12, 2025

[Feature] Support amd blis optimizatioin #1601

Open

2 tasks

KMSorSMS added 2 commits November 12, 2025 06:41

[fix]: fix setup and moe kernel warpper

8dd1887

[fix](setup.py): support rebuild with cache and import kt_kernel works

ff65ea7

fine

This was referenced Nov 14, 2025

[Bug] error with kt-kernel installation #1591

Open

AMX only? re: KTransformers+SGLang Inference Deployment #1612

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat](moe_kernel): add amd blis support #1600

[feat](moe_kernel): add amd blis support #1600

KMSorSMS commented Nov 12, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 12, 2025

Uh oh!

gemini-code-assist bot Nov 12, 2025

Uh oh!

gemini-code-assist bot Nov 12, 2025

Uh oh!

gemini-code-assist bot Nov 12, 2025

Uh oh!

gemini-code-assist bot Nov 12, 2025

Uh oh!

gemini-code-assist bot Nov 12, 2025

Uh oh!

gemini-code-assist bot Nov 12, 2025

Uh oh!

gemini-code-assist bot Nov 12, 2025

Uh oh!

gemini-code-assist bot Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	from .utils.moe_kernel import Int8_KERNEL_MOE, Int4_KERNEL_MOE
	from .utils.moe_kernel import GeneralMoEWrapper


		from typing import Optional

		class GeneralMoEWrapper(BaseMoEWrapper):

	"KTRANSFORMERS_CPU_USE_AMX_AVX512": "ON",
	"KTRANSFORMERS_CPU_USE_AMX_AVX512": "OFF",

[feat](moe_kernel): add amd blis support #1600

Are you sure you want to change the base?

[feat](moe_kernel): add amd blis support #1600

Conversation

KMSorSMS commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Nov 12, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KMSorSMS commented Nov 12, 2025 •

edited

Loading