I am trying to implement a GEMM with FP16 and INT4. I hope to call the fpA_intB_gemm_fp16_int4 kernel located in  FasterTransformer/src/fastertransformer/kernels/cutlass_kernels/fpA_intB_gemm, but I see that the examples are all implementations for model inference. If I only want to reproduce the GEMM kernel, what should I do?