IST-DASLab
diff --git a/‎README.md‎
Lines changed: 63 additions & 6 deletions b/‎README.md‎
Lines changed: 63 additions & 6 deletions
diff --git a/‎assets/inference_speedup_qwen3_14b.png‎
35.6 KB b/‎assets/inference_speedup_qwen3_14b.png‎
35.6 KB
diff --git a/‎assets/inference_speedup_qwen3_8b.png‎
35.3 KB b/‎assets/inference_speedup_qwen3_8b.png‎
35.3 KB
diff --git a/‎assets/llama-3.1-8b-acc-weight_and_activation.png‎
24.7 KB b/‎assets/llama-3.1-8b-acc-weight_and_activation.png‎
24.7 KB
diff --git a/‎assets/llama-3.1-8b-acc-weight_only.png‎
24.2 KB b/‎assets/llama-3.1-8b-acc-weight_only.png‎
24.2 KB
diff --git a/‎assets/qwen3-3-8b-acc-weight_and_activation.png‎
23.4 KB b/‎assets/qwen3-3-8b-acc-weight_and_activation.png‎
23.4 KB
diff --git a/‎assets/qwen3-3-8b-acc-weight_only.png‎
23.4 KB b/‎assets/qwen3-3-8b-acc-weight_only.png‎
23.4 KB
diff --git a/‎model_quant.py‎
Lines changed: 2 additions & 2 deletions b/‎model_quant.py‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎src/quantization/gptq.py‎
Lines changed: 14 additions & 3 deletions b/‎src/quantization/gptq.py‎
Lines changed: 14 additions & 3 deletions
diff --git a/‎src/quantization/qconfig.py‎
Lines changed: 1 addition & 1 deletion b/‎src/quantization/qconfig.py‎
Lines changed: 1 addition & 1 deletion
@@ -1,23 +1,29 @@
-# FP-quantization-harness
+# FP Format Quantization Harness
 
-Repository for the the development of a recipe for efficient and accurate weight + activation quantization for low-bit FP formats (FP4, NVFP4, MXFP**B**).
+This is a harness for efficient and accurate weight-and-activation quantization for low-bit FP/INT formats, with and without microscaling, including FP4, NVFP4, and MXFP. These formats are compatible with the NVIDIA Blackwell GPU architecture. 
+
+The goal of the repository is to allow you to produce quantized models in these formats. 
+Currently, the repository supports the standard microscaled MXFP4 format, together with standard methods such as RTN and GPTQ quantization for the weights. The main new approach supported--which we found to be particularly effective--is a variant of GPTQ (called GPTQ+Had) where a block-wise Hadamard transform is applied onto the weights and activations before quantization. Key to efficiency is that the Hadamard block size matches the microscaling format group size (16 or 32); in turn, this small Hadamard transform is automatically "fused" into our MatMul kernels. 
+
+The inference code to run models in the `MXFP` format (with speedups) can be found in the [QuTLASS](https://github.com/IST-DASLab/qutlass) repository. 
 
 ### Repository structure
 ---
 
 The repository is structured as follows:
 
-* `model_quant.py` - the main script for quantization of the Llama models
+* `model_quant.py` - the main script for quantization of Llama/Qwen models
 * `src/` - source code with implementation of all necessary functionality \
     ```├── quantization``` - quantization functionality \
     ```├── transforms``` - transform functionality \
     ```├── utils``` - utility functions
 
 
+
 ### Usage
 ---
 
-Below is an example of the qat script usage:
+Below is an example of the model quantization script usage:
 
 ```shell
 MODEL=${MODEL:-"meta-llama/Llama-3.1-8B-Instruct"}
@@ -101,7 +107,7 @@ Above:
 * `--w_observer` - The observer to use for the weights (`mse` or `minmax`).
 * `--a_group_size` - The number of activations to quantize together.
 * `--parametrization` - Transform parameterization.
-* `--gptq` - Whether to use GPTQ quantization.
+* `--gptq` - Whether to use GPTQ quantization for the weights.
 * `--transform_class` - Transform class (`identity` or `hadamard`).
 * `--dataset_name_or_path` - Dataset to use.
 * `--sequence_length` - Sequence length.
@@ -110,4 +116,55 @@ Above:
 * `--save_path` - Path to save the quantized model.
 * `--real_quant` - Whether to save model in real quantization format.
 * `--eval_perplexity` - Whether to compute perplexity.
-* `--eval_openllm` - Whether to compute OpenLLMv1 scores.
+* `--eval_openllm` - Whether to compute OpenLLMv1 scores.
+
+`real_quant` option produces models that are runnable on Blackwell architectures (`sm_120`) via transformers and vLLM (currently using the transformers [fork](https://github.com/huggingface/transformers/pull/38696/)).
+
+
+
+### Accuracy Evaluations
+
+The results below provide the evaluation results for quantized Llama-3 and Qwen-3 models 
+on the OpenLLM v1 leaderboard. Specifically, we provide average metrics for the following tasks:
+* `mmlu_cot_llama` (exact_match, strict_match)
+* `arc_challenge_llama` (exact_match, strict_match)
+* `gsm8k_llama` (exact_match, strict_match)
+* `hellaswag` (acc_norm)
+* `winogrande` (acc)
+* `truthfulqa_mc2` (acc)
+
+The results for Qwen3 exclude `arc_challenge_llama` as it turns out to be very noisy. 
+
+Below left column corresponds to **weight-only** quantization, right column corresponds to **weight-and-activation** quantization. Results for AWQ were produced via the dedicated [AutoAWQ fork](https://github.com/Godofnothing/AutoAWQ-FP).
+
+**Llama-3.1-8B-Instruct**
+
+<p float="left">
+  <img src="assets/llama-3.1-8b-acc-weight_only.png" width="400" />
+  <img src="assets/llama-3.1-8b-acc-weight_and_activation.png" width="400" />
+</p>
+
+**Qwen-3-8B**
+
+<p float="left">
+  <img src="assets/qwen3-3-8b-acc-weight_only.png" width="400" />
+  <img src="assets/qwen3-3-8b-acc-weight_and_activation.png" width="400" />
+</p>
+
+*Notes*. For NVFP format without `hadamard` rotation GPTQ's average performance is below 0.65. 
+By and large, `GPTQ+Had` appears to be the best method for preserving accuracy.  
+
+
+### Inference speedups
+
+Below we provide some performance numbers for end-2-end inference with QuTLASS kernels vs `bf16` baseline for Qwen3 models, on an RTX 5090 GPU.
+Please see the [QuTLASS](https://github.com/IST-DASLab/qutlass) repository for details on how to reproduce this. 
+
+<p float="left">
+  <img src="assets/inference_speedup_qwen3_8b.png" width="400" />
+  <img src="assets/inference_speedup_qwen3_14b.png" width="400" />
+</p>
+
+### Contributors 
+
+This project is still in active development. So far, it has benefitted from contributions from Denis Kuznedelev, Andrei Panferov, Vage Egiazarian, Saleh Ashkboos, as well as Dan Alistarh, Michael Goin and Eldar Kurtic. The [QuTLASS](https://github.com/IST-DASLab/qutlass) repository is developed primarily by Roberto Lopez Castro, with help from Jiale Chen. 
@@ -316,7 +316,7 @@ def main():
     model = AutoModelForCausalLM.from_pretrained(
         args.model_name_or_path, 
         torch_dtype=args.dtype, 
-        device_map=device, # to avoid errors when model is split on mulitple GPUs
+        device_map=None if args.cpu_offload_modules else device, 
         low_cpu_mem_usage=True,
     )
     model.config.use_cache = False
@@ -338,7 +338,7 @@ def main():
                 args.num_sequences,
                 args.seed
             )
-            quantized_state_dict = gptq_quantization(model, calibration_data, args, device)
+            quantized_state_dict = gptq_quantization(model, calibration_data, args, device=device)
         else:
             quantized_state_dict = rtn_quantization(model, args, device)
 
 
@@ -229,6 +229,7 @@ def gptq_quantization(
 ) -> Optional[dict[str, torch.Tensor]]:
     print("GPTQ quantization...")
     orig_dtype = model.config.torch_dtype if args.dtype == "auto" else args.dtype
+    activation_offload_device = "cpu" if args.cpu_offload_activations else None
     # State dict with quantized weights, scales and hadamards
     quantized_state_dict = {}
     # Define common transform kwargs
@@ -261,7 +262,11 @@ def gptq_quantization(
 
     blocks = model.model.layers
     blocks[0] = blocks[0].to(device)
-    blocks[0] = InputCollector(blocks[0], cpu_offload=False)
+    blocks[0] = InputCollector(blocks[0], cpu_offload=activation_offload_device)
+
+    if args.cpu_offload_modules:
+        model.get_input_embeddings().to(device)
+        blocks[0].to(device)
 
     for sample in calibration_data:
         try:
@@ -274,6 +279,9 @@ def gptq_quantization(
     input_kwargs = blocks[0].input_kwargs
     blocks[0] = blocks[0].module
 
+    if args.cpu_offload_modules:
+        model.get_input_embeddings().cpu()
+
     # Iterate over transformer blocks
     for block_idx, block in enumerate(blocks):
         print(f"Processing block {block_idx}...")
@@ -381,12 +389,15 @@ def _hook(_, inp, out):
             out = maybe_first_element(out)
             # change only first input argument
             if len(inp_args) > 0:
-                inp_args[0].data = out
+                inp_args[0].data = out.to(activation_offload_device)
             elif "hidden_states" in inp_kwargs:
-                inp_kwargs["hidden_states"] = out
+                inp_kwargs["hidden_states"] = out.to(activation_offload_device)
             else:
                 raise ValueError("Unsupported block input format.")
 
+        if args.cpu_offload_modules:
+            block = block.cpu()
+
         # 10. Clean-up
         del gptq_handles
         del hooks
 
@@ -8,7 +8,7 @@ def prepare_quantization_config(group_size: int, format: str) -> dict[str, Any]:
             "forward_method": "abs_max",
             "hadamard_group_size": group_size,
             "modules_to_not_convert": ["lm_head"],
-            "quant_method": "quartet",
+            "quant_method": "fp_quant",
             "store_master_weights": False
         }
     elif format == "nvfp":
Original file line number	Diff line number	Diff line change
`@@ -8,7 +8,7 @@ def prepare_quantization_config(group_size: int, format: str) -> dict[str, Any]:`
`8`	`8`	`"forward_method": "abs_max",`
`9`	`9`	`"hadamard_group_size": group_size,`
`10`	`10`	`"modules_to_not_convert": ["lm_head"],`
`11`		`- "quant_method": "quartet",`
	`11`	`+ "quant_method": "fp_quant",`
`12`	`12`	`"store_master_weights": False`
`13`	`13`	`}`
`14`	`14`	`elif format == "nvfp":`