You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+63-6Lines changed: 63 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,23 +1,29 @@
1
-
# FP-quantization-harness
1
+
# FP Format Quantization Harness
2
2
3
-
Repository for the the development of a recipe for efficient and accurate weight + activation quantization for low-bit FP formats (FP4, NVFP4, MXFP**B**).
3
+
This is a harness for efficient and accurate weight-and-activation quantization for low-bit FP/INT formats, with and without microscaling, including FP4, NVFP4, and MXFP. These formats are compatible with the NVIDIA Blackwell GPU architecture.
4
+
5
+
The goal of the repository is to allow you to produce quantized models in these formats.
6
+
Currently, the repository supports the standard microscaled MXFP4 format, together with standard methods such as RTN and GPTQ quantization for the weights. The main new approach supported--which we found to be particularly effective--is a variant of GPTQ (called GPTQ+Had) where a block-wise Hadamard transform is applied onto the weights and activations before quantization. Key to efficiency is that the Hadamard block size matches the microscaling format group size (16 or 32); in turn, this small Hadamard transform is automatically "fused" into our MatMul kernels.
7
+
8
+
The inference code to run models in the `MXFP` format (with speedups) can be found in the [QuTLASS](https://github.com/IST-DASLab/qutlass) repository.
4
9
5
10
### Repository structure
6
11
---
7
12
8
13
The repository is structured as follows:
9
14
10
-
*`model_quant.py` - the main script for quantization of the Llama models
15
+
*`model_quant.py` - the main script for quantization of Llama/Qwen models
11
16
*`src/` - source code with implementation of all necessary functionality \
*`--gptq` - Whether to use GPTQ quantization for the weights.
105
111
*`--transform_class` - Transform class (`identity` or `hadamard`).
106
112
*`--dataset_name_or_path` - Dataset to use.
107
113
*`--sequence_length` - Sequence length.
@@ -110,4 +116,55 @@ Above:
110
116
*`--save_path` - Path to save the quantized model.
111
117
*`--real_quant` - Whether to save model in real quantization format.
112
118
*`--eval_perplexity` - Whether to compute perplexity.
113
-
*`--eval_openllm` - Whether to compute OpenLLMv1 scores.
119
+
*`--eval_openllm` - Whether to compute OpenLLMv1 scores.
120
+
121
+
`real_quant` option produces models that are runnable on Blackwell architectures (`sm_120`) via transformers and vLLM (currently using the transformers [fork](https://github.com/huggingface/transformers/pull/38696/)).
122
+
123
+
124
+
125
+
### Accuracy Evaluations
126
+
127
+
The results below provide the evaluation results for quantized Llama-3 and Qwen-3 models
128
+
on the OpenLLM v1 leaderboard. Specifically, we provide average metrics for the following tasks:
The results for Qwen3 exclude `arc_challenge_llama` as it turns out to be very noisy.
137
+
138
+
Below left column corresponds to **weight-only** quantization, right column corresponds to **weight-and-activation** quantization. Results for AWQ were produced via the dedicated [AutoAWQ fork](https://github.com/Godofnothing/AutoAWQ-FP).
This project is still in active development. So far, it has benefitted from contributions from Denis Kuznedelev, Andrei Panferov, Vage Egiazarian, Saleh Ashkboos, as well as Dan Alistarh, Michael Goin and Eldar Kurtic. The [QuTLASS](https://github.com/IST-DASLab/qutlass) repository is developed primarily by Roberto Lopez Castro, with help from Jiale Chen.
0 commit comments