[Quantization Args] Add scale and zp dtype #508

dsikka · 2025-10-28T21:03:54Z

Summary

Add the option to define a scale or zp dtype when defining your quantization schemes
If defined, the scale dtype is used to round the scale when generating qparams and cast the scale during compression time
Through this, we can remove the requirement of is_fp4 and some of the fp4 specific functionality that was tied closely to the global scale generation

- We are not applying this logic for now but would like to discuss with the team to gather thoughts:

We set the zp_dtype to None if running symmetric quantization.
We set the scale_dtype to None if running dynamic or local quantization.

Clean-up calculate_qparam
Update round_to_quantized_type ---> round_to_quantized_type_args and add clamping functionality to this method
Add additional round_to_quantized_type_dtype which similarly clamps and rounds if given a dtype as an input, not a set of qargs

Question:

The zp_dtype for int4 is int8, but we pack to int32 which is what gets saved to disk / in the checkpoint. Does it make sense to have specific logic to set the zp_dtype as int32 when the config is saved, as that is what ends up in the checkpoint? I am leaning towards yes as we want the ct config to best reflect what is in the checkpoint

Example Updates:

KV Cache Scheme:

"kv_cache_scheme": {
  "actorder": null,
  "block_structure": null,
  "dynamic": false,
  "group_size": null,
  "num_bits": 8,
  "observer": "minmax",
  "observer_kwargs": {},
  "scale_dtype": "bfloat16",
  "strategy": "tensor",
  "symmetric": true,
  "type": "float",
  "zp_dtype": null
}

NVFP4:

"quantization_config": {
    "config_groups": {
      "group_0": {
        "format": "nvfp4-pack-quantized",
        "input_activations": {
          "actorder": null,
          "block_structure": null,
          "dynamic": "local",
          "group_size": 16,
          "num_bits": 4,
          "observer": "static_minmax",
          "observer_kwargs": {},
          "scale_dtype": null,
          "strategy": "tensor_group",
          "symmetric": true,
          "type": "float",
          "zp_dtype": null
        },
        "output_activations": null,
        "targets": [
          "Linear"
        ],
        "weights": {
          "actorder": null,
          "block_structure": null,
          "dynamic": false,
          "group_size": 16,
          "num_bits": 4,
          "observer": "static_minmax",
          "observer_kwargs": {},
          "scale_dtype": "float8_e4m3fn",
          "strategy": "tensor_group",
          "symmetric": true,
          "type": "float",
          "zp_dtype": null
        }
      }
    },

FP8 Dynamic:

"quantization_config": {
    "config_groups": {
      "group_0": {
        "format": "float-quantized",
        "input_activations": {
          "actorder": null,
          "block_structure": null,
          "dynamic": true,
          "group_size": null,
          "num_bits": 8,
          "observer": null,
          "observer_kwargs": {},
          "scale_dtype": null,
          "strategy": "token",
          "symmetric": true,
          "type": "float",
          "zp_dtype": null
        },
        "output_activations": null,
        "targets": [
          "Linear"
        ],
        "weights": {
          "actorder": null,
          "block_structure": null,
          "dynamic": false,
          "group_size": null,
          "num_bits": 8,
          "observer": "minmax",
          "observer_kwargs": {},
          "scale_dtype": "bfloat16",
          "strategy": "channel",
          "symmetric": true,
          "type": "float",
          "zp_dtype": null
        }
      }
    },

W4A16 + Asym

 "quantization_config": {
    "config_groups": {
      "group_0": {
        "format": "pack-quantized",
        "input_activations": null,
        "output_activations": null,
        "targets": [
          "Linear"
        ],
        "weights": {
          "actorder": null,
          "block_structure": null,
          "dynamic": false,
          "group_size": 128,
          "num_bits": 4,
          "observer": "minmax",
          "observer_kwargs": {},
          "scale_dtype": "torch.bfloat16",
          "strategy": "group",
          "symmetric": false,
          "type": "int",
          "zp_dtype": "torch.int8"
        }
      }
    },

dsikka · 2025-10-29T21:38:23Z

Dipika todo: should try a w4a16 with zp to make sure it is saved correctly

tests/test_quantization/lifecycle/test_apply.py

src/compressed_tensors/quantization/quant_args.py

src/compressed_tensors/quantization/quant_config.py

src/compressed_tensors/quantization/lifecycle/initialize.py

src/compressed_tensors/quantization/utils/helpers.py

HDCharles · 2025-10-30T13:24:24Z

Im unsure about the zp_dtype = None meaning symmetric quantization if we're going to leave symmetric as it's own field. Feels like either symmetric should be deprecated or zp_dtype should be ignored when symmetric is true.

I strongly dislike scale_dtype = None meaning dynamic quantization, that seems entirely unintuitive. While zp_dtype=None could be understood as 'there is no zp'-> symmetric quant, scale_dtype=None has no such logical progression to dynamic quant. It also has the same issue as above with duplicating the information in the dynamic field.

dsikka · 2025-10-30T14:01:10Z

Im unsure about the zp_dtype = None meaning symmetric quantization if we're going to leave symmetric as it's own field. Feels like either symmetric should be deprecated or zp_dtype should be ignored when symmetric is true.

I strongly dislike scale_dtype = None meaning dynamic quantization, that seems entirely unintuitive. While zp_dtype=None could be understood as 'there is no zp'-> symmetric quant, scale_dtype=None has no such logical progression to dynamic quant. It also has the same issue as above with duplicating the information in the dynamic field.

The point is to make it clear in the metadata what is compressed on disk. When doing asymmetric quantization or dynamic quantization, neither the scale or zp are saved or set in the checkpoint. Having them set in the config would be extremely confusing.

You can also run dynamic generations with any fp dtype, depending on how you load your model as it will just match the dtype of the activations. So having it defined in the config doesn't make a lot of sense.

In the case of the zp_dtype, it is ignored if symmetric. It is set as None in the config.

src/compressed_tensors/quantization/quant_config.py

src/compressed_tensors/quantization/utils/helpers.py

src/compressed_tensors/quantization/quant_args.py

src/compressed_tensors/quantization/utils/helpers.py

src/compressed_tensors/quantization/quant_args.py

src/compressed_tensors/compressors/quantized_compressors/fp4_quantized.py

src/compressed_tensors/quantization/lifecycle/forward.py

src/compressed_tensors/quantization/lifecycle/initialize.py

src/compressed_tensors/quantization/quant_args.py

src/compressed_tensors/quantization/utils/helpers.py

brian-dellabetta

One question on skip scale

src/compressed_tensors/compressors/quantized_compressors/base.py

dsikka added 9 commits October 28, 2025 20:56

update

1211455

add back test

41aa0fc

update

de9f16a

update

c02000d

fix serialization

fbccd40

fix condition

2a2f2a3

update

cbd6d66

update

6fca61f

update

e53bf78

dsikka marked this pull request as ready for review October 29, 2025 21:32

brian-dellabetta reviewed Oct 29, 2025

View reviewed changes

tests/test_quantization/lifecycle/test_apply.py Show resolved Hide resolved

kylesayrs reviewed Oct 29, 2025

View reviewed changes

src/compressed_tensors/quantization/quant_args.py Outdated Show resolved Hide resolved

src/compressed_tensors/quantization/quant_config.py Outdated Show resolved Hide resolved

kylesayrs reviewed Oct 29, 2025

View reviewed changes

dsikka added 9 commits October 31, 2025 02:36

update

dec2b2c

update

8b7181c

remove torch

9bd9040

update

ecb7d7f

update

933c624

update tests

ee742c0

update

e7475d2

update

e7d6b52

fix comment

1970b26

dsikka requested a review from kylesayrs November 3, 2025 19:56

kylesayrs reviewed Nov 3, 2025

View reviewed changes

update

e8107e5

kylesayrs reviewed Nov 4, 2025

View reviewed changes

src/compressed_tensors/quantization/quant_args.py Show resolved Hide resolved

updatE

7fbdbbf

dsikka added 8 commits November 5, 2025 13:07

update

f04d7e3

update

3a1ec7e

update

7ab90d5

update

d55633e

update

9e5a93b

update

9d229c9

update

6453b2e

update

a628a6c

dsikka requested a review from kylesayrs November 5, 2025 21:34

updatE

572776c

kylesayrs reviewed Nov 5, 2025

View reviewed changes

update

e987088

dsikka requested a review from kylesayrs November 5, 2025 23:38

brian-dellabetta reviewed Nov 6, 2025

View reviewed changes

src/compressed_tensors/compressors/quantized_compressors/base.py Show resolved Hide resolved

brian-dellabetta previously approved these changes Nov 6, 2025

View reviewed changes

kylesayrs previously approved these changes Nov 7, 2025

View reviewed changes

update

dc235db

dsikka dismissed stale reviews from kylesayrs and brian-dellabetta via dc235db November 9, 2025 14:16

dsikka requested review from brian-dellabetta and kylesayrs November 9, 2025 14:19

update

e4f363a

kylesayrs approved these changes Nov 10, 2025

View reviewed changes

brian-dellabetta approved these changes Nov 10, 2025

View reviewed changes

dsikka merged commit 8471264 into main Nov 10, 2025
3 checks passed

dsikka deleted the quant_args_dtype branch November 10, 2025 16:05

mratsim mentioned this pull request Nov 20, 2025

[Bug]: The new scale_dtype and zp_dtype are not backward compatible with released vLLM vllm-project/llm-compressor#2057

Open

[Quantization Args] Add scale and zp dtype #508

[Quantization Args] Add scale and zp dtype #508

Uh oh!

Conversation

dsikka commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Question:

Example Updates:

KV Cache Scheme:

NVFP4:

FP8 Dynamic:

W4A16 + Asym

Uh oh!

dsikka commented Oct 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HDCharles commented Oct 30, 2025

Uh oh!

dsikka commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dsikka commented Oct 28, 2025 •

edited

Loading

dsikka commented Oct 30, 2025 •

edited

Loading