Skip to content

Conversation

@dsikka
Copy link
Collaborator

@dsikka dsikka commented Oct 28, 2025

Summary

  • Add the option to define a scale or zp dtype when defining your quantization schemes
  • If defined, the scale dtype is used to round the scale when generating qparams and cast the scale during compression time
  • Through this, we can remove the requirement of is_fp4 and some of the fp4 specific functionality that was tied closely to the global scale generation

- We are not applying this logic for now but would like to discuss with the team to gather thoughts:

  1. We set the zp_dtype to None if running symmetric quantization.
  2. We set the scale_dtype to None if running dynamic or local quantization.
  • Clean-up calculate_qparam
  • Update round_to_quantized_type ---> round_to_quantized_type_args and add clamping functionality to this method
  • Add additional round_to_quantized_type_dtype which similarly clamps and rounds if given a dtype as an input, not a set of qargs

Question:

  • The zp_dtype for int4 is int8, but we pack to int32 which is what gets saved to disk / in the checkpoint. Does it make sense to have specific logic to set the zp_dtype as int32 when the config is saved, as that is what ends up in the checkpoint? I am leaning towards yes as we want the ct config to best reflect what is in the checkpoint

Example Updates:

KV Cache Scheme:

"kv_cache_scheme": {
  "actorder": null,
  "block_structure": null,
  "dynamic": false,
  "group_size": null,
  "num_bits": 8,
  "observer": "minmax",
  "observer_kwargs": {},
  "scale_dtype": "bfloat16",
  "strategy": "tensor",
  "symmetric": true,
  "type": "float",
  "zp_dtype": null
}

NVFP4:

"quantization_config": {
    "config_groups": {
      "group_0": {
        "format": "nvfp4-pack-quantized",
        "input_activations": {
          "actorder": null,
          "block_structure": null,
          "dynamic": "local",
          "group_size": 16,
          "num_bits": 4,
          "observer": "static_minmax",
          "observer_kwargs": {},
          "scale_dtype": null,
          "strategy": "tensor_group",
          "symmetric": true,
          "type": "float",
          "zp_dtype": null
        },
        "output_activations": null,
        "targets": [
          "Linear"
        ],
        "weights": {
          "actorder": null,
          "block_structure": null,
          "dynamic": false,
          "group_size": 16,
          "num_bits": 4,
          "observer": "static_minmax",
          "observer_kwargs": {},
          "scale_dtype": "float8_e4m3fn",
          "strategy": "tensor_group",
          "symmetric": true,
          "type": "float",
          "zp_dtype": null
        }
      }
    },

FP8 Dynamic:

"quantization_config": {
    "config_groups": {
      "group_0": {
        "format": "float-quantized",
        "input_activations": {
          "actorder": null,
          "block_structure": null,
          "dynamic": true,
          "group_size": null,
          "num_bits": 8,
          "observer": null,
          "observer_kwargs": {},
          "scale_dtype": null,
          "strategy": "token",
          "symmetric": true,
          "type": "float",
          "zp_dtype": null
        },
        "output_activations": null,
        "targets": [
          "Linear"
        ],
        "weights": {
          "actorder": null,
          "block_structure": null,
          "dynamic": false,
          "group_size": null,
          "num_bits": 8,
          "observer": "minmax",
          "observer_kwargs": {},
          "scale_dtype": "bfloat16",
          "strategy": "channel",
          "symmetric": true,
          "type": "float",
          "zp_dtype": null
        }
      }
    },

W4A16 + Asym

 "quantization_config": {
    "config_groups": {
      "group_0": {
        "format": "pack-quantized",
        "input_activations": null,
        "output_activations": null,
        "targets": [
          "Linear"
        ],
        "weights": {
          "actorder": null,
          "block_structure": null,
          "dynamic": false,
          "group_size": 128,
          "num_bits": 4,
          "observer": "minmax",
          "observer_kwargs": {},
          "scale_dtype": "torch.bfloat16",
          "strategy": "group",
          "symmetric": false,
          "type": "int",
          "zp_dtype": "torch.int8"
        }
      }
    },

@dsikka dsikka marked this pull request as ready for review October 29, 2025 21:32
@dsikka
Copy link
Collaborator Author

dsikka commented Oct 29, 2025

Dipika todo: should try a w4a16 with zp to make sure it is saved correctly

@HDCharles
Copy link
Collaborator

Im unsure about the zp_dtype = None meaning symmetric quantization if we're going to leave symmetric as it's own field. Feels like either symmetric should be deprecated or zp_dtype should be ignored when symmetric is true.

I strongly dislike scale_dtype = None meaning dynamic quantization, that seems entirely unintuitive. While zp_dtype=None could be understood as 'there is no zp'-> symmetric quant, scale_dtype=None has no such logical progression to dynamic quant. It also has the same issue as above with duplicating the information in the dynamic field.

@dsikka
Copy link
Collaborator Author

dsikka commented Oct 30, 2025

Im unsure about the zp_dtype = None meaning symmetric quantization if we're going to leave symmetric as it's own field. Feels like either symmetric should be deprecated or zp_dtype should be ignored when symmetric is true.

I strongly dislike scale_dtype = None meaning dynamic quantization, that seems entirely unintuitive. While zp_dtype=None could be understood as 'there is no zp'-> symmetric quant, scale_dtype=None has no such logical progression to dynamic quant. It also has the same issue as above with duplicating the information in the dynamic field.

The point is to make it clear in the metadata what is compressed on disk. When doing asymmetric quantization or dynamic quantization, neither the scale or zp are saved or set in the checkpoint. Having them set in the config would be extremely confusing.

You can also run dynamic generations with any fp dtype, depending on how you load your model as it will just match the dtype of the activations. So having it defined in the config doesn't make a lot of sense.

In the case of the zp_dtype, it is ignored if symmetric. It is set as None in the config.

@dsikka dsikka requested a review from kylesayrs November 3, 2025 19:56
@dsikka dsikka requested a review from kylesayrs November 5, 2025 21:34
@dsikka dsikka requested a review from kylesayrs November 5, 2025 23:38
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question on skip scale

kylesayrs
kylesayrs previously approved these changes Nov 7, 2025
@dsikka dsikka merged commit 8471264 into main Nov 10, 2025
3 checks passed
@dsikka dsikka deleted the quant_args_dtype branch November 10, 2025 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants