-
Couldn't load subscription status.
- Fork 269
Description
Hello,
I’m using LLM Compressor (v0.x) to quantize the model meta-llama/Llama-3.2-1B with the following recipe:
AWQModifier(
ignore=["lm_head"],
scheme="W4A16",
targets=["Linear"],
)
After running:
oneshot(...)
and saving with:
model.save_pretrained(
SAVE_DIR,
safe_serialization=True,
save_compressed=True,
state_dict=state_dict,
)
I obtained two output files:
-
model.safetensors -
pytorch_model.bin
Both files are roughly ~1.44 GB in size.
However, when I run AutoAWQ on the same model using the W4A16 quantization scheme, the resulting file is only about 1 000 MB (~1 GB).
Main question
Is this file size difference (~1.44 GB with LLM Compressor vs. ~1 GB with AutoAWQ) expected for W4A16 quantization?
Or could it indicate that:
-
the weights were not fully packed into 4-bit format,
-
additional metadata or full-precision tensors were saved alongside compressed weights,
-
or that
save_pretrainedreverted to storing redundant copies (e.g., both FP and quantized states)?
Additional notes
The quantized model reloads and generates correctly, but I want to confirm whether the ~1.44 GB file size is normal for a 1B-parameter model at W4A16, or if it suggests that the compression is only partial (e.g., still storing FP16 weights somewhere).
Thanks a lot for your time and for maintaining this great project! 🙏