Skip to content

Model file size (~1.44 GB) larger than expected after W4A16 quantization (meta-llama/Llama-3.2-1B) #1969

@manfeilong

Description

@manfeilong

Hello,

I’m using LLM Compressor (v0.x) to quantize the model meta-llama/Llama-3.2-1B with the following recipe:

AWQModifier(
    ignore=["lm_head"],
    scheme="W4A16",
    targets=["Linear"],
)

After running:

oneshot(...)

and saving with:

model.save_pretrained(
    SAVE_DIR,
    safe_serialization=True,
    save_compressed=True,
    state_dict=state_dict,
)

I obtained two output files:

  • model.safetensors

  • pytorch_model.bin

Both files are roughly ~1.44 GB in size.

However, when I run AutoAWQ on the same model using the W4A16 quantization scheme, the resulting file is only about 1 000 MB (~1 GB).


Main question

Is this file size difference (~1.44 GB with LLM Compressor vs. ~1 GB with AutoAWQ) expected for W4A16 quantization?

Or could it indicate that:

  • the weights were not fully packed into 4-bit format,

  • additional metadata or full-precision tensors were saved alongside compressed weights,

  • or that save_pretrained reverted to storing redundant copies (e.g., both FP and quantized states)?

Additional notes

The quantized model reloads and generates correctly, but I want to confirm whether the ~1.44 GB file size is normal for a 1B-parameter model at W4A16, or if it suggests that the compression is only partial (e.g., still storing FP16 weights somewhere).

Thanks a lot for your time and for maintaining this great project! 🙏


Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions