A comprehensive toolkit for quantizing large language models to GGUF format with support for multiple acceleration backends (CUDA, Metal, CPU).
| Feature | Status | Description |
|---|---|---|
| π₯οΈ Bare Metal | β | Native installation without Docker |
| π§ Auto Setup | β | Automatic environment detection and configuration |
| π― Multi-Backend | β | CUDA, Metal (Apple Silicon), and CPU support |
| π¦ Conda Ready | β | Complete conda environment with all dependencies |
| β‘ Quick Scripts | β | Convenient scripts for common tasks |
| π Perplexity | β | Automated quality testing of quantized models |
| π Validation | β | Environment health checks and troubleshooting |
| Requirement | Minimum Version | Notes |
|---|---|---|
| Conda | Latest | Miniconda or Anaconda |
| Python | 3.11+ | Installed via conda |
| Git | 2.0+ | For repository operations |
| CMake | 3.14+ | For building llama.cpp |
| Platform | Requirements | Acceleration |
|---|---|---|
| NVIDIA | CUDA 11.8+ | β CUDA acceleration |
| Apple Silicon | macOS + M1/M2/M3 | β Metal acceleration |
| Others | Any CPU | β Optimized CPU processing |
# Clone the repository
git clone https://github.com/Vikhrmodels/quantization-utils.git
cd quantization-utils
# Run the automated setup script
chmod +x scripts/setup.sh
./scripts/setup.sh# Create conda environment (OS-specific)
# For Linux:
conda env create -f environment-linux.yml
# For macOS:
conda env create -f environment-macos.yml
# Generic (fallback):
conda env create -f environment.yml
# Activate environment
conda activate quantization-utils
# Run setup to install llama.cpp and prepare directories
python setup.py
# Add to PATH (if needed)
export PATH="$HOME/.local/bin:$PATH"Verify your installation:
# Check environment health
./scripts/validate.sh
# Quick test
conda activate quantization-utils
cd GGUF
python -c "from shared import validate_environment; validate_environment()"# Activate environment
conda activate quantization-utils
# Quantize a model with default settings
./scripts/quantize.sh microsoft/DialoGPT-medium
# Custom quantization levels
./scripts/quantize.sh Vikhrmodels/Vikhr-Gemma-2B-instruct -q Q4_K_M,Q5_K_M,Q8_0
# Force re-quantization
./scripts/quantize.sh microsoft/DialoGPT-medium --forcecd GGUF
# Full pipeline with all quantization levels
python pipeline.py --model_id microsoft/DialoGPT-medium
# Specific quantization levels only
python pipeline.py --model_id microsoft/DialoGPT-medium -q Q4_K_M -q Q8_0
# With perplexity testing
python pipeline.py --model_id microsoft/DialoGPT-medium --perplexity
# For gated models (requires HF token)
python pipeline.py --model_id meta-llama/Llama-2-7b-hf --hf_token $HF_TOKEN# Test all quantized versions
./scripts/perplexity.sh microsoft/DialoGPT-medium
# Force recalculation
./scripts/perplexity.sh microsoft/DialoGPT-medium --forcequantization-utils/
βββ π environment.yml # Conda environment definition
βββ π setup.py # Environment setup script
βββ π README.md # This file
β
βββ π§ scripts/ # Convenience scripts
β βββ setup.sh # Automated setup
β βββ validate.sh # Environment validation
β βββ quantize.sh # Quick quantization
β βββ perplexity.sh # Perplexity testing
β
βββ π¦ GGUF/ # Main processing directory
βββ π pipeline.py # Main pipeline script
βββ π shared.py # Shared utilities
βββ π models/ # Downloaded models
βββ π imatrix/ # Importance matrices
βββ π output/ # Final quantized models
βββ π resources/ # Calibration data
β βββ standard_cal_data/
βββ π modules/ # Processing modules
βββ convert.py
βββ quantize.py
βββ imatrix.py
βββ perplexity.py
| Variable | Description | Example |
|---|---|---|
HF_TOKEN |
HuggingFace API token | hf_... |
CUDA_VISIBLE_DEVICES |
GPU selection | 0,1 |
OMP_NUM_THREADS |
CPU threads | 8 |
| Parameter | Description | Default |
|---|---|---|
--model_id |
HuggingFace model ID | Required |
--quants |
Quantization levels | All levels |
--force |
Force reprocessing | False |
--perplexity |
Run quality tests | False |
--threads |
Processing threads | CPU count |
| Level | Description | Size | Quality |
|---|---|---|---|
Q2_K |
2-bit quantization | Smallest | Good |
Q4_K_M |
4-bit mixed | Balanced | Very Good |
Q5_K_M |
5-bit mixed | Larger | Excellent |
Q6_K |
6-bit | Large | Near Original |
Q8_0 |
8-bit | Largest | Original |
| Issue | Solution |
|---|---|
conda: command not found |
Install Miniconda/Anaconda |
llama-quantize: not found |
Run python setup.py |
CUDA out of memory |
Reduce batch size or use CPU |
Permission denied |
Check file permissions with chmod +x |
PackagesNotFoundError |
Use OS-specific environment file |
# Reset environment
conda env remove -n quantization-utils
# Recreate with OS-specific file
# Linux:
conda env create -f environment-linux.yml
# macOS:
conda env create -f environment-macos.yml
# Reinstall llama.cpp
rm -rf ~/.local/bin/llama-*
python setup.py
# Check installation
./scripts/validate.sh# Manual llama.cpp installation
cd /tmp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
mkdir build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/.local
make -j$(nproc)
make install- Update
shared.pywith newQuantenum values - Modify
modules/quantize.pyto handle new methods - Update pipeline default quantization list
- Test with validation scripts
# Add to GGUF/resources/standard_cal_data/
# Files should be UTF-8 text with one sample per line| Tip | Description |
|---|---|
| π GPU Usage | Use CUDA/Metal for 5-10x speedup |
| πΎ Memory | Monitor RAM usage with large models |
| π Batch Size | Adjust based on available memory |
| π Threads | Set to CPU core count for optimal CPU performance |
- Fork the repository
- Create a feature branch
- Test with
./scripts/validate.sh - Submit a pull request
This project is licensed under the terms specified in the LICENSE file.
- llama.cpp: https://github.com/ggerganov/llama.cpp
- HuggingFace: https://huggingface.co/
- GGUF Format: https://github.com/ggerganov/ggml/blob/master/docs/gguf.md
Ready to quantize? Start with ./scripts/setup.sh π