diff --git a/README.md b/README.md index b5f5350..a0a9926 100644 --- a/README.md +++ b/README.md @@ -1,25 +1,25 @@ -# 🧪 Advanced AI on Arm +# Advanced AI on Arm This course provides a hands-on introduction to *extreme model quantization*, *hardware-aware optimization*, and *on-device deployment* for generative AI models. You'll explore advanced techniques to reduce model size, accelerate inference, and deploy compact LLMs on edge devices like Android smartphones. -## 🧬 Labs Overview +## Labs Overview -### 🔹 Lab 1: **Extreme Quantization** -Train a language model and progressively quantize it from FP32 to 8-bit, 4-bit, 2-bit, and 1-bit precision. Implement and evaluate **Quantization-Aware Training (QAT)** to mitigate accuracy degradation in ultra-low-bit models. +### Lab 1: **Extreme Quantization** +Train a language model and progressively quantize it from FP32 to 8-bit, 4-bit, 2-bit, and 1-bit precision. Implement and evaluate **quantization aware training (QAT)** to mitigate accuracy degradation in ultra-low-bit models. -### 🔹 Lab 2: **Hardware–Software Model Co-Design** -Wrap all `nn.Linear` layers with a custom `QLinear` module and explore **layerwise post-training quantization**. Search for the optimal bit-width configuration to maximize efficiency while maintaining model fidelity in a software-hardware co-design process. +### Lab 2: **Hardware–Software Model Co-Design** +Wrap all `nn.Linear` layers with a custom `QLinear` module and explore **layerwise post-training quantization**. Search for the optimal bit-width configuration to maximize efficiency while maintaining model fidelity in a hardware-software co-design process. -### 🔹 Lab 3: **Running & Quantizing Models on Android** -Use [`llama.cpp`](https://github.com/ggerganov/llama.cpp) to quantize and deploy LLaMA-style LLMs on Android. Learn how to benchmark and run models *offline*, directly on your mobile hardware. +### Lab 3: **Running & Quantizing Models on Android** +Use [`llama.cpp`](https://github.com/ggerganov/llama.cpp) to quantize and deploy Llama-style LLMs on Android. Learn how to benchmark and run models *offline*, directly on your mobile hardware. --- -## 🚀 Getting Started +## Getting Started This repository uses a unified `requirements.txt` and Git LFS to manage dependencies and large pretrained models. -### 1️⃣ Clone the Repository and Download Model Weights +### 1️. Clone the Repository and Download Model Weights ```bash # Install Git LFS if needed @@ -32,7 +32,7 @@ cd Advanced-AI-on-Arm git lfs pull ``` -### 2️⃣ Set Up the Python Environment +### 2️. Set Up the Python Environment ```bash python3 -m venv venv @@ -40,7 +40,7 @@ source venv/bin/activate pip install -r requirements.txt ``` -### 3️⃣ Run the Labs +### 3️. Run the Labs ```bash jupyter lab @@ -48,13 +48,13 @@ jupyter lab Open: -- `lab1.ipynb` for **Extreme Quantization** -- `lab2.ipynb` for **Hardware–Software Co-Design** -- Follow `lab3.md` for **Android deployment** with `llama.cpp` +- `lab1.ipynb` for **Extreme Quantization**; +- `lab2.ipynb` for **Hardware–Software Co-Design**; and +- Follow `lab3.md` for **Android deployment** with `llama.cpp`. --- -## 📁 Repository Structure +## Repository Structure ``` Advanced-AI-on-Arm/ @@ -69,19 +69,19 @@ Advanced-AI-on-Arm/ --- -## 📱 Android Deployment Notes +## Android Deployment Notes To complete **Lab 3**, make sure the following are installed: -- Android Studio (Hedgehog or later) -- Android NDK + ADB -- A physical Android 10+ device with ≥6GB RAM +- Android Studio (Hedgehog or later); +- Android NDK + ADB; and +- a physical Android 10+ device with ≥6GB RAM. > Windows users: use **WSL 2** with Ubuntu 22.04 for full compatibility with build tools. --- -## 🧠 Learning Outcomes +## Learning Outcomes - Understand bit-width trade-offs (accuracy vs. compression) - Apply QAT to recover performance in quantized models @@ -90,7 +90,7 @@ To complete **Lab 3**, make sure the following are installed: --- -## 📫 Questions? +## Questions? Open an issue or contact `oliver@grainge.me` if you encounter problems during setup or execution. diff --git a/lab1.ipynb b/lab1.ipynb index b096893..c05d756 100644 --- a/lab1.ipynb +++ b/lab1.ipynb @@ -9,35 +9,35 @@ "\n", "## Introduction\n", "\n", - "Deep neural networks are often large and computationally expensive, particularly when trained and stored in full-precision (FP32) floating-point format. While 8-bit quantization has become an industry-standard approach for improving efficiency and reducing model size, recent research has shown that pushing to even [lower bit-widths can offer improved accuracy-to-memory trade-offs](https://arxiv.org/abs/2502.02631#:~:text=The%20optimal%20bit%2Dwidth%20for,1.58%2Dbit%20offers%20superior%20results.).\n", + "Deep neural networks are often large and computationally expensive, particularly when trained and stored in full-precision (FP32) floating-point format. While 8-bit quantization has become an industry-standard approach for improving efficiency and reducing model size, recent research has shown that pushing to even [lower bit widths can offer improved accuracy-to-memory trade-offs](https://arxiv.org/abs/2502.02631#:~:text=The%20optimal%20bit%2Dwidth%20for,1.58%2Dbit%20offers%20superior%20results.).\n", "\n", "This brings us into the realm of **extreme quantization**, where weights (and sometimes activations) are quantized to 4 bits, 2 bits, or even just 1 bit (as in binary networks).\n", "\n", "The benefits can be substantial:\n", "\n", - "- A binary network that uses just **1 bit per weight**, reduces model size by **32×** compared to FP32.\n", - "- Speedups arise from using **bitwise operations** (e.g. XNOR + popcount) or **lookup tables (LUTs)** in place of expensive multiply accumulates.\n", + "- A binary network that uses just **1 bit per weight** reduces model size by **32×** compared to FP32.\n", + "- Speedups arise from using **bitwise operations** (e.g. XNOR + popcount) or **lookup tables (LUTs)** in place of expensive multiply-accumulates.\n", "- These operations are especially efficient on **Mobile or Edge platforms**, where memory and compute resources are constrained.\n", "\n", "However, this efficiency gain often comes at a cost: **accuracy often degrades** however special training procedures or architectural modifications can mitigate this. Moreover, software support for sub-8-bit inference remains limited in many libraries.\n", "\n", "This lab provides a hands-on exploration of these trade-offs. You will:\n", "\n", - "- Train a baseline FP32 model language model.\n", - "- Progressively quantize it to **8-bit**, **4-bit**, **2-bit**, and **1-bit** with varying **group-sizes**.\n", - "- Observe the negative impact on accuracy.\n", - "- Implement Extreme ternary/binary quantization. \n", - "- Optimize the extreme quantized model to achieve near FP32 accuracy with Quantized-Aware-Training (QAT).\n", + "- train a baseline FP32 model language model;\n", + "- progressively quantize it to **8-bit**, **4-bit**, **2-bit**, and **1-bit** with varying **group sizes**;\n", + "- observe the negative impact on accuracy;\n", + "- implement extreme ternary/binary quantization; and\n", + "- optimize the extreme quantized model to achieve near FP32 accuracy with quantization aware training (QAT).\n", "\n", - "Our goal is to build a strong intuition for the trade-offs involved in extreme quantization, and to motivate the use of advanced techniques like quantization-aware training (QAT) to unlock even lower bit-widths, potentially inspiring you to design your own highly compressed, efficient models.\n", + "Our goal is to build a strong intuition for the trade-offs involved in extreme quantization, and to motivate the use of advanced techniques like QAT to unlock even lower bit widths, potentially inspiring you to design your own highly compressed, efficient models.\n", "\n", "---\n", "\n", - "### Lab Objectives\n", + "### Learning Objectives\n", "\n", - "1. Demonstrate the impact of 8 → 1-bit quantization with varying group-sizes on model size, speed, and accuracy. \n", - "2. Understand and Implement Quantized-Aware-Training (QAT) to recover accuracy degredation from extreme quantization.\n", - "3. Lay the foundation for future labs on **QAT** and **custom kernel design**.\n", + "1. Demonstrate the impact of 8 → 1-bit quantization with varying group sizes on model size, speed, and accuracy \n", + "2. Understand and implement quantization aware training (QAT) to recover accuracy degredation from extreme quantization\n", + "3. Lay the foundation for future labs on **QAT** and **custom kernel design**\n", "\n", "By the end of this lab, you'll understand the trade-offs between compression and performance, particularly for **deployment on Arm-based edge devices**.\n", "\n", @@ -45,11 +45,11 @@ "\n", "## 1. Baseline FP32 Model Training\n", "\n", - "Let's start by training a **Autoregressive Transformer Language Model (LM)** using the popular **TinyShakespeare dataset**, a compact corpus composed of Shakespeare's text. We have already provided the training code for you. The model's task is next-token prediction, given a sequence of text, the model is trained to predict the next word or character.\n", + "Let's start by training an **autoregressive transformer language model** using the popular **TinyShakespeare dataset**, a compact corpus composed of Shakespeare's text. We have already provided the training code for you. The model's task is next-token prediction, given a sequence of text, the model is trained to predict the next word or character.\n", "\n", "After training the model, we will explore post-training quantization, a commonly used method that quantizes the neural network without requiring retraining.\n", "\n", - "> You can run the following cell to begin training the model, It is advised to use a GPU for this as it may take up to 15 minutes. Should you get out of memory errors, try reducing the batch size. " + "> You can run the following cell to begin training the model, It is advisable to use a GPU for this as it may take up to 15 minutes. Should you get out of memory errors, try reducing the batch size. " ] }, { @@ -286298,19 +286298,19 @@ "\n", "While your model is training, let’s take a moment to explore the theory behind quantization through the lens of hardware efficiency so we can understand why it is an effective technique for reducing memory usage and latency.\n", "\n", - "Quantization is a powerful technique for reducing memory usage and latency. It directly affects the most computationally intensive component of deep learning: **matrix multiplication**. These operations dominate inference workloads and benefit significantly from reduced precision—enabling smaller data transfers and simpler arithmetic.\n", + "Quantization directly affects the most computationally intensive component of deep learning: **matrix multiplication**. These operations dominate inference workloads and benefit significantly from reduced precision, enabling smaller data transfers and simpler arithmetic.\n", "\n", "\n", "### Matrix Multiply: A Naive Implementation\n", "\n", - "To understand how quantization effections matrix multiplication, lets look at the operation at a element-wise level.\n", + "To understand how quantization affects matrix multiplication, lets look at the operation at an element-wise level.\n", "\n", "For a matrix multiplication between matrices $X \\in \\mathbb{R}^{M \\times K}$ and $W \\in \\mathbb{R}^{K \\times N}$, resulting in $Y \\in \\mathbb{R}^{M \\times N}$ the operation is defined as:\n", "\n", "$Y_{m,n} = \\sum_{k=0}^{K-1} X_{m,k} \\cdot W_{k,n}$\n", "\n", "\n", - "> Lets implement this naively in python.\n" + "> Lets implement this naively in Python.\n" ] }, { @@ -286365,12 +286365,12 @@ "- Outer loops iterate over the **M** (rows of output) and **N** (columns of output) dimensions.\n", "- The inner loop iterates over the **K** dimension and performs the core computation.\n", "\n", - "The innermost loop executes a dot-product using **multiply-accumulate (MAC)** operations, where each step multiplies a weight and activation pair and immediately adds the result to a running sum. This operation is fundamental in neural network inference and training, as it efficiently computes weighted sums that are later passed through nonlinearities.\n", + "The innermost loop executes a dot product using **multiply-accumulate (MAC)** operations, where each step multiplies a weight and activation pair and immediately adds the result to a running sum. This operation is fundamental in neural network inference and training, as it efficiently computes weighted sums that are later passed through nonlinearities.\n", "\n", "\n", "---\n", "\n", - "#### Each MAC requires the folloing operations:\n", + "#### Each MAC requires the following operations:\n", "\n", "- **Two memory reads**: \n", " - Load element `x[m][k]` \n", @@ -286389,14 +286389,14 @@ "\n", "Two key factors limit the speed of matrix multiplication:\n", "\n", - "1. **Memory bandwidth** — how fast activations and weights can be fetched/written to memory. \n", - "2. **Compute throughput** — how fast multiply-accumulate operations can be executed (typically measured in **TOPS**, or Tera Operations Per Second).\n", + "1. **memory bandwidth**—how fast activations and weights can be fetched/written to memory; and\n", + "2. **compute throughput**—how fast multiply-accumulate operations can be executed (typically measured in **TOPS**, or tera operations per second).\n", "\n", "Hardware trends, however, consistently show that compute capabilities (**TOPS**) are improving at a significantly [faster rate than memory bandwidth](https://arxiv.org/abs/2403.14123). As a result, **memory bandwidth is increasingly becoming the primary bottleneck** for neural network inference.\n", "\n", "\n", "\n", - "This is clearly illustrated in the figure below: while the performance of floating-point operations (grey line) continues to scale rapidly, both **interconnect bandwidth** (blue) and **system memory bandwidth** (green) lag behind.\n", + "This is clearly illustrated in the figure below: while the performance of floating-point operations (gray line) continues to scale rapidly, both **interconnect bandwidth** (blue) and **system memory bandwidth** (green) lag behind.\n", "\n", "Consequently, **efficient memory access patterns and compression strategies** are becoming essential to reduce latency and improve throughput.\n", "\n", @@ -286434,9 +286434,9 @@ "\n", "In practice:\n", "\n", - "- In **edge AI**, it's common to use [**8-bit integer quantization**](https://arxiv.org/pdf/2106.08295) for weights and activations.\n", - "- This offers up to a **4× reduction in memory bandwidth** compared to FP32, with **negligible accuracy loss** in many cases.\n", - "- Further gains are possible using **4-bit**, **2-bit**, or **1-bit** formats.\n", + "- in **edge AI**, it's common to use [**8-bit integer quantization**](https://arxiv.org/pdf/2106.08295) for weights and activations;\n", + "- this offers up to a **4× reduction in memory bandwidth** compared to FP32, with **negligible accuracy loss** in many cases; and\n", + "- further gains are possible using **4-bit**, **2-bit**, or **1-bit** formats.\n", "\n", "Despite reduced precision, these [ultra-low-bit models can still deliver high accuracy and significant efficiency gains](https://arxiv.org/abs/2402.17764).\n", "\n", @@ -286446,32 +286446,32 @@ "\n", "Let’s start by looking at the simple but effective process of **symmetric linear quantization**, which maps floating-point values to **signed integers**, assuming that the distribution of values is **approximately zero-centered**.\n", "\n", - "For a bit-width $(b)$, signed integers can represent values in the range:\n", + "For a bit width $(b)$, signed integers can represent values in the range:\n", " \n", "#### $\\text{ values: } \\{x^{\\text{intb}} \\in \\mathbb{Z} : -2^{b-1} \\leq x < 2^{b-1}\\}$\n", "\n", - "This is well-suited for weight tensors and certain activation distributions (e.g. after BatchNorm), where the data naturally clusters around zero.\n", + "This is well suited for weight tensors and certain activation distributions (e.g. after BatchNorm), where the data naturally clusters around zero.\n", "\n", "---\n", "\n", "### Quantization and Dequantization Equations\n", "\n", - "In symmetric quantization, we assume the value range is symmetric around zero and use a **single scale factor** to map from the floating point paramter distribution to the integer. The quantization and dequantization equations simplify to:\n", + "In symmetric quantization, we assume the value range is symmetric around zero and use a **single scale factor** to map from the floating-point parameter distribution to the integer. The quantization and dequantization equations simplify to:\n", "\n", "### $q = \\text{round}\\left(\\frac{r}{s}\\right)$ \n", "### $r = s \\cdot q$\n", "\n", "Where:\n", "\n", - "- $ s $: **scale factor** — defines the step size between quantization levels\n", + "- $ s $: **scale factor**—defines the step size between quantization levels\n", "\n", "The scale is computed from the absolute max of the observed range:\n", "\n", "### $ s = \\frac{r_{\\text{max-abs}}}{q_{\\text{max}}} \\quad\\quad\\text{where } \\quad r_{\\text{max-abs}} = \\max(|r_{\\text{min}}|, |r_{\\text{max}}|) \\quad \\text{and} \\quad q_{\\text{max}}=2^{b-1}-1$\n", "\n", - " - $b$: bit-width — the number of bits allocated for representing each quantized value. For example, $b=8$ corresponds to an integer range of [−127, +127].\n", + " - $b$: bit width—the number of bits allocated for representing each quantized value. For example, $b=8$ corresponds to an integer range of [−127, +127].\n", "\n", - "This ensures the floating point range is linearly mapped to the symmetric integer range with zero always preserved exactly.\n", + "This ensures the floating-point range is linearly mapped to the symmetric integer range with zero always preserved exactly.\n", "\n", "---\n", "\n", @@ -286479,9 +286479,9 @@ "\n", "Symmetric quantization **simplifies hardware implementation** and is particularly effective when:\n", "\n", - "- The distribution is **centered around zero** (e.g., weights after training)\n", - "- Deterministic mapping to and from integer space is desirable\n", - "- Hardware **vector units (e.g., SIMD/NEON)** benefit from no offset handling\n", + "- the distribution is **centered around zero** (e.g., weights after training);\n", + "- deterministic mapping to and from integer space is desirable; and\n", + "- hardware **vector units (e.g., SIMD/NEON)** benefit from no offset handling.\n", "\n", "However, when distributions are heavily skewed (e.g., post-ReLU activations), this method can waste representational capacity compared to asymmetric quantization.\n", "\n", @@ -286588,15 +286588,15 @@ "id": "50509b40", "metadata": {}, "source": [ - "The interactive plot above demonstrates the relationship between bitwidth and quantization in neural networks. As you adjust the bitwidth slider, you can observe how the weight distribution changes and how the quantization error (measured by the Frobenius norm) increases as bitwidth decreases:\n", + "The interactive plot above demonstrates the relationship between bit width and quantization in neural networks. As you adjust the bit-width slider, you can observe how the weight distribution changes and how the quantization error (measured by the Frobenius norm) increases as bit width decreases:\n", "\n", "$ \\text{Quantization Error} = \\sqrt{\\sum_{i,j} (w_{i,j} - Q(w_{i,j}))^2} $\n", "\n", - "This degradation occurs because lower bitwidths (e.g., 2-4 bits) have fewer quantization levels available, leading to increased rounding errors.\n", + "This degradation occurs because lower bit widths (e.g., 2-4 bits) have fewer quantization levels available, leading to increased rounding errors.\n", "\n", - "## 4. Group-wise Quantization\n", + "## 4. Group-Wise Quantization\n", "\n", - "🛠️ To mitigate this accuracy loss at low bitwidths, we can use **groupwise quantization** - a technique that reduces quantization granularity by operating on smaller groups of weights individually rather than the entire tensor at once.\n", + "🛠️ To mitigate this accuracy loss at low bit widths, we can use **group-wise quantization**—a technique that reduces quantization granularity by operating on smaller groups of weights individually rather than the entire tensor at once.\n", "\n" ] }, @@ -286605,13 +286605,13 @@ "id": "cc2ea627-66dd-4f8f-9d25-eb7e6fd4ee1d", "metadata": {}, "source": [ - "Instead of applying a single global scale across the entire tensor (i.e., *per-tensor quantization*), **groupwise quantization** partitions the weight tensor into **fixed-size groups** (e.g., along the input or output dimension). Each group receives its own **independently computed scale**.\n", + "Instead of applying a single global scale across the entire tensor (i.e., *per-tensor quantization*), **group-wise quantization** partitions the weight tensor into **fixed-size groups** (e.g., along the input or output dimension). Each group receives its own **independently computed scale**.\n", "\n", "This allows the quantizer to better capture **local variations** in the data distribution, while keeping the quantization scheme **symmetric** (i.e., zero-point is fixed at 0).\n", "\n", "---\n", "\n", - "For a group index $g$, let the group contain a subset of values:\n", + "For a group index, $g$, let the group contain a subset of values:\n", "\n", "### $r_g = \\{ r_{g,0}, r_{g,1}, \\dots, r_{g,n-1} \\}$\n", "\n", @@ -286639,7 +286639,7 @@ "\n", "---\n", "\n", - "While this method increases quantization metadata (i.e., one scale *per-group* instead of *per-tensor*), it **significantly reduces quantization error**—particularly at low bit-widths by better modeling the **local statistics** of the data, all while preserving the efficiency benefits of **symmetric quantization** (e.g., no need to subtract/add zero-points).\n", + "While this method increases quantization metadata (i.e., one scale *per group* instead of *per tensor*), it **significantly reduces quantization error**—particularly at low bit widths—by better modeling the **local statistics** of the data, all while preserving the efficiency benefits of **symmetric quantization** (e.g., no need to subtract/add zero-points).\n", "\n", "> Let’s now implement this in code below!\n", "\n" @@ -286647,7 +286647,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": null, "id": "47092234-05bd-4df0-89a8-e450ae97fc61", "metadata": {}, "outputs": [ @@ -286714,8 +286714,8 @@ " scale = x_abs_max / q_max\n", " return scale\n", "\n", - "# This function scales the floating point distirubtion (group-wise) using the scale parameter \n", - "# and rounds of to the nearest integer to quantize to a integer\n", + "# This function scales the floating-point distribution (group-wise) using the scale parameter \n", + "# and rounds to the nearest integer to quantize to a integer\n", "def quantize_int_groupwise(x, scale, bitwidth=8, group_size=None):\n", " if group_size is None:\n", " group_size = x.shape[1]\n", @@ -286737,7 +286737,7 @@ "\n", "output = widgets.Output()\n", "\n", - "# Interactive plotting function with sliders for group size and bitwidth\n", + "# Interactive plotting function with sliders for group size and bit width\n", "def plot_interactive(group_size, bitwidth):\n", " with output:\n", " clear_output(wait=True)\n", @@ -286764,7 +286764,7 @@ "# Valid group sizes (divisors of 128, excluding 1)\n", "valid_group_sizes = [i for i in range(2, 129) if 128 % i == 0]\n", "\n", - "# Create sliders for group size and bitwidth\n", + "# Create sliders for group size and bit width\n", "group_size_slider = widgets.SelectionSlider(options=valid_group_sizes, value=128, description='Group Size:')\n", "bitwidth_slider = widgets.IntSlider(value=8, min=2, max=8, step=1, description='Bitwidth:')\n", "\n", @@ -286779,9 +286779,9 @@ "id": "526d376c", "metadata": {}, "source": [ - "### Exploring Bit-Width and Group Size Trade-offs\n", + "### Exploring Bit Width and Group Size Trade-offs\n", "\n", - "In the above figure, you can interactively explore the effects of **bit-width** and **group size** on quantization:\n", + "In the above figure, you can interactively explore the effects of **bit width** and **group size** on quantization:\n", "\n", "- The **left subplot** shows the original full-precision weight distribution. \n", "- The **right subplot** shows the quantized distribution using the selected parameters.\n", @@ -286794,18 +286794,18 @@ "\n", "---\n", "\n", - "This **finer-grained quantization** improves the approximation of the original weights—especially at **low bit-widths**, where preserving both dynamic range and resolution is critical.\n", + "This **finer-grained quantization** improves the approximation of the original weights—especially at **low bit widths**, where preserving both dynamic range and resolution is critical.\n", "\n", " However, this comes with an increased cost: more **quantization metadata** (i.e., one scale per group). This must be carefully balanced against hardware and memory constraints.\n", "\n", "---\n", "\n", - "> Let’s now plot quantization error across **bit-widths** and **group-sizes** to visualize the trade-offs. " + "> Let’s now plot quantization error across **bit widths** and **group sizes** to visualize the trade-offs. " ] }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "id": "44d3e63b", "metadata": {}, "outputs": [ @@ -286837,7 +286837,7 @@ "plt.grid(True, linestyle='--', alpha=0.7)\n", "plt.xlabel('Bit Width', fontsize=12)\n", "plt.ylabel('Quantization Error (Frobenius Norm)', fontsize=12)\n", - "plt.title('Symmetric Integer Quantization Error vs Bit-Width for Different Group-Sizes', fontsize=13, pad=15)\n", + "plt.title('Symmetric Integer Quantization Error vs Bit-Width for Different Group Sizes', fontsize=13, pad=15)\n", "plt.xticks(range(2, 9))\n", "plt.legend(title='Group Sizes', bbox_to_anchor=(1.05, 1), loc='upper left')\n", "\n", @@ -286850,7 +286850,7 @@ "id": "117f99fe", "metadata": {}, "source": [ - "> The plot demonstrates an exponential decrease in quantization error as the bit-width increases. Which makes sense as increasing the bit-width exponentially increases the number of representable states. Furthermore, smaller group-sizes consistently achieve lower quantization error, as illustrated by the different curves. This validates our earlier intuition about group size's impact on quantization accuracy.\n", + "> The plot demonstrates an exponential decrease in quantization error as the bit width increases. Which makes sense as increasing the bit width exponentially increases the number of representable states. Furthermore, smaller group sizes consistently achieve lower quantization error, as illustrated by the different curves. This validates our earlier intuition about the impact of group size on quantization accuracy.\n", "\n", "---" ] @@ -286862,9 +286862,9 @@ "source": [ "## 5. Post-Training Quantization Experiments\n", "\n", - "We’ve now explored how linear **symmetric** quantization works and how the group size and bit-width affect quantization error. So far however, we have only applied it on an example tensor. Now let’s apply it to the parameters of your trained model and evaluate its impact on **task accuracy**, which ultimately matters most.\n", + "We’ve now explored how linear **symmetric** quantization works and how the group size and bit width affect quantization error. So far however, we have only applied it on an example tensor. Now let’s apply it to the parameters of your trained model and evaluate its impact on **task accuracy**, which ultimately matters most.\n", "\n", - "We'll do this by implementing a custom `nn.Module` that performs static quantization on the weights using the ```quantize_int_groupwise``` function we have defined before, and dynamic quantization on the activations where the scale is computed on the fly. The layer will store quantized weights, biases (if present), and quantization parameters (`scale`) as non-trainable buffers in a `nn.Module` object. The object will also implement the logic for a forward pass that involves **dequantizing the weights using the scale** and **dynamically quantizing and de-quantizing** and then performing the linear operations." + "We'll do this by implementing a custom `nn.Module` that performs static quantization on the weights using the ```quantize_int_groupwise``` function we have defined before, and dynamic quantization on the activations where the scale is computed on the fly. The layer will store quantized weights, biases (if present), and quantization parameters (`scale`) as non-trainable buffers in a `nn.Module` object. The object will also implement the logic for a forward pass that involves **dequantizing the weights using the scale** and **dynamically quantizing and dequantizing** and then performing the linear operations." ] }, { @@ -286935,7 +286935,7 @@ "source": [ "### Replacing Linear Layers with Quantized Versions\n", "\n", - "Now that we have defined our quantized version of a linear layer (`QLinear`), we can quantize the entire model model by replacing all `nn.linear`layer instances with `QLinear`. In the next cell we define a recursive function to do this by searching through the model tree. We avoid quantizing the output layer, as it can be sensitive to the noise quantization introuces. \n" + "Now that we have defined our quantized version of a linear layer (`QLinear`), we can quantize the entire model model by replacing all `nn.linear`layer instances with `QLinear`. In the next cell we define a recursive function to do this by searching through the model tree. We avoid quantizing the output layer, as it can be sensitive to the noise quantization introduces. \n" ] }, { @@ -286968,7 +286968,7 @@ "source": [ "### Evaluating Accuracy vs. Post-Training Quantization\n", "\n", - "Now that we have defined the functions required to quantize the model using symmetric linear group-wise integer quantization with varying bit-widths and group sizes, we can evaluate the impact of these parameters on task accuracy by measuring the loss on the test set. In the next cell, we will compute the test loss across bit-widths ranging from 2 to 9 and group sizes specified by `valid_group_sizes`." + "Now that we have defined the functions required to quantize the model using symmetric linear group-wise integer quantization with varying bit widths and group sizes, we can evaluate the impact of these parameters on task accuracy by measuring the loss on the test set. In the next cell, we will compute the test loss across bit widths ranging from 2 to 9 and group sizes specified by `valid_group_sizes`." ] }, { @@ -287059,11 +287059,11 @@ "id": "68ad6187-ea14-4ab7-9ba2-e8290f5e09ad", "metadata": {}, "source": [ - "### The Impact of Bit-Width on Quantization in Neural Networks\n", + "### The Impact of Bit Width on Quantization in Neural Networks\n", "\n", - "As shown in the figures and interactive plots above, reducing the weight bit-width below 4 bits leads to a significant degradation in model performance, evidenced by a marked increase in test loss. While using smaller group sizes, increasing the granularity of quantization can mitigate this effect to some extent, it does not fully preserve task accuracy. This observation aligns with prior findings: aggressive quantization below 4 bits introduces substantial error that finer local adaptation alone cannot overcome.\n", + "As shown in the figures and interactive plots above, reducing the weight bit width below 4 bits leads to a significant degradation in model performance, evidenced by a marked increase in test loss. While using smaller group sizes, increasing the granularity of quantization can mitigate this effect to some extent, it does not fully preserve task accuracy. This observation aligns with prior findings: aggressive quantization below 4 bits introduces substantial error that finer local adaptation alone cannot overcome.\n", "\n", - "> **Post-Training Quantization below 4-bits results in significantly reduced task performance**\n", + "> **Post-Training Quantization below 4-bits results in significantly reduced task performance.**\n", "\n", "However, despite these limitations, recent research shows that extreme quantization when applied carefully can still produce competitive models. Instead of treating sub 4-bit quantization as inherently flawed, we can use **task-aware quantization techniques** that restructure the computation entirely, such as **ternary** and **binary quantization**. These methods don't just compress precision, they reformulate how the network operates.\n", "\n", @@ -287077,7 +287077,7 @@ "\\{-1, 0, 1\\}\n", "$$\n", "\n", - "This drastically simplifies the required computations: multiply-accumulate (MAC) operations reduce to a combination of sparse additions and subtractions. Since zero-valued weights require no computation, this naturally induces sparsity and leads to **reduced memory bandwidth** and **lower power usage**. These benefits are particularly valuable on edge devices.\n", + "This drastically simplifies the required computations: MAC operations reduce to a combination of sparse additions and subtractions. Since zero-valued weights require no computation, this naturally induces sparsity and leads to **reduced memory bandwidth** and **lower power usage**. These benefits are particularly valuable on edge devices.\n", "\n", "For example, consider computing a dot product between input **$x = [x_1, x_2, x_3, x_4]$** and ternary weights **$w = [1, 0, -1, 1]$**:\n", "\n", @@ -287130,7 +287130,7 @@ "\n", "This method effectively maps dense floating-point weights to a sparse ternary representation, significantly reducing memory and computation costs with minimal accuracy degradation.\n", "\n", - "> In the next cell, we’ll implement this method in PyTorch using **groupwise thresholding and scaling** for improved fidelity.\n", + "> In the next cell, we’ll implement this method in PyTorch using **group-wise thresholding and scaling** for improved fidelity.\n", "\n" ] }, @@ -287221,9 +287221,9 @@ "\n", "### The Key Question\n", "\n", - "> **How can we aggressively reduce bit-width to 1.58 bits per weight while preserving accuracy?**\n", + "> **How can we aggressively reduce bit width to 1.58 bits per weight while preserving accuracy?**\n", "\n", - "The answer: **Quantization-Aware Training (QAT)**—a technique that simulates quantization during training so the model can adapt to it.\n", + "The answer: **Quantization aware training (QAT)**—a technique that simulates quantization during training so the model can adapt to it.\n", "\n", "---\n", "\n", @@ -287246,7 +287246,7 @@ "\n", "### $\\frac{d\\hat{W}}{dW} = 0 \\quad \\text{(almost everywhere)}, \\quad \\text{undefined at } \\Delta, \\text{and} -\\Delta$\n", "\n", - "This breaks Backpropagation:\n", + "This breaks backpropagation:\n", "
\n", "
\n", "$\n", @@ -287259,13 +287259,13 @@ "\n", "### Straight-Through Estimator (STE)\n", "\n", - "To work around this, we use **approximate gradients** like the **Straight-Through Estimator (STE)**:\n", + "To work around this, we use **approximate gradients** like the **STE**:\n", "\n", "$\n", "\\frac{d\\hat{W}}{dW} \\approx 1\n", "$\n", "\n", - "STE bypasses the quantization op in the backward pass, letting gradients flow through as if quantization didn't occur. While crude, it works remarkably well in practice.\n", + "The STE bypasses the quantization op in the backward pass, letting gradients flow through as if quantization didn't occur. While crude, it works remarkably well in practice.\n", "\n", "
\n", "\n", @@ -287279,29 +287279,29 @@ "y = W x\n", "$\n", "\n", - "During Ternary QAT, we **simulate quantization during the forward pass**:\n", + "During ternary QAT, we **simulate quantization during the forward pass**:\n", "\n", "$\n", "\\hat{y} = Q^T(W) \\cdot x\n", "$\n", "\n", - "But during the backward pass, we apply STE on the quantization function $Q^T$ and backpropagate to the weights as if they were never quantized:\n", + "But during the backward pass, we apply the STE on the quantization function $Q^T$ and backpropagate to the weights as if they were never quantized:\n", "\n", "$\n", " \\frac{\\partial \\hat{y}}{\\partial Q(W)} \\approx \\frac{\\partial \\hat{y}}{\\partial W}\n", "$\n", "\n", - "Then once, training is complete, we can drop the full precision weights $W$ are replace them with the quantized weights $\\hat{W}$ = Q^T(W) and use the quantized weights for inference.\n", + "Then once, training is complete, we can drop the full-precision weights $W$, replace them with the quantized weights $\\hat{W}$ = Q^T(W) and use the quantized weights for inference.\n", "\n", "---\n", "\n", "### Outcome\n", "\n", - "The outcome of QAT is a model that **learns to be robust to quantization**, maintaining strong accuracy even at low bit-widths (4-bit, ternary, binary).\n", + "The outcome of QAT is a model that **learns to be robust to quantization**, maintaining strong accuracy even at low bit widths (4-bit, ternary, binary).\n", "\n", "> 🔜 So now let's give it a go with the ternary example. Using just 1.58 bits per weight! \n", "\n", - "To implement it now define a custom **autograd function** to apply ternary quantization to weights during the forward pass, while using the **Straight-Through Estimator (STE)** in the backward pass. This is different, to the earlier post-training quantization process where we only had to implement the quantized forward pass. Bear in mind, that in the rest of the lab we will implement simulated quantization, so we can perform full Quantized-Aware-Training. Once QAT is complete, the simulation can be droppped, and specialized low-biwidth kernels can be utilized for improved memory and latency.\n", + "To implement it now define a custom **autograd function** to apply ternary quantization to weights during the forward pass, while using the **STE** in the backward pass. This is different, to the earlier post-training quantization process where we only had to implement the quantized forward pass. Bear in mind, that in the rest of the lab we will implement simulated quantization, so we can perform full QAT. Once QAT is complete, the simulation can be droppped, and specialized low-bit-width kernels can be utilized for improved memory and latency.\n", "\n", "> The code below implements the forward and backward passes of ternary group-wise weight quantization and per-tensor int8 activation quantization. \n" ] @@ -287411,7 +287411,7 @@ "source": [ "### Integrating Ternary Quantization into a Linear Layer\n", "\n", - "Next, we wrap the ternary weight and int8 activation quantization functionsfunction inside a custom `TernaryLinear` layer. This layer behaves like a standard `nn.Linear`, but quantizes its weights to the ternary set {−1, 0, 1} during the forward pass, and implements the correct STE estimation for the backward pass.\n", + "Next, we wrap the ternary weight and int8 activation quantization functions inside a custom `TernaryLinear` layer. This layer behaves like a standard `nn.Linear` layer, but quantizes its weights to the ternary set {−1, 0, 1} during the forward pass, and implements the correct STE estimation for the backward pass.\n", "\n", "**This makes it easy to plug into any PyTorch model and optimize, as if it were a completely differentiable network!**" ] @@ -287456,7 +287456,7 @@ "id": "302e53f1-9c6a-4b51-b354-b1709f9336a0", "metadata": {}, "source": [ - "### Recursively Replacing Layers with TernaryLinear\n", + "### Recursively Replacing Layers with `TernaryLinear`\n", "\n", "This function searches through the model and replaces all `nn.Linear` layers, except those specified in `skip_layers`, with their ternary quantized equivalents.\n", "\n", @@ -287494,7 +287494,7 @@ "source": [ "### Verifying Ternary Layer Replacement\n", "\n", - "Let’s load a full precision model and apply ternary quantization to its internal linear layers. After replacement, we print the model to confirm that all expected layers have been converted. You should see instances of `TernaryLinear` in the below cell's output. \n" + "Let’s load a full-precision model and apply ternary quantization to its internal linear layers. After replacement, we print the model to confirm that all expected layers have been converted. You should see instances of `TernaryLinear` in the below cell's output. \n" ] }, { @@ -287569,7 +287569,7 @@ "source": [ "### Time to Train!\n", "\n", - "With ternary quantization applied, we’re ready to fine-tune the model using quantization-aware training. Let’s see how well the model performs under aggressive compression!\n" + "With ternary quantization applied, we’re ready to fine-tune the model using QAT. Let’s see how well the model performs under aggressive compression!\n" ] }, { @@ -573878,13 +573878,13 @@ "source": [ "## 7. Binary Quantization & QAT\n", "\n", - "Lets now take this a step further with Binary quantization. Binary Quantization takes compression to the extreme: each weight is constrained to one of two values,\n", + "Lets now take this a step further with binary quantization. Binary quantization takes compression to the extreme: each weight is constrained to one of two values,\n", "\n", "$$\n", "\\hat{W}_{i,j} \\in \\{-1, +1\\},\n", "$$\n", "\n", - "allowing **1-bit representation per weight** achieve a **32× compression** over 32-bit floats. On hardware, this enables **fast, energy-efficient inference** via **bitwise operations** (e.g., XNOR + popcount) instead of multiply-accumulate (MAC).\n", + "allowing **1-bit representation per weight** achieve a **32× compression** over 32-bit floats. On hardware, this enables **fast, energy-efficient inference** via **bitwise operations** (e.g., XNOR + popcount) instead of MAC.\n", "\n", "> Binary networks are thus ultra-compact and tailored for digital efficiency.\n", "\n", @@ -573908,7 +573908,7 @@ "\\hat{W}_{\\text{scaled}} = \\alpha \\cdot \\text{sign}(W), \\quad \\alpha = \\frac{1}{|W|} \\sum_{i,j} |W_{i,j}|\n", "$$\n", "\n", - "This scaling preserves coarse magnitude information without increasing bitwidth. Let's implement it below\n" + "This scaling preserves coarse magnitude information without increasing bit width. Let's implement it below\n" ] }, { @@ -573959,7 +573959,7 @@ "\\frac{d \\hat{W}}{dW} = 0\n", "$$\n", "\n", - "This causes **vanishing gradients** during backpropagation. To enable training, we again use the **Straight-Through Estimator (STE)**:\n", + "This causes **vanishing gradients** during backpropagation. To enable training, we again use the **STE**:\n", "\n", "$$\n", "\\frac{d \\hat{W}}{dW} \\approx 1\n", @@ -573977,7 +573977,7 @@ "\\hat{W} = \\alpha \\cdot \\text{sign}(W),\n", "$$\n", "\n", - "while the **backward pass** uses STE to propagate gradients through the full-precision weights. At **inference**, only `sign(W)` and the scalar $\\alpha$ are retained. Let's add the STE to the backward pass of the function by running the cell below. \n" + "while the **backward pass** uses the STE to propagate gradients through the full-precision weights. At **inference**, only `sign(W)` and the scalar $\\alpha$ are retained. Let's add the STE to the backward pass of the function by running the cell below. \n" ] }, { @@ -574013,9 +574013,9 @@ "source": [ "---\n", "\n", - "### Quantization-Aware Training (QAT)\n", + "### Quantization Aware Training (QAT)\n", "\n", - "To retain accuracy with binary weights, we use **Quantization-Aware Training (QAT)**, which simulates quantization during training so that the model learns to be robust to low-precision constraints.\n", + "To retain accuracy with binary weights, we use **QAT**, which simulates quantization during training so that the model learns to be robust to low-precision constraints.\n", "\n", "Because binary quantization is an extreme form of compression, it's important to apply **group-wise quantization**, where each group of weights (e.g., 64 or 128 values) shares a single scaling factor:\n", "\n", @@ -574025,7 +574025,7 @@ "\n", "This preserves coarse magnitude information within each group while maintaining 1-bit weights.\n", "\n", - "During backpropagation, gradients are propagated through the quantization step using the **Straight-Through Estimator (STE)**:\n", + "During backpropagation, gradients are propagated through the quantization step using the **STE**:\n", "\n", "$$\n", "\\frac{\\partial \\mathcal{L}}{\\partial \\hat{W}} \\approx \\frac{\\partial \\mathcal{L}}{\\partial W}\n", @@ -574085,9 +574085,9 @@ "\n", "> Lets now integrate this group-wise binary quantization function into a custom `BinaryLinear` layer.\n", "\n", - "### BinaryLinear Layer\n", + "### `BinaryLinear` Layer\n", "\n", - "Below, we utilize these per-group binary quantization function into a linear layer class object, so that we can introduce them into our pytorch models. \n", + "Below, we utilize these per-group binary quantization function into a linear layer class object, so that we can introduce them into our PyTorch models. \n", "\n", "---" ] @@ -574132,7 +574132,7 @@ "source": [ "### Recursively Replacing Layers\n", "\n", - "To apply quantization to the model inplace, we replace each `nn.Linear` layer with our custom `BinaryLinear` version. This can be done recursively, skipping any layers we want to preserve (e.g., output heads).\n", + "To apply quantization to the model in-place, we replace each `nn.Linear` layer with our custom `BinaryLinear` version. This can be done recursively, skipping any layers we want to preserve (e.g., output heads).\n", "\n", "---" ] @@ -917445,9 +917445,9 @@ "source": [ "### Testing and Visualization\n", "\n", - "After training, we evaluate the quantized model on the test set and compare performance to models quantized at higher bit-widths (e.g., 2-bit, 4-bit).\n", + "After training, we evaluate the quantized model on the test set and compare performance to models quantized at higher bit widths (e.g., 2-bit, 4-bit).\n", "\n", - "We can visualize the trade-off between **bitwidth** and **model loss/accuracy**, highlighting how the binary model (1-bit) performs relative to the ternary (1.58-bit) and standard quantized models.\n", + "We can visualize the trade-off between **bit width** and **model loss/accuracy**, highlighting how the binary model (1-bit) performs relative to the ternary (1.58-bit) and standard quantized models.\n", "\n", "---" ] @@ -917496,7 +917496,7 @@ "source": [ "### Outcome\n", "\n", - "With **just 1-bit per weight**, the model is able to maintain **surprisingly competitive performance**, especially when fine-tuned using QAT. While binary quantization removes all magnitude information, with STE-based training the model can adapt to this harsh constraint.\n", + "With **just 1-bit per weight**, the model is able to maintain **surprisingly competitive performance**, especially when fine-tuned using QAT. While binary quantization removes all magnitude information, with the STE-based training the model can adapt to this harsh constraint.\n", "\n", "This represents one of the most memory- and energy-efficient forms of quantization available today, particularly suited for **edge devices, microcontrollers, and custom ASICs**.\n", "\n", @@ -917557,7 +917557,7 @@ "source": [ "### Summary\n", "\n", - "In this lab, you explored how reducing model precision to very low bit-widths, such as ternary and binary, can significantly shrink memory usage and improve computational efficiency. While extreme quantization can hurt accuracy, techniques like Quantization-Aware Training help recover much of the lost performance. These findings are essential for deploying models on limited hardware and form a solid foundation for future labs focused on advanced optimization techniques." + "In this lab, you explored how reducing model precision to very low bit widths, such as ternary and binary, can significantly shrink memory usage and improve computational efficiency. While extreme quantization can hurt accuracy, techniques like quantization aware training (QAT) help recover much of the lost performance. These findings are essential for deploying models on limited hardware and form a solid foundation for future labs focused on advanced optimization techniques." ] } ], diff --git a/lab2.ipynb b/lab2.ipynb index 3b6b2de..63ac4f1 100644 --- a/lab2.ipynb +++ b/lab2.ipynb @@ -15,17 +15,17 @@ "\n", "Specifically, you will:\n", "\n", - "- **Wrap every `nn.Linear` in a quantized integer-only `QLinear` module** \n", - "- **Post-quantize both weights and activations layerwise**, selecting precision from **8 → 2 bits** for each individual layer \n", - "- **Measure performance of each quantization configuration** using **Cross Entropy** and **memory consumption** \n", - "- **Perform automated layerwise bit-width search** to optimize a hardware-aware objective function\n", + "- **wrap every `nn.Linear` in a quantized integer-only `QLinear` module**;\n", + "- **post-quantize both weights and activations layerwise**, selecting precision from **8 → 2 bits** for each individual layer; \n", + "- **measure performance of each quantization configuration** using **Cross Entropy** and **memory consumption**; and\n", + "- **perform automated layerwise bit-width search** to optimize a hardware-aware objective function.\n", "\n", "> **Why co-design matters:** \n", - "> A model that looks efficient in software may still bottleneck on real hardware due to memory access patterns, compute throughput, or unsupported bit-widths. Hardware–software co-design ensures the model structure aligns with hardware constraints, enabling deployment that is both **accurate** and **efficient** on edge devices.\n", + "> A model that looks efficient in software may still bottleneck on real hardware due to memory access patterns, compute throughput, or unsupported bit widths. Hardware–software co-design ensures the model structure aligns with hardware constraints, enabling deployment that is both **accurate** and **efficient** on edge devices.\n", "\n", "---\n", "\n", - "### Lab Objectives\n", + "### Learning Objectives\n", "\n", "1. **Implement `QLinear`**, a simulated integer GEMM layer with scale-offset dequantization, compatible with PyTorch CPU kernels \n", "2. **Post-quantize a pretrained model checkpoint** using per-layer {8, 4, 2}-bit precision, and export metadata for downstream hardware cost modeling \n", @@ -73,7 +73,7 @@ "source": [ "## 1. Building `QLinear`\n", "\n", - "The first step is to implement a quantized linear layer that supports configurable bit-widths. This will allow us to experiment with different per-layer quantization strategies and conduct a hardware–software co-design process to identify an optimal configuration for deployment.\n", + "The first step is to implement a quantized linear layer that supports configurable bit widths. This will allow us to experiment with different per-layer quantization strategies and conduct a hardware–software co-design process to identify an optimal configuration for deployment.\n", "\n", "We’ll define a `QLinear` layer that performs **symmetric uniform quantization**, mapping floating-point tensors to signed integers in the range \n", "$ - (2^{b-1} - 1), \\dots, + (2^{b-1} - 1) $, using a single scale factor **s**.\n", @@ -176,7 +176,7 @@ "\n", "Run the cell below to quantize a random matrix at 2, 4, and 8-bit precision, \n", "and print the reconstruction error. \n", - "> As bit-width increases, the reconstruction error should decrease. Let's run the cell below just to check that. \n" + "> As bit width increases, the reconstruction error should decrease. Let's run the cell below just to check that. \n" ] }, { @@ -212,13 +212,13 @@ "source": [ "## 2. Swapping Layers In-Place\n", "\n", - "To enable **per-layer mixed-precision quantization**, where each layer can be quantized with an arbitrary bit-width, we define a function that traverses the model and replaces each `nn.Linear` layer with a quantized equivalent.\n", + "To enable **per-layer mixed-precision quantization**, where each layer can be quantized with an arbitrary bit width, we define a function that traverses the model and replaces each `nn.Linear` layer with a quantized equivalent.\n", "\n", - "The key mechanism relies on a dictionary called **`qconfig`** which maps the fully qualified names of `nn.Linear` modules to their desired bit-widths. This configuration is then used to locate and swap the corresponding layers in-place with quantized versions (`QLinear`) that store the quantized weights.\n", + "The key mechanism relies on a dictionary called **`qconfig`** which maps the fully qualified names of `nn.Linear` modules to their desired bit widths. This configuration is then used to locate and swap the corresponding layers in-place with quantized versions (`QLinear`) that store the quantized weights.\n", "\n", "The functions below implement this mechanism:\n", - "- `quantize_linear`: replaces a single `nn.Linear` with a quantized version given a particular weight bit-width.\n", - "- `quantize_model`: applies `quantize_linear` to each layer in the model, quantizing with the per-layer precision specified in `qconfig`.\n", + "- `quantize_linear`: replaces a single `nn.Linear` with a quantized version given a particular weight bit width;\n", + "- `quantize_model`: applies `quantize_linear` to each layer in the model, quantizing with the per-layer precision specified in `qconfig`; and\n", "- `default_qconfig`: generates a default `qconfig` dictionary in which every layer has 8-bit precision.\n", "\n", "The `qconfig` dictionary looks like this:\n", @@ -282,7 +282,7 @@ "\\text{Relative MSE} = \\frac{\\mathbb{E}[(W - \\hat{W})^2]}{\\mathbb{E}[W^2]}\n", "$\n", "\n", - "Each bar represents a `nn.Linear` layer in the transformer blocks, quantized to a fixed bit-width (e.g., 4-bit). \n", + "Each bar represents a `nn.Linear` layer in the transformer blocks, quantized to a fixed bit width (e.g., 4-bit). \n", "\n", "- **Blue bars** indicate self-attention projection layers. \n", "- **Orange bars** represent feedforward layers.\n", @@ -379,7 +379,7 @@ "source": [ "## 3. Accuracy Metric: Validation Loss on a Single Batch\n", "\n", - "As we transition to **software-hardware co-design**, our goal is to explore the space of per-layer bitwidth configurations under a joint cost function that balances **memory footprint** and **task accuracy**.\n", + "As we transition to **hardware–software co-design**, our goal is to explore the space of per-layer bit width configurations under a joint cost function that balances **memory footprint** and **task accuracy**.\n", "\n", "To evaluate the **accuracy impact** to the cost function of a given quantization configuration, we measure the model's **cross-entropy loss** on a single held-out validation batch. This provides a fast and task-relevant proxy for accuracy degradation due to quantization, enabling efficient evaluation during search.\n", "\n", @@ -399,7 +399,7 @@ "\n", "> **Note**: \n", "> This metric captures the **end-to-end functional effect** of quantization on the model’s output, reflecting how quantization impacts actual task performance. \n", - "> In contrast to weight-level metrics (like relative MSE), which assess error in isolation, this loss provides a **task-specific measure of accuracy degradation**, making it more suitable for optimization in a software-hardware co-design loop.\n", + "> In contrast to weight-level metrics (like relative MSE), which assess error in isolation, this loss provides a **task-specific measure of accuracy degradation**, making it more suitable for optimization in a hardware–software co-design loop.\n", "\n" ] }, @@ -431,18 +431,18 @@ "source": [ "## 4. Memory Metric: Static Model Size\n", "\n", - "In our software-hardware co-design cost function, we also need to quantify the **hardware cost** of any given quantization configuration. \n", + "In our hardware–software co-design cost function, we also need to quantify the **hardware cost** of any given quantization configuration. \n", "To do this, we compute the model’s **static memory footprint** in **bytes**, which reflects the total size of the model's parameters after quantization.\n", "\n", - "This metric provides a direct estimate of how much memory the model will occupy on disk or RAM, and depends on the per-layer bit-widths specified in the `qconfig`.\n", + "This metric provides a direct estimate of how much memory the model will occupy on disk or RAM, and depends on the per-layer bit widths specified in the `qconfig`.\n", "\n", "### What gets counted?\n", "\n", - "- **Quantized weights** (e.g., `QLinear.qweight`) are stored using the **bit-width defined in `qconfig`**\n", - "- **Biases, embeddings, and normalization layers** are kept in **full precision**: 32 bits per parameter\n", - "- **Other unquantized parameters** (e.g., LayerNorm scales or modules not explicitly quantized) are also assumed to be **32-bit floats**\n", + "- **Quantized weights** (e.g., `QLinear.qweight`) are stored using the **bit width defined in `qconfig`**.\n", + "- **Biases, embeddings, and normalization layers** are kept in **full precision**: 32 bits per parameter.\n", + "- **Other unquantized parameters** (e.g., LayerNorm scales or modules not explicitly quantized) are also assumed to be **32-bit floats**.\n", "\n", - "For a quantized linear layer with weight shape $[m, n]$ and bit-width $b$, the contribution to memory is:\n", + "For a quantized linear layer with weight shape $[m, n]$ and bit width $b$, the contribution to memory is:\n", "\n", "$\n", "\\text{Size}_{\\text{qweight}} = m \\times n \\times b \\text{ bits}\n", @@ -495,10 +495,10 @@ "source": [ "## 5. HyperOpt Objective: Balancing Accuracy and Efficiency\n", "\n", - "With well-defined metrics for **accuracy** and **memory footprint**, we can now construct a **scalarized objective function** to guide the search for optimal per-layer bit-width configurations. Although this lab adopts a specific formulation combining validation loss and model size, the framework is flexible — you may substitute alternative metrics (e.g., latency, energy, throughput) and compose a cost function that aligns with your design goals.\n", + "With well-defined metrics for **accuracy** and **memory footprint**, we can now construct a **scalarized objective function** to guide the search for optimal per-layer bit width configurations. Although this lab adopts a specific formulation combining validation loss and model size, the framework is flexible—you may substitute alternative metrics (e.g., latency, energy, throughput) and compose a cost function that aligns with your design goals.\n", "\n", "\n", - "This objective captures the fundamental trade-off in software-hardware co-design: achieving **high predictive performance** while maintaining **compact model size**.\n", + "This objective captures the fundamental trade-off in hardware—software co-design: achieving **high predictive performance** while maintaining **compact model size**.\n", "\n", "We combine the two criteria, validation loss and memory size into a single scalar objective:\n", "\n", @@ -557,12 +557,12 @@ "\n", "This process requires two components which are defined inside the `hyperopt_search` function below:\n", "\n", - "- A **search space**: defines which bit-widths can be assigned to each quantizable layer\n", - "- A **search driver**: proposes candidate configurations and evaluates them using the scalarized objective\n", + "- a **search space**: defines which bit widths can be assigned to each quantizable layer; and\n", + "- a **search driver**: proposes candidate configurations and evaluates them using the scalarized objective.\n", "\n", "### Search Space\n", "\n", - "We restrict quantization to a discrete set of bit-widths, typically `[3, 4, 5, 6, 7, 8]`. \n", + "We restrict quantization to a discrete set of bit widths, typically `[3, 4, 5, 6, 7, 8]`. \n", "Each `nn.Linear` layer in the model gets its own bit-width choice, so for a model with $L$ quantizable layers, the total search space contains $6^L$ configurations, making exhaustive enumeration intractable for non-trivial models.\n", "\n", "To define the space using `hyperopt`:\n", @@ -572,9 +572,7 @@ " name: hp.choice(name, [3, 4, 5, 6, 7, 8])\n", " for name in linear_layer_names(model)\n", "}\n", - "\n", "```\n", - "\n", "### Search Driver\n", "\n", "We use `hyperopt.fmin()` to minimize the total loss:\n", @@ -584,15 +582,15 @@ "$\n", "\n", "The driver works iteratively:\n", - "1. Propose a candidate `qconfig` from the search space \n", - "2. Apply quantization according to that configuration \n", + "1. Propose a candidate `qconfig` from the search space. \n", + "2. Apply quantization according to that configuration. \n", "3. Evaluate:\n", - " - **Accuracy** via cross-entropy loss on a validation batch \n", - " - **Size** via the static memory estimate of the quantized model \n", + " - **accuracy** via cross-entropy loss on a validation batch; and \n", + " - **size** via the static memory estimate of the quantized model. \n", "\n", "Over many iterations, the optimization converges toward configurations that optimally trade off model size and accuracy, according to the weight $\\alpha$.\n", "\n", - "By default, we use the **Tree-structured Parzen Estimator** (`tpe.suggest`) for guided exploration, but other strategies (e.g., bayesian) can be substituted depending on performance and exploration needs.\n", + "By default, we use the **Tree-Structured Parzen Estimator** (`tpe.suggest`) for guided exploration, but other strategies (e.g., Bayesian) can be substituted depending on performance and exploration needs.\n", "\n", "> **Tip for faster search**: \n", "> Simplify the bit-width set (e.g., restrict to `[4, 6, 8]`) to reduce the search space or reduce the number of trials.\n", @@ -621,7 +619,7 @@ "source": [ "## 7. Main Entry Point: Running the Full Quantization Search\n", "\n", - "This function below `optimize_qconfig` orchestrates the **end-to-end quantization search pipeline**, tying together model loading, configuration setup, and the optimization loop.\n", + "The function below, `optimize_qconfig`, orchestrates the **end-to-end quantization search pipeline**, tying together model loading, configuration setup, and the optimization loop.\n", "\n", "It performs the following steps:\n", "\n", @@ -716,16 +714,16 @@ "source": [ "## 8. Visualizing the Quantization Strategy\n", "\n", - "Once the quantization search completes, it’s insightful to examine **how bit-widths were assigned across the model architecture**. This helps identify which components the optimizer considered **sensitive** to quantization and which were more **robust to compression**.\n", + "Once the quantization search completes, it’s insightful to examine **how bit widths were assigned across the model architecture**. This helps identify which components the optimizer considered **sensitive** to quantization and which were more **robust to compression**.\n", "\n", - "### Per-block Bit-width Allocation\n", + "### Per-Block Bit-Width Allocation\n", "\n", - "The visualization below summarizes the selected bit-widths by transformer block, categorized into:\n", + "The visualization below summarizes the selected bit widths by transformer block, categorized into:\n", "\n", - "- **Feed-forward layers**: `linear1`, `linear2` \n", - "- **Self-attention layers**: `qkv_proj`, `out_proj`\n", + "- **feed-forward layers**: `linear1`, `linear2`; and \n", + "- **self-attention layers**: `qkv_proj`, `out_proj`.\n", "\n", - "For each transformer block, we display the bit-widths assigned to its constituent layers **side by side**, enabling pattern recognition — e.g., whether early layers retain higher precision, or certain layer types are consistently quantized more aggressively.\n", + "For each transformer block, we display the bit widths assigned to its constituent layers **side by side**, enabling pattern recognition—e.g., whether early layers retain higher precision, or certain layer types are consistently quantized more aggressively.\n", "\n", "Let’s begin by visualizing the bit-width configuration produced by **your own quantization search**:\n" ] @@ -835,8 +833,8 @@ "\n", "To get a sense of what longer or more thorough searches might yield, we can also visualize a **previously optimized configuration**, obtained via:\n", "\n", - "- **2000 trials**\n", - "- A strong accuracy preference: $\\alpha = 0.2$\n", + "- **2000 trials**; and\n", + "- a strong accuracy preference: $\\alpha = 0.2$.\n", "\n", "This provides a useful baseline or reference for understanding how the search process evolves with more iterations and different objective weights.\n", "\n", @@ -885,28 +883,28 @@ "id": "ad775d38", "metadata": {}, "source": [ - "## 9. Precision vs. Accuracy Trade-off\n", + "## 9. Precision vs. Accuracy Trade-Off\n", "\n", - "To assess the benefits of **mixed-precision quantization** in the context of software-hardware co-design, we compare it against simpler **uniform-precision baselines**, where all layers share the same bit-width.\n", + "To assess the benefits of **mixed-precision quantization** in the context of hardware–software co-design, we compare it against simpler **uniform-precision baselines**, where all layers share the same bit width.\n", "\n", "### Experimental Setup\n", "\n", - "We evaluate the model under a range of uniform bit-widths (from 8-bit to 2-bit) and record:\n", + "We evaluate the model under a range of uniform bit widths (from 8-bit to 2-bit) and record:\n", "\n", - "- **Validation loss**: accuracy after quantization \n", - "- **Model size**: static memory footprint (in MB)\n", + "- **validation loss**: accuracy after quantization; and\n", + "- **model size**: static memory footprint (in MB).\n", "\n", "We then compare these baselines to the **mixed-precision configuration** found by the search algorithm.\n", "\n", "This analysis addresses a key question: \n", "**Can adaptive bit-width allocation yield better accuracy at a comparable or lower memory cost?**\n", "\n", - "### Plotting the Trade-off\n", + "### Plotting the Trade-Off\n", "\n", "The plot below illustrates the trade-off between model size and validation loss:\n", "\n", - "- The **blue curve** traces uniform-precision results (each point labeled with its bit-width) \n", - "- The **grey diamond** highlights the mixed-precision result\n", + "- The **blue curve** traces uniform precision results (each point labeled with its bit-width). \n", + "- The **grey diamond** highlights the mixed precision result.\n", "\n", "If the grey diamond appears **below and to the left** of the curve, it demonstrates that the mixed strategy achieves a **superior accuracy–efficiency balance** compared to any fixed-precision alternative.\n" ] @@ -1035,13 +1033,13 @@ "id": "536639d5", "metadata": {}, "source": [ - "You should see in the graph below, taht the Mixed-Precision quantization solution falls, below the uniform precision line plot. It hence, provides an improved memory-accuracy trade-of thanks to the help of software-hardware co-design. \n", + "You should see in the graph below, that the mixed precision quantization solution falls below the uniform precision line plot. Hence, it provides an improved memory accuracy trade-off thanks to the help of software-hardware co-design. \n", "\n", "## Conclusion\n", "\n", "In this lab, we explored **post-training quantization** as a hardware–software co-design tool for optimizing transformer models. By implementing an integer-only `QLinear` layer and automating per-layer bit-width assignment using `hyperopt`, we demonstrated how quantization directly impacts both **model size** and **predictive performance**.\n", "\n", - "Our experiments showed that selectively reducing bit-widths (e.g., to 2–4 bits) yields significant memory savings with modest accuracy trade-offs. The scalarized loss function allowed us to balance these trade-offs, adapting the quantization strategy to hardware constraints.\n", + "Our experiments showed that selectively reducing bit widths (e.g., to 2–4 bits) yields significant memory savings with modest accuracy trade-offs. The scalarized loss function allowed us to balance these trade-offs, adapting the quantization strategy to hardware constraints.\n", "\n", "This lab highlights a key principle in efficient ML deployment: *quantization is not a post-hoc hack, but a tunable design dimension*. Co-designing quantization levels per layer, based on actual hardware cost and accuracy impact, enables principled, deployable models for edge and embedded applications." ] diff --git a/lab3.md b/lab3.md index b670d75..285863b 100644 --- a/lab3.md +++ b/lab3.md @@ -1,15 +1,15 @@ -# Lab 3: **Running & Quantizing Llama Models on Android** +# Lab 3: **Running and Quantizing Llama Models on Android** -This lab provides a hands-on walkthrough of how to run and optimize compact large language models (LLMs) directly on Android devices using the `llama.cpp` framework. You'll explore the full workflow, including downloading, converting, quantizing, deploying, and benchmarking LLaMA-style models. By the end, you'll have a working offline mobile LLM capable of running locally without server dependencies in an on-device android application. +This lab provides a hands-on walkthrough of how to run and optimize compact large language models (LLMs) directly on Android devices using the `llama.cpp` framework. You'll explore the full workflow, including downloading, converting, quantizing, deploying, and benchmarking Llama-style models. By the end, you'll have a working offline mobile LLM capable of running locally without server dependencies in an on-device Android application. -### Learning Goals +### Learning Outcomes - Understand the toolchain for working with `llama.cpp` - Learn how model quantization reduces memory and compute requirements -- Modify an Android app to load and run local quantized models on device -- Benchmark and compare the runtime performance of different quantized formats on an android device +- Modify an Android app to load and run local quantized models on-device +- Benchmark and compare the runtime performance of different quantized formats on an Android device -> ⚙️ Why it matters: Running LLMs natively on mobile offers benefits like reduced latency, improved privacy, and offline capability—critical for edge AI applications. +>**Why it matters:** Running LLMs natively on mobile offers benefits like reduced latency, improved privacy, and offline capability—critical for edge AI applications. --- @@ -28,17 +28,17 @@ This lab provides a hands-on walkthrough of how to run and optimize compact larg --- -## Step 1: Install Android Studio & ADB +## Step 1: Install Android Studio and ADB Android Studio includes the tools needed to build and debug Android apps: 1. Download and install Android Studio **on your computer**. 2. During setup, select components: **SDK**, **SDK Platform Tools**, and **NDK**. 3. In settings, verify: - - SDK path: `~/Android/Sdk/` + - SDK path: `~/Android/sdk/` - NDK ≥ version 25 - - Platform-tools ≥ version 34 -4. Add `~/Android/Sdk/platform-tools` to your shell `PATH`. + - Platform-tools ≥ version 34. +4. Add `~/Android/sdk/platform-tools` to your shell `PATH`. Check ADB installation **on your computer**: @@ -50,7 +50,7 @@ adb version ## Step 2: Authenticate with Hugging Face -We use Hugging Face to download a pretrained LLaMA-style model. +We use Hugging Face to download a pretrained Llama-style model. **On your computer**, run: @@ -60,13 +60,13 @@ huggingface-cli login ``` 1. Log in at [https://huggingface.co](https://huggingface.co). -2. Go to **Settings → Access Tokens**. +2. Go to "Settings" → "Access Tokens". 3. Create a token with **Read** access. 4. Paste the token in your terminal. --- -## Step 3: Clone & Build `llama.cpp` +## Step 3: Clone and Build `llama.cpp` This builds the tools used to convert and quantize models: @@ -78,7 +78,7 @@ cmake .. -DCMAKE_BUILD_TYPE=Release -DLLAMA_BUILD_TOOLS=ON make ``` -> 🧪 **Why:** This creates binaries like `llama-quantize` and `llama-run`, which are used to transform and test models. +>**Why:** This creates binaries like `llama-quantize` and `llama-run`, which are used to transform and test models. --- @@ -122,14 +122,14 @@ python convert_hf_to_gguf.py llama-models/Llama-3.2-3B-Instruct --outfile llama- ## Step 7: Quantize the Model -Quantization is a technique that reduces the precision of the model's weights from 32-bit floating point numbers to lower bit representations (like 8-bit, 4-bit or even 2-bit integers). This significantly reduces model size and speeds up inference, with minimal impact on accuracy when done carefully. +Quantization is a technique that reduces the precision of the model's weights from 32-bit floating point numbers to lower-bit representations (like 8-bit, 4-bit, or even 2-bit integers). This significantly reduces model size and speeds up inference, with minimal impact on accuracy when done carefully. For example, a 3B parameter model normally requires ~12GB in FP32. After quantization: -- Q8_0 (8-bit) reduces it to ~3GB -- Q4_K_M (4-bit) reduces it to ~1.5GB -- TQ2_0 (2-bit) reduces it to ~750MB +- Q8_0 (8-bit) reduces it to ~3GB; +- Q4_K_M (4-bit) reduces it to ~1.5GB; and +- TQ2_0 (2-bit) reduces it to ~750MB. -To Quantize, we can use the llama.cpp tools. **On your computer**, run: +To quantize, we can use the llama.cpp tools. **On your computer**, run: ```bash ./build/bin/llama-quantize llama-models/Llama-3.2-3B-Instruct-gguf llama-models/Llama-3.2-3B-Instruct-gguf-Q8_0 Q8_0 @@ -140,10 +140,10 @@ To Quantize, we can use the llama.cpp tools. **On your computer**, run: | Format | Bitwidth | Trade-off | | -------- | -------- | ------------------------- | | Q8_0 | 8-bit | High accuracy, large size | -| Q4_K_M | 4-bit | Balance of sizespeed | +| Q4_K_M | 4-bit | Balance of size and speed | | TQ2_0 | ~2-bit | Tiny, fast, less accurate | -> **Why Quantize?** Reduces memory and compute cost, enabling real-time use on mobile. +> **Why quantize?** Reduces memory and compute cost, enabling real-time use on mobile. --- @@ -163,9 +163,9 @@ To Quantize, we can use the llama.cpp tools. **On your computer**, run: **In Android Studio on your computer**: -- Open `llama.cpp/examples/llama.android` -- Wait for Gradle sync -- Switch to "Project" view for easier navigation. You can do this by selecting 'Android' in the top left of your screen, then selecting 'project' from the dropdown. +- open `llama.cpp/examples/llama.android`; +- wait for Gradle sync; and +- switch to "Project" view for easier navigation. You can do this by selecting "Android" in the top left of your screen, then selecting "project" from the dropdown. --- @@ -173,11 +173,11 @@ To Quantize, we can use the llama.cpp tools. **On your computer**, run: First to enable debugging mode on your phone, do the following on your **mobile**: -1. Enable USB debugging in Developer Options of you mobile: - - Navigate to **Settings** > **About phone** - - Tap **Build number** 7 times to enable Developer options - - Return to **Settings** > **Developer options** - - Toggle on **USB debugging** +1. Enable USB debugging in Developer options of you mobile: + - navigate to "Settings" > "About phone"; + - tap "Build number" 7 times to enable Developer options; + - return to "Settings" > "Developer options"; and + - toggle on "USB debugging". **Then In Android Studio on your computer**, edit the following files: @@ -213,13 +213,13 @@ fun Button(viewModel: MainViewModel, dm: DownloadManager, item: Downloadable) { } ``` -Press the Play button LLaMA Android App UI in android studio to push the application to the device, and initialize the activation space. We haven't pushed the models yet so you won't be able to load them. We can fix that in the next step. +Press the Play button LLaMA Android App UI in Android Studio to push the application to the device, and initialize the activation space. We haven't pushed the models yet so you won't be able to load them. We can fix that in the next step. --- ## Step 11: Push Quantized Models to Android Device -1. Connect your device via USB cable and authorize your computer when prompted +1. Connect your device via USB cable and authorize your computer when prompted. 2. Push the model files to device: @@ -249,7 +249,7 @@ Llama-3.2-3B-Instruct-gguf-TQ2_0 --- -## Step 13: Use the App 🌟 +## Step 13: Use the App With the app installed and your models loaded onto the device, it's time to interact with them **on your Android phone**. @@ -273,8 +273,8 @@ This setup lets you evaluate both the usability and responsiveness of different To evaluate how each quantized model performs, tap the **Bench** button after loading a model. This will execute a benchmark routine that reports: -- **Prompt processing speed (pp)** – tokens per second during initial input -- **Token generation speed (tg)** – tokens per second during autoregressive generation +- **Prompt processing speed (pp)** – tokens per second during initial input. +- **Token generation speed (tg)** – tokens per second during autoregressive generation. Repeat this for each model version (Q8, Q4, Q2) to compare their efficiency. @@ -290,7 +290,7 @@ And you can summarize your findings in a table: | Q4 | 12.0 tokens/s | 7.1 tokens/s | | Q2 | 9.7 tokens/s | 10.0 tokens/s | -> 💡 **Insight:** Lower-bit models often trade slight accuracy degradation for faster runtime and smaller memory footprint. Particularly in the token generation phases as this is a memory bound process. +>**Insight:** Lower-bit models often trade slight accuracy degradation for faster runtime and smaller memory footprint. Particularly in the token generation phases as this is a memory bound process. This step gives you both a subjective impression of quality and a quantitative measure of model efficiency on-device. @@ -300,9 +300,9 @@ This step gives you both a subjective impression of quality and a quantitative m ## Final Recap - Built and tested `llama.cpp` locally -- Converted and quantized a LLaMA model to GGUF +- Converted and quantized a Llama model to GGUF - Integrated quantized models into an Android app - Deployed and benchmarked on-device -> 🌟 You're now equipped to deploy and iterate on LLMs for edge/mobile inference! +>You're now equipped to deploy and iterate on LLMs for edge/mobile inference!