You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-08-18-diff-distill.md
+29-42Lines changed: 29 additions & 42 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
layout: distill
3
3
title: A Unified Framework for Diffusion Distillation
4
4
description: The explosive growth in one-step and few-step diffusion models has taken the field deep into the weeds of complex notations. In this blog, we cut through the confusion by proposing a coherent set of notations that reveal the connections among these methods.
5
-
tags: generative-models diffusion flows
5
+
tags: generative-models diffusion flow
6
6
giscus_comments: true
7
7
date: 2025-08-21
8
8
featured: true
@@ -15,12 +15,6 @@ authors:
15
15
16
16
bibliography: 2025-08-18-diff-distill.bib
17
17
18
-
# Optionally, you can add a table of contents to your post.
19
-
# NOTES:
20
-
# - make sure that TOC names match the actual section names
21
-
# for hyperlinks within the post to work correctly.
22
-
# - we may want to automate TOC generation in the future using
# Below is an example of injecting additional post-specific styles.
41
-
# If you use this post as a template, delete this _styles block.
42
-
# _styles: >
43
-
# .fake-img {
44
-
# background: #bbb;
45
-
# border: 1px solid rgba(0, 0, 0, 0.1);
46
-
# box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1);
47
-
# margin-bottom: 12px;
48
-
# }
49
-
# .fake-img p {
50
-
# font-family: monospace;
51
-
# color: white;
52
-
# text-align: left;
53
-
# margin: 12px 0;
54
-
# text-align: center;
55
-
# font-size: 16px;
56
-
# }
57
33
---
58
34
59
35
## Introduction
@@ -62,27 +38,20 @@ Diffusion and flow-based models<d-cite key="ho2020denoising, lipman_flow_2023, a
62
38
63
39
At its core, diffusion models (equivalently, flow matching models) operate by iteratively refining noisy data into high-quality outputs through a series of denoising steps. Similar to divide-and-conquer algorithms <d-footnote>Common ones like Mergesort, locating the median and Fast Fourier Transform.</d-footnote>, diffusion models first *divide* the difficult denoising task into subtasks and *conquer* one of these at a time during training. To obtain a sample, we make a sequence of recursive predictions which means we need to *conquer* the entire task end-to-end.
64
40
65
-
This challenge has spurred research into acceleration strategies across multiple granular levels, including hardware optimization, mixed precision training<d-citekey="micikevicius2017mixed"></d-cite>, [quantization](https://github.com/bitsandbytes-foundation/bitsandbytes), and parameter-efficient fine-tuning<d-citekey="hu2021lora"></d-cite>. In this blog, we focus on an orthogonal approach, **ODE distillation**, which minimize Number of Function Evaluations (NFEs) so that we can generate high-quality samples with as few denoising steps as possible.
41
+
This challenge has spurred research into acceleration strategies across multiple granular levels, including hardware optimization, mixed precision training<d-citekey="micikevicius2017mixed"></d-cite>, [quantization](https://github.com/bitsandbytes-foundation/bitsandbytes), and parameter-efficient fine-tuning<d-citekey="hu2021lora"></d-cite>. In this blog, we focus on an orthogonal approach named **Ordinary Differential Equation (ODE) distillation**. This method introduces an auxiliary structure that bypasses explicit ODE solving, thereby reducing the Number of Function Evaluations (NFEs). As a result, we can generate high-quality samples with fewer denoising steps.
66
42
67
43
Distillation, in general, is a technique that transfers knowledge from a complex, high-performance model (the *teacher*) to a more efficient, customized model (the *student*). Recent distillation methods have achieved remarkable reductions in sampling steps, from hundreds to a few and even **one** step, while preserving the sample quality. This advancement paves the way for real-time applications and deployment in resource-constrained environments.
44
+
68
45
<divclass="row mt-3">
69
46
<div class="col-sm mt-3 mt-md-0">
70
-
{% include video.liquid path="blog/2025/diff-distill/diff-distill.mp4" class="img-fluid rounded z-depth-1" controls=true autoplay=true loop=true%}
47
+
{% include video.liquid path="/blog/2025/diff-distill/diff-distill.mp4" class="img-fluid rounded z-depth-1" controls=true autoplay=true loop=true%}
71
48
</div>
72
49
</div>
73
50
74
51
75
52
## Notation at a Glance
76
-
<divclass="row mt-3">
77
-
<div class="col-sm mt-3 mt-md-0">
78
-
{% include figure.liquid loading="eager" path="blog/2025/diff-distill/teaser_probpath_velocity_field.png" class="img-fluid rounded z-depth-1" %}
79
-
</div>
80
-
</div>
81
-
<divclass="caption">
82
-
From left to right:<d-citekey="lipman2024flowmatchingguidecode"></d-cite>conditional and marginal probability paths, conditional and marginal velocity fields. The velocity field induces a flow that dictates its instantaneous movement across all points in space.
83
-
</div>
84
53
85
-
The modern approaches of generative modelling consist of picking some samples from a base distribution $$\mathbf{x}_1\sim p_{\text{noise}}$$, typically an isotropic Gaussian, and learning a map such that $$\mathbf{x}_0\sim p_{\text{data}}$$. The connection between these two distributions can be expressed by establishing an initial value problem controlled by the **velocity field** $v(\mathbf{x}_t, t)$,
54
+
The modern approaches of generative modelling consist of picking some samples from a base distribution $$\mathbf{x}_{1} \sim p_{\text{noise}}$$, typically an isotropic Gaussian, and learning a map such that $$\mathbf{x}_{0} \sim p_{\text{data}}$$. The connection between these two distributions can be expressed by establishing an initial value problem controlled by the **velocity field**$$v(\mathbf{x}_{t}, t)$$,
86
55
87
56
$$
88
57
\require{physics}
@@ -91,7 +60,17 @@ $$
91
60
\end{equation}
92
61
$$
93
62
94
-
where the **flow** $\psi_t:\mathbb{R}^d\times[0,1]\to \mathbb{R}^d$ is a diffeomorphic map with $$\psi_t(\mathbf{x}_t)$$ defined as the solution to the above ODE (\ref{eq:1}). If the flow satisfies the push-forward equation<d-footnote>This is also known as the change of variable equation: $[\phi_t]_\# p_0(x) = p_0(\phi_t^{-1}(x)) \det \left[ \frac{\partial \phi_t^{-1}}{\partial x}(x) \right].$</d-footnote> $$p_t=[\psi_t]_\#p_0$$, we say a **probability path** $$(p_t)_{t\in[0,1]}$$ is generated from the velocity vector field. The goal of flow matching<d-cite key="lipman_flow_2023"></d-cite> is to find a velocity field $$v_\theta(\mathbf{x}_t, t)$$ so that it transforms $$\mathbf{x}_1\sim p_{\text{noise}}$$ to $$\mathbf{x}_0\sim p_{\text{data}}$$ when integrated. In order to receive supervision at each time step, one must predefine a condition probability path $$p_t(\cdot \vert \mathbf{x}_0)$$<d-footnote>In practice, the most common one is the Gaussian conditional probability path. This arises from a Gaussian conditional vector field, whose analytical form can be derived from the continuity equation. $$\frac{\partial p_t}{\partial t} + \nabla \cdot (p_t v) = 0$$ See the table for details.</d-footnote> associated with its velocity field. For each datapoint $$\mathbf{x}_0\in \mathbb{R}^d$$, let $$v(\mathbf{x}_t, t\vert\mathbf{x}_0)=\mathbb{E}_{p_t(v_t \vert \mathbf{x}_0)}[v_t]$$ denote a conditional velocity vector field so that the corresponding ODE (\ref{eq:1}) yields the conditional flow.
63
+
where the **flow** $$\psi_t:\mathbb{R}^d\times[0,1]\to \mathbb{R}^d$$ is a diffeomorphic map with $$\psi_t(\mathbf{x}_t)$$ defined as the solution to the above ODE (\ref{eq:1}). If the flow satisfies the push-forward equation<d-footnote>This is also known as the change of variable equation: $[\phi_t]_\# p_0(x) = p_0(\phi_t^{-1}(x)) \det \left[ \frac{\partial \phi_t^{-1}}{\partial x}(x) \right].$</d-footnote> $$p_t=[\psi_t]_\#p_0$$, we say a **probability path** $$(p_t)_{t\in[0,1]}$$ is generated from the velocity vector field. The goal of flow matching<d-cite key="lipman_flow_2023"></d-cite> is to find a velocity field $$v_\theta(\mathbf{x}_t, t)$$ so that it transforms $$\mathbf{x}_1\sim p_{\text{noise}}$$ to $$\mathbf{x}_0\sim p_{\text{data}}$$ when integrated. In order to receive supervision at each time step, one must predefine a condition probability path $$p_t(\cdot \vert \mathbf{x}_0)$$<d-footnote>In practice, the most common one is the Gaussian conditional probability path. This arises from a Gaussian conditional vector field, whose analytical form can be derived from the continuity equation. $$\frac{\partial p_t}{\partial t} + \nabla \cdot (p_t v) = 0$$ See the table for details.</d-footnote> associated with its velocity field. For each datapoint $$\mathbf{x}_0\in \mathbb{R}^d$$, let $$v(\mathbf{x}_t, t\vert\mathbf{x}_0)=\mathbb{E}_{p_t(v_t \vert \mathbf{x}_0)}[v_t]$$ denote a conditional velocity vector field so that the corresponding ODE (\ref{eq:1}) yields the conditional flow.
64
+
65
+
66
+
<divclass="row mt-3">
67
+
<div class="col-sm mt-3 mt-md-0">
68
+
{% include figure.liquid loading="eager" path="/blog/2025/diff-distill/teaser_probpath_velocity_field.png" class="img-fluid rounded z-depth-1" %}
69
+
</div>
70
+
</div>
71
+
<divclass="caption">
72
+
From left to right:<d-citekey="lipman2024flowmatchingguidecode"></d-cite>conditional and marginal probability paths, conditional and marginal velocity fields. The velocity field induces a flow that dictates its instantaneous movement across all points in space.
73
+
</div>
95
74
96
75
Most of the conditional probability paths are designed as the **differentiable** interpolation between noise and data for simplicity, and we can express sampling from a marginal path
97
76
$$\mathbf{x}_t = \alpha(t)\mathbf{x}_0 + \beta(t)\mathbf{x}_1$$ where $$\alpha(t), \beta(t)$$ are predefined schedules. <d-footnote>The stochastic interpolant paper defines this probability path that summarizes all diffusion models, with several assumptions. Here, we use a simpler interpolant for clean illustration.</d-footnote>
@@ -127,13 +106,15 @@ where $$w(t)$$ is a reweighting function<d-footnote>The weighting function modul
127
106
128
107
129
108
## ODE Distillation methods
109
+
130
110
Before introducing ODE distillation methods, it is imperative to define a general continuous-time flow map $$f_{t\to s}(\mathbf{x}_t, t, s)$$<d-citekey="boffi2025build"></d-cite> where it maps any noisy input $$\mathbf{x}_t, t\in[0,1]$$ to any point $$\mathbf{x}_s, s\in[0,1]$$ on the ODE (\ref{eq:1}) that describes the probability flow aformentioned. This is a generalization of flow-based distillation and consistency models within a single unified framework. The flow map is well-defined only if its **boundary conditions** satisfy $$f_{t\to t}(\mathbf{x}_t, t, t) = \mathbf{x}_t$$ for all time steps. One popular way to meet the condition is to parameterize the model as $$ f_{t\to s}(\mathbf{x}_t, t, s)= c_{\text{skip}}(t, s)\mathbf{x}_t + c_{\text{out}}(t,s)F_{t\to s}(\mathbf{x}_t, t, s)$$ where $$c_{\text{skip}}(t, t) = 1$$ and $$c_{\text{out}}(t, t) = 0$$ for all $$t$$.
131
111
132
112
At its core, ODE distillation boils down to how to strategically construct the training objective of the flow map $$f_{t\to s}(\mathbf{x}_t, t, s)$$ so that it can be efficiently evaluated during sampling. In addition, we need to orchestrate the schedule of $$(t,s)$$ pairs for better training dynamics.
133
113
134
114
In the context of distillation, the forward direction $$s<t$$ is typically taken as the target. Yet, the other direction can also carry meaningful structure. Notice in DDIM<d-citekey="song2020denoising"></d-cite> sampling, the conditional probability path is traversed twice. In our flow map formulation, this can be replaced with the flow maps $$f_{\tau_i\to 0}(\mathbf{x}_{\tau_i}, \tau_i, 0), f_{0\to \tau_{i-1}}(\mathbf{x}_0, 0, \tau_{i-1})$$ where $$0<\tau_{i-1}<\tau_i<1$$. Intuitively, the flow map $$f_{t\to s}(\mathbf{x}_t, t, s)$$ represents a direct mapping of some **displacement field** where $$F_{t\to s}(\mathbf{x}_t, t, s)$$ measures the increment which corresponds to a **velocity field**.
135
115
136
116
### MeanFlow
117
+
137
118
MeanFlow<d-citekey="geng2025mean"></d-cite> can be trained from scratch or distilled from a pretrained FM model. The conditional probability path is defined as the linear interpolation between noise and data $$\mathbf{x}_t = (1-t)\mathbf{x}_0 + t\mathbf{x}_1$$ with the corresponding default conditional velocity field OT target $$v(\mathbf{x}_t, t \vert \mathbf{x}_0)=\mathbf{x}_1- \mathbf{x}_0.$$ The main contribution consists of identifying and defining an **average velocity field** which coincides with our flow map as
Note that in MeanFlow $$\dv{\mathbf{x}_t}{t} = v(\mathbf{x}_t, t\vert \mathbf{x}_0)$$ and $$\dv{s}{t}=0$$ since $s$ is independent of $t$.
185
168
</details>
186
169
@@ -250,7 +233,8 @@ Type (b) backward loss, while CTMs<d-cite key="kim2023consistency"></d-cite> opt
250
233
251
234
Similar to MeanFlow preliminaries, Flow Anchor Consistency Model (FACM)<d-citekey="peng2025flow"></d-cite> also adopts the linear conditional probability path $$\mathbf{x}_t = (1-t)\mathbf{x}_0 + t\mathbf{x}_1$$ with the corresponding default conditional velocity field OT target $$v(\mathbf{x}_t, t \vert \mathbf{x}_0)=\mathbf{x}_1- \mathbf{x}_0.$$ In our flow maps notation, FACM parameterizes the model as $$ f^\theta_{t\to s}(\mathbf{x}_t, t, 0)= \mathbf{x}_t - tF^\theta_{t\to s}(\mathbf{x}_t, t, 0) $$ where $$c_{\text{skip}}(t,s)=1$$ and $$c_{\text{out}}(t,s)=-t$$.
252
235
253
-
FACM imposes a **consistency property** which requires the total derivative of the consistency function to be zero
236
+
FACM imposes a **consistency property** which requires the total derivative of the consistency function to be zero
237
+
254
238
$$
255
239
\require{physics}
256
240
\dv{t}f^\theta_{t \to 0}(\mathbf{x}, t, 0) = 0.
@@ -268,7 +252,7 @@ Notice this is equivalent to [MeanFlow](#meanflow) where $$s=0$$. This indicates
268
252
<spanstyle="color: blue; font-weight: bold;">Training</span>: FACM training algorithm equipped with our flow map notation. Notice that $$d_1, d_2$$ are $\ell_2$ with cosine loss<d-footnote>$L_{\cos}(\mathbf{x}, \mathbf{y}) = 1 - \dfrac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\|_{2} \,\|\mathbf{y}\|_{2}}$</d-footnote> and norm $\ell_2$ loss<d-footnote>$L_{\text{norm}}(\mathbf{x}, \mathbf{y}) =\dfrac{\|\mathbf{x}-\mathbf{y}\|^2}{\sqrt{\|\mathbf{x}-\mathbf{y}\|^2+c}}$ where $c$ is a small constant. This is a special case of adaptive L2 loss proposed in MeanFlow<d-citekey="geng2025mean"></d-cite>.</d-footnote> respectively, plus reweighting. Interestingly, they separate the training of FM and CM on disentangled time intervals. When training with CM target, we let $$s=0, t\in[0,1]$$. On the other hand, we set $$t'=2-t, t'\in[1,2]$$ when training with FM anchors.
269
253
<divclass="row mt-3">
270
254
<div class="col-sm mt-3 mt-md-0">
271
-
{% include figure.liquid loading="eager" path="blog/2025/diff-distill/facm_training.png" class="img-fluid rounded z-depth-1" %}
255
+
{% include figure.liquid loading="eager" path="/blog/2025/diff-distill/facm_training.png" class="img-fluid rounded z-depth-1" %}
272
256
</div>
273
257
</div>
274
258
@@ -306,12 +290,14 @@ Type (b) backward loss for AYF-EMD, type (a) forward loss for AYF-LMD.
306
290
</details>
307
291
308
292
## Connections
293
+
309
294
Now it is time to connect the dots with some previous existing methods. Let's frame their objectives in our flow map notation and identify their loss types if possible.
310
295
311
296
### Shortcut Models
297
+
312
298
<divclass="row mt-3">
313
299
<div class="col-sm mt-3 mt-md-0">
314
-
{% include figure.liquid loading="eager" path="blog/2025/diff-distill/shortcut_model.png" class="img-fluid rounded z-depth-1" %}
300
+
{% include figure.liquid loading="eager" path="/blog/2025/diff-distill/shortcut_model.png" class="img-fluid rounded z-depth-1" %}
315
301
</div>
316
302
</div>
317
303
<divclass="caption">
@@ -337,7 +323,7 @@ Type (c) tri-consistency loss
337
323
### ReFlow
338
324
<divclass="row mt-3">
339
325
<div class="col-sm mt-3 mt-md-0">
340
-
{% include figure.liquid loading="eager" path="blog/2025/diff-distill/rectifiedflow.png" class="img-fluid rounded z-depth-1" %}
326
+
{% include figure.liquid loading="eager" path="/blog/2025/diff-distill/rectifiedflow.png" class="img-fluid rounded z-depth-1" %}
341
327
</div>
342
328
</div>
343
329
<divclass="caption">
@@ -350,9 +336,10 @@ Unlike most ODE distillation methods that learn to jump from $$t\to s$$ accordin
350
336
<spanstyle="color: orange; font-weight: bold;">Sampling</span>: Same as FMs.
351
337
352
338
### Inductive Moment Matching
339
+
353
340
<divclass="row mt-3">
354
341
<div class="col-sm mt-3 mt-md-0">
355
-
{% include figure.liquid loading="eager" path="blog/2025/diff-distill/IMM.png" class="img-fluid rounded z-depth-1" %}
342
+
{% include figure.liquid loading="eager" path="/blog/2025/diff-distill/IMM.png" class="img-fluid rounded z-depth-1" %}
0 commit comments