Skip to content

Commit b4ee3a6

Browse files
committed
fix the latex rendering issue, update the mathjax.html file, explain the ODE distillation in detail.
1 parent 8e41516 commit b4ee3a6

File tree

2 files changed

+50
-49
lines changed

2 files changed

+50
-49
lines changed

_includes/scripts/mathjax.html

Lines changed: 21 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,26 @@
11
{%- if site.enable_math -%}
22
<!-- MathJax -->
3-
<script type="text/javascript">
3+
<script type="text/javascript">
44
window.MathJax = {
5-
tex: {
6-
tags: 'ams'
7-
}
5+
tex: {
6+
tags: 'ams',
7+
inlineMath: [['$', '$'], ['\\(', '\\)']],
8+
displayMath: [['$$', '$$'], ['\\[', '\\]']]
9+
},
10+
options: {
11+
skipHtmlTags: ['script', 'noscript', 'style', 'textarea', 'pre'],
12+
ignoreHtmlClass: 'tex2jax_ignore',
13+
processHtmlClass: 'tex2jax_process'
14+
},
15+
startup: {
16+
ready: () => {
17+
MathJax.startup.defaultReady();
18+
MathJax.startup.promise.then(() => {
19+
console.log('MathJax initial typesetting complete');
20+
});
21+
}
22+
}
823
};
9-
</script>
10-
<script defer type="text/javascript" id="MathJax-script" src="https://cdn.jsdelivr.net/npm/mathjax@{{ site.mathjax.version }}/es5/tex-mml-chtml.js"></script>
11-
<script defer src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
24+
</script>
25+
<script defer type="text/javascript" id="MathJax-script" src="https://cdn.jsdelivr.net/npm/mathjax@{{ site.mathjax.version }}/es5/tex-mml-chtml.js"></script>
1226
{%- endif %}

_posts/2025-08-18-diff-distill.md

Lines changed: 29 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
layout: distill
33
title: A Unified Framework for Diffusion Distillation
44
description: The explosive growth in one-step and few-step diffusion models has taken the field deep into the weeds of complex notations. In this blog, we cut through the confusion by proposing a coherent set of notations that reveal the connections among these methods.
5-
tags: generative-models diffusion flows
5+
tags: generative-models diffusion flow
66
giscus_comments: true
77
date: 2025-08-21
88
featured: true
@@ -15,12 +15,6 @@ authors:
1515

1616
bibliography: 2025-08-18-diff-distill.bib
1717

18-
# Optionally, you can add a table of contents to your post.
19-
# NOTES:
20-
# - make sure that TOC names match the actual section names
21-
# for hyperlinks within the post to work correctly.
22-
# - we may want to automate TOC generation in the future using
23-
# jekyll-toc plugin (https://github.com/toshimaru/jekyll-toc).
2418
toc:
2519
- name: Introduction
2620
- name: Notation at a Glance
@@ -36,24 +30,6 @@ toc:
3630
- name: ReFlow
3731
- name: Inductive Moment Matching
3832
- name: Closing Thoughts
39-
40-
# Below is an example of injecting additional post-specific styles.
41-
# If you use this post as a template, delete this _styles block.
42-
# _styles: >
43-
# .fake-img {
44-
# background: #bbb;
45-
# border: 1px solid rgba(0, 0, 0, 0.1);
46-
# box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1);
47-
# margin-bottom: 12px;
48-
# }
49-
# .fake-img p {
50-
# font-family: monospace;
51-
# color: white;
52-
# text-align: left;
53-
# margin: 12px 0;
54-
# text-align: center;
55-
# font-size: 16px;
56-
# }
5733
---
5834

5935
## Introduction
@@ -62,27 +38,20 @@ Diffusion and flow-based models<d-cite key="ho2020denoising, lipman_flow_2023, a
6238

6339
At its core, diffusion models (equivalently, flow matching models) operate by iteratively refining noisy data into high-quality outputs through a series of denoising steps. Similar to divide-and-conquer algorithms <d-footnote>Common ones like Mergesort, locating the median and Fast Fourier Transform.</d-footnote>, diffusion models first *divide* the difficult denoising task into subtasks and *conquer* one of these at a time during training. To obtain a sample, we make a sequence of recursive predictions which means we need to *conquer* the entire task end-to-end.
6440

65-
This challenge has spurred research into acceleration strategies across multiple granular levels, including hardware optimization, mixed precision training<d-cite key="micikevicius2017mixed"></d-cite>, [quantization](https://github.com/bitsandbytes-foundation/bitsandbytes), and parameter-efficient fine-tuning<d-cite key="hu2021lora"></d-cite>. In this blog, we focus on an orthogonal approach, **ODE distillation**, which minimize Number of Function Evaluations (NFEs) so that we can generate high-quality samples with as few denoising steps as possible.
41+
This challenge has spurred research into acceleration strategies across multiple granular levels, including hardware optimization, mixed precision training<d-cite key="micikevicius2017mixed"></d-cite>, [quantization](https://github.com/bitsandbytes-foundation/bitsandbytes), and parameter-efficient fine-tuning<d-cite key="hu2021lora"></d-cite>. In this blog, we focus on an orthogonal approach named **Ordinary Differential Equation (ODE) distillation**. This method introduces an auxiliary structure that bypasses explicit ODE solving, thereby reducing the Number of Function Evaluations (NFEs). As a result, we can generate high-quality samples with fewer denoising steps.
6642

6743
Distillation, in general, is a technique that transfers knowledge from a complex, high-performance model (the *teacher*) to a more efficient, customized model (the *student*). Recent distillation methods have achieved remarkable reductions in sampling steps, from hundreds to a few and even **one** step, while preserving the sample quality. This advancement paves the way for real-time applications and deployment in resource-constrained environments.
44+
6845
<div class="row mt-3">
6946
<div class="col-sm mt-3 mt-md-0">
70-
{% include video.liquid path="blog/2025/diff-distill/diff-distill.mp4" class="img-fluid rounded z-depth-1" controls=true autoplay=true loop=true%}
47+
{% include video.liquid path="/blog/2025/diff-distill/diff-distill.mp4" class="img-fluid rounded z-depth-1" controls=true autoplay=true loop=true %}
7148
</div>
7249
</div>
7350

7451

7552
## Notation at a Glance
76-
<div class="row mt-3">
77-
<div class="col-sm mt-3 mt-md-0">
78-
{% include figure.liquid loading="eager" path="blog/2025/diff-distill/teaser_probpath_velocity_field.png" class="img-fluid rounded z-depth-1" %}
79-
</div>
80-
</div>
81-
<div class="caption">
82-
From left to right:<d-cite key="lipman2024flowmatchingguidecode"></d-cite>conditional and marginal probability paths, conditional and marginal velocity fields. The velocity field induces a flow that dictates its instantaneous movement across all points in space.
83-
</div>
8453

85-
The modern approaches of generative modelling consist of picking some samples from a base distribution $$\mathbf{x}_1\sim p_{\text{noise}}$$, typically an isotropic Gaussian, and learning a map such that $$\mathbf{x}_0\sim p_{\text{data}}$$. The connection between these two distributions can be expressed by establishing an initial value problem controlled by the **velocity field** $v(\mathbf{x}_t, t)$,
54+
The modern approaches of generative modelling consist of picking some samples from a base distribution $$ \mathbf{x}_{1} \sim p_ {\text{noise}} $$, typically an isotropic Gaussian, and learning a map such that $$ \mathbf{x}_{0} \sim p_ {\text{data}} $$. The connection between these two distributions can be expressed by establishing an initial value problem controlled by the **velocity field** $$v(\mathbf{x}_{t}, t)$$,
8655

8756
$$
8857
\require{physics}
@@ -91,7 +60,17 @@ $$
9160
\end{equation}
9261
$$
9362

94-
where the **flow** $\psi_t:\mathbb{R}^d\times[0,1]\to \mathbb{R}^d$ is a diffeomorphic map with $$\psi_t(\mathbf{x}_t)$$ defined as the solution to the above ODE (\ref{eq:1}). If the flow satisfies the push-forward equation<d-footnote>This is also known as the change of variable equation: $[\phi_t]_\# p_0(x) = p_0(\phi_t^{-1}(x)) \det \left[ \frac{\partial \phi_t^{-1}}{\partial x}(x) \right].$</d-footnote> $$p_t=[\psi_t]_\#p_0$$, we say a **probability path** $$(p_t)_{t\in[0,1]}$$ is generated from the velocity vector field. The goal of flow matching<d-cite key="lipman_flow_2023"></d-cite> is to find a velocity field $$v_\theta(\mathbf{x}_t, t)$$ so that it transforms $$\mathbf{x}_1\sim p_{\text{noise}}$$ to $$\mathbf{x}_0\sim p_{\text{data}}$$ when integrated. In order to receive supervision at each time step, one must predefine a condition probability path $$p_t(\cdot \vert \mathbf{x}_0)$$<d-footnote>In practice, the most common one is the Gaussian conditional probability path. This arises from a Gaussian conditional vector field, whose analytical form can be derived from the continuity equation. $$\frac{\partial p_t}{\partial t} + \nabla \cdot (p_t v) = 0$$ See the table for details.</d-footnote> associated with its velocity field. For each datapoint $$\mathbf{x}_0\in \mathbb{R}^d$$, let $$v(\mathbf{x}_t, t\vert\mathbf{x}_0)=\mathbb{E}_{p_t(v_t \vert \mathbf{x}_0)}[v_t]$$ denote a conditional velocity vector field so that the corresponding ODE (\ref{eq:1}) yields the conditional flow.
63+
where the **flow** $$\psi_t:\mathbb{R}^d\times[0,1]\to \mathbb{R}^d$$ is a diffeomorphic map with $$\psi_t(\mathbf{x}_t)$$ defined as the solution to the above ODE (\ref{eq:1}). If the flow satisfies the push-forward equation<d-footnote>This is also known as the change of variable equation: $[\phi_t]_\# p_0(x) = p_0(\phi_t^{-1}(x)) \det \left[ \frac{\partial \phi_t^{-1}}{\partial x}(x) \right].$</d-footnote> $$p_t=[\psi_t]_\#p_0$$, we say a **probability path** $$(p_t)_{t\in[0,1]}$$ is generated from the velocity vector field. The goal of flow matching<d-cite key="lipman_flow_2023"></d-cite> is to find a velocity field $$v_\theta(\mathbf{x}_t, t)$$ so that it transforms $$\mathbf{x}_1\sim p_{\text{noise}}$$ to $$\mathbf{x}_0\sim p_{\text{data}}$$ when integrated. In order to receive supervision at each time step, one must predefine a condition probability path $$p_t(\cdot \vert \mathbf{x}_0)$$<d-footnote>In practice, the most common one is the Gaussian conditional probability path. This arises from a Gaussian conditional vector field, whose analytical form can be derived from the continuity equation. $$\frac{\partial p_t}{\partial t} + \nabla \cdot (p_t v) = 0$$ See the table for details.</d-footnote> associated with its velocity field. For each datapoint $$\mathbf{x}_0\in \mathbb{R}^d$$, let $$v(\mathbf{x}_t, t\vert\mathbf{x}_0)=\mathbb{E}_{p_t(v_t \vert \mathbf{x}_0)}[v_t]$$ denote a conditional velocity vector field so that the corresponding ODE (\ref{eq:1}) yields the conditional flow.
64+
65+
66+
<div class="row mt-3">
67+
<div class="col-sm mt-3 mt-md-0">
68+
{% include figure.liquid loading="eager" path="/blog/2025/diff-distill/teaser_probpath_velocity_field.png" class="img-fluid rounded z-depth-1" %}
69+
</div>
70+
</div>
71+
<div class="caption">
72+
From left to right:<d-cite key="lipman2024flowmatchingguidecode"></d-cite>conditional and marginal probability paths, conditional and marginal velocity fields. The velocity field induces a flow that dictates its instantaneous movement across all points in space.
73+
</div>
9574

9675
Most of the conditional probability paths are designed as the **differentiable** interpolation between noise and data for simplicity, and we can express sampling from a marginal path
9776
$$\mathbf{x}_t = \alpha(t)\mathbf{x}_0 + \beta(t)\mathbf{x}_1$$ where $$\alpha(t), \beta(t)$$ are predefined schedules. <d-footnote>The stochastic interpolant paper defines this probability path that summarizes all diffusion models, with several assumptions. Here, we use a simpler interpolant for clean illustration.</d-footnote>
@@ -127,13 +106,15 @@ where $$w(t)$$ is a reweighting function<d-footnote>The weighting function modul
127106

128107

129108
## ODE Distillation methods
109+
130110
Before introducing ODE distillation methods, it is imperative to define a general continuous-time flow map $$f_{t\to s}(\mathbf{x}_t, t, s)$$<d-cite key="boffi2025build"></d-cite> where it maps any noisy input $$\mathbf{x}_t, t\in[0,1]$$ to any point $$\mathbf{x}_s, s\in[0,1]$$ on the ODE (\ref{eq:1}) that describes the probability flow aformentioned. This is a generalization of flow-based distillation and consistency models within a single unified framework. The flow map is well-defined only if its **boundary conditions** satisfy $$f_{t\to t}(\mathbf{x}_t, t, t) = \mathbf{x}_t$$ for all time steps. One popular way to meet the condition is to parameterize the model as $$ f_{t\to s}(\mathbf{x}_t, t, s)= c_{\text{skip}}(t, s)\mathbf{x}_t + c_{\text{out}}(t,s)F_{t\to s}(\mathbf{x}_t, t, s)$$ where $$c_{\text{skip}}(t, t) = 1$$ and $$c_{\text{out}}(t, t) = 0$$ for all $$t$$.
131111

132112
At its core, ODE distillation boils down to how to strategically construct the training objective of the flow map $$f_{t\to s}(\mathbf{x}_t, t, s)$$ so that it can be efficiently evaluated during sampling. In addition, we need to orchestrate the schedule of $$(t,s)$$ pairs for better training dynamics.
133113

134114
In the context of distillation, the forward direction $$s<t$$ is typically taken as the target. Yet, the other direction can also carry meaningful structure. Notice in DDIM<d-cite key="song2020denoising"></d-cite> sampling, the conditional probability path is traversed twice. In our flow map formulation, this can be replaced with the flow maps $$f_{\tau_i\to 0}(\mathbf{x}_{\tau_i}, \tau_i, 0), f_{0\to \tau_{i-1}}(\mathbf{x}_0, 0, \tau_{i-1})$$ where $$0<\tau_{i-1}<\tau_i<1$$. Intuitively, the flow map $$f_{t\to s}(\mathbf{x}_t, t, s)$$ represents a direct mapping of some **displacement field** where $$F_{t\to s}(\mathbf{x}_t, t, s)$$ measures the increment which corresponds to a **velocity field**.
135115

136116
### MeanFlow
117+
137118
MeanFlow<d-cite key="geng2025mean"></d-cite> can be trained from scratch or distilled from a pretrained FM model. The conditional probability path is defined as the linear interpolation between noise and data $$\mathbf{x}_t = (1-t)\mathbf{x}_0 + t\mathbf{x}_1$$ with the corresponding default conditional velocity field OT target $$v(\mathbf{x}_t, t \vert \mathbf{x}_0)=\mathbf{x}_1- \mathbf{x}_0.$$ The main contribution consists of identifying and defining an **average velocity field** which coincides with our flow map as
138119

139120
$$
@@ -172,6 +153,7 @@ where $$F_{t\to s}^{\text{tgt}}(\mathbf{x}_t, t, s\vert\mathbf{x}_0)=v - (t-s)(v
172153
<details>
173154
<summary>Full derivation of the target</summary>
174155
Based on the MeanFlow identity, we can compute the target as follows:
156+
175157
$$
176158
\require{physics}
177159
\require{cancel}
@@ -181,6 +163,7 @@ F_{t\to s}^{\text{tgt}}(\mathbf{x}_t, t, s\vert\mathbf{x}_0) &= \dv{\mathbf{x}_t
181163
& = v - (t-s)\left(v \nabla_{\mathbf{x}_t} F_{t\to s}(\mathbf{x}_t, t, s) + \partial_t F_{t\to s}(\mathbf{x}_t, t, s)\right). \\
182164
\end{align*}
183165
$$
166+
184167
Note that in MeanFlow $$\dv{\mathbf{x}_t}{t} = v(\mathbf{x}_t, t\vert \mathbf{x}_0)$$ and $$\dv{s}{t}=0$$ since $s$ is independent of $t$.
185168
</details>
186169

@@ -250,7 +233,8 @@ Type (b) backward loss, while CTMs<d-cite key="kim2023consistency"></d-cite> opt
250233

251234
Similar to MeanFlow preliminaries, Flow Anchor Consistency Model (FACM)<d-cite key="peng2025flow"></d-cite> also adopts the linear conditional probability path $$\mathbf{x}_t = (1-t)\mathbf{x}_0 + t\mathbf{x}_1$$ with the corresponding default conditional velocity field OT target $$v(\mathbf{x}_t, t \vert \mathbf{x}_0)=\mathbf{x}_1- \mathbf{x}_0.$$ In our flow maps notation, FACM parameterizes the model as $$ f^\theta_{t\to s}(\mathbf{x}_t, t, 0)= \mathbf{x}_t - tF^\theta_{t\to s}(\mathbf{x}_t, t, 0) $$ where $$c_{\text{skip}}(t,s)=1$$ and $$c_{\text{out}}(t,s)=-t$$.
252235

253-
FACM imposes a **consistency property** which requires the total derivative of the consistency function to be zero
236+
FACM imposes a **consistency property** which requires the total derivative of the consistency function to be zero
237+
254238
$$
255239
\require{physics}
256240
\dv{t}f^\theta_{t \to 0}(\mathbf{x}, t, 0) = 0.
@@ -268,7 +252,7 @@ Notice this is equivalent to [MeanFlow](#meanflow) where $$s=0$$. This indicates
268252
<span style="color: blue; font-weight: bold;">Training</span>: FACM training algorithm equipped with our flow map notation. Notice that $$d_1, d_2$$ are $\ell_2$ with cosine loss<d-footnote>$L_{\cos}(\mathbf{x}, \mathbf{y}) = 1 - \dfrac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\|_{2} \, \|\mathbf{y}\|_{2}}$</d-footnote> and norm $\ell_2$ loss<d-footnote>$L_{\text{norm}}(\mathbf{x}, \mathbf{y}) =\dfrac{\|\mathbf{x}-\mathbf{y}\|^2}{\sqrt{\|\mathbf{x}-\mathbf{y}\|^2+c}}$ where $c$ is a small constant. This is a special case of adaptive L2 loss proposed in MeanFlow<d-cite key="geng2025mean"></d-cite>.</d-footnote> respectively, plus reweighting. Interestingly, they separate the training of FM and CM on disentangled time intervals. When training with CM target, we let $$s=0, t\in[0,1]$$. On the other hand, we set $$t'=2-t, t'\in[1,2]$$ when training with FM anchors.
269253
<div class="row mt-3">
270254
<div class="col-sm mt-3 mt-md-0">
271-
{% include figure.liquid loading="eager" path="blog/2025/diff-distill/facm_training.png" class="img-fluid rounded z-depth-1" %}
255+
{% include figure.liquid loading="eager" path="/blog/2025/diff-distill/facm_training.png" class="img-fluid rounded z-depth-1" %}
272256
</div>
273257
</div>
274258

@@ -306,12 +290,14 @@ Type (b) backward loss for AYF-EMD, type (a) forward loss for AYF-LMD.
306290
</details>
307291

308292
## Connections
293+
309294
Now it is time to connect the dots with some previous existing methods. Let's frame their objectives in our flow map notation and identify their loss types if possible.
310295

311296
### Shortcut Models
297+
312298
<div class="row mt-3">
313299
<div class="col-sm mt-3 mt-md-0">
314-
{% include figure.liquid loading="eager" path="blog/2025/diff-distill/shortcut_model.png" class="img-fluid rounded z-depth-1" %}
300+
{% include figure.liquid loading="eager" path="/blog/2025/diff-distill/shortcut_model.png" class="img-fluid rounded z-depth-1" %}
315301
</div>
316302
</div>
317303
<div class="caption">
@@ -337,7 +323,7 @@ Type (c) tri-consistency loss
337323
### ReFlow
338324
<div class="row mt-3">
339325
<div class="col-sm mt-3 mt-md-0">
340-
{% include figure.liquid loading="eager" path="blog/2025/diff-distill/rectifiedflow.png" class="img-fluid rounded z-depth-1" %}
326+
{% include figure.liquid loading="eager" path="/blog/2025/diff-distill/rectifiedflow.png" class="img-fluid rounded z-depth-1" %}
341327
</div>
342328
</div>
343329
<div class="caption">
@@ -350,9 +336,10 @@ Unlike most ODE distillation methods that learn to jump from $$t\to s$$ accordin
350336
<span style="color: orange; font-weight: bold;">Sampling</span>: Same as FMs.
351337

352338
### Inductive Moment Matching
339+
353340
<div class="row mt-3">
354341
<div class="col-sm mt-3 mt-md-0">
355-
{% include figure.liquid loading="eager" path="blog/2025/diff-distill/IMM.png" class="img-fluid rounded z-depth-1" %}
342+
{% include figure.liquid loading="eager" path="/blog/2025/diff-distill/IMM.png" class="img-fluid rounded z-depth-1" %}
356343
</div>
357344
</div>
358345
<div class="caption">

0 commit comments

Comments
 (0)