Skip to content

Commit 8358353

Browse files
committed
improve word use
1 parent 584b3e7 commit 8358353

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

_posts/2025-08-17-cogen-motion.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -125,7 +125,7 @@ where $$v_\theta$$ is the parametrized vector field approximating the straight f
125125
### Data-Space Predictive Learning Objectives
126126

127127
From an engineering standpoint, a somewhat **bitter lesson** we encountered is that **existing predictive learning objectives remain remarkably strong**. Despite the appeal of noise-prediction formulations (e.g., $\epsilon$-prediction introduced in DDPM <d-cite key="ho2020denoising"></d-cite> and later adopted in flow matching <d-cite key="lipman2022flow"></d-cite>), straightforward predictive objectives in the data space—such as direct $$\hat{x}_0$$ reconstruction in DDPM notation<d-footnote>
128-
Note that we follow the flow matching notations in <d-cite key="lipman2022flow"></d-cite> to use $t=1$ as the data distribution and $t=0$ as the noise distribution, which is opposite to the original DDPM notations in <d-cite key="ho2020denoising"></d-cite>.</d-footnote>—consistently yielded more stable convergence.
128+
Note that we follow the flow matching notations in <d-cite key="lipman2022flow"></d-cite> to use $t=1$ as the data distribution and $t=0$ as the noise distribution, which is opposite to the original DDPM notations in <d-cite key="ho2020denoising"></d-cite>.</d-footnote>—consistently yields more stable convergence.
129129

130130

131131
Concretely, by rearranging the original linear flow objective, we define a neural network
@@ -143,7 +143,7 @@ $$
143143
Our empirical observation is that data-space predictive learning objectives outperform denoising objectives. We argue that this is largely influenced by the current evaluation protocol, which heavily rewards model outputs that are close to the ground truth.
144144

145145
During training, the original denoising target matches the vector field $Y^1 - Y^0$, defined as the difference between the data sample (future trajectory) and the noise sample (drawn from the noise distribution). Under the current proximity-based metrics, this objective is harder to optimize than the predictive objective because of the stochasticity introduced by $Y^0$, as the metrics do not adequately reward diverse forecasting. Moreover, during the sampling process, small errors in the vector field model $v_\theta$—measured with respect to the single ground-truth velocity field at intermediate time steps—can be amplified through subsequent iterative steps. Consequently, increasing inference-time compute may not necessarily improve results without incorporating regularization from the data-space loss <d-footnote>
146-
Interestingly, in our experiments, we found that flow-matching ODEs—thanks to their less noisy inference process—usually perform more stably than diffusion-model SDEs, which is somewhat surprising. In image generation, as shown in SiT <d-cite key="ma2024sit"></d-cite>, ODE-based samplers are generally weaker than SDE-based samplers.
146+
Interestingly, in our experiments, we found that flow-matching ODEs—thanks to their less noisy inference process—usually perform more stably than diffusion-model SDEs, which is surprising. In image generation, as shown in SiT <d-cite key="ma2024sit"></d-cite>, ODE-based samplers are generally weaker than SDE-based samplers.
147147
</d-footnote>.
148148

149149
### Joint Multi-Modal Learning Losses
@@ -164,7 +164,7 @@ We acknowledge that some prior works, such as MotionLM <d-cite key="seff2023moti
164164

165165
## Exploring Inference Acceleration
166166

167-
To accelerate inference in flow-matching models, which typically require tens or even hundreds of iterations for ODE simulation, we adopt a somewhat underrated idea from the image generation literature: conditional **IMLE (implicit maximum likelihood estimation)** <d-cite key="li2018implicit, li2019diverse"></d-cite>. IMLE provides a way to distill an iterative generative model into a **one-step generator**.
167+
To accelerate inference in flow-matching models, which typically require tens or even hundreds of iterations for ODE simulation, we adopt an underrated idea from the image generation literature: conditional **IMLE (implicit maximum likelihood estimation)** <d-cite key="li2018implicit, li2019diverse"></d-cite>. IMLE provides a way to distill an iterative generative model into a **one-step generator**.
168168

169169
The IMLE family consists of generative models designed to produce diverse samples in a single forward pass, conceptually similar to the generator in GANs <d-cite key="goodfellow2020generative"></d-cite>. In our setting, we construct a conditional IMLE model that takes the same context $$C$$ as the teacher flow-matching model and learns to match the teacher’s motion prediction results directly in the data space.
170170

0 commit comments

Comments
 (0)