You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-08-17-cogen-motion.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -125,7 +125,7 @@ where $$v_\theta$$ is the parametrized vector field approximating the straight f
125
125
### Data-Space Predictive Learning Objectives
126
126
127
127
From an engineering standpoint, a somewhat **bitter lesson** we encountered is that **existing predictive learning objectives remain remarkably strong**. Despite the appeal of noise-prediction formulations (e.g., $\epsilon$-prediction introduced in DDPM <d-citekey="ho2020denoising"></d-cite> and later adopted in flow matching <d-citekey="lipman2022flow"></d-cite>), straightforward predictive objectives in the data space—such as direct $$\hat{x}_0$$ reconstruction in DDPM notation<d-footnote>
128
-
Note that we follow the flow matching notations in <d-citekey="lipman2022flow"></d-cite> to use $t=1$ as the data distribution and $t=0$ as the noise distribution, which is opposite to the original DDPM notations in <d-citekey="ho2020denoising"></d-cite>.</d-footnote>—consistently yielded more stable convergence.
128
+
Note that we follow the flow matching notations in <d-citekey="lipman2022flow"></d-cite> to use $t=1$ as the data distribution and $t=0$ as the noise distribution, which is opposite to the original DDPM notations in <d-citekey="ho2020denoising"></d-cite>.</d-footnote>—consistently yields more stable convergence.
129
129
130
130
131
131
Concretely, by rearranging the original linear flow objective, we define a neural network
@@ -143,7 +143,7 @@ $$
143
143
Our empirical observation is that data-space predictive learning objectives outperform denoising objectives. We argue that this is largely influenced by the current evaluation protocol, which heavily rewards model outputs that are close to the ground truth.
144
144
145
145
During training, the original denoising target matches the vector field $Y^1 - Y^0$, defined as the difference between the data sample (future trajectory) and the noise sample (drawn from the noise distribution). Under the current proximity-based metrics, this objective is harder to optimize than the predictive objective because of the stochasticity introduced by $Y^0$, as the metrics do not adequately reward diverse forecasting. Moreover, during the sampling process, small errors in the vector field model $v_\theta$—measured with respect to the single ground-truth velocity field at intermediate time steps—can be amplified through subsequent iterative steps. Consequently, increasing inference-time compute may not necessarily improve results without incorporating regularization from the data-space loss <d-footnote>
146
-
Interestingly, in our experiments, we found that flow-matching ODEs—thanks to their less noisy inference process—usually perform more stably than diffusion-model SDEs, which is somewhat surprising. In image generation, as shown in SiT <d-citekey="ma2024sit"></d-cite>, ODE-based samplers are generally weaker than SDE-based samplers.
146
+
Interestingly, in our experiments, we found that flow-matching ODEs—thanks to their less noisy inference process—usually perform more stably than diffusion-model SDEs, which is surprising. In image generation, as shown in SiT <d-citekey="ma2024sit"></d-cite>, ODE-based samplers are generally weaker than SDE-based samplers.
147
147
</d-footnote>.
148
148
149
149
### Joint Multi-Modal Learning Losses
@@ -164,7 +164,7 @@ We acknowledge that some prior works, such as MotionLM <d-cite key="seff2023moti
164
164
165
165
## Exploring Inference Acceleration
166
166
167
-
To accelerate inference in flow-matching models, which typically require tens or even hundreds of iterations for ODE simulation, we adopt a somewhat underrated idea from the image generation literature: conditional **IMLE (implicit maximum likelihood estimation)** <d-citekey="li2018implicit, li2019diverse"></d-cite>. IMLE provides a way to distill an iterative generative model into a **one-step generator**.
167
+
To accelerate inference in flow-matching models, which typically require tens or even hundreds of iterations for ODE simulation, we adopt an underrated idea from the image generation literature: conditional **IMLE (implicit maximum likelihood estimation)** <d-citekey="li2018implicit, li2019diverse"></d-cite>. IMLE provides a way to distill an iterative generative model into a **one-step generator**.
168
168
169
169
The IMLE family consists of generative models designed to produce diverse samples in a single forward pass, conceptually similar to the generator in GANs <d-citekey="goodfellow2020generative"></d-cite>. In our setting, we construct a conditional IMLE model that takes the same context $$C$$ as the teacher flow-matching model and learns to match the teacher’s motion prediction results directly in the data space.
0 commit comments