Skip to content

Add Helios (14B minute-scale video generation)#21

Open
dmunch wants to merge 40 commits intoBlaizzy:mainfrom
dmunch:helios
Open

Add Helios (14B minute-scale video generation)#21
dmunch wants to merge 40 commits intoBlaizzy:mainfrom
dmunch:helios

Conversation

@dmunch
Copy link
Copy Markdown
Contributor

@dmunch dmunch commented Mar 16, 2026

Helios is a 14B-parameter autoregressive video generation model that produces minute-scale, temporally coherent video. This implementation targets the Helios-Distilled variant for text-to-video generation on Apple Silicon via MLX.

Again, heavy AI engineering here, so do let me know where we can clean up :)

Copilot Summary

This pull request introduces the initial implementation of the Helios text-to-video model for Apple Silicon, including model architecture, configuration, attention mechanisms, and loading utilities. It also adds comprehensive documentation for setup, conversion, and usage. The changes are grouped below by documentation and codebase additions.

Documentation:

  • Added a detailed README.md for mlx_video/models/helios, providing step-by-step instructions for downloading, converting, quantizing, and running the Helios-Distilled model, as well as an in-depth explanation of the model architecture and generation pipeline.
  • Updated the porting guide to reflect the new location of model-specific diagnostic scripts under mlx_video/models/helios/scripts/.

Helios Model Implementation:

  • Introduced the Helios model configuration in config.py, defining all model, VAE, and scheduler parameters, with support for the distilled variant.
  • Implemented core attention modules in attention.py, including custom RMSNorm and LayerNorm, self-attention with history restriction and 3-way RoPE, and cross-attention with QK normalization and key/value caching.
  • Added model loading utilities in loading.py, supporting quantized model loading and reusing Wan's T5 encoder and VAE components, in line with Helios architecture.

dmunch and others added 30 commits March 12, 2026 08:39
Add complete Helios video generation support (distilled, T2V):

- Transformer backbone: 40 layers, dim=5120, multi-scale history memory
- 3-stage pyramid denoising: denoise at 1/4 → 1/2 → full resolution
- DMD scheduler with x0-prediction, dynamic shifting, block noise
- Autoregressive 33-frame chunking with short/mid/long history
- Weight conversion from HuggingFace diffusers format
- T5 weight sanitization for UMT5-XXL encoder
- CLI: --pyramid-steps (default 2 2 2), --amplify-first-chunk
- 40 tests covering config, scheduler, RoPE, attention, pyramid helpers
- Usage docs in mlx_video/models/helios/README.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two root causes for uniformly grey video output:

1. Dynamic time shift formula was inverted:
   Wrong: mu*t / (mu + (1-mu)*t)
   Correct: mu*t / (1 + (mu-1)*t)
   This caused sigma values at each pyramid stage to be wildly incorrect
   (e.g., 0.61 instead of 0.998 for stage 0).

2. VAE weight keys from HF diffusers format were not mapped to WanVAE
   structure. Added sanitize_helios_vae_weights() with complete key
   mapping: post_quant_conv→conv2, decoder.conv_in→decoder.conv1,
   up_blocks→upsamples flat indexing, resnet sub-key mapping, etc.

Also:
- Added pyramid-aware spatial alignment (latent dims divisible by 8)
- Added patch-size truncation safety in _patchify for odd dimensions
- 46 tests passing (6 new VAE sanitization tests)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add classifier-free guidance (CFG) with configurable guidance_scale (default 5.0)
  The distilled model requires CFG for correct color output despite being distilled
- Add negative_prompt parameter for unconditional CFG baseline
- Trim first stride_t-1 (3) warmup frames from VAE decode output
  The causal convolution warmup produces garbage frames at the start
- Add --guidance-scale and --negative-prompt CLI arguments
- Update README with new CLI options

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix color distortion: use fresh random noise as start_point for pyramid
  stages > 0 instead of blended signal (prevents variance inflation)
- Add CFG support with guidance_scale parameter (default 1.0 for distilled)
- Cast timesteps to int for text embedding conditioning
- Use float32 precision in scheduler step_dmd to prevent numerical drift
- Set restrict_self_attn=False (full attention matches reference behavior)
- Add keep_first_frame logic for multi-chunk generation
- Add VAE temporal frame trimming (remove causal padding warmup frames)

All 46 tests passing. Generation produces correct colors with motion.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three changes to address color bias in pyramid denoising:

1. Normalize DMD start_point per-channel (zero mean, unit std) for stages > 0.
   The blended signal carries mean bias from the previous stage's x0 prediction,
   which cascades through re-noising causing monotonically growing channel means
   (-0.25 -> -0.49 -> -0.83 for ch0). Normalizing preserves spatial structure
   while removing the bias.

2. Cast latents and history to bfloat16 before model calls, matching the
   reference which uses bfloat16 throughout (model trained with bf16 activations).

3. Cast scheduler step_dmd output back to original dtype (bfloat16), matching
   the reference's convert_flow_pred_to_x0 which returns in original_dtype.

Before: R=206, G=107, B=43 (orange bias for beach prompt)
After:  R=152, G=111, B=75 (balanced warm tones)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comprehensive technical reference covering:
- All verified components (transformer 0.999 correlation, VAE, scheduler)
- 8 resolved bugs with root causes and fixes
- Open problems (chunk 2 instability, warm bias, performance)
- Things to watch out for (bfloat16 promotion, VAE offset, history resolution)
- Key constants, formulas, and diagnostic recipes

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The start-point normalization added in c5acde72 to fix color bias was
actually breaking the DMD denoising trajectory. The normalization changed
the scale of the noise tensor used in re-noising, destroying the
signal-to-noise ratio that the alpha/beta blending coefficients were
designed for.

The reference implementation (pipeline_helios_diffusers.py) does NOT
normalize start_point — it simply appends the blended latent. The mild
per-channel mean growth across pyramid stages is inherent model behavior.

Changes:
- Revert start_point normalization to match reference: raw append
- Add --debug flag for per-step latent statistics logging
- Update bilinear downsample documentation (already equivalent to
  F.interpolate for 2x integer factors)
- Document Bug 9 in HELIOS-DIAGNOSTICS.md

Test output (seed=42, 'A calm ocean at sunset', 384x640, 33 frames):
- R=114, G=59, B=17 (warm sunset tones)
- Frame-to-frame diff: 3.46 avg (temporally coherent)
- Entropy: 5.54 bits (structured, not noise)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…cision

The root cause of the pure noise/uniform color output was in the
zero-history timestep embedding computation. History tokens (81.6% of
all tokens) were getting modulated with the wrong timestep embedding:

- Bug: t0_emb = zeros (all 0s), which produces wrong MLP output
- Fix: t0_emb = [cos(0*freq), sin(0*freq)] = [1,...,1, 0,...,0]
  matching the reference Timesteps(0) sinusoidal encoding

This single bug caused catastrophic divergence from block 0 onward
(cosine similarity -0.30 → 0.999982 after fix).

Also fixes scheduler step_dmd to return float32 (matching reference)
instead of casting back to bfloat16, preventing precision loss across
denoising steps.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
At chunk boundaries, the first few latent frames of each new chunk are
blurrier due to lack of temporal context during denoising. This adds a
latent-space blend that mixes boundary frames toward the last sharp frame
of the previous chunk before VAE decode.

Default: --chunk-blend 2 (blend 2 latent frames ~8 pixel frames).
Use --chunk-blend 0 to disable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace stat-normalization (which reduced variance/detail) with raw
blend + per-channel mean correction. This preserves the detail transfer
from blending with the sharp reference frame while preventing brightness
shift. With correlated real latents, yields ~1.37x detail improvement
at chunk boundaries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Match reference behavior: decode each chunk's 9 latent frames
independently rather than the full concatenated sequence. This avoids
cross-chunk VAE temporal convolution artifacts (grid patterns, brightness
bleeding) that occurred when quality discontinuities at chunk boundaries
hit the causal convolutions.

Changes:
- Per-chunk VAE decode loop with per-chunk warmup trimming
- Tiling config uses per-chunk frame count (33) not full video
- chunk_blend default changed from 2 to 0 (off)
- Blend code retained as opt-in (--chunk-blend N)
- Updated HELIOS-DIAGNOSTICS.md with findings

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Port the reference's AdaptiveAntiDrifting mechanism. Tracks per-channel
latent mean/variance via EMA across chunks. When both drift beyond L2
thresholds (0.15), adds Gaussian noise (default 10%) to the chunk's
latents before saving to history, forcing subsequent chunks to re-anchor
to global statistics.

Usage: --anti-drifting [--anti-drift-strength 0.1]
Off by default (matching reference).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The previous approach (adding Gaussian noise when drift detected) degraded
output quality and caused camera jumps — noise was applied to both decoded
output AND history, cascading into future chunks.

New approach:
- Clean latents always saved for decoding (no output quality impact)
- When drift detected, normalize only the HISTORY copy's per-channel
  mean/var toward the running EMA (deterministic, no random noise)
- Blend controlled by --anti-drift-blend (0-1, default 0.5)
- Fix EMA update order to match reference (update before detect)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two mitigations for the inherent autoregressive chunk boundary artifacts:

1. Enable --amplify-first-chunk by default (matching reference distilled
   script). Doubles DMD steps for the first chunk, providing a better
   spatial anchor for subsequent chunks via history.

2. Add pixel-space cross-fade (--crossfade-frames 4, default ON). Linear
   blend of last N frames of chunk K with first N of chunk K+1. Unlike
   latent-space blending, pixel-space is clean since VAE decode has
   already resolved block noise patterns.

Confirmed NOT a bug: thorough reference comparison shows identical
schedules, indices, history, and DMD steps.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Helios transformer block was computing residual connections (self-attn,
cross-attn, FFN) entirely in bfloat16. The reference PyTorch implementation
promotes to float32 for these additions (.float() + ...).type_as()).

In bfloat16 (7 bits mantissa), small residual updates from attention/FFN
are systematically truncated. Over 48 blocks × 3 residuals × 6 model calls
per chunk, this truncation preferentially removes high-frequency spatial
content. When these smoothed latents become history for the next chunk,
the effect compounds, producing a progressive 'zooming in' artifact.

Fix: cast to float32 before residual additions, matching the reference:
  x = (x.astype(float32) + attn_out * gate).astype(w_dtype)

Verified via ablation:
- Frozen history (same history all chunks): NO zoom → proves history causes it
- Model forward pass: cos_sim=0.996 vs reference → model correct per-call
- After fix: spatial gradient 2x higher, declining slope 3x smaller

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reference pipeline comparison showed cross-fade was causing 40% sharpness
drops at chunk boundaries. Without it, boundaries show the correct pattern:
first frame is sharpest (matching reference's UP +16-21% spikes).

Cross-fade can still be enabled via --crossfade-frames N if desired.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reference: PKU-YuanGroup/Helios#2

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Each chunk's first frame is a distorted reconstruction of the previous chunk's
last frame via history conditioning. Dropping it gives 32 frames/chunk (exactly
2s at 16fps) and eliminates visible boundary artifacts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The VAE's causal temporal convolutions lack context at the start of each
independently-decoded chunk, causing a ~7% contrast drop in the first few
frames. This manifests as visible brightness jumps at chunk boundaries.

Correct this by matching each non-first chunk's initial frames to the
contrast level of the previous chunk's last frame, with a smooth 6-frame
linear ramp. Reduces the contrast discontinuity from 7% to 1.5% and the
pixel diff ratio from 4.1x to 3.0x (matching the reference pipeline).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tching

Replace the global contrast-only correction with per-channel mean and std
matching. This eliminates the remaining brightness jump (0.9% -> 0.0%) and
color shift (1.8 -> 0.3 pixel max channel shift) at chunk boundaries.
The pixel diff ratio improves to 2.4x, better than the reference's 3.0x.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The VAE causal warmup causes not just global contrast drops but spatial
brightness redistribution (face darkens while background brightens).
Fix with a two-stage correction:

1. Low-frequency spatial match: downsample frames 16x, compute per-channel
   brightness difference vs previous chunk's last frame, upsample and apply
   as a smooth additive correction
2. Per-channel contrast scaling to match reference std dev

Results: center brightness shift reduced from -1.85 to -0.15, periphery
from +0.68 to +0.09. Contrast jump improved to -1.0%.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add scripts/helios/ with 4 consolidated diagnostic tools:
  analyze_boundaries.py, run_reference.py, compare_pipelines.py, compare_models.py
- Update boundary quality section with full fix chain and metrics table
- Document failed VAE overlap decode approach
- Add diagnostic recipes section referencing new scripts
- Update file layout and commit history appendix

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove dead RoPE computation in restricted self-attention path
  (helios_rope_apply was called but results were immediately discarded)
- Fuse V projection: single self.v() call instead of two separate calls
- Cache constant t=0 timestep embedding in __init__ instead of
  recomputing every forward pass (saves 4 linear layers per call)
- Replace concatenate+repeat padding with mx.pad(mode='edge') in
  rope.py and transformer.py _patchify_history
- Add mx.compile support to generate_helios.py with --no-compile flag

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When --amplify-first-chunk is enabled (default), the first chunk
gets doubled steps per stage. The progress bar now reflects this
so it shows 12/12 instead of 12/6.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
OPT-9: Pre-convert timesteps/sigmas to Python lists before denoising
loop, avoiding .item() and float() sync points that force GPU evaluation.

OPT-10: Remove intermediate mx.eval(noise_pred) — the full computation
graph (model forward + scheduler step) is now evaluated in a single
mx.eval(latents) call, giving MLX maximum fusion opportunity.

OPT-11: scheduler.step_dmd() accepts optional sigma_t/sigma_next as
Python floats, bypassing float(self.sigmas[idx]) sync points.

OPT-12: Batch modulation dtype cast in transformer block — cast the
full mod tensor to w_dtype once instead of 6 individual casts per block
(240 unnecessary casts eliminated across 40 blocks).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dmunch and others added 10 commits March 12, 2026 09:45
The cached t=0 timestep projection was being computed in __init__
before load_weights() populated the trained parameters. This meant
the history conditioning used random/initial weights instead of the
actual trained time_embedding and time_projection layers, causing:
- Degraded colors in the first chunk
- Loss of inter-chunk continuity (mostly solid color)

Fix: make _t0_proj lazy — computed and cached on first forward pass
when the trained weights are guaranteed to be present.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
History tensors (h_short, h_mid, h_long) are constant within each
pyramid stage but were being cast to bfloat16 on every denoising
step. Move the casts before the loop to avoid redundant work.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds a 'quantize' subcommand to convert_helios.py that takes an existing
MLX model directory and produces a quantized copy:

  python -m mlx_video.convert_helios quantize <model_dir> [--bits 4]

Creates <model_dir>-4bit/ by default (customizable via --output-dir).
The generate script auto-detects quantization from config.json.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace subparser-based CLI with flat --flag pattern matching
convert_wan.py:
- --checkpoint-dir / --output-dir (named args, not positional)
- --quantize-only flag (instead of separate 'quantize' subcommand)
- --bits choices=[4,8], --group-size choices=[32,64,128]
- Smart copy: only copies non-transformer files for quantize-only
- source_dir support in _quantize_saved_model

Update README.md examples to match new CLI.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant