Add Helios (14B minute-scale video generation)#21
Open
dmunch wants to merge 40 commits intoBlaizzy:mainfrom
Open
Add Helios (14B minute-scale video generation)#21dmunch wants to merge 40 commits intoBlaizzy:mainfrom
dmunch wants to merge 40 commits intoBlaizzy:mainfrom
Conversation
Add complete Helios video generation support (distilled, T2V): - Transformer backbone: 40 layers, dim=5120, multi-scale history memory - 3-stage pyramid denoising: denoise at 1/4 → 1/2 → full resolution - DMD scheduler with x0-prediction, dynamic shifting, block noise - Autoregressive 33-frame chunking with short/mid/long history - Weight conversion from HuggingFace diffusers format - T5 weight sanitization for UMT5-XXL encoder - CLI: --pyramid-steps (default 2 2 2), --amplify-first-chunk - 40 tests covering config, scheduler, RoPE, attention, pyramid helpers - Usage docs in mlx_video/models/helios/README.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two root causes for uniformly grey video output: 1. Dynamic time shift formula was inverted: Wrong: mu*t / (mu + (1-mu)*t) Correct: mu*t / (1 + (mu-1)*t) This caused sigma values at each pyramid stage to be wildly incorrect (e.g., 0.61 instead of 0.998 for stage 0). 2. VAE weight keys from HF diffusers format were not mapped to WanVAE structure. Added sanitize_helios_vae_weights() with complete key mapping: post_quant_conv→conv2, decoder.conv_in→decoder.conv1, up_blocks→upsamples flat indexing, resnet sub-key mapping, etc. Also: - Added pyramid-aware spatial alignment (latent dims divisible by 8) - Added patch-size truncation safety in _patchify for odd dimensions - 46 tests passing (6 new VAE sanitization tests) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add classifier-free guidance (CFG) with configurable guidance_scale (default 5.0) The distilled model requires CFG for correct color output despite being distilled - Add negative_prompt parameter for unconditional CFG baseline - Trim first stride_t-1 (3) warmup frames from VAE decode output The causal convolution warmup produces garbage frames at the start - Add --guidance-scale and --negative-prompt CLI arguments - Update README with new CLI options Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix color distortion: use fresh random noise as start_point for pyramid stages > 0 instead of blended signal (prevents variance inflation) - Add CFG support with guidance_scale parameter (default 1.0 for distilled) - Cast timesteps to int for text embedding conditioning - Use float32 precision in scheduler step_dmd to prevent numerical drift - Set restrict_self_attn=False (full attention matches reference behavior) - Add keep_first_frame logic for multi-chunk generation - Add VAE temporal frame trimming (remove causal padding warmup frames) All 46 tests passing. Generation produces correct colors with motion. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three changes to address color bias in pyramid denoising: 1. Normalize DMD start_point per-channel (zero mean, unit std) for stages > 0. The blended signal carries mean bias from the previous stage's x0 prediction, which cascades through re-noising causing monotonically growing channel means (-0.25 -> -0.49 -> -0.83 for ch0). Normalizing preserves spatial structure while removing the bias. 2. Cast latents and history to bfloat16 before model calls, matching the reference which uses bfloat16 throughout (model trained with bf16 activations). 3. Cast scheduler step_dmd output back to original dtype (bfloat16), matching the reference's convert_flow_pred_to_x0 which returns in original_dtype. Before: R=206, G=107, B=43 (orange bias for beach prompt) After: R=152, G=111, B=75 (balanced warm tones) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comprehensive technical reference covering: - All verified components (transformer 0.999 correlation, VAE, scheduler) - 8 resolved bugs with root causes and fixes - Open problems (chunk 2 instability, warm bias, performance) - Things to watch out for (bfloat16 promotion, VAE offset, history resolution) - Key constants, formulas, and diagnostic recipes Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The start-point normalization added in c5acde72 to fix color bias was actually breaking the DMD denoising trajectory. The normalization changed the scale of the noise tensor used in re-noising, destroying the signal-to-noise ratio that the alpha/beta blending coefficients were designed for. The reference implementation (pipeline_helios_diffusers.py) does NOT normalize start_point — it simply appends the blended latent. The mild per-channel mean growth across pyramid stages is inherent model behavior. Changes: - Revert start_point normalization to match reference: raw append - Add --debug flag for per-step latent statistics logging - Update bilinear downsample documentation (already equivalent to F.interpolate for 2x integer factors) - Document Bug 9 in HELIOS-DIAGNOSTICS.md Test output (seed=42, 'A calm ocean at sunset', 384x640, 33 frames): - R=114, G=59, B=17 (warm sunset tones) - Frame-to-frame diff: 3.46 avg (temporally coherent) - Entropy: 5.54 bits (structured, not noise) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…cision The root cause of the pure noise/uniform color output was in the zero-history timestep embedding computation. History tokens (81.6% of all tokens) were getting modulated with the wrong timestep embedding: - Bug: t0_emb = zeros (all 0s), which produces wrong MLP output - Fix: t0_emb = [cos(0*freq), sin(0*freq)] = [1,...,1, 0,...,0] matching the reference Timesteps(0) sinusoidal encoding This single bug caused catastrophic divergence from block 0 onward (cosine similarity -0.30 → 0.999982 after fix). Also fixes scheduler step_dmd to return float32 (matching reference) instead of casting back to bfloat16, preventing precision loss across denoising steps. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
At chunk boundaries, the first few latent frames of each new chunk are blurrier due to lack of temporal context during denoising. This adds a latent-space blend that mixes boundary frames toward the last sharp frame of the previous chunk before VAE decode. Default: --chunk-blend 2 (blend 2 latent frames ~8 pixel frames). Use --chunk-blend 0 to disable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace stat-normalization (which reduced variance/detail) with raw blend + per-channel mean correction. This preserves the detail transfer from blending with the sharp reference frame while preventing brightness shift. With correlated real latents, yields ~1.37x detail improvement at chunk boundaries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Match reference behavior: decode each chunk's 9 latent frames independently rather than the full concatenated sequence. This avoids cross-chunk VAE temporal convolution artifacts (grid patterns, brightness bleeding) that occurred when quality discontinuities at chunk boundaries hit the causal convolutions. Changes: - Per-chunk VAE decode loop with per-chunk warmup trimming - Tiling config uses per-chunk frame count (33) not full video - chunk_blend default changed from 2 to 0 (off) - Blend code retained as opt-in (--chunk-blend N) - Updated HELIOS-DIAGNOSTICS.md with findings Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Port the reference's AdaptiveAntiDrifting mechanism. Tracks per-channel latent mean/variance via EMA across chunks. When both drift beyond L2 thresholds (0.15), adds Gaussian noise (default 10%) to the chunk's latents before saving to history, forcing subsequent chunks to re-anchor to global statistics. Usage: --anti-drifting [--anti-drift-strength 0.1] Off by default (matching reference). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The previous approach (adding Gaussian noise when drift detected) degraded output quality and caused camera jumps — noise was applied to both decoded output AND history, cascading into future chunks. New approach: - Clean latents always saved for decoding (no output quality impact) - When drift detected, normalize only the HISTORY copy's per-channel mean/var toward the running EMA (deterministic, no random noise) - Blend controlled by --anti-drift-blend (0-1, default 0.5) - Fix EMA update order to match reference (update before detect) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two mitigations for the inherent autoregressive chunk boundary artifacts: 1. Enable --amplify-first-chunk by default (matching reference distilled script). Doubles DMD steps for the first chunk, providing a better spatial anchor for subsequent chunks via history. 2. Add pixel-space cross-fade (--crossfade-frames 4, default ON). Linear blend of last N frames of chunk K with first N of chunk K+1. Unlike latent-space blending, pixel-space is clean since VAE decode has already resolved block noise patterns. Confirmed NOT a bug: thorough reference comparison shows identical schedules, indices, history, and DMD steps. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Helios transformer block was computing residual connections (self-attn, cross-attn, FFN) entirely in bfloat16. The reference PyTorch implementation promotes to float32 for these additions (.float() + ...).type_as()). In bfloat16 (7 bits mantissa), small residual updates from attention/FFN are systematically truncated. Over 48 blocks × 3 residuals × 6 model calls per chunk, this truncation preferentially removes high-frequency spatial content. When these smoothed latents become history for the next chunk, the effect compounds, producing a progressive 'zooming in' artifact. Fix: cast to float32 before residual additions, matching the reference: x = (x.astype(float32) + attn_out * gate).astype(w_dtype) Verified via ablation: - Frozen history (same history all chunks): NO zoom → proves history causes it - Model forward pass: cos_sim=0.996 vs reference → model correct per-call - After fix: spatial gradient 2x higher, declining slope 3x smaller Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reference pipeline comparison showed cross-fade was causing 40% sharpness drops at chunk boundaries. Without it, boundaries show the correct pattern: first frame is sharpest (matching reference's UP +16-21% spikes). Cross-fade can still be enabled via --crossfade-frames N if desired. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reference: PKU-YuanGroup/Helios#2 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Each chunk's first frame is a distorted reconstruction of the previous chunk's last frame via history conditioning. Dropping it gives 32 frames/chunk (exactly 2s at 16fps) and eliminates visible boundary artifacts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The VAE's causal temporal convolutions lack context at the start of each independently-decoded chunk, causing a ~7% contrast drop in the first few frames. This manifests as visible brightness jumps at chunk boundaries. Correct this by matching each non-first chunk's initial frames to the contrast level of the previous chunk's last frame, with a smooth 6-frame linear ramp. Reduces the contrast discontinuity from 7% to 1.5% and the pixel diff ratio from 4.1x to 3.0x (matching the reference pipeline). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tching Replace the global contrast-only correction with per-channel mean and std matching. This eliminates the remaining brightness jump (0.9% -> 0.0%) and color shift (1.8 -> 0.3 pixel max channel shift) at chunk boundaries. The pixel diff ratio improves to 2.4x, better than the reference's 3.0x. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The VAE causal warmup causes not just global contrast drops but spatial brightness redistribution (face darkens while background brightens). Fix with a two-stage correction: 1. Low-frequency spatial match: downsample frames 16x, compute per-channel brightness difference vs previous chunk's last frame, upsample and apply as a smooth additive correction 2. Per-channel contrast scaling to match reference std dev Results: center brightness shift reduced from -1.85 to -0.15, periphery from +0.68 to +0.09. Contrast jump improved to -1.0%. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add scripts/helios/ with 4 consolidated diagnostic tools: analyze_boundaries.py, run_reference.py, compare_pipelines.py, compare_models.py - Update boundary quality section with full fix chain and metrics table - Document failed VAE overlap decode approach - Add diagnostic recipes section referencing new scripts - Update file layout and commit history appendix Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove dead RoPE computation in restricted self-attention path (helios_rope_apply was called but results were immediately discarded) - Fuse V projection: single self.v() call instead of two separate calls - Cache constant t=0 timestep embedding in __init__ instead of recomputing every forward pass (saves 4 linear layers per call) - Replace concatenate+repeat padding with mx.pad(mode='edge') in rope.py and transformer.py _patchify_history - Add mx.compile support to generate_helios.py with --no-compile flag Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When --amplify-first-chunk is enabled (default), the first chunk gets doubled steps per stage. The progress bar now reflects this so it shows 12/12 instead of 12/6. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
OPT-9: Pre-convert timesteps/sigmas to Python lists before denoising loop, avoiding .item() and float() sync points that force GPU evaluation. OPT-10: Remove intermediate mx.eval(noise_pred) — the full computation graph (model forward + scheduler step) is now evaluated in a single mx.eval(latents) call, giving MLX maximum fusion opportunity. OPT-11: scheduler.step_dmd() accepts optional sigma_t/sigma_next as Python floats, bypassing float(self.sigmas[idx]) sync points. OPT-12: Batch modulation dtype cast in transformer block — cast the full mod tensor to w_dtype once instead of 6 individual casts per block (240 unnecessary casts eliminated across 40 blocks). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The cached t=0 timestep projection was being computed in __init__ before load_weights() populated the trained parameters. This meant the history conditioning used random/initial weights instead of the actual trained time_embedding and time_projection layers, causing: - Degraded colors in the first chunk - Loss of inter-chunk continuity (mostly solid color) Fix: make _t0_proj lazy — computed and cached on first forward pass when the trained weights are guaranteed to be present. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
History tensors (h_short, h_mid, h_long) are constant within each pyramid stage but were being cast to bfloat16 on every denoising step. Move the casts before the loop to avoid redundant work. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds a 'quantize' subcommand to convert_helios.py that takes an existing MLX model directory and produces a quantized copy: python -m mlx_video.convert_helios quantize <model_dir> [--bits 4] Creates <model_dir>-4bit/ by default (customizable via --output-dir). The generate script auto-detects quantization from config.json. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace subparser-based CLI with flat --flag pattern matching convert_wan.py: - --checkpoint-dir / --output-dir (named args, not positional) - --quantize-only flag (instead of separate 'quantize' subcommand) - --bits choices=[4,8], --group-size choices=[32,64,128] - Smart copy: only copies non-transformer files for quantize-only - source_dir support in _quantize_saved_model Update README.md examples to match new CLI. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Helios is a 14B-parameter autoregressive video generation model that produces minute-scale, temporally coherent video. This implementation targets the Helios-Distilled variant for text-to-video generation on Apple Silicon via MLX.
Again, heavy AI engineering here, so do let me know where we can clean up :)
Copilot Summary
This pull request introduces the initial implementation of the Helios text-to-video model for Apple Silicon, including model architecture, configuration, attention mechanisms, and loading utilities. It also adds comprehensive documentation for setup, conversion, and usage. The changes are grouped below by documentation and codebase additions.
Documentation:
README.mdformlx_video/models/helios, providing step-by-step instructions for downloading, converting, quantizing, and running the Helios-Distilled model, as well as an in-depth explanation of the model architecture and generation pipeline.mlx_video/models/helios/scripts/.Helios Model Implementation:
config.py, defining all model, VAE, and scheduler parameters, with support for the distilled variant.attention.py, including custom RMSNorm and LayerNorm, self-attention with history restriction and 3-way RoPE, and cross-attention with QK normalization and key/value caching.loading.py, supporting quantized model loading and reusing Wan's T5 encoder and VAE components, in line with Helios architecture.