Skip to content

feat: support kimi 1t training#45

Draft
xrsrke wants to merge 20 commits intodev-updated-againfrom
phuc/kimi1t_training
Draft

feat: support kimi 1t training#45
xrsrke wants to merge 20 commits intodev-updated-againfrom
phuc/kimi1t_training

Conversation

@xrsrke
Copy link

@xrsrke xrsrke commented Jan 20, 2026

No description provided.

Phuc Nguyen and others added 5 commits January 19, 2026 13:04
- Add activation checkpoint offload module
- Add memory defragmentation utilities
- Add deep memory profiler script
- Add various Kimi 1T training configs (EP64, EP96, EP128, CP2, etc.)
- Add Qwen3 activation offload test configs
- Add slurm launch scripts
- Update DeepSeek V3 model with MoE improvements

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove kimi_1t, debug, exp1a training configs, qwen3 test configs,
and root-level debug scripts from git tracking.
@xrsrke xrsrke force-pushed the phuc/kimi1t_training branch from 9cfe11d to e04c0f6 Compare January 20, 2026 19:51
xrsrke and others added 15 commits January 20, 2026 12:00
…control

- fsdp_reshard_after_forward now accepts integer N for partial resharding
  to N-GPU groups (e.g., N=8 for intra-node NVLink communication)
- Add fsdp_bucket_cap_mb config to control gradient reduction bucket size
- Add fsdp_disable_prefetch config to disable forward/backward prefetching
- Pass new options through to apply_fsdp() in deepseek_v3 and llama4
- Add nvidia-smi memory reporting for verification against PyTorch stats
- Display active memory (actual tensor usage) as primary metric instead of reserved
- Log detailed memory breakdown (active/reserved/nvidia-smi) on rank 0
- Enable profile_memory=True in profiler to track allocations per operation
- New AggressiveMemoryManager with 4 modes: minimal, balanced, aggressive, maximum
- Clears CUDA cache at strategic points (post-backward, post-optimizer)
- Add aggressive_memory_mode and aggressive_memory_verbose config options
- Integrate into training loop with post_backward(), post_optimizer(), step_complete() hooks
- Add model presets: qwen3 (2048 dim, 128 experts) and kimi_k2 (7168 dim, 384 experts)
- Add init_dist_torchrun() for torchrun environment compatibility
- Add CLI arguments: --model, --hidden, --num-experts, --num-topk
- Change from group-based to uniform token distribution for routing
- Fix MASTER_PORT to be consistent across ranks
- Visualize GPU allocation across DP, PP, TP, CP, EP dimensions
- Show mesh structure, submeshes, and coordinate mappings
- Visualize expert parallel and context parallel group allocation
- Display FSDP sharding details for expert vs non-expert parameters
- Add return_outputs parameter for PP compatibility
- Accept **kwargs to handle additional PP arguments
- kimi_k2_12n_ep96_cp16_32k_ctx_lbs11.toml: 12-node baseline config
  - EP=96, CP=16, DP=1, LBS=11, 32k context
  - Expected: 402 TPS, 67.55 GiB (85.2%), 17.72% MFU

- kimi_k2_36n_ep96_cp16_32k_ctx_hsdp_replicate3_shard6_lbs10.toml: 36-node HSDP config
  - EP=96, CP=16, dp_replicate=3, dp_shard=6, LBS=10, 32k context
  - Expected: 378 TPS, 69.45 GiB (87.6%), 16.64% MFU

Both configs include aggressive memory management (mode=maximum).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants