feat: support kimi 1t training by xrsrke · Pull Request #45 · NousResearch/torchtitan

xrsrke · 2026-01-20T19:39:14Z

No description provided.

- Add activation checkpoint offload module - Add memory defragmentation utilities - Add deep memory profiler script - Add various Kimi 1T training configs (EP64, EP96, EP128, CP2, etc.) - Add Qwen3 activation offload test configs - Add slurm launch scripts - Update DeepSeek V3 model with MoE improvements Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Remove kimi_1t, debug, exp1a training configs, qwen3 test configs, and root-level debug scripts from git tracking.

…control - fsdp_reshard_after_forward now accepts integer N for partial resharding to N-GPU groups (e.g., N=8 for intra-node NVLink communication) - Add fsdp_bucket_cap_mb config to control gradient reduction bucket size - Add fsdp_disable_prefetch config to disable forward/backward prefetching - Pass new options through to apply_fsdp() in deepseek_v3 and llama4

- Add nvidia-smi memory reporting for verification against PyTorch stats - Display active memory (actual tensor usage) as primary metric instead of reserved - Log detailed memory breakdown (active/reserved/nvidia-smi) on rank 0 - Enable profile_memory=True in profiler to track allocations per operation

- New AggressiveMemoryManager with 4 modes: minimal, balanced, aggressive, maximum - Clears CUDA cache at strategic points (post-backward, post-optimizer) - Add aggressive_memory_mode and aggressive_memory_verbose config options - Integrate into training loop with post_backward(), post_optimizer(), step_complete() hooks

- Add model presets: qwen3 (2048 dim, 128 experts) and kimi_k2 (7168 dim, 384 experts) - Add init_dist_torchrun() for torchrun environment compatibility - Add CLI arguments: --model, --hidden, --num-experts, --num-topk - Change from group-based to uniform token distribution for routing - Fix MASTER_PORT to be consistent across ranks

- Visualize GPU allocation across DP, PP, TP, CP, EP dimensions - Show mesh structure, submeshes, and coordinate mappings - Visualize expert parallel and context parallel group allocation - Display FSDP sharding details for expert vs non-expert parameters

- Add return_outputs parameter for PP compatibility - Accept **kwargs to handle additional PP arguments

…aining

- kimi_k2_12n_ep96_cp16_32k_ctx_lbs11.toml: 12-node baseline config - EP=96, CP=16, DP=1, LBS=11, 32k context - Expected: 402 TPS, 67.55 GiB (85.2%), 17.72% MFU - kimi_k2_36n_ep96_cp16_32k_ctx_hsdp_replicate3_shard6_lbs10.toml: 36-node HSDP config - EP=96, CP=16, dp_replicate=3, dp_shard=6, LBS=10, 32k context - Expected: 378 TPS, 69.45 GiB (87.6%), 16.64% MFU Both configs include aggressive memory management (mode=maximum). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Phuc Nguyen and others added 5 commits January 19, 2026 13:04

hacky

114b8af

backup

b4f226d

Remove experimental configs and debug scripts from tracking

82f2afe

Remove kimi_1t, debug, exp1a training configs, qwen3 test configs, and root-level debug scripts from git tracking.

remove assert for cp, and removed new activation checkpointing

e04c0f6

xrsrke force-pushed the phuc/kimi1t_training branch from 9cfe11d to e04c0f6 Compare January 20, 2026 19:51

xrsrke and others added 15 commits January 20, 2026 12:00

remove MemoryDefragManager

e42846c

fast path for initing bfloat16 params on cpu

94e59dc

add bfloat16 optim states, fix page cahce

6dd01dd

Merge remote-tracking branch 'origin/phuc/kimi1t_training'

d36c5d3

add reference for init scheme

f18db98

error if cp set but can't import

2d60a01

overlapped cpu offload muon

53eea6b

Add pipeline parallelism support to DeepSeek V3 model

f6bf1ec

- Add return_outputs parameter for PP compatibility - Accept **kwargs to handle additional PP arguments

Merge remote-tracking branch 'temp/remote-branch' into phuc/kimi1t_tr…

6037d27

…aining

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support kimi 1t training#45

feat: support kimi 1t training#45
xrsrke wants to merge 20 commits intodev-updated-againfrom
phuc/kimi1t_training

xrsrke commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xrsrke commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants