-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Labels
Description
TorchSpec Roadmap 2026 Q2
Model Support
- Minimax M 2.5
- Qwen 3.5
- Continuous training of the MTP layer from GLM 5
Training
- Packed sequence training: pack multiple shorter sequences into a single training sample to maximize GPU utilization and reduce padding waste, especially for datasets with variable-length inputs
- Additional training methods: expand beyond Eagle3 to support DFlash, MTP, and other speculative decoding training approaches, broadening the range of draft model architectures TorchSpec can train
- LK Loss (PR #29): add LK^alpha and LK^lambda losses for direct acceptance rate optimization, improving average acceptance length by 3-8% over Forward KL on Eagle3
- Context Parallel under DP ranks: support context parallelism within data-parallel ranks
- FlexAttention native FA4 backend (Issue #30): adopt
BACKEND="FLASH"in FlexAttention to unify theflex_attentionandfa_experimentalcode paths, replacing manual CuTeDSL integration with a stable PyTorch API for FA4-level performance on Hopper/Blackwell GPUs
Inference
- TensorRT-LLM integration: add as an inference backend alongside SGLang and vLLM so users can plug in whichever engine best fits their deployment stack
- Inference auto-expansion: automatically scale inference when more nodes become available
- Support chunked-prefill: Support chunked prefill to allow longer context
Framework
- Placement group node pinning by IP: allow users to pin inference to specific nodes by IP, with finer granularity for multiple inference engines on the same node
- Automatic Mooncake config determination: derive Mooncake transfer config from batch size and max sampling pool size; auto-compute max sampling pool size as
global_batch_size * delay_deletion_ratio - Debugging mode: add a debugging mode for both inference and training sides
Reactions are currently unavailable