Deepseek V3.1 became the best non-reasoning model in february 2025, this is a recreation based on the paper
paper - https://arxiv.org/pdf/2412.19437
Learning:
- Multihead latent attention
- attention basics
- RoPE
- MLA
- Mixture of experts
- Gate
- expert
- Parallelization across GPUs