Use Triton to implement a high-performance FlashAttention-2 kernel that balances ease of use and speed for users without deep CUDA knowledge.
Sub-Issues:
- Understand Triton Kernel Design for Attention
- Develop Triton Kernel for FlashAttention-2 Forward Pass
- Develop Triton Kernel for FlashAttention-2 Backward Pass
- Integrate Triton Kernel with PyTorch
- Test and Benchmark Triton Kernel
Use Triton to implement a high-performance FlashAttention-2 kernel that balances ease of use and speed for users without deep CUDA knowledge.
Sub-Issues: