NSA: Native Sparse Attention

A PyTorch+Triton+FlexAttention implementation of NSA that combines compression, selection, and sliding window attention mechanisms described in DeepSeeks Native Sparse Attention paper.

For a deep dive into sparse attention mechanisms and the design of this kernel, check out our blog post: Sparsity is Cool.

Installation

# Using uv (recommended)
uv sync

Usage

import torch
from nsa import nsa_func
# Run NSA
output = nsa_func(
    q, k, v,
    g_cmp=g_cmp,
    g_slc=g_slc,
    g_swa=g_swa,
    block_counts=16,
    block_size=16,
    window_size=32,
    scale=None  # Defaults to 1/sqrt(D)
)

Features

Supports toggling between one-pass (atomic) and two-pass backward variants for selection attention
GQA (Grouped Query Attention) compatible
Efficient Triton kernels for high throughput

Acknowledgments

This implementation uses components from flash-linear-attention, specifically the parallel NSA implementation for the two-pass variant. We thank the FLA team for their excellent work on efficient attention mechanisms.

The kernel has been implemented following the Native Sparse Attention paper by DeepSeek: arXiv:2502.11089.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
nsa		nsa
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NSA: Native Sparse Attention

Installation

Usage

Features

Acknowledgments

About

Uh oh!

Contributors 2

Uh oh!

Languages

tilde-research/nsa-impl

Folders and files

Latest commit

History

Repository files navigation

NSA: Native Sparse Attention

Installation

Usage

Features

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages