Flash Hog

This repo contains the code for Flash Higher-Order-Gradients, aka. Flash Hog. This kernel achieves around a 3.7x speedup over an XLA optimized kernel, with linear memory scaling instead of quadratic scaling.

Installation

TODO

Method

Flash Hog does 4 recomputation passes to avoid any atomics or saving any intermediary tensors of shape (N_Q, N_K). This shakes out to be thread-wise tiling across Q in 3 passes first, once to compute dd, then once for b, then once for both dQ' and ddO. Finally we do another pass tiled over K, producing dK' and dV'. The equations we implement are the following:

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
assets		assets
experiments		experiments
flash_hog		flash_hog
jax_refs		jax_refs
references		references
tests		tests
.gitignore		.gitignore
.python-version		.python-version
Flash-HOG.pdf		Flash-HOG.pdf
README.md		README.md
pyproject.toml		pyproject.toml
test_data.zip		test_data.zip
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Flash Hog

Installation

Method

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Flash Hog

Installation

Method

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages