Adaptive GPU Hash Table

A high-performance, GPU-accelerated hash table implementation using CUDA cooperative groups for optimized concurrent operations. Designed for large-scale parallel workloads requiring efficient key-value storage and retrieval on NVIDIA GPUs.

Overview

The Adaptive GPU Hash Table is a plug-and-play, header-only hash table optimized for NVIDIA CUDA architectures. It leverages cooperative groups (CG) to enable efficient parallel insertions, lookups, and deletions across thousands of GPU threads. The implementation uses linear probing with tombstone-based deletion and supports configurable load factors and probing strategies.

Architecture

Design Principles

Cooperative Groups: Uses CUDA cooperative groups (tile size: 16 threads) to parallelize probing sequences
Linear Probing: Open addressing with linear probing for cache-friendly memory access
Tombstone Deletion: Tombstone markers enable efficient deletion without rehashing

Structure of Arrays: Separate arrays for keys, values, and metadata for optimal memory coalescing

Core Components

AhgDeviceView: Device-side view providing insert, find, and erase operations
AhgHost: Host-side table management for memory allocation and initialization
AdaptiveHashTableGPU: High-level C++ API wrapper for batch operations
Kernels: Optimized CUDA kernels for batch insert, find, and erase operations

Requirements

Hardware

Recommended GPU Hardware: NVIDIA GeForce RTX 4070 or better
Memory: Sufficient GPU memory for your dataset (capacity × 17 bytes per element)

Software

CUDA Toolkit: Version 12.4 or compatible
Compiler: nvcc (CUDA compiler) and g++ (C++17 support)
OS: Linux (tested on CUDA servers) [All reported benchmarks are from cuda5.cims.nyu.edu cluster in NYU Courant]

Installation

Step 1: Setup CUDA Environment

# Load CUDA module (on systems with module system)
module load cuda-12.4

# Verify CUDA installation
nvcc --version

Step 2: Clone Repository

git clone https://github.com/NerdyVisky/adaptive-gpu-hashtable.git
cd adaptive-gpu-hashtable

Step 3: Include in Your Project

Simply include the header file in your CUDA source:

#include "path/to/repo/src/hashtable_gpu_adaptive.cuh"

Step 4: Compile

nvcc -std=c++17 -O3 -arch=sm_80 your_file.cu -o your_executable

Note: Adjust -arch=sm_80 based on your GPU's compute capability:

sm_75 for Turing (RTX 20xx)
sm_80 for Ampere (RTX 30xx, A100)
sm_86 for Ampere (RTX 30xx mobile)
sm_89 for Ada Lovelace (RTX 40xx)

Quick Start

Basic Usage

#include "src/hashtable_gpu_adaptive.cuh"
#include <vector>

using namespace gpu::hashtable;

int main() {
    // Create hash table with capacity (will be rounded up to power of 2)
    size_t capacity = 1'000'000;
    AdaptiveHashTableGPU<uint64_t> table(capacity, 256); // max_probe = 256
    
    // Prepare data
    std::vector<uint64_t> keys = {1, 2, 3, 4, 5};
    std::vector<uint64_t> values = {100, 200, 300, 400, 500};
    std::vector<uint8_t> success(keys.size());
    
    // Batch insert
    table.insert(keys.data(), values.data(), keys.size(), success.data());
    
    // Batch find
    std::vector<uint64_t> out_values(keys.size());
    std::vector<uint8_t> found(keys.size());
    table.find(keys.data(), out_values.data(), found.data(), keys.size());
    
    // Batch erase
    table.erase(keys.data(), success.data(), keys.size());
    
    return 0;
}

API Reference

`AdaptiveHashTableGPU<KeyType>`

Constructor

AdaptiveHashTableGPU(size_t capacity_pow2, uint32_t max_probe = 256)

Creates a new hash table with the specified capacity (rounded up to power of 2) and maximum probe length.

Parameters:

capacity_pow2: Desired capacity (automatically rounded to next power of 2)
max_probe: Maximum number of probe steps before giving up (default: 256)

Methods

`void insert(const KeyType* h_keys, const uint64_t* h_vals, size_t n, uint8_t* h_success)`

Batch insert operation. Inserts n key-value pairs into the hash table.

Parameters:

h_keys: Host pointer to array of keys
h_vals: Host pointer to array of values
n: Number of elements to insert
h_success: Output array indicating success (1) or failure (0) for each insertion

Returns: void (success flags written to h_success)

`void find(const KeyType* h_keys, uint64_t* h_out_vals, uint8_t* h_found, size_t n) const`

Batch find operation. Looks up n keys in the hash table.

Parameters:

h_keys: Host pointer to array of keys to find
h_out_vals: Output array for found values
h_found: Output array indicating whether each key was found (1) or not (0)
n: Number of keys to look up

Returns: void (results written to h_out_vals and h_found)

`void erase(const KeyType* h_keys, uint8_t* h_success, size_t n)`

Batch erase operation. Removes n keys from the hash table using tombstone markers.

Parameters:

h_keys: Host pointer to array of keys to erase
h_success: Output array indicating success (1) or failure (0) for each erasure
n: Number of keys to erase

Returns: void (success flags written to h_success)

`size_t capacity() const`

Returns the actual capacity of the hash table (power of 2).

`std::string hash_policy_name() const`

Returns the name of the hash function used ("ahg_hash64").

`std::string probing_policy_name() const`

Returns the name of the probing strategy ("LinearProbing(CG-optimized)").

Benchmarks

We provide comprehensive benchmarking tools to evaluate performance across different configurations.

Running Benchmarks

Testing Dynamic Resizing and Compaction

The project includes a dedicated benchmark for evaluating the dynamic (resizing and compaction) features of the adaptive GPU hash table.

To run the adaptive resizing benchmark:

Compile the benchmark:

nvcc -std=c++17 -O3 -arch=sm_80 benchmarks/benchmark_adaptive_resize.cu -o exe/nvcc/benchmark_adaptive_resize

Run the benchmark:
```
./exe/nvcc/benchmark_adaptive_resize
```

What this benchmark does:

Insertion Phase: Inserts a large number of items in batches, triggering table growth and dynamic resizing.
Deletion Phase: Deletes a significant portion of the table, creating tombstones and fragmentation.
Compaction Phase: Triggers table shrinking and compaction, efficiently cleaning up tombstones.
Performance Reporting: Prints table capacity, memory usage, and read performance before and after compaction, showing the effect of dynamic resizing.

Output Example:

--- INSERTION PHASE (Growth) ---
Batch,Capacity,Memory(MB),Time(ms)
...
Final capacity: 262144

--- DELETION PHASE (80% Data) ---
Pre-Compact Capacity: 262144 (Fragmented)
Fragmented Read Time (100k items): 12345 us

--- COMPACTION TRIGGERED ---
Post-Compact Capacity: 32768
Memory Saved: 3.6 MB
Compacted Read Time (100k items): 6789 us
Speedup: 1.8x

You can adjust the number of insertions, batch size, and load factor thresholds in benchmarks/benchmark_adaptive_resize.cu to suit your experiments.

Option 1: Using Precompiled Executables (Recommended)

Fixed Load Factor, Variable Size:

./scripts/run_benchmark.sh \
    ./exe/nvcc/syn_gpu_adaptive_benchmark \
    ./results/gpu/adaptive \
    linear \
    adaptive \
    0.75 \
    64

Fixed Size, Variable Load Factor:

./scripts/run_benchmark_load_factors.sh \
    ./exe/nvcc/syn_gpu_adaptive_benchmark \
    ./results/gpu/adaptive \
    linear \
    adaptive \
    1000000 \
    64

Parameters:

EXECUTABLE: Path to benchmark executable
RESULTS_DIR: Directory to save results
PROBING_STRATEGY: Probing method (linear, quadratic, double)
HASHTABLE_IMPLEMENTATION: Implementation type (adaptive, naive, default)
MAX_LOAD_FACTOR or N_SIZE: Load factor (0.0-1.0) or number of elements
KEY_SIZE: Key size in bits (32 or 64)

Option 2: Compile Your Own Benchmark

# GPU Adaptive
nvcc -std=c++17 -O3 -arch=sm_80 \
    benchmarks/synthetic/run_benchmark_gpu_adaptive.cu \
    -o exe/nvcc/syn_gpu_adaptive_benchmark

# GPU Naive
nvcc -std=c++17 -O3 -arch=sm_80 \
    benchmarks/synthetic/run_benchmark_gpu_naive.cu \
    -o exe/nvcc/syn_gpu_naive_benchmark

# CPU
g++ -std=c++17 -O3 \
    benchmarks/synthetic/run_benchmark.cpp \
    -o exe/g++/syn_cpu_benchmark

Performance Results

Static Operations Performance

Performance analysis of insert, find, and erase operations at various input sizes:

Dynamic Performance Comparison

Comparing C++ STL's unordered_map on CPU vs. adaptive-hash-map on GPU:

Load Factor Analysis

Effect of load factor on CPU vs. GPU performance at 10M elements:

Repository Structure

adaptive-gpu-hashtable/
├── src/                          # Core implementation
│   ├── adaptive_hash_gpu.cuh    # Low-level device operations
│   ├── hashtable_gpu_adaptive.cuh # High-level C++ API
│   ├── hashtable_gpu_naive.cuh   # Naive GPU implementation
│   └── hashtable_cpu.hpp         # CPU reference implementation
├── benchmarks/                   # Benchmarking suite
│   └── synthetic/
│       ├── run_benchmark_gpu.cu  # GPU benchmark driver
│       ├── run_benchmark.cpp     # CPU benchmark driver
│       └── cuda_helpers/         # GPU benchmark utilities
├── scripts/                      # Benchmark automation scripts
│   ├── run_benchmark.sh          # Variable size benchmarks
│   └── run_benchmark_load_factors.sh # Variable load factor benchmarks
├── results/                      # Benchmark results
│   ├── cpu/                      # CPU benchmark results
│   ├── gpu/                      # GPU benchmark results
│   └── plots/                    # Performance plots
├── exe/                          # Precompiled executables
│   ├── g++/                      # CPU executables
│   └── nvcc/                     # GPU executables
└── assets/                       # Documentation assets

Advanced Configuration

Switching Between Serial and Cooperative Group Modes

To use serial (scalar) operations instead of cooperative groups, modify src/adaptive_hash_gpu.cuh:

// In AhgDeviceView struct, change wrapper methods:
template<unsigned int TILE_SIZE, typename Parent>
__device__ bool insert(cg::thread_block_tile<TILE_SIZE, Parent> tile, const K& k, const V& v) {
    // return insert_cg(tile, k, v);  // Comment out CG version
    return insert_scalar(k, v);       // Use scalar version
}

template<unsigned int TILE_SIZE, typename Parent>
__device__ bool erase(cg::thread_block_tile<TILE_SIZE, Parent> tile, const K& k) {
    // return erase_cg(tile, k);      // Comment out CG version
    return erase_scalar(k);           // Use scalar version
}

Customizing Tile Size

Modify the tile size by defining AHG_TILE before including the header:

#define AHG_TILE 32  // Default is 16
#include "src/adaptive_hash_gpu.cuh"

Contributing

Contributions are welcome! Please follow these guidelines:

Fork the repository and create a feature branch
Follow code style: Use consistent formatting and naming conventions
Add tests: Include benchmarks or unit tests for new features
Update documentation: Keep README and code comments up to date
Submit a pull request with a clear description of changes

Development Setup

# Clone your fork
git clone https://github.com/your-username/adaptive-gpu-hashtable.git
cd adaptive-gpu-hashtable

# Create a branch
git checkout -b feature/your-feature-name

# Make changes and test
nvcc -std=c++17 -O3 -arch=sm_80 your_test.cu -o test
./test

# Commit and push
git commit -m "Add your feature"
git push origin feature/your-feature-name

Known Issues

Memory Alignment: Ensure keys and values are properly aligned for optimal performance
Load Factor: Very high load factors (>0.95) may result in increased probe lengths and reduced performance
Key Restrictions: Keys cannot be 0xFFFFFFFFFFFFFFFF (EMPTY_KEY) or 0xFFFFFFFFFFFFFFFE (TOMB_KEY)
Concurrent Modifications: While thread-safe, extreme contention may reduce performance

Future Work

Support for custom hash functions
Dynamic resizing/rehashing
Additional probing strategies (quadratic, double hashing)
Multi-GPU support
Stream-based asynchronous operations
Integration with CUDA unified memory
Support for non-64-bit value types
Performance profiling and optimization tools

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this implementation in your research, please cite:

@software{adaptive_gpu_hashtable,
  title = {Adaptive GPU Hash Table},
  author = {Trivedi, Vishesh and Sathi, Vaigarai},
  year = {2025},
  url = {https://github.com/NerdyVisky/adaptive-gpu-hashtable}
}

Contact

For questions, issues, or contributions, please open an issue on GitHub.

Note: This implementation is optimized for NVIDIA GPUs with CUDA support. For CPU-based hash tables, consider using the reference implementation in src/hashtable_cpu.hpp or standard library containers.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
benchmarks		benchmarks
cuda_files		cuda_files
exe		exe
results		results
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
dynamic_perf.png		dynamic_perf.png
main.tex		main.tex
readme.txt		readme.txt
static_ops.png		static_ops.png

NerdyVisky/adaptive-gpu-hashtable

Folders and files

Latest commit

History

Repository files navigation

Adaptive GPU Hash Table

Overview

Architecture

Design Principles

Core Components

Requirements

Hardware

Software

Installation

Step 1: Setup CUDA Environment

Step 2: Clone Repository

Step 3: Include in Your Project

Step 4: Compile

Quick Start

Basic Usage

API Reference

AdaptiveHashTableGPU<KeyType>

Constructor

Methods

void insert(const KeyType* h_keys, const uint64_t* h_vals, size_t n, uint8_t* h_success)

void find(const KeyType* h_keys, uint64_t* h_out_vals, uint8_t* h_found, size_t n) const

void erase(const KeyType* h_keys, uint8_t* h_success, size_t n)

size_t capacity() const

std::string hash_policy_name() const

std::string probing_policy_name() const

Benchmarks

Running Benchmarks

Testing Dynamic Resizing and Compaction

Option 1: Using Precompiled Executables (Recommended)

Option 2: Compile Your Own Benchmark

Performance Results

Static Operations Performance

Dynamic Performance Comparison

Load Factor Analysis

Repository Structure

Advanced Configuration

Switching Between Serial and Cooperative Group Modes

Customizing Tile Size

Contributing

Development Setup

Known Issues

Future Work

License

Citation

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

`AdaptiveHashTableGPU<KeyType>`

`void insert(const KeyType* h_keys, const uint64_t* h_vals, size_t n, uint8_t* h_success)`

`void find(const KeyType* h_keys, uint64_t* h_out_vals, uint8_t* h_found, size_t n) const`

`void erase(const KeyType* h_keys, uint8_t* h_success, size_t n)`

`size_t capacity() const`

`std::string hash_policy_name() const`

`std::string probing_policy_name() const`

Packages