A high-performance, GPU-accelerated hash table implementation using CUDA cooperative groups for optimized concurrent operations. Designed for large-scale parallel workloads requiring efficient key-value storage and retrieval on NVIDIA GPUs.
The Adaptive GPU Hash Table is a plug-and-play, header-only hash table optimized for NVIDIA CUDA architectures. It leverages cooperative groups (CG) to enable efficient parallel insertions, lookups, and deletions across thousands of GPU threads. The implementation uses linear probing with tombstone-based deletion and supports configurable load factors and probing strategies.
- Cooperative Groups: Uses CUDA cooperative groups (tile size: 16 threads) to parallelize probing sequences
- Linear Probing: Open addressing with linear probing for cache-friendly memory access
- Tombstone Deletion: Tombstone markers enable efficient deletion without rehashing
- Structure of Arrays: Separate arrays for keys, values, and metadata for optimal memory coalescing
AhgDeviceView: Device-side view providing insert, find, and erase operationsAhgHost: Host-side table management for memory allocation and initializationAdaptiveHashTableGPU: High-level C++ API wrapper for batch operations- Kernels: Optimized CUDA kernels for batch insert, find, and erase operations
- Recommended GPU Hardware: NVIDIA GeForce RTX 4070 or better
- Memory: Sufficient GPU memory for your dataset (capacity × 17 bytes per element)
- CUDA Toolkit: Version 12.4 or compatible
- Compiler:
nvcc(CUDA compiler) andg++(C++17 support) - OS: Linux (tested on CUDA servers) [All reported benchmarks are from cuda5.cims.nyu.edu cluster in NYU Courant]
# Load CUDA module (on systems with module system)
module load cuda-12.4
# Verify CUDA installation
nvcc --versiongit clone https://github.com/NerdyVisky/adaptive-gpu-hashtable.git
cd adaptive-gpu-hashtableSimply include the header file in your CUDA source:
#include "path/to/repo/src/hashtable_gpu_adaptive.cuh"nvcc -std=c++17 -O3 -arch=sm_80 your_file.cu -o your_executableNote: Adjust -arch=sm_80 based on your GPU's compute capability:
sm_75for Turing (RTX 20xx)sm_80for Ampere (RTX 30xx, A100)sm_86for Ampere (RTX 30xx mobile)sm_89for Ada Lovelace (RTX 40xx)
#include "src/hashtable_gpu_adaptive.cuh"
#include <vector>
using namespace gpu::hashtable;
int main() {
// Create hash table with capacity (will be rounded up to power of 2)
size_t capacity = 1'000'000;
AdaptiveHashTableGPU<uint64_t> table(capacity, 256); // max_probe = 256
// Prepare data
std::vector<uint64_t> keys = {1, 2, 3, 4, 5};
std::vector<uint64_t> values = {100, 200, 300, 400, 500};
std::vector<uint8_t> success(keys.size());
// Batch insert
table.insert(keys.data(), values.data(), keys.size(), success.data());
// Batch find
std::vector<uint64_t> out_values(keys.size());
std::vector<uint8_t> found(keys.size());
table.find(keys.data(), out_values.data(), found.data(), keys.size());
// Batch erase
table.erase(keys.data(), success.data(), keys.size());
return 0;
}AdaptiveHashTableGPU(size_t capacity_pow2, uint32_t max_probe = 256)Creates a new hash table with the specified capacity (rounded up to power of 2) and maximum probe length.
Parameters:
capacity_pow2: Desired capacity (automatically rounded to next power of 2)max_probe: Maximum number of probe steps before giving up (default: 256)
Batch insert operation. Inserts n key-value pairs into the hash table.
Parameters:
h_keys: Host pointer to array of keysh_vals: Host pointer to array of valuesn: Number of elements to inserth_success: Output array indicating success (1) or failure (0) for each insertion
Returns: void (success flags written to h_success)
Batch find operation. Looks up n keys in the hash table.
Parameters:
h_keys: Host pointer to array of keys to findh_out_vals: Output array for found valuesh_found: Output array indicating whether each key was found (1) or not (0)n: Number of keys to look up
Returns: void (results written to h_out_vals and h_found)
Batch erase operation. Removes n keys from the hash table using tombstone markers.
Parameters:
h_keys: Host pointer to array of keys to eraseh_success: Output array indicating success (1) or failure (0) for each erasuren: Number of keys to erase
Returns: void (success flags written to h_success)
Returns the actual capacity of the hash table (power of 2).
Returns the name of the hash function used ("ahg_hash64").
Returns the name of the probing strategy ("LinearProbing(CG-optimized)").
We provide comprehensive benchmarking tools to evaluate performance across different configurations.
The project includes a dedicated benchmark for evaluating the dynamic (resizing and compaction) features of the adaptive GPU hash table.
To run the adaptive resizing benchmark:
-
Compile the benchmark:
nvcc -std=c++17 -O3 -arch=sm_80 benchmarks/benchmark_adaptive_resize.cu -o exe/nvcc/benchmark_adaptive_resize
-
Run the benchmark:
./exe/nvcc/benchmark_adaptive_resize
What this benchmark does:
- Insertion Phase: Inserts a large number of items in batches, triggering table growth and dynamic resizing.
- Deletion Phase: Deletes a significant portion of the table, creating tombstones and fragmentation.
- Compaction Phase: Triggers table shrinking and compaction, efficiently cleaning up tombstones.
- Performance Reporting: Prints table capacity, memory usage, and read performance before and after compaction, showing the effect of dynamic resizing.
Output Example:
--- INSERTION PHASE (Growth) ---
Batch,Capacity,Memory(MB),Time(ms)
...
Final capacity: 262144
--- DELETION PHASE (80% Data) ---
Pre-Compact Capacity: 262144 (Fragmented)
Fragmented Read Time (100k items): 12345 us
--- COMPACTION TRIGGERED ---
Post-Compact Capacity: 32768
Memory Saved: 3.6 MB
Compacted Read Time (100k items): 6789 us
Speedup: 1.8x
You can adjust the number of insertions, batch size, and load factor thresholds in benchmarks/benchmark_adaptive_resize.cu to suit your experiments.
Fixed Load Factor, Variable Size:
./scripts/run_benchmark.sh \
./exe/nvcc/syn_gpu_adaptive_benchmark \
./results/gpu/adaptive \
linear \
adaptive \
0.75 \
64Fixed Size, Variable Load Factor:
./scripts/run_benchmark_load_factors.sh \
./exe/nvcc/syn_gpu_adaptive_benchmark \
./results/gpu/adaptive \
linear \
adaptive \
1000000 \
64Parameters:
EXECUTABLE: Path to benchmark executableRESULTS_DIR: Directory to save resultsPROBING_STRATEGY: Probing method (linear,quadratic,double)HASHTABLE_IMPLEMENTATION: Implementation type (adaptive,naive,default)MAX_LOAD_FACTORorN_SIZE: Load factor (0.0-1.0) or number of elementsKEY_SIZE: Key size in bits (32 or 64)
# GPU Adaptive
nvcc -std=c++17 -O3 -arch=sm_80 \
benchmarks/synthetic/run_benchmark_gpu_adaptive.cu \
-o exe/nvcc/syn_gpu_adaptive_benchmark
# GPU Naive
nvcc -std=c++17 -O3 -arch=sm_80 \
benchmarks/synthetic/run_benchmark_gpu_naive.cu \
-o exe/nvcc/syn_gpu_naive_benchmark
# CPU
g++ -std=c++17 -O3 \
benchmarks/synthetic/run_benchmark.cpp \
-o exe/g++/syn_cpu_benchmarkPerformance analysis of insert, find, and erase operations at various input sizes:
Comparing C++ STL's unordered_map on CPU vs. adaptive-hash-map on GPU:
Effect of load factor on CPU vs. GPU performance at 10M elements:
adaptive-gpu-hashtable/
├── src/ # Core implementation
│ ├── adaptive_hash_gpu.cuh # Low-level device operations
│ ├── hashtable_gpu_adaptive.cuh # High-level C++ API
│ ├── hashtable_gpu_naive.cuh # Naive GPU implementation
│ └── hashtable_cpu.hpp # CPU reference implementation
├── benchmarks/ # Benchmarking suite
│ └── synthetic/
│ ├── run_benchmark_gpu.cu # GPU benchmark driver
│ ├── run_benchmark.cpp # CPU benchmark driver
│ └── cuda_helpers/ # GPU benchmark utilities
├── scripts/ # Benchmark automation scripts
│ ├── run_benchmark.sh # Variable size benchmarks
│ └── run_benchmark_load_factors.sh # Variable load factor benchmarks
├── results/ # Benchmark results
│ ├── cpu/ # CPU benchmark results
│ ├── gpu/ # GPU benchmark results
│ └── plots/ # Performance plots
├── exe/ # Precompiled executables
│ ├── g++/ # CPU executables
│ └── nvcc/ # GPU executables
└── assets/ # Documentation assets
To use serial (scalar) operations instead of cooperative groups, modify src/adaptive_hash_gpu.cuh:
// In AhgDeviceView struct, change wrapper methods:
template<unsigned int TILE_SIZE, typename Parent>
__device__ bool insert(cg::thread_block_tile<TILE_SIZE, Parent> tile, const K& k, const V& v) {
// return insert_cg(tile, k, v); // Comment out CG version
return insert_scalar(k, v); // Use scalar version
}
template<unsigned int TILE_SIZE, typename Parent>
__device__ bool erase(cg::thread_block_tile<TILE_SIZE, Parent> tile, const K& k) {
// return erase_cg(tile, k); // Comment out CG version
return erase_scalar(k); // Use scalar version
}Modify the tile size by defining AHG_TILE before including the header:
#define AHG_TILE 32 // Default is 16
#include "src/adaptive_hash_gpu.cuh"Contributions are welcome! Please follow these guidelines:
- Fork the repository and create a feature branch
- Follow code style: Use consistent formatting and naming conventions
- Add tests: Include benchmarks or unit tests for new features
- Update documentation: Keep README and code comments up to date
- Submit a pull request with a clear description of changes
# Clone your fork
git clone https://github.com/your-username/adaptive-gpu-hashtable.git
cd adaptive-gpu-hashtable
# Create a branch
git checkout -b feature/your-feature-name
# Make changes and test
nvcc -std=c++17 -O3 -arch=sm_80 your_test.cu -o test
./test
# Commit and push
git commit -m "Add your feature"
git push origin feature/your-feature-name- Memory Alignment: Ensure keys and values are properly aligned for optimal performance
- Load Factor: Very high load factors (>0.95) may result in increased probe lengths and reduced performance
- Key Restrictions: Keys cannot be
0xFFFFFFFFFFFFFFFF(EMPTY_KEY) or0xFFFFFFFFFFFFFFFE(TOMB_KEY) - Concurrent Modifications: While thread-safe, extreme contention may reduce performance
- Support for custom hash functions
- Dynamic resizing/rehashing
- Additional probing strategies (quadratic, double hashing)
- Multi-GPU support
- Stream-based asynchronous operations
- Integration with CUDA unified memory
- Support for non-64-bit value types
- Performance profiling and optimization tools
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this implementation in your research, please cite:
@software{adaptive_gpu_hashtable,
title = {Adaptive GPU Hash Table},
author = {Trivedi, Vishesh and Sathi, Vaigarai},
year = {2025},
url = {https://github.com/NerdyVisky/adaptive-gpu-hashtable}
}For questions, issues, or contributions, please open an issue on GitHub.
Note: This implementation is optimized for NVIDIA GPUs with CUDA support. For CPU-based hash tables, consider using the reference implementation in src/hashtable_cpu.hpp or standard library containers.



