Skip to content

A spelled out, plug-and-play, header-only data structure implementation of GPU-friendly HashTable which works optimally on single-core, multicore, or GPU-accelerated applications

Notifications You must be signed in to change notification settings

NerdyVisky/adaptive-gpu-hashtable

Repository files navigation

Adaptive GPU Hash Table

CUDA C++ License

A high-performance, GPU-accelerated hash table implementation using CUDA cooperative groups for optimized concurrent operations. Designed for large-scale parallel workloads requiring efficient key-value storage and retrieval on NVIDIA GPUs.

Overview

The Adaptive GPU Hash Table is a plug-and-play, header-only hash table optimized for NVIDIA CUDA architectures. It leverages cooperative groups (CG) to enable efficient parallel insertions, lookups, and deletions across thousands of GPU threads. The implementation uses linear probing with tombstone-based deletion and supports configurable load factors and probing strategies.

Architecture

Design Principles

  • Cooperative Groups: Uses CUDA cooperative groups (tile size: 16 threads) to parallelize probing sequences
  • Linear Probing: Open addressing with linear probing for cache-friendly memory access
  • Tombstone Deletion: Tombstone markers enable efficient deletion without rehashing
  • Structure of Arrays: Separate arrays for keys, values, and metadata for optimal memory coalescing

Core Components

  • AhgDeviceView: Device-side view providing insert, find, and erase operations
  • AhgHost: Host-side table management for memory allocation and initialization
  • AdaptiveHashTableGPU: High-level C++ API wrapper for batch operations
  • Kernels: Optimized CUDA kernels for batch insert, find, and erase operations

Requirements

Hardware

  • Recommended GPU Hardware: NVIDIA GeForce RTX 4070 or better
  • Memory: Sufficient GPU memory for your dataset (capacity × 17 bytes per element)

Software

  • CUDA Toolkit: Version 12.4 or compatible
  • Compiler: nvcc (CUDA compiler) and g++ (C++17 support)
  • OS: Linux (tested on CUDA servers) [All reported benchmarks are from cuda5.cims.nyu.edu cluster in NYU Courant]

Installation

Step 1: Setup CUDA Environment

# Load CUDA module (on systems with module system)
module load cuda-12.4

# Verify CUDA installation
nvcc --version

Step 2: Clone Repository

git clone https://github.com/NerdyVisky/adaptive-gpu-hashtable.git
cd adaptive-gpu-hashtable

Step 3: Include in Your Project

Simply include the header file in your CUDA source:

#include "path/to/repo/src/hashtable_gpu_adaptive.cuh"

Step 4: Compile

nvcc -std=c++17 -O3 -arch=sm_80 your_file.cu -o your_executable

Note: Adjust -arch=sm_80 based on your GPU's compute capability:

  • sm_75 for Turing (RTX 20xx)
  • sm_80 for Ampere (RTX 30xx, A100)
  • sm_86 for Ampere (RTX 30xx mobile)
  • sm_89 for Ada Lovelace (RTX 40xx)

Quick Start

Basic Usage

#include "src/hashtable_gpu_adaptive.cuh"
#include <vector>

using namespace gpu::hashtable;

int main() {
    // Create hash table with capacity (will be rounded up to power of 2)
    size_t capacity = 1'000'000;
    AdaptiveHashTableGPU<uint64_t> table(capacity, 256); // max_probe = 256
    
    // Prepare data
    std::vector<uint64_t> keys = {1, 2, 3, 4, 5};
    std::vector<uint64_t> values = {100, 200, 300, 400, 500};
    std::vector<uint8_t> success(keys.size());
    
    // Batch insert
    table.insert(keys.data(), values.data(), keys.size(), success.data());
    
    // Batch find
    std::vector<uint64_t> out_values(keys.size());
    std::vector<uint8_t> found(keys.size());
    table.find(keys.data(), out_values.data(), found.data(), keys.size());
    
    // Batch erase
    table.erase(keys.data(), success.data(), keys.size());
    
    return 0;
}

API Reference

AdaptiveHashTableGPU<KeyType>

Constructor

AdaptiveHashTableGPU(size_t capacity_pow2, uint32_t max_probe = 256)

Creates a new hash table with the specified capacity (rounded up to power of 2) and maximum probe length.

Parameters:

  • capacity_pow2: Desired capacity (automatically rounded to next power of 2)
  • max_probe: Maximum number of probe steps before giving up (default: 256)

Methods

void insert(const KeyType* h_keys, const uint64_t* h_vals, size_t n, uint8_t* h_success)

Batch insert operation. Inserts n key-value pairs into the hash table.

Parameters:

  • h_keys: Host pointer to array of keys
  • h_vals: Host pointer to array of values
  • n: Number of elements to insert
  • h_success: Output array indicating success (1) or failure (0) for each insertion

Returns: void (success flags written to h_success)

void find(const KeyType* h_keys, uint64_t* h_out_vals, uint8_t* h_found, size_t n) const

Batch find operation. Looks up n keys in the hash table.

Parameters:

  • h_keys: Host pointer to array of keys to find
  • h_out_vals: Output array for found values
  • h_found: Output array indicating whether each key was found (1) or not (0)
  • n: Number of keys to look up

Returns: void (results written to h_out_vals and h_found)

void erase(const KeyType* h_keys, uint8_t* h_success, size_t n)

Batch erase operation. Removes n keys from the hash table using tombstone markers.

Parameters:

  • h_keys: Host pointer to array of keys to erase
  • h_success: Output array indicating success (1) or failure (0) for each erasure
  • n: Number of keys to erase

Returns: void (success flags written to h_success)

size_t capacity() const

Returns the actual capacity of the hash table (power of 2).

std::string hash_policy_name() const

Returns the name of the hash function used ("ahg_hash64").

std::string probing_policy_name() const

Returns the name of the probing strategy ("LinearProbing(CG-optimized)").

Benchmarks

We provide comprehensive benchmarking tools to evaluate performance across different configurations.

Running Benchmarks

Testing Dynamic Resizing and Compaction

The project includes a dedicated benchmark for evaluating the dynamic (resizing and compaction) features of the adaptive GPU hash table.

To run the adaptive resizing benchmark:

  1. Compile the benchmark:

    nvcc -std=c++17 -O3 -arch=sm_80 benchmarks/benchmark_adaptive_resize.cu -o exe/nvcc/benchmark_adaptive_resize
  2. Run the benchmark:

    ./exe/nvcc/benchmark_adaptive_resize

What this benchmark does:

  • Insertion Phase: Inserts a large number of items in batches, triggering table growth and dynamic resizing.
  • Deletion Phase: Deletes a significant portion of the table, creating tombstones and fragmentation.
  • Compaction Phase: Triggers table shrinking and compaction, efficiently cleaning up tombstones.
  • Performance Reporting: Prints table capacity, memory usage, and read performance before and after compaction, showing the effect of dynamic resizing.

Output Example:

--- INSERTION PHASE (Growth) ---
Batch,Capacity,Memory(MB),Time(ms)
...
Final capacity: 262144

--- DELETION PHASE (80% Data) ---
Pre-Compact Capacity: 262144 (Fragmented)
Fragmented Read Time (100k items): 12345 us

--- COMPACTION TRIGGERED ---
Post-Compact Capacity: 32768
Memory Saved: 3.6 MB
Compacted Read Time (100k items): 6789 us
Speedup: 1.8x

You can adjust the number of insertions, batch size, and load factor thresholds in benchmarks/benchmark_adaptive_resize.cu to suit your experiments.

Option 1: Using Precompiled Executables (Recommended)

Fixed Load Factor, Variable Size:

./scripts/run_benchmark.sh \
    ./exe/nvcc/syn_gpu_adaptive_benchmark \
    ./results/gpu/adaptive \
    linear \
    adaptive \
    0.75 \
    64

Fixed Size, Variable Load Factor:

./scripts/run_benchmark_load_factors.sh \
    ./exe/nvcc/syn_gpu_adaptive_benchmark \
    ./results/gpu/adaptive \
    linear \
    adaptive \
    1000000 \
    64

Parameters:

  • EXECUTABLE: Path to benchmark executable
  • RESULTS_DIR: Directory to save results
  • PROBING_STRATEGY: Probing method (linear, quadratic, double)
  • HASHTABLE_IMPLEMENTATION: Implementation type (adaptive, naive, default)
  • MAX_LOAD_FACTOR or N_SIZE: Load factor (0.0-1.0) or number of elements
  • KEY_SIZE: Key size in bits (32 or 64)

Option 2: Compile Your Own Benchmark

# GPU Adaptive
nvcc -std=c++17 -O3 -arch=sm_80 \
    benchmarks/synthetic/run_benchmark_gpu_adaptive.cu \
    -o exe/nvcc/syn_gpu_adaptive_benchmark

# GPU Naive
nvcc -std=c++17 -O3 -arch=sm_80 \
    benchmarks/synthetic/run_benchmark_gpu_naive.cu \
    -o exe/nvcc/syn_gpu_naive_benchmark

# CPU
g++ -std=c++17 -O3 \
    benchmarks/synthetic/run_benchmark.cpp \
    -o exe/g++/syn_cpu_benchmark

Performance Results

Static Operations Performance

Performance analysis of insert, find, and erase operations at various input sizes:

Static Operations Performance

Dynamic Performance Comparison

Comparing C++ STL's unordered_map on CPU vs. adaptive-hash-map on GPU:

Dynamic Performance

Load Factor Analysis

Effect of load factor on CPU vs. GPU performance at 10M elements:

CPU-10M

GPU-10M

Repository Structure

adaptive-gpu-hashtable/
├── src/                          # Core implementation
│   ├── adaptive_hash_gpu.cuh    # Low-level device operations
│   ├── hashtable_gpu_adaptive.cuh # High-level C++ API
│   ├── hashtable_gpu_naive.cuh   # Naive GPU implementation
│   └── hashtable_cpu.hpp         # CPU reference implementation
├── benchmarks/                   # Benchmarking suite
│   └── synthetic/
│       ├── run_benchmark_gpu.cu  # GPU benchmark driver
│       ├── run_benchmark.cpp     # CPU benchmark driver
│       └── cuda_helpers/         # GPU benchmark utilities
├── scripts/                      # Benchmark automation scripts
│   ├── run_benchmark.sh          # Variable size benchmarks
│   └── run_benchmark_load_factors.sh # Variable load factor benchmarks
├── results/                      # Benchmark results
│   ├── cpu/                      # CPU benchmark results
│   ├── gpu/                      # GPU benchmark results
│   └── plots/                    # Performance plots
├── exe/                          # Precompiled executables
│   ├── g++/                      # CPU executables
│   └── nvcc/                     # GPU executables
└── assets/                       # Documentation assets

Advanced Configuration

Switching Between Serial and Cooperative Group Modes

To use serial (scalar) operations instead of cooperative groups, modify src/adaptive_hash_gpu.cuh:

// In AhgDeviceView struct, change wrapper methods:
template<unsigned int TILE_SIZE, typename Parent>
__device__ bool insert(cg::thread_block_tile<TILE_SIZE, Parent> tile, const K& k, const V& v) {
    // return insert_cg(tile, k, v);  // Comment out CG version
    return insert_scalar(k, v);       // Use scalar version
}

template<unsigned int TILE_SIZE, typename Parent>
__device__ bool erase(cg::thread_block_tile<TILE_SIZE, Parent> tile, const K& k) {
    // return erase_cg(tile, k);      // Comment out CG version
    return erase_scalar(k);           // Use scalar version
}

Customizing Tile Size

Modify the tile size by defining AHG_TILE before including the header:

#define AHG_TILE 32  // Default is 16
#include "src/adaptive_hash_gpu.cuh"

Contributing

Contributions are welcome! Please follow these guidelines:

  1. Fork the repository and create a feature branch
  2. Follow code style: Use consistent formatting and naming conventions
  3. Add tests: Include benchmarks or unit tests for new features
  4. Update documentation: Keep README and code comments up to date
  5. Submit a pull request with a clear description of changes

Development Setup

# Clone your fork
git clone https://github.com/your-username/adaptive-gpu-hashtable.git
cd adaptive-gpu-hashtable

# Create a branch
git checkout -b feature/your-feature-name

# Make changes and test
nvcc -std=c++17 -O3 -arch=sm_80 your_test.cu -o test
./test

# Commit and push
git commit -m "Add your feature"
git push origin feature/your-feature-name

Known Issues

  • Memory Alignment: Ensure keys and values are properly aligned for optimal performance
  • Load Factor: Very high load factors (>0.95) may result in increased probe lengths and reduced performance
  • Key Restrictions: Keys cannot be 0xFFFFFFFFFFFFFFFF (EMPTY_KEY) or 0xFFFFFFFFFFFFFFFE (TOMB_KEY)
  • Concurrent Modifications: While thread-safe, extreme contention may reduce performance

Future Work

  • Support for custom hash functions
  • Dynamic resizing/rehashing
  • Additional probing strategies (quadratic, double hashing)
  • Multi-GPU support
  • Stream-based asynchronous operations
  • Integration with CUDA unified memory
  • Support for non-64-bit value types
  • Performance profiling and optimization tools

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this implementation in your research, please cite:

@software{adaptive_gpu_hashtable,
  title = {Adaptive GPU Hash Table},
  author = {Trivedi, Vishesh and Sathi, Vaigarai},
  year = {2025},
  url = {https://github.com/NerdyVisky/adaptive-gpu-hashtable}
}

Contact

For questions, issues, or contributions, please open an issue on GitHub.


Note: This implementation is optimized for NVIDIA GPUs with CUDA support. For CPU-based hash tables, consider using the reference implementation in src/hashtable_cpu.hpp or standard library containers.

About

A spelled out, plug-and-play, header-only data structure implementation of GPU-friendly HashTable which works optimally on single-core, multicore, or GPU-accelerated applications

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •