Feat: hybrid CPU/GPU n-gram hashing with automatic path selection #15

Playmaker3334 · 2026-01-24T01:05:49Z

Summary

This PR adds GPU support to the Engram demo and fixes bugs that prevented it from running on CUDA devices.

Bug Fixes

Device mismatch - torch.from_numpy() in Engram.forward() always created CPU tensors, causing runtime errors when model was on GPU
Index out of bounds - CompressedTokenizer._compress() had no bounds checking, could crash with certain input_ids

New Features

HybridNgramHashMapping

Replaces NgramHashMapping with a proper nn.Module that supports both CPU and GPU:

if input_size < self.gpu_threshold:
    # NumPy path - lower overhead for small inputs
else:
    # PyTorch path - better throughput for large inputs

Key changes:

Multipliers stored as register_buffer() for automatic device transfer
_hash_gpu() uses torch.bitwise_xor instead of numpy
Configurable threshold via EngramConfig.gpu_threshold

CompressedTokenizer

compress_cpu(): fast numpy path with bounds checking
compress_gpu(): lazy tensor initialization, tracks device

Benchmark Suite

New files for validation:

benchmark.py: measures latency across configs
test_correctness.py: verifies numerical equivalence

Benchmark Results

Metric	Value
Mean speedup	1.02x
Median speedup	1.01x
Max speedup	1.09x
Memory delta	-0.03%

Tested on NVIDIA GPU with batch_size=[2,4,8], seq_len=[128,256,512]

Backward Compatibility

Original engram_demo_v1.py unchanged
All original APIs preserved in optimized version
Numerical outputs match within rtol=1e-4, atol=1e-5

- Convert CUDA tensors to CPU before numpy conversion in CompressedTokenizer - Fixes TypeError when running on GPU: 'can't convert cuda:0 device type tensor to numpy' - Maintains backward compatibility with CPU-only usage

- Use actual tokenizer vocab size instead of config value - Prevents IndexError when generating test data

…ngram

Playmaker3334 added 10 commits January 21, 2026 09:26

Add GPU-optimized Engram with LRU cache and benchmarks

8782815

Fix: Add GPU tensor support to original demo

ea3c6c3

- Convert CUDA tensors to CPU before numpy conversion in CompressedTokenizer - Fixes TypeError when running on GPU: 'can't convert cuda:0 device type tensor to numpy' - Maintains backward compatibility with CPU-only usage

Fix vocab size mismatch in correctness tests

bfcadea

- Use actual tokenizer vocab size instead of config value - Prevents IndexError when generating test data

Cambios

5c03529

Final changes?

cdb87d6

Fix: Corregir orden de argumentos en benchmark y device mismatch en e…

76fc778

…ngram

Fix: Corregir orden de argumentos en benchmark y device mismatch en e…

233bae3

…ngram

Fix: Device mismatch e index bounds en ambos modelos

7f07709

Fix: Device mismatch e index bounds en ambos modelos

ed7ba60

Fix: Device mismatch e index bounds en ambos modelos

eaa8f5b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: hybrid CPU/GPU n-gram hashing with automatic path selection #15

Feat: hybrid CPU/GPU n-gram hashing with automatic path selection #15

Playmaker3334 commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Feat: hybrid CPU/GPU n-gram hashing with automatic path selection #15

Are you sure you want to change the base?

Feat: hybrid CPU/GPU n-gram hashing with automatic path selection #15

Conversation

Playmaker3334 commented Jan 24, 2026

Summary

Bug Fixes

New Features

HybridNgramHashMapping

CompressedTokenizer

Benchmark Suite

Benchmark Results

Backward Compatibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant