AdaptiveHashMapCLI is a teaching and benchmarking tool that turns abstract data structures into live, measurable systems.
hashmap_cli.py is a batteries-included command-line harness for exploring and validating three hash-map backends:
TwoLevelChainingMap– cache-friendly two-level chaining tuned for heavy writes.RobinHoodMap– open addressing with tombstone-aware Robin Hood displacement.HybridAdaptiveHashMap– automatically migrates between chaining and Robin Hood based on live health metrics.
The toolset covers workload generation, profiling, CSV replays with live metrics, latency sampling, snapshot/restore flows (with safe repairs), and verification utilities. Everything ships in a single Python file so you can drop it into pipelines or run ad-hoc tests without extra dependencies.
- hashmap_cli.py – main CLI and data-structure mplementations
- audit.md – step-by-step validation checklist
- perf_*.json – JSON summaries from the most recent audit runs
- state.pkl.gz – snapshots generated during the audit
data/– reusable assets (workloads/holds generated CSV workloads;config/contains demo overrides).snapshots/– serialized map states referenced in the audit (Robin Hood compaction, adaptive resume, etc.).docs/data_inventory.md– file tree with per-artifact explanations and which test produced each file.
- Python 3.9+ (tested with CPython 3.10).
- POSIX-like shell (macOS/Linux). PowerShell users should replace
\line continuations with backticks or single-line commands. - Local port access when using the built-in metrics dashboard (default ports 9090/9091/9100). If your sandbox restricts binding ports, run the commands with elevated permissions.
python hashmap_cli.py -h # explore available subcommands
python hashmap_cli.py --mode adaptive put K V # single operation using adaptive backend
python hashmap_cli.py generate-csv --outfile workload.csv --ops 50000 --read-ratio 0.8
python hashmap_cli.py profile --csv workload.csv
python hashmap_cli.py --mode adaptive run-csv --csv workload.csv --json-summary-out perf.jsonpython hashmap_cli.py --mode adaptive run-csv --csv workload.csv --metrics-port 9090- Dashboard:
http://localhost:9090/ - Metrics:
http://localhost:9090/metrics
Counters include:
- total ops/puts/gets/dels
- migrations, compactions, gauges for load factor
- max chaining group length, average Robin Hood probe distance
- an identifying backend label.
ADAPTIVE_MAX_LF_CHAINING=0.55 ADAPTIVE_MAX_GROUP_LEN=2 \
python hashmap_cli.py --mode adaptive run-csv \
--csv data/workloads/stress_big.csv \
--metrics-port 8000 \
--json-summary-out results/json/run_stress_big_tuned.json
- Open http://localhost:8000/ while it’s running.
- You’ll see the backend label flip between chaining and robinhood, the mgl chart jump to 2 on the chaining phases, and the ape chart populate once Robin Hood takes over.
- After the run finishes (about 14 s) run_stress_big_tuned.json confirms migrations triggered.
When running the tuned stress workload, the adaptive system automatically transitions from chaining to Robin Hood mode. In this run:
- Backend label flipped to robinhood after thresholds tripped.
- Max group length (mgl) chart dropped to zero, reflecting the move away from chaining.
- Average probe estimate (ape) chart populated with non-zero values once Robin Hood displacement began.
- Ops summary recorded ~570k operations with 3 migrations, confirming the adaptive engine was active.
Comprehensive end-to-end coverage lives in audit.md. It is organized into 12 sections plus an optional “demo” flow:
- CLI sanity checks.
- Workload generation (uniform, skewed, adversarial).
- CSV profiling to recommend backends.
- CSV replays with live metrics, JSON summaries, and per-op latency reservoirs.
- Proactive compaction ticks.
- Snapshot save/load/continue workflows (including verifying and resuming from snapshots).
- Offline compaction for RobinHood snapshots.
- Safe repair path for corrupted RobinHood snapshots.
- One-shot smoke tests across all modes.
- Adversarial stress workload with metrics streaming.
- Minimal interactive smoke checks.
- Failure-path verification (ensures compaction refuses incompatible snapshots).
- Optional “everything in one go” demo.
All items were executed successfully on 2025-09-22; checkmarks in audit.md reflect the exact commands run and artifacts produced. Step 6 notes that when splitting CSV files manually, you must keep the header row (op,key,value).
| Workload | Ops | Mode | Ops/s | Backend at Finish | p99 Latency (ms, overall) | JSON Summary |
|---|---|---|---|---|---|---|
w_uniform.csv |
100k | adaptive | ~2.28k | chaining | 3.35 | results/json/perf_uniform.json |
w_skew_adv.csv |
100k | adaptive | ~3.34k | chaining | 1.68 | results/json/perf_skew_adv.json |
w_uniform_rest.csv (resume) |
50k | adaptive | ~1.47k | chaining | 4.13 | results/json/perf_rest.json |
w_heavy_adv.csv |
200k | adaptive | ~2.54k | chaining | 0.89 | results/json/perf_heavy_adv.json |
demo.csv (full demo) |
80k | adaptive | ~5.25k | chaining | 0.86 | results/json/demo_perf.json |
- Throughput ranges from ~1.5k to ~5.2k ops/s for the Python implementation on a single core, increasing with workload locality. These numbers align with expectations for a pure-Python, single-threaded benchmark.
- No adaptive migrations or compactions were triggered under the audited workloads; load factors and probe estimates stayed within configured guardrails.
- Per-op latency reservoirs captured GET sub-millisecond response times and PUT/DEL latencies in the ~0.8–3.4 ms range depending on load.
- Safety features verified: snapshots serialize without callback issues, Robin Hood repair paths compact tombstones, and
compact-snapshotcorrectly rejects non-RobinHood inputs.
Snapshots now store a callback-free state. On load (including direct pickle usage), HybridAdaptiveHashMap.__setstate__ rebuilds a fresh AdaptiveConfig. Runtime callbacks for metrics are reattached automatically by _reattach_runtime_callbacks, ensuring migrations/compactions continue to update counters even when starting from restored snapshots.
- Ensure Python 3.9+ is installed and you are in the repository root.
- Run through audit.md, checking each item. The file already lists the commands with proper line continuations.
- For steps that bind HTTP ports, run with appropriate privileges if your environment blocks low-numbered or privileged ports.
- Verify that CSV/JSON/snapshot outputs match expectations (names are hard-coded in the checklist so results overwrite existing artifacts).
- Per-op latency reservoirs are on by default whenever
--json-summary-outis provided; to reduce overhead you can raise--latency-sample-everyor set--latency-sample-k 0to disable sampling. - The dashboards rely on
Chart.jsvia CDN; require outbound network access if you open them in a browser. compact-snapshotintentionally refuses non-RobinHood snapshots (as seen in the demo flow) to avoid corrupting incompatible formats.
The adaptive wrapper only migrates when chaining trips its health limits. Under many workloads it happily stays in chaining—ape stays at 0 and hashmap_migrations_total never increments. To demonstrate the Robin Hood leg:
# 1) Generate a highly adversarial, write-heavy workload
python hashmap_cli.py generate-csv \
--outfile stress.csv \
--ops 200000 \
--read-ratio 0.10 \
--key-skew 1.6 \
--key-space 2048 \
--adversarial-ratio 0.80 \
--adversarial-lowbits 12 \
--seed 42
# 2) (Optional) temporarily tighten thresholds in build_map()
# max_lf_chaining = 0.55 # below the chaining rehash at 0.80
# max_group_len = 2
# 3) Replay with metrics enabled
python hashmap_cli.py --mode adaptive run-csv \
--csv stress.csv \
--metrics-port 8000 \
--json-summary-out run_stress.jsonWatch http://localhost:8000/ during the run:
- The backend label flips to
chaining -> robinhood (migrating)andapestarts plotting non-zero values once Robin Hood takes over. hashmap_migrations_totalin/metricsand the JSON summary will increment accordingly.- If you only want to visualise probe distances, run the same CSV with
--mode fast-lookupto stay on RobinHoodMap from the start.
Prefer not to touch the source? Use environment variables to tighten the adaptive thresholds for a single run:
ADAPTIVE_MAX_LF_CHAINING=0.55 ADAPTIVE_MAX_GROUP_LEN=2 \
python hashmap_cli.py --mode adaptive run-csv \
--csv data/workloads/stress_big.csv \
--metrics-port 8000 \
--json-summary-out results/json/run_stress_big_tuned.jsonWith these overrides the workload triggers migrations immediately.
- The latest audit captured
results/json/run_stress_big_tuned.json(600k ops, three migrations, sustained throughput ~43k ops/s). - Keep the CLI running while you open
http://localhost:8000/you’ll see the backend label toggle and theapechart populate in real time.
- Production-ready today. Snapshot serialization hardened, metrics hooks reattached after load, metrics server lifecycle wrapped in
try/finally+server_close(). - Next up: C++ port using the same snapshot binary spec (
Adaptive HMAP v1) so we can benchmark native throughput and keep cross-language compatibility. - Integration roadmap: publish the binary snapshot schema, wire metrics into Grafana/Prometheus exporters, add JSONL streaming, and let CI pipelines diff latencies across implementations.
- Learning track: explore the audit.md demo flow for a quick “generate → profile → run → snapshot → verify → repair” walkthrough, then plug in the forthcoming C++ reader once available.
Contributions or experiments should update audit.md with new scenarios and attach fresh artifacts so future auditors can reproduce and benchmark the same paths.
This project is primarily a portfolio + research artifact. The intent is to demonstrate systems design, benchmarking, and adaptive data-structure engineering, not to provide a ready-made commercial library.
- If you are here to learn from the code (students, researchers, interview prep, portfolio reviewers): you are welcome to read, fork, and experiment.
- If you want to build products, services, or commercial tools using this code: you need explicit permission from the author.
- See LICENSE for the full text.
This repository is released under a dual-purpose license:
- GPLv3: for anyone reusing code in other open-source projects.
GPL ensures derivatives must also remain open, preventing closed-source commercial use without permission. - Commercial / production use: requires explicit written consent from the author.
Copyright (c) 2025 Justin Guida. All rights reserved.
This code is provided for educational and portfolio purposes only. Commercial or production use requires explicit permission.
GPLv3 was selected to balance openness and control:
- It allows others to study and modify the code in open contexts.
- It prevents companies from silently embedding this into proprietary products without sharing improvements.
- It keeps the project focused on its intent: research, education, and portfolio demonstration.