Skip to content

deeeelin/Distributed-Key-Value-Store-System-Inspired-by-Amazon-DynamoDB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributed Key Value Store System Inspired by Amazon DynamoDB

Design Principles

The design of this project follows Dynamo-style data partitioning, including the use of a consistent hash ring and replication policies, with a simplified consistency strategy. The communication framework uses gRPC.


Data Partitioning

The partitioning algorithm is based on a hash ring. Both nodes and keys are hashed into unique positions in the hash space, and these positions determine the key ranges each node is responsible for and which key range a key falls in.

The hashing function uses Splitmix64, which is a pseudo-random generator that produces random but deterministic 64-bit tokens for nodes and keys. By leveraging a uniform hash function, keys are distributed across different locations in the hash ring, which partitions keys uniformly into different key ranges (nodes).

As an example, if node A is hashed to position 1 on the hash ring and node B to position 5 as the next node following A, then a key hashed to position 3 will fall within the key range A–B, as illustrated in the figure below.

Figure 2: Partitioning and replication of keys in Dynamo ring

Figure 2: Partitioning and replication of keys in Dynamo ring.

The hashed value (token), IP address, port, and the version number of the hash ring used for maintaining consistency are stored in a Routing Table. This table is updated and maintained by the manager node, cached locally by each storage node, and fetched by the client during get or put operations.


Data Replication

Replication is handled by walking the hash ring and assigning replica nodes after the primary node. The primary node is defined as the first node whose token is greater than or equal to the key’s token.

For example, when the replication factor is 3 and a key K falls within the key range A–B, the replicas are assigned to the next two nodes following the primary node on the ring.

Because placement is fully determined by the hash ring and all nodes follow the same rules, every node knows exactly which key ranges it should store after any membership change. This keeps the replication algorithm simple, reliable, and deterministic.


Data Consistency

Manager Node Behavior

The manager maintains two routing tables:

  • Current Table: Represents the state that has been synchronized with storage nodes in the previous round.
  • Next Table: Represents the new system state that has not yet been synchronized (including newly added nodes, removed failed nodes, and rehashed tokens).

Separating the two tables keeps the system stable and prevents overlap between old and new states.

The manager enforces consistency with storage nodes at a fixed interval (every 5 seconds in this implementation). During each synchronization round, the next table becomes the current table, and the manager broadcasts the updated routing table to all storage nodes.

Each storage node then uses this table to perform consistency operations and fetch key ranges it is newly responsible for.


Client Behavior

The client.put() function writes each key–value pair to all primary and replica nodes. However, partial failures can lead to inconsistent versions across replicas. For example, with a replication factor of 3, two nodes may successfully apply an update while the third fails.

To handle this, each key is associated with a timestamp. During client.get(), the client queries all nodes that hold a copy of the key and returns the value with the most recent timestamp, ensuring the newest version is selected regardless of replica inconsistencies. Any inconsistencies are later corrected during synchronization initiated by the manager.


Storage Node Behavior

When a storage node receives a synchronization message from the manager containing the current routing table, it computes all token ranges it is now responsible for, including both primary and replica ranges.

The node then contacts other nodes that previously owned these ranges and fetches all key–value pairs that fall within them. Among replicas, the latest value is selected based on timestamps.

To ensure consistency, storage nodes wait for all peers to complete the fetch phase using a barrier mechanism implemented via a blocking gRPC call to the manager. Once all acknowledgments are received (or a timeout occurs), the node installs the new routing table, removes keys that no longer belong to its ranges, and inserts newly fetched keys.

The use of timestamps guarantees correctness even if a replica node fails during an update, since newer timestamps always dominate older values.

Experiment Usage

  • run make to compile project.
  • run make validate to run correctness test
  • run make perf for throughput, load balance performance test
  • run make clean to clean build files
  • run make kill to kill manager/storage process.

References

  1. Rosetta Code contributors. (n.d.). Pseudo-random numbers/Splitmix64. Rosetta Code.
    Retrieved November 14, 2025, from https://rosettacode.org/wiki/Pseudo-random_numbers/Splitmix64

  2. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A.,
    Sivasubramanian, S., Vosshall, P., & Vogels, W. (2007).
    Dynamo: Amazon’s highly available key-value store.
    Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP’07), 205–220.

About

A scalable and fault-tolerance distributed key value store system inspired by the Amazon DynamoDB paper: https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors