-
Notifications
You must be signed in to change notification settings - Fork 20
Description
Is your feature request related to a problem? Please describe.
Partitioning a DS leads to more write-scaling however the write-scaling is bottlenecked by the amount of replication we do for a DS. With CNR, the best scaling factor we can get for a partitionable data-structure is min(cores per replica, partitions). Hence if we vary the amount of replication dynamically (#52) we also need to be able to adjust the partitioning.
To clarify the formula: 4 sockets, 48 cores per socket:
- with replication factor of four, it doesn't make sense to have more than 48 partitions
- with replication factor of 1 it makes sense to have 192 partitions
Describe the solution you'd like
Dynamic Partitions
Some thoughts on dynamically varying the partitions in CNR data-structures
Background: A usize is nice because
- We have a neutral element 0 (identity)
- We can merge multiple usize together with +
- and it’s associative which makes it a monoid I guess
- It’s also commutative not sure if this matters
- We can split it with subtraction (or division) operations
- Not sure what is better division allows us to go from 1 -> n partition
So let’s say we have Partitionable(usize) we can…
- Make a new partition with Partitionable(0)
- Merge two partitions (e.g., drop one) with Partitionable(a) + Partitionable(b) = Partitionable(a+b)
- Split a partition into two e.g., by doing subtraction: Partition(max(x-part, 0)), Partition(part))
And we still have some meaningful things we can define on it:
- Increment (on one partition)
- Sum (as a scan over all partitions)
[TODO: Need to figure out exactly what properties we need from Monoid, Ring etc.]
Background: OS page-tables
For a page-table in nrkernel, we partitioned statically; every PML4Entry (first level of the radix tree) was one partition.
For dynamic partitions, maybe the data-structure for every partitions is a slice of [&PML4Entry(idx)] with index going from 0..512
- The regular NR case is just [PML4Entry; 512]
- The original CNR case is 512 partitions of [PMLEntry; 1]
- The new case is something in between these two (including these two cases)
We have a neutral element: []
- We can make a new Partition by making a slice with no entries [] (neutral element)
- The sum of all the entries in all slices will always be <= 512, we could potentially make more by using lower levels, but there is always (even if it’s just theoretical) a limit
- Not sure if no entries is better here as it’s technically the neutral element of the slice
We can add things together by merging the slices [a] + [b] = [a, b], [] + [a] = [a]
- Unclear if we should require that addition of [a, b] + [c] only works if a < b < c and there is no “gap” (other PM4 entries between b and c and a and b)?
- This would probably simplify the hash-function which now needs to change dynamically (more on that later) but I don’t think it’s strictly necessary?
- We also have associativity (not sure yet if this is a strict requirement for us)
- Slice addition isn’t commutative (not sure if it would be nice to have this property)
We can split things up by “sub-slicing” [a,b] -> [a], [b] or [a] -> [a], []
We can define some operations on this
- Map (return an error if we would need to map somewhere that we don’t have the partition for it in the slice – hasher needs to ensure it routes to the correct partition more on that below)
- Unmap (same as above)
- etc.
Some thoughts on implementing this
We need to be able to route the operations to the right partitions. Right now we have a static hash(op, &mut Vec<logIds>) function that the user provides for a data-structure.
In order for dynamic partitions to work, the hash function needs to be dynamically generated which means the user should provide a higher order function, e.g., something like this:
E.g., generate_hashfn(description of all partitions) -> fn hash(op, &mut Vec<LogIds>) { … } (TBD: Not clear how "description of all partitions" looks like in code)
Adding a new partition:
- We need to allocate a log for it
- We need to register with the log on all replicas
- We need to use the
generate_hashfnto route operations to the new partition - If the partition is made from the neutral element (e.g., 0 in the usize case) and doesn’t not made from merging some other partitions, then it might be that we’re done here
Merging two partitions:
- Let’s assume we only ever merge two partitions into one for simplicity
- we need a quiescent state for (at least) one of the partition
- Make sure no new ops are incoming (e.g., take combiner lock of both replicas?)
- Then drain the log; apply all outstanding ops on all replicas
- What about scans? They might complicate things: not sure if we can just “drain” a log by applying all outstanding operations (a scan depends on having elements in multiple logs and one root log) a scan where the “root” is in another log will depend on that entry still being there once it executes the scan but now we’re just about to throw this log away
- Maybe we can keep the log around for a while until all scans have completed?
- Not sure this is a good idea either; seems like a bad idea to have two logs for a single partition…
- Maybe we can keep the log around for a while until all scans have completed?
- What about scans? They might complicate things: not sure if we can just “drain” a log by applying all outstanding operations (a scan depends on having elements in multiple logs and one root log) a scan where the “root” is in another log will depend on that entry still being there once it executes the scan but now we’re just about to throw this log away
- Now we call a merge function that takes the partition and merges it with the one that we haven’t drained
- Now the log of the other partition can be dropped
- Finally need to use generate-hashfn to route ops to new merged partition
- We could use a scan operation to merge two partition (scan op is already getting a consistent view on n partitions)
- One limit in the current code we don’t support scans with less than max logs in the instance; scans (always affects all logs)
- Reason was: hard to reason about linearizability
- One limit in the current code we don’t support scans with less than max logs in the instance; scans (always affects all logs)
Splitting a partition into two
TBD
Interaction of replicas and partition
- We eventually want both, varying replicas and varying partitions (Dynamic Replication #52)
- This brings additional complexity due to numa affinity,
- E.g., imagine having the page-table with 192 partitions (on the 4socket machine) and one replica
- This has maximum write scalability
- Eventually we want to have read scaling and add 4 replicas
- Now when we create a copy need to be smart about cloning e.g., we track the numa affinity of every partition and sort it so each replica gets the partitions that are already numa local and copies the other partitions that aren’t
Policies for adding/removing replicas and partitions
We need an interface so the client using the library can implement the policy when to change partitions (and replicas fwiw)
Example policies:
- Page-table
- read-write ratio (PMUs give approx reads), mmap gives writes
- Which cores the process runs on is also known
- Let the application decide
- read-write ratio (PMUs give approx reads), mmap gives writes
- FS
- Look at read:write ratio
- Here it makes less sense to have the application decide
- Scheduler
- Maybe change partitions based on HW ACPI core-hotplug events?
API pseudo-code: adding and merging partitions
TBD
Describe alternatives you've considered
- Don't change partitions ever set to num of cores in system. What's the overhead? What are the downsides? Deadlocks(?), might not know amount of cores in advance (hotplugging).