Skip to content

Dynamic Partitioning #53

@gz

Description

@gz

Is your feature request related to a problem? Please describe.

Partitioning a DS leads to more write-scaling however the write-scaling is bottlenecked by the amount of replication we do for a DS. With CNR, the best scaling factor we can get for a partitionable data-structure is min(cores per replica, partitions). Hence if we vary the amount of replication dynamically (#52) we also need to be able to adjust the partitioning.

To clarify the formula: 4 sockets, 48 cores per socket:

  • with replication factor of four, it doesn't make sense to have more than 48 partitions
  • with replication factor of 1 it makes sense to have 192 partitions

Describe the solution you'd like

Dynamic Partitions

Some thoughts on dynamically varying the partitions in CNR data-structures

Background: A usize is nice because

  • We have a neutral element 0 (identity)
  • We can merge multiple usize together with +
    • and it’s associative which makes it a monoid I guess
    • It’s also commutative not sure if this matters
  • We can split it with subtraction (or division) operations
    • Not sure what is better division allows us to go from 1 -> n partition

So let’s say we have Partitionable(usize) we can…

  • Make a new partition with Partitionable(0)
  • Merge two partitions (e.g., drop one) with Partitionable(a) + Partitionable(b) = Partitionable(a+b)
  • Split a partition into two e.g., by doing subtraction: Partition(max(x-part, 0)), Partition(part))

And we still have some meaningful things we can define on it:

  • Increment (on one partition)
  • Sum (as a scan over all partitions)

[TODO: Need to figure out exactly what properties we need from Monoid, Ring etc.]

Background: OS page-tables

For a page-table in nrkernel, we partitioned statically; every PML4Entry (first level of the radix tree) was one partition.
For dynamic partitions, maybe the data-structure for every partitions is a slice of [&PML4Entry(idx)] with index going from 0..512

  • The regular NR case is just [PML4Entry; 512]
  • The original CNR case is 512 partitions of [PMLEntry; 1]
  • The new case is something in between these two (including these two cases)

We have a neutral element: []

  • We can make a new Partition by making a slice with no entries [] (neutral element)
  • The sum of all the entries in all slices will always be <= 512, we could potentially make more by using lower levels, but there is always (even if it’s just theoretical) a limit
  • Not sure if no entries is better here as it’s technically the neutral element of the slice

We can add things together by merging the slices [a] + [b] = [a, b], [] + [a] = [a]

  • Unclear if we should require that addition of [a, b] + [c] only works if a < b < c and there is no “gap” (other PM4 entries between b and c and a and b)?
  • This would probably simplify the hash-function which now needs to change dynamically (more on that later) but I don’t think it’s strictly necessary?
  • We also have associativity (not sure yet if this is a strict requirement for us)
  • Slice addition isn’t commutative (not sure if it would be nice to have this property)

We can split things up by “sub-slicing” [a,b] -> [a], [b] or [a] -> [a], []

We can define some operations on this

  • Map (return an error if we would need to map somewhere that we don’t have the partition for it in the slice – hasher needs to ensure it routes to the correct partition more on that below)
  • Unmap (same as above)
  • etc.

Some thoughts on implementing this

We need to be able to route the operations to the right partitions. Right now we have a static hash(op, &mut Vec<logIds>) function that the user provides for a data-structure.

In order for dynamic partitions to work, the hash function needs to be dynamically generated which means the user should provide a higher order function, e.g., something like this:
E.g., generate_hashfn(description of all partitions) -> fn hash(op, &mut Vec<LogIds>) { … } (TBD: Not clear how "description of all partitions" looks like in code)

Adding a new partition:

  • We need to allocate a log for it
  • We need to register with the log on all replicas
  • We need to use the generate_hashfn to route operations to the new partition
  • If the partition is made from the neutral element (e.g., 0 in the usize case) and doesn’t not made from merging some other partitions, then it might be that we’re done here

Merging two partitions:

  • Let’s assume we only ever merge two partitions into one for simplicity
  • we need a quiescent state for (at least) one of the partition
    • Make sure no new ops are incoming (e.g., take combiner lock of both replicas?)
  • Then drain the log; apply all outstanding ops on all replicas
    • What about scans? They might complicate things: not sure if we can just “drain” a log by applying all outstanding operations (a scan depends on having elements in multiple logs and one root log) a scan where the “root” is in another log will depend on that entry still being there once it executes the scan but now we’re just about to throw this log away
      • Maybe we can keep the log around for a while until all scans have completed?
        • Not sure this is a good idea either; seems like a bad idea to have two logs for a single partition…
  • Now we call a merge function that takes the partition and merges it with the one that we haven’t drained
  • Now the log of the other partition can be dropped
  • Finally need to use generate-hashfn to route ops to new merged partition
  • We could use a scan operation to merge two partition (scan op is already getting a consistent view on n partitions)
    • One limit in the current code we don’t support scans with less than max logs in the instance; scans (always affects all logs)
      • Reason was: hard to reason about linearizability

Splitting a partition into two

TBD

Interaction of replicas and partition

  • We eventually want both, varying replicas and varying partitions (Dynamic Replication #52)
  • This brings additional complexity due to numa affinity,
  • E.g., imagine having the page-table with 192 partitions (on the 4socket machine) and one replica
    • This has maximum write scalability
  • Eventually we want to have read scaling and add 4 replicas
    • Now when we create a copy need to be smart about cloning e.g., we track the numa affinity of every partition and sort it so each replica gets the partitions that are already numa local and copies the other partitions that aren’t

Policies for adding/removing replicas and partitions

We need an interface so the client using the library can implement the policy when to change partitions (and replicas fwiw)

Example policies:

  • Page-table
    • read-write ratio (PMUs give approx reads), mmap gives writes
      • Which cores the process runs on is also known
    • Let the application decide
  • FS
    • Look at read:write ratio
    • Here it makes less sense to have the application decide
  • Scheduler
    • Maybe change partitions based on HW ACPI core-hotplug events?

API pseudo-code: adding and merging partitions

TBD

Describe alternatives you've considered

  • Don't change partitions ever set to num of cores in system. What's the overhead? What are the downsides? Deadlocks(?), might not know amount of cores in advance (hotplugging).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions