Skip to content
This repository was archived by the owner on Apr 30, 2020. It is now read-only.

Conversation

@JohnStrunk
Copy link
Member

Describe what this PR does
This PR provides a proposed set of CRs for the Gluster operator. While this proposal encompasses much more than what an initial implementation will support, my goal is to make room the features we know we will need to support eventually so we can incorporate them over time, maintaining compatibility of the CRD.

Is there anything that requires special attention?

  • The CRDs described here will tangentially interact with the StorageClass definition, currently open for review in Add design for StorageClass parameters gluster-csi-driver#30.
  • Though the main goal is moving toward GCS, I have attempted to incorporate sufficient information that this same CRD could be used to deploy a Heketi-based cluster.

While I expect this proposal to be fairly contentious, please try to offer concrete suggestions for improvement along with the criticisms. 🙏

Related issues:
Closes #3

Signed-off-by: John Strunk <jstrunk@redhat.com>
@ghost ghost assigned JohnStrunk Sep 28, 2018
@ghost ghost added the in progress label Sep 28, 2018
@JohnStrunk
Copy link
Member Author

Folks who should take a look:

@jarrpa
Copy link
Collaborator

jarrpa commented Sep 28, 2018

An off-hand remark to start after my first skimming: While "quiesce" is a technically appropriate term, I think it's a little too uncommon and hard to quickly understand. I think either of "inactive" or "disabled" would be better.

# hostPath mount
mode: device # (device | volume)
# In device mode, the pod would use the entire device(s) as a single VG
device: # (iff mode==device)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but I thought they belongs to Node CR ? Is it guaranteed to be same for all nodes in the zone?

Other than this, looks fine on the top level!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a tough one. Here's my reasoning...
The Zone provides 2 functions:

  • A grouping of nodes in a single fault domain
  • A template for how to deploy nodes
    It's the 2nd function that resulted in the whole storage subtree. The main option that is expected to be used is mode: volume to get block PVs. In this case, a standard device template for the whole Zone is fairly easy to see. mode: device is really an escape hatch for clusters that (1) don't yet support block PV or (2) are using Heketi & the old stack. I expect even bare metal to use block PV along w/ dynamic local provisioning once that is fully merged.

The other difficulty of using the Node object is that (as currently planned) it will always be created by the operator. In that case, there's no way to set the device list when the object is first created. Either we'd have to find a way to change the semantics or an admin would need to annotate each Node object after the operator creates it 👎 .

I do agree with you that having a fixed device list for a whole zone isn't desirable. The question is how common it is for the device list to vary by host. I would expect most environments to be fairly symmetric.

But here's an addl wrinkle that I hadn't considered...
We don't really want to use sd* designations since they are not stable. Instead, we'd need something in the /dev/disk/by-* tree, and these will vary by host. Perhaps the sd* can be used as a "template" in keeping w/ the Zone notion, and the actual IDs get used internally (and potentially annotated onto the Node object).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While targeting a symmetric and automated cluster model is definitely the way to go, we can't ignore that on-prem scenarios are especially likely to not be symmetric. I've run across a number of them when dealing with customer cases. We also can't deny the reality that local block PVs may not reach beta for a good while after we want the operator to be released. I say the way to do this is to leverage (or mimic?) some of the functionality of the local storage provisioner such that we provide a regular expression of devices that are valid for storage consumption.

We don't really want to use sd* designations since they are not stable. Instead, we'd need something in the /dev/disk/by-* tree, and these will vary by host. Perhaps the sd* can be used as a "template" in keeping w/ the Zone notion, and the actual IDs get used internally (and potentially annotated onto the Node object).

Could we have the operator detect a /dev/disk name for a given /dev/sd* name under the covers? We could update the status of the specific Node CR to indicate which device names are actually in use.

@JohnStrunk
Copy link
Member Author

Here's something I'd like opinions on:
Currently, there is both a Cluster CR and a Zone CR that are mandatory. In many (most?) environments, the number of Zones would be small (1-3). Should we move the information of the Zone CR into the Cluster CR as a list?
Imagine adding the following to the spec section of the Cluster CR (taking the place of 3x Zone CRs in the AWS multi-az example):

zones:
  - name: az1
    management:
      mode: automatic
    automatic:
      minNodes: 3
      maxNodes: 99
      maxStorage: 100Ti
      freeStorageMin: 500Gi
      freeStorageMax: 2Ti
    storage:
      mode: volume
      volume:
        size: 1Ti
        storageClass: ebs-az1
  - name: az2
    management:
      mode: automatic
    automatic:
      minNodes: 3
      maxNodes: 99
      maxStorage: 100Ti
      freeStorageMin: 500Gi
      freeStorageMax: 2Ti
    storage:
      mode: volume
      volume:
        size: 1Ti
        storageClass: ebs-az2
  - name: az3
    management:
      mode: automatic
    automatic:
      minNodes: 3
      maxNodes: 99
      maxStorage: 100Ti
      freeStorageMin: 500Gi
      freeStorageMax: 2Ti
    storage:
      mode: volume
      volume:
        size: 1Ti
        storageClass: ebs-az3

Whatever would have gone into the status of the Zone CR could be placed into a corresponding list in the Cluster CR's status.

The benefit to the above change would be that the admin need only create a single CR to get a cluster up and running (or show their configuration). The down-side is that the CR is bigger w/ more fiddly indentation.

@jarrpa
Copy link
Collaborator

jarrpa commented Oct 1, 2018

@JohnStrunk I say the ease of having only one CR is worth the increased length. In particular, I see the majority of users only having one Zone anyway.

@phlogistonjohn
Copy link

phlogistonjohn commented Oct 1, 2018

First off, this is a really awesome start! :-)

I have a couple questions based on my first read-through:

  • One thing I expected to see but didn't was a mechanism to specify what images should be used. In other words, something that the operator would use to select what image to use for initial deployments and/or upgrades in the future. Understandably, we may not be doing upgrades right away but I think automatic mode would need something to key off of in terms of images & tags.
  • I am quite fuzzy on the "manual" mode. To me the name makes it sound like an alternative to external, but that's not the way it appears to function. Could you clarify what this mode is supposed to do, please?

(Edit: The next two points I wrote overlap a bit with what @JohnStrunk and @jarrpa were discussing above. Apologies for any redundancy, but rather than interject into that subthread I'll leave what I wrote intact but admit that I didn't read through all that when I wrote my comments)

  • There seems to be a bit of asymmetry between the node CRD and the node references in the "external" mode section. Perhaps I'm misreading things but it sounds like the external config would be statically defined in the spec and the operator would not create node CRs. Regardless of that, I wonder if there wouldn't be an advantage to the admin (manually) creating & editing node CRs for the external case. This way all node configuration is done at the same "level" in the CRD hierarchy you lined out at the top of the doc. One disadvantage of my idea is that it might be tedious for initial deployment with the need to type out the yaml for each and every node... but perhaps that could be solved by having some template support there.
  • Currently, you have put the storage section at the same level as "management" and the intent appears to have this section apply to all items in the zone. My concern with this is that the current system defines the devices at the level of the node. This approach is preferred if you can create cookie-cutter nodes, but I wonder if we shouldn't also have a per-node or even multiple named storage subsections. I think that a situation like this could easily arise in a bare-metal on-prem situation like, "I bought some servers last year and they came with N disks" then "I expanded my cluster this year and the new servers have M disks and P ssds".

To expand on my "multiple named subsections Idea above". I imagine that we could define the storage section with something like:

  storage:
    # the default storage policy
    default:
      mode: device
      device: 
        - /dev/sdb
        - /dev/sdc
    # storage policy to be used on oldservers (TBD how oldservers are matched with the pods)
    oldservers:
      mode: device
      device:
        - /dev/sdc
        - /dev/sdd
        - /dev/sde

If we wanted to do per node storage we could have a key like "storageOverride".

@jarrpa
Copy link
Collaborator

jarrpa commented Oct 2, 2018

@phlogistonjohn I would modify your proposal to have spec.storage be a list, and each item in storage have a name key to identify. Then you could have spec.zone.storage be a list of names of storage that apply to that zone.

More broadly, to account for legacy scenarios where nodes have asymmetric storage topology, we could go as far as just having one Zone per node such that we can specify exact devices for each node. It's an abuse of the intent of the Zone, but one that's technically feasible given that Zones are (to the Operator) just logical distinctions and have no technical underpinnings (as far as Gluster itself is concerned).

@phlogistonjohn
Copy link

I would modify your proposal to have spec.storage be a list, and each item in storage have a name key to identify. Then you could have spec.zone.storage be a list of names of storage that apply to that zone.

I understand using a list, it's six of one, half dozen of the other for me. I'm not sure what the 2nd sentence is getting at though. Could you provide an example of what you mean?

More broadly, to account for legacy scenarios where nodes have asymmetric storage topology, we could go as far as just having one Zone per node such that we can specify exact devices for each node. It's an abuse of the intent of the Zone, but one that's technically feasible given that Zones are (to the Operator) just logical distinctions and have no technical underpinnings (as far as Gluster itself is concerned).

That sounds a bit too hacky / clever to me. If I really thought asymmetry in the storage devices of the nodes really would be a legacy thing you might be able to convince me. However, I really do think these sorts of asymmetries will occur naturally in the real world as clusters age & grow. If we need to have a strict matching of device layout to abstraction, it seems that maybe we need an additional level of abstraction between node and zone. This may not need it's own CRD but it would prevent us from overloading the concept of zone beyond the natural understanding that users will have (roughly an Availablity Zone) IMO.

@jarrpa
Copy link
Collaborator

jarrpa commented Oct 2, 2018

I understand using a list, it's six of one, half dozen of the other for me. I'm not sure what the 2nd sentence is getting at though. Could you provide an example of what you mean?

A list would be more graceful in implementation, IMO. :) As far as the second sentence, think like volumes and volumeMounts in kube Pods. To put the two together, something like this:

spec:
  management:
    mode: external
    external:
      manual:
        nodes: 2
        storage:
        - storage1
  storage:
  - name: storage1
    mode: device
    device: 
      - /dev/sdb
      - /dev/sdc
  - name: storage2
    mode: device
    device:
      - /dev/sdc
      - /dev/sdd
      - /dev/sde

More broadly, to account for legacy scenarios where nodes have asymmetric storage topology, we could go as far as just having one Zone per node such that we can specify exact devices for each node. It's an abuse of the intent of the Zone, but one that's technically feasible given that Zones are (to the Operator) just logical distinctions and have no technical underpinnings (as far as Gluster itself is concerned).

That sounds a bit too hacky / clever to me. If I really thought asymmetry in the storage devices of the nodes really would be a legacy thing you might be able to convince me. However, I really do think these sorts of asymmetries will occur naturally in the real world as clusters age & grow. If we need to have a strict matching of device layout to abstraction, it seems that maybe we need an additional level of abstraction between node and zone. This may not need it's own CRD but it would prevent us from overloading the concept of zone beyond the natural understanding that users will have (roughly an Availablity Zone) IMO.

In that case, we could expand the Zone spec to allow multiple entries in spec.management. To expand on the above:

spec:
  management:
  - mode: external
    external:
      manual:
        nodes: 2
        storage:
        - storage1
  - mode: automatic
    automatic:
      manual:
        minNodes: 2
        maxNodes: 99
        storage:
        - storage2
  storage:
  - name: storage1
    mode: device
    device: 
      - /dev/sdb
      - /dev/sdc
  - name: storage2
    mode: volume
    volume:
      size: 500Gi
      storageClassName: my-az1-class

I'm not sure how likely it is to have mixed-management-mode clusters, but it should be technically feasible. More likely, as you said, you'd only use this for external clusters that grow over time.

@JohnStrunk
Copy link
Member Author

Following up on some of the questions/discussion so far...

Since my question about merging the Zone w/ the Cluster got a positive response, I'm going to write up and push that change.

@phlogistonjohn You asked about automatic/manual/external... (I think you figured it out, so this is more for the benefit of others following along)
The idea is that there are 2 way to run: in the kube cluster as pods or connecting to an external gluster cluster. "automatic" and "manual" would be for running as pods, and the defference between these is just that in auto mode, the operator can create & destroy nodes at will, whereas in manual mode, it maintains a fixed number of nodes in that zone.
External mode is for using an external cluster. In that case, you have to tell the operator where to find the nodes, and you also have to manage them yourself (create/destroy/upgrade).
Perhaps a better naming scheme would be: dynamic, fixed, external?

I, too, have a dislike for the asymmetry of the "external" nodes in the Zone. I'm trying to come up with a way to push the device list down to the Node CR (so we can use device UUIDs), and convert what is now at the Zone level into a "template" (as that's what it actually is). Perhaps with some re-arranging, it's possible to make the "external" nodes have the contact info in Node CRs, and have corresponding Zones with a "custom" setting (i.e., child Nodes are created manually and don't follow a template). 🤔

For storage.mode: device, there's just no safe way to use /dev/sd*. I was trying to figure out a way to, on first start, record the UUID of the device, but there's no guarantee that even on 1st start the devices are what the admin thinks they are. So, I'm starting to be rather pessimistic about finding a way to templatize mode=device at all.

@phlogistonjohn
Copy link

In that case, we could expand the Zone spec to allow multiple entries in spec.management. To expand on the above:

I think I like this, although I think this "thing" needs some sort of name so we can talk about it. (Not entirely serious suggestions: corrals)

@phlogistonjohn You asked about automatic/manual/external... (I think you figured it out, so this is more for the benefit of others following along)

Actually, I hadn't, so this is great.

The idea is that there are 2 way to run: in the kube cluster as pods or connecting to an external gluster cluster. "automatic" and "manual" would be for running as pods, and the defference between these is just that in auto mode, the operator can create & destroy nodes at will, whereas in manual mode, it maintains a fixed number of nodes in that zone.
External mode is for using an external cluster. In that case, you have to tell the operator where to find the nodes, and you also have to manage them yourself (create/destroy/upgrade).
Perhaps a better naming scheme would be: dynamic, fixed, external?

I agree a better name would help make the concept clearer. I also think this is something akin to the current approach being taken by the gk-deploy and openshift-ansible approaches, no?
If that's so, the fixed count threw me a little bit. IIRC the current style is based on node labeling. If I'm not too off base would it make sense to make that counter a minimum value (used to assist with upgrades / reboots) and an optional label? Perhaps a list of starter nodes?
As for naming maybe something like: "predefined" or "restricted"?

I, too, have a dislike for the asymmetry of the "external" nodes in the Zone. I'm trying to come up with a way to push the device list down to the Node CR (so we can use device UUIDs), and convert what is now at the Zone level into a "template" (as that's what it actually is). Perhaps with some re-arranging, it's possible to make the "external" nodes have the contact info in Node CRs, and have corresponding Zones with a "custom" setting (i.e., child Nodes are created manually and don't follow a template). thinking

+1 to the sentiment. I think what might be helpful here is to imagine what the user workflow would be. Since I think its not particularly good UX to make someone type a lot of boilerplate we could have a sort of initial template (initialNodes?) buried in the zone, but that always expands out to node CRs and post install we always manage via nodes?

For storage.mode: device, there's just no safe way to use /dev/sd*. I was trying to figure out a way to, on first start, record the UUID of the device, but there's no guarantee that even on 1st start the devices are what the admin thinks they are. So, I'm starting to be rather pessimistic about finding a way to templatize mode=device at all.

Hehe. This mirrors a discussion I have been part of at the heketi repo. I'm convinced that normal people think of their devices as /dev/sdX etc. But, that's not "safe" for long term use. But I think we're relatively safe if we are up front about saying these device nodes are what's used for initial discovery but not long term use. I think it could be sufficient if the operator only uses these devices to feed the next step in the management chain (eg. heketi or gd2) and then perhaps rely on those subsystems to come back to the operator with a stable identifier?

@jarrpa
Copy link
Collaborator

jarrpa commented Oct 3, 2018

I've finally taken the time to do a more thorough review of the proposal. Below are my thoughts as well as some repeat of prior discussion:

  • Cluster CRD:
    • Rename CRD to GlusterCluster
    • glusterCA:
      • Should probably have name and namespace fields, just to make parsing and readability a little easier.
      • Maybe rename to glusterCASecret.
      • It should also be optional.
    • Per previous discussion, I support the idea of having a spec.zones field in the cluster CRD. The field would be a list of what is currently the Zone CR spec, with an additional spec.zones[i].name field as an identifier.
  • Zone CR / zones spec:
    • Each Zone should have an optional disabled boolean field. If true, all nodes in that Zone should be marked as disabled and the Operator should not perform any actions on those nodes.
    • The current set of management types is a little confusing. I like @JohnStrunk's suggested names, but I think we could keep most of the originals and go with dynamic, manual, and external.
    • I brought up the idea of having multiple "management" entries in the Zone spec, so that we avoid using the Zone grouping for something other than its intended purpose (deliniating failure domains). Thinking on it further, it may make sense to introduce the concept of "node management groups", and rename the management field to nodeTemplates.
    • In fact, we could move the management spec out of spec.zones into spec.nodeTemplates where each nodeTemplate has a name identifier, then have spec.zones[n].nodes which would be a list of node template names.
    • We technically don't really need the mode field at all. Speaking from Go, we could take a page from the k8s PersistentVolume spec and just have an ManagementType field that's anonymous in the parent struct, with fields of the format DynamicManagementType `json:dynamic,omitempty` such that we only need to specify the type field. The operator would have to check to make sure that only one management type field is specified, and fail otherwise.
    • nodeTemplate should be extended to have a stateVolume field that defines the type of storage that would be used to maintain each node's state. This field would only apply to dynamic and manual management types, and be ignored for external types. stateVolume should have the following fields:
      • local: boolean. If True, indicates that a hostPath volume should be used and all other stateVolume fields should be ignored. The Operator would take care of determining the path on the host.
      • claimName: A specific PVC name to use.
      • claimNamespace: PVC's namespace, optional
      • storageClassName: StorageClass for a dynamic PV. Ignored if either local or claimName are used.
      • size: Size of state volume. Defaults to 1GiB
        If no fields are specified, a 1GiB volume would be provisioned from the cluster's default StorageClass.
    • The external mode should allow for a name and namespace for the credentials Secret, much like I propose for glusterCA.
    • The storage field should be renamed to storageTemplates, with subfields of name, devices, and `volumes.
      • The storageTemplates[n].devices list should behave like storage.device and be extended to accept regular expressions of device names.
      • The storageTemplates[n].volumes field should be a list of possible PV types. Fields would include:
        • storageClassName: a StorageClass name to provision dynamic PVs.
        • size: size of dynamic PVCs from storageClassName. Defaults to 1GiB
        • minCount: minimum number of dynamic PVs to provision, defaults to 1.
        • maxCount: maximum number of dynamic PVs to provision. If not specified, limit is 1 for manual nodes and -1 (no limit) for dynamic nodes.
          If no fields are specified, a 1GiB volume would be provisioned from the cluster's default StorageClass.
      • If a storageTemplate specifies neither devices nor volumes, the volumes defaults are assumed.
      • To reference a storageTemplate, a nodeTemplate should have a storage field, which is a list of storageTemplates. If the storage field is absent, a default storageTemplate is used that behaves with the volumes defaults.

Some examples that put together the above:

Minimalist Cluster CR

apiVersion: "operator.gluster.org/v1alpha1"
kind: GlusterCluster
metadata:
  name: my-cluster
  namespace: gcs
spec:
  zones:
  - name: az1
    nodes:
    - nodes-dynamic
  nodeTemplates:
  - name: nodes-dynamic
    dynamic:
      minNodes: 2
      maxNodes: 99

Complex Cluster CR

apiVersion: "operator.gluster.org/v1alpha1"
kind: GlusterCluster
metadata:
  name: my-cluster
  namespace: gcs
spec:
  clusterOptions:
    "cluster.halo-enabled": "yes"
  glusterCA: # optional
    name: ca-secret
    namespace: gcs # optional, defaults to GlusterCluster's namespace
  zones:
  - name: az1
    nodes:
    - nodes-dynamic
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
            - key: failure-domain.beta.kubernetes.io/zone
              operator: In
              values:
                - us-west-2a
  - name: az2
    nodes:
    - nodes-dynamic
    disabled: True
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
            - key: failure-domain.beta.kubernetes.io/zone
              operator: In
              values:
                - us-east-2a
  - name: az3
    nodes:
    - nodes-manual
    - nodes-external
  nodeTemplates:
  - name: nodes-dynamic
    dynamic:
      minNodes: 2
      maxNodes: 99
      storage:
      - storage1
  - name: nodes-manual
    manual:
      nodes: 2
      storage:
      - storage2
    stateVolume:
      storageClassName: manual-state-class
  - name: nodes-external
    external:
      nodes:
      - 1.2.3.4
      - some.node.net
      credentialsSecret: # required
        name: ssh-secret
        namespace: gcs # optional, defaults to the GlusterCluster's namespace
  storageTemplates:
  - name: storage1
    volumes:
    - size: 500Gi
      storageClassName: az1-class
      minCount: 1
      maxCount: 100
  - name: storage2
    devices: 
    - /dev/sdb
    - "/dev/sd[d-f]"
  • Node CRD:
    • CRD should be renamed to GlusterNode.
    • Metadata should include the node group, as defined above for the cluster CRD.
    • targetState could be renamed to desiredState
      • Available states should be active, disabled, and unavailable. Only active and disabled are permitted for spec.desiredState, while all three may be present in status.currentState. unavailable in this case indicates that the Operator can currently not perform management operations on the node but would keep trying until the node comes back, is configured disabled, or is deleted. If the Node CR is marked for deletion but is in an unavailable state, the Operator would simply remove the node from the cluster configuration and allow the resource deletion to succeed.
      • desiredState would default first to disabled if the owning cluster's CR's spec.zones[n].disabled field if True, otherwise default to active.
    • The spec should allow for node-specific overrides of configurations in the cluster CR, namely spec.storage, spec.nodeAffinity, and spec.credentialsSecret. The overrides would behave differently depending on the node's management mode:
      • dynamic: all spec fields are immutable except for spec.desiredState
      • manual: all fields are mutable and optional, would override the cluster configs if present
      • external: all fields are mutable, with most being optional except for spec.storage
    • spec.storage would be a list that behaves as the cluster CRD field spec.storageTemplates, with the following exceptions:
      • spec.storage entries would not have a name field.
      • spec.storage[n].volumes entries would be extended to recognize claimName and claimNamespace fields that identify a specific PVC for use as a Block volume.

Examples for the above:

Node CR - Dynamic (created by Operator, immutable except for spec.desiredState)

apiVersion: "operator.gluster.org/v1alpha1"
kind: GlusterNode
metadata:
  name: az1-nodes-dynamic-001
  namespace: gcs
  cluster: my-cluster
  zone: az1
  nodeGroup: nodes-dynamic
spec:
  desiredState: active
status:
  currentState: active

Node CR - Manual (created by Operator, mutable by admin)

apiVersion: "operator.gluster.org/v1alpha1"
kind: GlusterNode
metadata:
  name: az1-nodes-manual-001
  namespace: gcs
  cluster: my-cluster
  zone: az3
  nodeGroup: nodes-manual
spec:
  storage:
  - volumes: 
    - claimName: block-pvc
      claimNamespace: gcs # optional, defaults to the GlusterNode's namespace
  desiredState: disabled
status:
  currentState: disabled

Node CR - External (created by admin, mutable by admin)

apiVersion: "operator.gluster.org/v1alpha1"
kind: GlusterNode
metadata:
  name: some.node.net
  namespace: gcs
  cluster: my-cluster
  zone: az3
  nodeGroup: nodes-external
spec:
  credentialsSecret: # required
    name: ssh-secret
    namespace: gcs # optional, defaults to the GlusterNode's namespace
  storage:
  - devices: 
      - /dev/sdb
  desiredState: active
status:
  currentState: unavailable

This all seems somewhat fiddly, but I think it's worth it to try and strike a balance between flexibility and maintenance overhead. If we wanted, we could ship utilities to generate examples inside the Operator container, such that you could run something like kubectl exec <anthill_pod> /gen-cluster-template.sh > cluster.yaml and have something to start from.

I have some more thoughts on cluster/node state, but I'll save those for another comment. :)

@JohnStrunk
Copy link
Member Author

Sorry to have disappeared for so long on this one...

@phlogistonjohn asked:

One thing I expected to see but didn't was a mechanism to specify what images should be used.

Since the operator needs to be able to manage upgrades (potentially a multi-step process walking across versions of server and csi pods), it needs to have a manifest knowing what transitions are legal (and tested). This precludes allowing the admin to specify arbitrary images & tags. The actual info will be a part of the manifest bundled w/ the operator. This is similar to how OLM works.

@JohnStrunk
Copy link
Member Author

@jarrpa
The configurability seems to have exploded. :)
If I back up a step here, I think we want 3 possible configurations for a gluster pod:

  1. Usage of block PV (desired)
  2. Usage of device in /dev (as stopgap until 1 is everywhere)
  3. External nodes
    I don't think I want to expose directly specifying the PVCs that would be used in (1). Do you have a usecase in mind where that is important?

@JohnStrunk
Copy link
Member Author

We technically don't really need the mode field at all. Speaking from Go, we could take a page from the k8s PersistentVolume spec and just have an ManagementType field that's anonymous in the parent struct...

If I'm understanding correctly this would be:

management:
  automatic:
    # stuff for auto

or

management:
  manual:
    # stuff for manual

but an error if

management:
  automatic:
    # stuff
  manual:
    # stuff

so the mode: auto|manual becomes superfluous. I like it. 👍

@JohnStrunk
Copy link
Member Author

I'm convinced that normal people think of their devices as /dev/sdX etc. But, that's not "safe" for long term use. But I think we're relatively safe if we are up front about saying these device nodes are what's used for initial discovery but not long term use. I think it could be sufficient if the operator only uses these devices to feed the next step in the management chain (eg. heketi or gd2) and then perhaps rely on those subsystems to come back to the operator with a stable identifier?

I think this is really dangerous. The CRs hold the authoritative, declarative state. The only (almost) safe thing I can think of would be for the operator to resolve to UUIDs, then overwrite the sdX w/ UUIDs in the CR. Putting it in the status: doesn't fix the problem, nor does trying the preserve a sdX==>UUID mapping someplace. It works if nothing will ever be changed in the future (add or remove), but breaks as soon as changes are allowed.
Complicating this is that the operator modifying the admin's settings is frowned upon (and a race).

@JohnStrunk
Copy link
Member Author

So we're talking about splitting the "template" functionality from the "failure domain" specification. Is this really necessary? I feel like it's bringing in added room for misconfiguration.
Leaving them tied together means you can't put "dev"-backed pods in the same domain as "pv"-backed or external, but does that really matter?

@phlogistonjohn brought up different templates for different server configs... if we mandate UUID device specification, there aren't really any templates when using /dev, so they could all go in the same zone. Dev-based zones would require manual creation of Node CRs, so they need not be homogeneous, and the operator wouldn't overwrite them.

@jarrpa
Copy link
Collaborator

jarrpa commented Oct 11, 2018

The configurability seems to have exploded. :)

It's a bad habit of mine. ;) See my openshift-ansible work.

If I back up a step here, I think we want 3 possible configurations for a gluster pod:

  1. Usage of block PV (desired)
  2. Usage of device in /dev (as stopgap until 1 is everywhere)
  3. External nodes
    I don't think I want to expose directly specifying the PVCs that would be used in (1). Do you have a usecase in mind where that is important?

As always, I'm trying to think about on-prem edge cases, because that's where the challenge is going to lie. I'm thinking of situations where 1. We want to use local block PVs, 2. Dynamic provisioning of local block PVs doesn't exist, and 3. The admin wants to ensure that very specific drives are used. We should also consider the possibility that users will come up with their own reasons we can't think of right now for not wanting to use StorageClasses, in whatever management mode. I can kinda guess at the code path for this when we already have the SC-based path, and it doesn't seem like it wouldn't be bad to implement and maintain.

I think this is really dangerous. The CRs hold the authoritative, declarative state. The only (almost) safe thing I can think of would be for the operator to resolve to UUIDs, then overwrite the sdX w/ UUIDs in the CR. Putting it in the status: doesn't fix the problem, nor does trying the preserve a sdX==>UUID mapping someplace. It works if nothing will ever be changed in the future (add or remove), but breaks as soon as changes are allowed.
Complicating this is that the operator modifying the admin's settings is frowned upon (and a race).

I imagine this is something the local block provisioner team has hit. How are they dealing with it? Personally I'm fine with letting the admin put in whatever device path they want, so long as we provide a stern warning somewhere in our spec docs that they REALLY SHOULD use UUIDs. Especially for throw-away environments, let 'em have /dev/sdX.

So we're talking about splitting the "template" functionality from the "failure domain" specification. Is this really necessary? I feel like it's bringing in added room for misconfiguration.

I think it makes things easier to understand. It also splits the functionality so that those who aren't interested in configuring failure domains don't have to do anything with them. If no zone is specified, just assume all nodeTemplates are part of a single default zone.

My default stance is always against artificially constraining configurations "for the user's benefit". It is a disservice to competent and power users who want to do potentially interesting things with our technology. We always have the authority to say that whatever they're trying is unsupported and/or won't be fixed. :)

Leaving them tied together means you can't put "dev"-backed pods in the same domain as "pv"-backed or external, but does that really matter?

I don't think that mandating they be split is of any meaningful benefit or importance.

@phlogistonjohn brought up different templates for different server configs... if we mandate UUID device specification, there aren't really any templates when using /dev, so they could all go in the same zone. Dev-based zones would require manual creation of Node CRs, so they need not be homogeneous, and the operator wouldn't overwrite them.

I'm not sure I quite understand the meaning of "server templates" here. Is that different from the concept I presented of just have different node templates, each with their (one and only one) management mode/type? Also, as above, I see no concrete reason to prevent users from using dev letter names so long as we make it clear in all our examples and documentation that UUIDs are safer.

@JohnStrunk
Copy link
Member Author

I imagine this is something the local block provisioner team has hit. How are they dealing with it? Personally I'm fine with letting the admin put in whatever device path they want, so long as we provide a stern warning somewhere in our spec docs that they REALLY SHOULD use UUIDs. Especially for throw-away environments, let 'em have /dev/sdX.

They are going to permit specifying either way.

I went looking for a good explanation of the options and found this: https://wiki.archlinux.org/index.php/persistent_block_device_naming
After reading that page, I'm not sure there's a good answer that would work in all cases. Further, the UUID I was hoping to use... I have now learned it's actually pulled from the file system, so it's useless for raw block devices. Even items in by-id such as wwn or ata* aren't guaranteed to be there.

I think our only option is to let them specify whatever path they want and warn them that it must be stable.

@jarrpa
Copy link
Collaborator

jarrpa commented Oct 15, 2018

After reading that page, I'm not sure there's a good answer that would work in all cases. Further, the UUID I was hoping to use... I have now learned it's actually pulled from the file system, so it's useless for raw block devices. Even items in by-id such as wwn or ata* aren't guaranteed to be there.

I think our only option is to let them specify whatever path they want and warn them that it must be stable.

Oof... yeah, all right.

Any chance you can get put together an update to the design doc with the things you're willing to incorporate? Or any chance we can collaborate on a branch somewhere so I can propose commits?

@JohnStrunk
Copy link
Member Author

Any chance you can get put together an update to the design doc with the things you're willing to incorporate? Or any chance we can collaborate on a branch somewhere so I can propose commits?

I'm aiming for tomorrow. I need to get this and the csi driver StorageClass definition done for GCS/0.2.


A given Gluster cluster is defined by several different Custom Resources (CRs)
that form a hierarchy. At the top level is the "cluster" CR that describes
cluster-wide configuration options such as the "name" for the cluster, the TLS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JohnStrunk can you provide some details about "name" ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "name" is the name field in the Cluster (soon to be GlusterCluster) CR. It would show up as the Service name to which CSI connects to in order to access this Gluster cluster.

that form a hierarchy. At the top level is the "cluster" CR that describes
cluster-wide configuration options such as the "name" for the cluster, the TLS
credentials that will be used for securing communication, and peer clusters for
geo-replication.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

peer cluster -> Does it mean a seperate TSP ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Peer clusters"... yes separate TSPs, usable as a georep target.

The "peering" for georep is handled at the cluster-level just as georep is set up at the cluster-level today for standalone. Once the peering is done, individual volumes can be replicated. This would be configured via StorageClass parameters that would list the name of this peer cluster and maybe some other options for the volume-level configuration of the replication.

where (DNS name) to find existing nodes (for independent mode).

Below the zone definitions are individual node definitions that track the state
of the individual Gluster pods. Manipulating these objects permits an
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some details on mentioned state?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Zone and Node specs are being reworked based on feedback from @jarrpa and @phlogistonjohn but the state of the individual pods are still tracked. This interacts w/ the node state machine (#36). As well, it will hold the desired configuration for the node. This Node object gets turned into a (currently) StatefulSet of replica=1 to run a Gluster pod (see details of GCS deployment for the StatefulSet stuff)


```yaml
apiVersion: "operator.gluster.org/v1alpha1"
kind: Cluster
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JohnStrunk IMO, it would be better if kind somehow mention it is a gluster cluster.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will become GlusterCluster

replication: # (optional)
# Credentials for using this cluster as a target
credentials: secretRef
targets:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

target->name is a cluster name , Isnt it ? There could be scenarios where one volume is geo replicated between clusters foo and bar where foo as source and bar is target. Another volume from bar to foo. so isnt it a good idea to have a optional volname field in targets ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just sets up the peering between clusters, it doesn't actual cause any replication to happen. If foo can replicate to bar, then in foo's GlusterCluster, it would list bar as a target. If the reverse is also possible, bar's GC would list foo also.

Actual volume replication is handled at the SC level.

kind: Cluster
metadata:
# Name for the Gluster cluster that will be created by the operator
name: my-cluster
Copy link
Collaborator

@humblec humblec Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JohnStrunk Isnt it a good idea to club or group Zones ( for ex: may be Zone 1 is high end RAM systems and Zone 2 is with Powerful processors) ? If yes, how can we achieve that with current Cluster CR ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SC will list the allowed zones for a volume. If you have a set of fast storage nodes, you'd put them into zones (fast1, fast2, fast3), and the slow ones into (slow1, 2, 3). The fast storage class would then say to use zones fast1, fast2, ...
Default is to spread across all available zones (i.e., zones: "*")


```yaml
apiVersion: "operator.gluster.org/v1alpha1"
kind: Node
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JohnStrunk Node CR should have a param to define a certain resource or its capabiliites . Current CR does not allow that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean reservations and limits? These should be part of the operator's manifest based on what we have qualified. This should not be exposed.

status:
# TBD operator state
# Possible states: (active | deleting | quiesced)
currentState: active
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isnt it good to match the state ( ex: Ready) of Node CR to Openshift/Kube node state/status.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more than just a boolean Ready|NotReady. I am open to using Conditions, however.

namespace: gluster
spec:
# Admin (or operator) sets desired state for the node.
targetState: active # (active | quiesce)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 2 states here ? cant operator and admin co-ordinate on one state?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distinction between desired state and actual state is important. Just because the admin wants to mark the node as active doesn't mean it's actually available immediately.

@humblec
Copy link
Collaborator

humblec commented Oct 17, 2018

@JohnStrunk late to the party, but started my review. Will push more, however one high level question I have on our CRs are, how our own topology based CRs play with topology based volume provisioning in Kube/OCP ? Isnt it going to confuse admins?

# Min unallocated capacity in the zone
freeStorageMin: 1Ti
# Max unallocated capacity in the zone
freeStorageMax: 3Ti
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is operator going to calculate and update CR ? If yes, is it in a periodic manner and how ? Is based on the number of volumes in the cluster? if yes, does operator account node addition and deletion ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would create and update the Node CRs, yes. My guess is that this would be done in response to alerts from agents (possibly within the operator, but not necessarily) monitoring # of volumes and total raw storage availability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If would update the status:, not these fields. These fields determine limits for autoscaling of the cluster. The operator would probe GD2 to ensure unallocated device capacity remains between Min & Max. It would then add/remove nodes if it gets out of bounds.

@phlogistonjohn
Copy link

Catching up on this discussion after a while of not having much time to look at it, excuse me if I reiterate anything previously discussed. I tried to read every update but I may not have fully absorbed everything.

After reading that page, I'm not sure there's a good answer that would work in all cases. Further, the UUID I was hoping to use... I have now learned it's actually pulled from the file system, so it's useless for raw block devices. Even items in by-id such as wwn or ata* aren't guaranteed to be there.

Yup. I found the same thing while researching options for a request for similar behavior natively baked into Heketi.

I think our only option is to let them specify whatever path they want and warn them that it must be stable.

Since the operator is not the end of the management chain, and we're going to have things like gd2/heketi/etc in the mix my thought is to simply (strongly) recommend persistent devices, but otherwise treat the given devices as a filter. As in "ok gd2/heketi feel free to absorb these devices into your management scheme" and those subsystems should be expected to map whatever currently is /dev/sdX in a persistent manner.

@phlogistonjohn
Copy link

Since the operator needs to be able to manage upgrades (potentially a multi-step process walking across versions of server and csi pods), it needs to have a manifest knowing what transitions are legal (and tested). This precludes allowing the admin to specify arbitrary images & tags. The actual info will be a part of the manifest bundled w/ the operator. This is similar to how OLM works.

OK, I'm fairly ignorant ot OLM. I can buy that strategy for most production scenarios but it strikes me as fairly rigid for developing with. I'm fine with leaving it out of the CRD but when it gets time to coding/implementing I hope that there's some mechanism where I can slot in my custom test build and have that get deployed without having to compile the operator from scratch.

@JohnStrunk
Copy link
Member Author

Since the operator is not the end of the management chain, and we're going to have things like gd2/heketi/etc in the mix my thought is to simply (strongly) recommend persistent devices, but otherwise treat the given devices as a filter. As in "ok gd2/heketi feel free to absorb these devices into your management scheme" and those subsystems should be expected to map whatever currently is /dev/sdX in a persistent manner.

I can't figure out how to make this work. Consider:

  • Scenario 1:
    1. Set CR: use sda
    2. low level: sda is currently D1... will use D1
    3. Set CR: also use sdb
    4. low level: sdb is currently D2... will add D2
    5. reboot shuffles mapping: sda=D2, sdb=D1
    6. Set CR: remove sdb
    7. What does low level do? If they remember sdb was D2 when added, we have other problems...
  • Scenario 2: (with remembering original mapping)
    1. Set CR: use sda
    2. low level: sda is D1... will use D1
    3. reboot shuffles. sda is now D2 and sdb is D1
    4. Problem: Admin wants to add D2, but sda is already in the CR.
  • Scenario 3: Like 2, but w/o tracking, there is a similar problem on remove.

@JohnStrunk
Copy link
Member Author

Since the operator needs to be able to manage upgrades (potentially a multi-step process walking across versions of server and csi pods), it needs to have a manifest knowing what transitions are legal (and tested). This precludes allowing the admin to specify arbitrary images & tags. The actual info will be a part of the manifest bundled w/ the operator. This is similar to how OLM works.

OK, I'm fairly ignorant ot OLM. I can buy that strategy for most production scenarios but it strikes me as fairly rigid for developing with. I'm fine with leaving it out of the CRD but when it gets time to coding/implementing I hope that there's some mechanism where I can slot in my custom test build and have that get deployed without having to compile the operator from scratch.

My plan here is that there would be a text manifest in the container image. For development, you are free to adjust these however you see fit. A release, however, contains a fixed version of the manifest which points to a very specific set of images that define "version X".

@JohnStrunk
Copy link
Member Author

@JohnStrunk late to the party, but started my review. Will push more, however one high level question I have on our CRs are, how our own topology based CRs play with topology based volume provisioning in Kube/OCP ? Isnt it going to confuse admins?

Unifying would be very cool. Do you have suggestions here?

The topology stuff I've seen seems very cloud provider specific. The current link is that (AWS example)... If you have a multi-AZ cluster, you would want to create 3 "zones" for your Gluster cluster, specifying nodeAffinity such that it would put each zone into a specific AWS AZ, and the StorageClass for that zone's PVs would have their affinities set to grab EBS volumes from the given zone. From my pov, you'd want to name your gluster zone "east-1a" if it's going to be running in AWS AZ "east-1a". I think that would make it fairly straightforward.

@JohnStrunk
Copy link
Member Author

Progress update: We discussed the CRD in the operator meeting this morning.

  • Current scratch pad: https://gist.github.com/JohnStrunk/63fd6141a25ca959e49b9c99758eeff7
    • This has the current draft CRDs & no text. I was having trouble keeping up w/ both.
  • Changes:
    • All secrets have an optional namespace. I personally think their use should be discouraged, but I don't think It'll hurt anything
    • We now just have GlusterCluster and GlusterNode
    • Zones have been replaced w/ nodeTemplate (list) in the GC, and zone groupings are now just a text label
    • The template-based config that used to have manual and automatic has been combined. You can now just set a fixed number of nodes via nodes or you can use the other 4 thresholds to allow autosizing. maxStorage has been removed since it was redundant.
    • GlusterNodes can be created either by the operator as a result of the templates or manually by the admin.
      • templates must use PVs
      • Template-based GNs will have an annotation so we can tell them apart from admin-created. Template-based will have config changes reverted by the operator.
      • GN storage is now a list, where items are devices or PVCs. Yes, you can mix/match; you probably shouldn't
      • GNs that use device paths must:
        • Ensure the path is stable
        • Apply a nodeAffinity that matches exactly 1 node

I'm going to make another pass through the comments, then try and update the real doc.

Signed-off-by: John Strunk <jstrunk@redhat.com>
@JohnStrunk
Copy link
Member Author

I know you thought it'd never happen. 💀 But I managed to update this PR.

nodeTemplates: # (optional)
- name: myTemplate
# Zone is the "failure domain"
zone: my-zone # default is .nodeTemplates.name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, very nice.

to automatically scale the Gluster cluster as required and to automatically
replace failed storage nodes.

Within this template, there is a `zone` tag that allows the nodes cretaed from
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/cretaed/created/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

of nodes by setting `nodes` to the desired value. The operator can also
dynamically size the template if, instead of setting `nodes`, the `minNodes`,
`maxNodes`, `freeStorageMin`, and `freeStorageMax` are configured. In this
case, the number of stroage nodes always remains between the min and max, and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/stoage/storage/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

when scaling should be invoked. The template can have a fixed (constant) number
of nodes by setting `nodes` to the desired value. The operator can also
dynamically size the template if, instead of setting `nodes`, the `minNodes`,
`maxNodes`, `freeStorageMin`, and `freeStorageMax` are configured. In this
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not proposing a change, but just to clarify: Is the strategy for keeping available storage between freeStorageMin and freeStorageMaximum to spin nodes up and down as needed, and not to manipulate devices used by individual nodes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is the intent.

# When this annotation is present, the admin may only modify
# .spec.desiredState or delete the CR. Any other change will be
# overwritten.
anthill.gluster.org/template: template-name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

# TBD operator state
# Possible states: (active | deleting | quiesced)
currentState: active
# Possible states: (enabled | deleting | disabled)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question on the deleting state... any distinction between that state and the resource being marked for deletion (in kube)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are related, but not quite the same.

An admin may trigger the deletion of a gluster node by deleting the corresponding GlusterNode CR. The operator will have a finalizer on the object, preventing the actual removal until it is properly drained and removed from the cluster. The currentState: deleting is intended to signal that the operator has noticed the request to delete (or decided to do it on its own), and has started the process of draining/waiting/removing from cluster. The CR would actually be deleted (finalizer removed) only after the node and its resources are truly gone since the CR is the only way the operator can track the pending operations.

corresponding pod would also be freed.
- By changing the `.spec.desiredState` of the Node, the administrator can notify
the operator (and by extension, other Gluster management layers) that the
particular node should be considered "in maintenance" (`disabled`)such that it
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/)such/) such/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: John Strunk <jstrunk@redhat.com>
@JohnStrunk
Copy link
Member Author

Anyone else?

Copy link
Collaborator

@jarrpa jarrpa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, for some reason I didn't get notified of the latest comments. LGTM!

@JohnStrunk JohnStrunk merged commit 4b8be53 into gluster:master Nov 2, 2018
@ghost ghost removed the in progress label Nov 2, 2018
@JohnStrunk JohnStrunk deleted the 3-crd-design branch December 19, 2018 14:55
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants