-
Notifications
You must be signed in to change notification settings - Fork 68
Description
Want to surface the discussion which took place via emails, couple of conference calls and f2f discussions in this ticket.
The aim of this ticket is to discuss the fundamental differences OLM v1 introduces relative to OLM v0, when it comes to the support and handling the multi-tenant clusters. Specifically for large software vendors, who implemented significant number of operators (like IBM, with 250+ operators), who can be deployed in the tenant-level scope (both operator & operands), we need to understand the migration strategy from OLM v0 into OLM v1.
Purpose of this document is to document the terminology and enterprise use cases of IBM Cloud Pak customers, with OLM-based operators
Glossary
Cluster Admin - typically member of the IT Infrastructure Ops team, responsible to providing the Kubernetes cluster infrastructure to the invididual tenant teams.
Cluster Admins are responsible for:
- setting up and upgrading Kubernetes/OCP clusters and related infrastucture (registry servers, infrastructure monitoring & scanning tools, etc)
- making sure that security policies are applied across all the clusters
- creating namespaces
- managing the tenant admins (CRUD of users and their roles)
Tenant - the team which is part of the customer's organisation, which is independent from other teams in the same company.
Tenants, aka Lines of Business, are responsible for providing the value via deployment and usage of the business apps like IBM Cloud Paks. This aligns most closely with the Kubernetes Teams multi-tenancy use case.
Tenants are provided a the set of Kubernetes namespaces and users are granted namespace admin roles for those namespaces, plus provided roles to deploy operators into their own namespace.
- TODO: There is a slight ambiguity here: if a single Line of Business has dev and test instance on the shared cluster, then we have 2 tenants, actually
Two types of Tenant namespaces are relevant here:
- Control: Hosts Operators and has access to the Kubernetes API server.
- Data: May be colocated with teh Control plane, but may also be further isolated. The data plane hosts the Operand CRs and does not typically have access to the Kubernetes API server.
Tenant Admin - A Kubernetes user that has permissions to administer Kubernetes resources within one or more of the Control Plane namespaces that collectively represent the tenant. These users do not have privileges to affect other tenants.
IBM Cloud Pak - the suite of logically related business applications, using operators for their deployment and lifecycle management. Each of the application might have 1 or more operators (typically one top level operator and several nested operators).
Typically:
- One or more Lead Operators (represents the Cloud Pak)
- One or more Capability Operators (can be part of a Cloud Pak or standalone... e.g. ibm-mq)
- One or more Component Operators (typically a dependency and not useful by itself).
Separation of control-and-data - the deployment topology where tenant's application is split into separate namespaces: one for operators and one or more for their operands. Customers are setting up firewalls (network policies) to block traffic to k8s API Server from operand namespaces.
Workload isolation - the deployment pattern where multiple tenants can deploy their applications like Cloud Paks (both operators and operands) independency from each other and manage its lifecycle independent from each other. It is acceptable to have multiple copies of same operator at different version, managed by different tenant.
- accepted limitations: given the K8s nature, the true multi-tenancy is not possible, thus we need to accept that control plane services (
etcd,StorageClasses, etc) and some cluster-level services (like Monitoring or Cert-manager) are cluster-level services. Yet, they shall be resillient to the noisy neighbour as much as possible
Relevant Kubernetes Resources that are in scope of a Tenant Namespaced isolation:
- Namespaced objects
- NetworkPolicies
- DNS
- Access Controls
- Quotas
Scenarios
Typically, tenants are provided access to one of more namespaces which are under their control. Tenants are deploying operator(s) into their namespaces. A single tenant has a single instance of any operator. It is accepted that single operator is watching multiple namespaces, as long as those namespaces belong to the same tenant.
Workload isolation scenarios
Cloud Pak operators can be installed either into AllNamespace mode (openshift-operators), meaning that there is just a single tenant on the cluster, or in the OwnNamespace mode, whenever there are more tenants on the cluster. It is expected to have multiple tenants each of them running the same Cloud Pak, most probably each at different version (aka dev/test deployments).
Operator dependencies
IBM Cloud Paks are leveraging two types of Statically Defined OLM operator dependencies:
- based on the GVK
- based on the package name (and version range)
IBM Cloud Paks also create Operators dynamically (via IBM Operand Deployment Lifecycle Manager ODLM), which enables auto-provisioning Operators and Operands (CRs) on-demand when required, typically for shared capabilities/components (like user identity and access management, common UI platform experience). With Cloud Pak 3.0 architecture, these shared capabilities/components are deployed as individual instance per tenant.
CatalogSource management
Currently, CatalogSources for IBM CloudPaks are deployed by the Cluster Admin as global catalogs in openshift-marketplace, but it leads to issues when CatalogSources are updated, causing uncontrolled operator upgrades across tenant namespaces. Mitigations leverage catalog source pinning, usage of manual approval mode and exploration of usage of private (namespaced) CatalogSources (each tenant having its own CatalogSource vs its own).
Proposal for OLM v1 discussion
Introduction of the API Bundles
De-couple the API Bundle from the Operator controller bundle. Have a semver-versioned API Bundle which is cluster-scope and will register only CRDs and their conversion webhooks (if needed). The Operator controller bundle can be deployed wither in All Namespace mode (openshift-operators) or into each of the tenant namespace. It is acceptable for controller operators to be deployed into multiple namespaces, each at different version. There shall be a way how controller operator would define a compatibility version range on the API Bundle(s). Controller code by itself would be responsible for making sure the individual CRs are properly structured (TBD whether validation webhooks are really required) and react accordingly by providing a proper .status updates.
API Bundle shall provide backwards and forwards compatibility as much as possible - and ideally there shall be a tool / method of validation the CRD evolution.
Ideally there shall be some migration tool (part of operator-sdk) which would take existing OLM v0 bundle and separate into two OLM v1 bundles: one with APIs (CRDs) vs the one with the rest of the code and properly definition the dependency relationship. Perhaps even it could be automatically executed for backwards compatibility with OLM v0 operators, on the OLM v1 OCP clusters.
RBAC
There shall be a way in OLM v1 to define the RBAC for the tenant-level operator, based on the prescriptive input (list of namespaces) which defines the topology of the given tenant (sort of OperatorGroup, which defines the WATCH_NAMESPACES for controller and namespaces to create RBAC for). Something like IBM Namespace Scope operator or Oria operator. Whenever tenant is just a single namespace, there shall not be required any topology definition - defaults should be assumed and RBAC properly created based on the operator controller bundle metadata. Tenant admins shall delegate to OLM v1 the CRUD of RBAC, based on the operator metadata and topology definition. Such RBAC shall be easily auditable - via few kubectl commands.
Customers would deploy the actual controller operators, which should automatically load related API Bundles (TBD compatibility checking).
Dependency management
Dependencies (TBD whether we need them) shall deploy dependant operators in the same mode and namespace as the requesting operator, AND configure RBAC using the same topology definition.
Dependency resultion should be executed in the scope of the tenant (one or more namespaces).
Catalog and Subscription management
CatalogSource shall be tenant-level.
Update of CatalogSource shall not impact other tenants.
There shall be still concept of Subscription which allows subscription to the fixes. There shall be some equivalent of approval mode, but perhaps not working (like in OLM v0) on the level of namespace (like InstallPlan), but rather on the level of tenant (set of namespaces).
There shall be a way to preview the available upgrade and what's involved with the upgrade (i.e. whether any additional cluster-level dependencies are introduced?)
Related Info
- https://github.ibm.com/IBMPrivateCloud/BedrockServices/blob/master/ArchitectureSpecifications/CloudPak3.0-FoundationalServices.md
- https://github.ibm.com/IBMPrivateCloud/BedrockServices/blob/master/OverallRequiredCapabilities/Multi-tenantDetails.md
- https://kubernetes.io/docs/concepts/security/multi-tenancy/
TODOs
- TBD what shall be the definition of
Channel- channels group all the fixes and updates to the semver-compatible version range - TBD relationship between
Channelsand API Bundles, if any - TBD schema evolution checking / validation

