feat(chart): add NetworkPolicy templates for all Slurm components#166
feat(chart): add NetworkPolicy templates for all Slurm components#166giuliocalzo wants to merge 5 commits intoSlinkyProject:mainfrom
Conversation
Add opt-in Kubernetes NetworkPolicy Helm templates for controller, nodeset, accounting, restapi, and loginset. Disabled by default via `networkPolicy.enabled: false` with per-component toggles and support for extra ingress/egress rules at both global and component level.
Generate one NetworkPolicy per enabled loginset/nodeset map entry instead of a single blanket policy, scoped via app.kubernetes.io/instance. Move networkPolicy config (enabled, extraIngress, extraEgress) into each map entry and remove the now-redundant top-level nodeset/loginset flags.
There was a problem hiding this comment.
I notice that you are missing rules for LoginSet -> Accounting (via sacct/sacctmgr), LoginSet -> NodeSet (via srun), and NodeSet -> NodeSet (via srun).
See https://slurm.schedmd.com/overview.html#architecture for details.
Also, across both charts, networkPolicy=true probably should not be the default. srun is problematically here. By default it uses all available ports but can be constrained by SrunPortRange.
|
Good Points I will update adding the instance and the namespace |
I've missed this completely, let me fix it |
|
This also should handle cases where the ports for Slurm, ssh, mariadb are not default. |
- LoginSet: add all-TCP egress to slurmd (srun), conditional accounting egress (sacct/sacctmgr on TCP 6819) - NodeSet: allow all TCP from slurmd and login pods (srun ephemeral ports), all TCP egress to slurmd - All templates: add app.kubernetes.io/instance to from/to selectors for singleton components (slurmctld, slurmdbd, slurmrestd)
…tors Singletons (slurmctld, slurmdbd, slurmrestd) use slurm.fullname as instance. Map components (slurmd, login) iterate over the map to generate per-instance from/to entries with the CR name as instance. Accounting ingress now also allows from loginset pods (sacct/sacctmgr).
|
good morning @SkylerMalinowski I've updated the label selector to match the required component and instance, I've also sorted the SrunPortRange. |
Summary
Add opt-in Kubernetes NetworkPolicy Helm templates for each Slurm component and the slurm-operator itself, providing network-level isolation between all components.
Component Traffic Diagram
flowchart LR subgraph kube [Kubernetes Control Plane] kubeapi["Kube API Server\n(443)"] end subgraph operator [Slurm Operator] op["Operator\n(metrics:8080)"] wh["Webhook\n(server:9443)"] end subgraph slurm [Slurm Cluster] ctrl["Controller\n(slurmctld:6817)"] worker["NodeSet\n(slurmd:6818, srun:*, ssh:22)"] acct["Accounting\n(slurmdbd:6819)"] rest["RestApi\n(slurmrestd:6820)"] login["LoginSet\n(ssh:22)"] end db["External DB\n(3306)"] users["Users / Clients"] worker <-->|"6817 / 6818"| ctrl worker <-->|"all TCP (srun)"| worker acct <-->|"6817 / 6819"| ctrl rest -->|"6817"| ctrl login -->|"6817"| ctrl login -->|"6819 (sacct)"| acct login -->|"all TCP (srun/ssh)"| worker op -->|"6820"| rest op -->|"443"| kubeapi wh -->|"443"| kubeapi kubeapi -->|"9443"| wh acct -->|"3306"| db users -->|"6820"| rest users -->|"22"| loginCharts
helm/slurm(Slurm workloads):controller-netpol.yaml-- slurmctld ingress/egressnodeset-netpol.yaml-- one NetworkPolicy per enabled nodeset instance; all TCP from slurmd (srun) and login (srun/ssh)accounting-netpol.yaml-- slurmdbd ingress/egress (conditional onaccounting.enabled)restapi-netpol.yaml-- slurmrestd ingress/egressloginset-netpol.yaml-- one NetworkPolicy per enabled loginset instance; all TCP egress to slurmd (srun/ssh), conditional accounting egress (sacct/sacctmgr)helm/slurm-operator(operator infrastructure):operator-netpol.yaml-- egress to K8s API (443) and slurmrestd (6820), metrics ingresswebhook-netpol.yaml-- ingress from K8s API (9443), egress to K8s API (443)Features
networkPolicy.enabled: false(disabled by default, in both charts)controller,restapi,accounting,operator,webhook): flag under<component>.networkPolicy.enablednodesets,loginsets): per-instancenetworkPolicy.enabledinside each map entryapp.kubernetes.io/instancefrom/torules for singleton components includeapp.kubernetes.io/instancefor precise targetingSrunPortRangein slurm.conf)accounting.enabledextraIngress/extraEgressat global (networkPolicy.*), per-component (<component>.networkPolicy.*), and per-instance (inside each nodeset/loginset entry) levelsaccounting.enablednamespaceSelector: {}to support different namespacesapp.kubernetes.io/nameandapp.kubernetes.io/instancelabels applied by the operatorFiles
helm/slurm chart:
helm/slurm/templates/networkpolicy/controller-netpol.yaml(new)helm/slurm/templates/networkpolicy/nodeset-netpol.yaml(new)helm/slurm/templates/networkpolicy/accounting-netpol.yaml(new)helm/slurm/templates/networkpolicy/restapi-netpol.yaml(new)helm/slurm/templates/networkpolicy/loginset-netpol.yaml(new)helm/slurm/tests/networkpolicy_test.yaml(new - 34 test cases)helm/slurm/tests/__snapshot__/networkpolicy_test.yaml.snap(new - 5 snapshots)helm/slurm/values.yaml(modified)helm/slurm-operator chart:
helm/slurm-operator/templates/networkpolicy/operator-netpol.yaml(new)helm/slurm-operator/templates/networkpolicy/webhook-netpol.yaml(new)helm/slurm-operator/tests/networkpolicy_test.yaml(new - 15 test cases)helm/slurm-operator/tests/__snapshot__/networkpolicy_test.yaml.snap(new - 2 snapshots)helm/slurm-operator/values.yaml(modified)Test plan
helm unittest --strict helm/slurmpasses (129 tests, 12 suites, 10 snapshots)helm unittest --strict helm/slurm-operatorpasses (82 tests, 12 suites, 12 snapshots)helm templatewithnetworkPolicy.enabled=truerenders correct policies for each chartfrom/toselectors for singletons verified