Skip to content

feat(chart): add NetworkPolicy templates for all Slurm components#166

Open
giuliocalzo wants to merge 5 commits intoSlinkyProject:mainfrom
giuliocalzo:feat/helm-networkpolicy
Open

feat(chart): add NetworkPolicy templates for all Slurm components#166
giuliocalzo wants to merge 5 commits intoSlinkyProject:mainfrom
giuliocalzo:feat/helm-networkpolicy

Conversation

@giuliocalzo
Copy link
Copy Markdown
Contributor

@giuliocalzo giuliocalzo commented Apr 2, 2026

Summary

Add opt-in Kubernetes NetworkPolicy Helm templates for each Slurm component and the slurm-operator itself, providing network-level isolation between all components.

Component Traffic Diagram

flowchart LR
    subgraph kube [Kubernetes Control Plane]
        kubeapi["Kube API Server\n(443)"]
    end
    subgraph operator [Slurm Operator]
        op["Operator\n(metrics:8080)"]
        wh["Webhook\n(server:9443)"]
    end
    subgraph slurm [Slurm Cluster]
        ctrl["Controller\n(slurmctld:6817)"]
        worker["NodeSet\n(slurmd:6818, srun:*, ssh:22)"]
        acct["Accounting\n(slurmdbd:6819)"]
        rest["RestApi\n(slurmrestd:6820)"]
        login["LoginSet\n(ssh:22)"]
    end
    db["External DB\n(3306)"]
    users["Users / Clients"]

    worker <-->|"6817 / 6818"| ctrl
    worker <-->|"all TCP (srun)"| worker
    acct <-->|"6817 / 6819"| ctrl
    rest -->|"6817"| ctrl
    login -->|"6817"| ctrl
    login -->|"6819 (sacct)"| acct
    login -->|"all TCP (srun/ssh)"| worker
    op -->|"6820"| rest
    op -->|"443"| kubeapi
    wh -->|"443"| kubeapi
    kubeapi -->|"9443"| wh
    acct -->|"3306"| db
    users -->|"6820"| rest
    users -->|"22"| login
Loading

Charts

helm/slurm (Slurm workloads):

  • controller-netpol.yaml -- slurmctld ingress/egress
  • nodeset-netpol.yaml -- one NetworkPolicy per enabled nodeset instance; all TCP from slurmd (srun) and login (srun/ssh)
  • accounting-netpol.yaml -- slurmdbd ingress/egress (conditional on accounting.enabled)
  • restapi-netpol.yaml -- slurmrestd ingress/egress
  • loginset-netpol.yaml -- one NetworkPolicy per enabled loginset instance; all TCP egress to slurmd (srun/ssh), conditional accounting egress (sacct/sacctmgr)

helm/slurm-operator (operator infrastructure):

  • operator-netpol.yaml -- egress to K8s API (443) and slurmrestd (6820), metrics ingress
  • webhook-netpol.yaml -- ingress from K8s API (9443), egress to K8s API (443)

Features

  • Global toggle: networkPolicy.enabled: false (disabled by default, in both charts)
  • Per-component toggle: each component can be individually disabled
    • Singleton components (controller, restapi, accounting, operator, webhook): flag under <component>.networkPolicy.enabled
    • Map components (nodesets, loginsets): per-instance networkPolicy.enabled inside each map entry
  • Per-instance policies: nodesets and loginsets each generate one NetworkPolicy per enabled instance, scoped via app.kubernetes.io/instance
  • Instance-scoped selectors: all from/to rules for singleton components include app.kubernetes.io/instance for precise targeting
  • srun support: all TCP allowed between slurmd<->slurmd and login->slurmd to support srun ephemeral ports (constrainable via SrunPortRange in slurm.conf)
  • sacct/sacctmgr: loginset egress to slurmdbd (TCP 6819) conditional on accounting.enabled
  • Extra rules: extraIngress / extraEgress at global (networkPolicy.*), per-component (<component>.networkPolicy.*), and per-instance (inside each nodeset/loginset entry) levels
  • Conditional logic: accounting egress on controller/loginset only when accounting.enabled
  • DNS: all policies allow UDP/TCP 53 egress for DNS resolution
  • Cross-namespace: operator egress to slurmrestd uses namespaceSelector: {} to support different namespaces
  • Pod selectors: use app.kubernetes.io/name and app.kubernetes.io/instance labels applied by the operator

Files

helm/slurm chart:

  • helm/slurm/templates/networkpolicy/controller-netpol.yaml (new)
  • helm/slurm/templates/networkpolicy/nodeset-netpol.yaml (new)
  • helm/slurm/templates/networkpolicy/accounting-netpol.yaml (new)
  • helm/slurm/templates/networkpolicy/restapi-netpol.yaml (new)
  • helm/slurm/templates/networkpolicy/loginset-netpol.yaml (new)
  • helm/slurm/tests/networkpolicy_test.yaml (new - 34 test cases)
  • helm/slurm/tests/__snapshot__/networkpolicy_test.yaml.snap (new - 5 snapshots)
  • helm/slurm/values.yaml (modified)

helm/slurm-operator chart:

  • helm/slurm-operator/templates/networkpolicy/operator-netpol.yaml (new)
  • helm/slurm-operator/templates/networkpolicy/webhook-netpol.yaml (new)
  • helm/slurm-operator/tests/networkpolicy_test.yaml (new - 15 test cases)
  • helm/slurm-operator/tests/__snapshot__/networkpolicy_test.yaml.snap (new - 2 snapshots)
  • helm/slurm-operator/values.yaml (modified)

Test plan

  • helm unittest --strict helm/slurm passes (129 tests, 12 suites, 10 snapshots)
  • helm unittest --strict helm/slurm-operator passes (82 tests, 12 suites, 12 snapshots)
  • helm template with networkPolicy.enabled=true renders correct policies for each chart
  • Default values render 0 policies (disabled by default)
  • Per-component and per-instance disable flags verified
  • Conditional accounting egress on controller and loginset verified
  • All-TCP srun rules (slurmd<->slurmd, login->slurmd) verified
  • Instance-scoped from/to selectors for singletons verified
  • Global, per-component, and per-instance extraIngress/extraEgress verified
  • Cross-namespace operator-to-slurmrestd egress verified

Add opt-in Kubernetes NetworkPolicy Helm templates for controller,
nodeset, accounting, restapi, and loginset. Disabled by default via
`networkPolicy.enabled: false` with per-component toggles and support
for extra ingress/egress rules at both global and component level.
Generate one NetworkPolicy per enabled loginset/nodeset map entry
instead of a single blanket policy, scoped via app.kubernetes.io/instance.
Move networkPolicy config (enabled, extraIngress, extraEgress) into each
map entry and remove the now-redundant top-level nodeset/loginset flags.
@SkylerMalinowski SkylerMalinowski self-requested a review April 3, 2026 14:27
@SkylerMalinowski SkylerMalinowski self-assigned this Apr 3, 2026
Copy link
Copy Markdown
Contributor

@SkylerMalinowski SkylerMalinowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that you are missing rules for LoginSet -> Accounting (via sacct/sacctmgr), LoginSet -> NodeSet (via srun), and NodeSet -> NodeSet (via srun).

See https://slurm.schedmd.com/overview.html#architecture for details.

Also, across both charts, networkPolicy=true probably should not be the default. srun is problematically here. By default it uses all available ports but can be constrained by SrunPortRange.

Comment thread helm/slurm/templates/networkpolicy/accounting-netpol.yaml
Comment thread helm/slurm/templates/networkpolicy/accounting-netpol.yaml
Comment thread helm/slurm/templates/networkpolicy/accounting-netpol.yaml
Comment thread helm/slurm/templates/networkpolicy/controller-netpol.yaml
Comment thread helm/slurm/templates/networkpolicy/controller-netpol.yaml
Comment thread helm/slurm/templates/networkpolicy/nodeset-netpol.yaml
Comment thread helm/slurm/templates/networkpolicy/nodeset-netpol.yaml
Comment thread helm/slurm/templates/networkpolicy/nodeset-netpol.yaml
Comment thread helm/slurm/templates/networkpolicy/restapi-netpol.yaml
Comment thread helm/slurm/templates/networkpolicy/restapi-netpol.yaml
@giuliocalzo
Copy link
Copy Markdown
Contributor Author

Good Points I will update adding the instance and the namespace

@giuliocalzo
Copy link
Copy Markdown
Contributor Author

I notice that you are missing rules for LoginSet -> Accounting (via sacct/sacctmgr), LoginSet -> NodeSet (via srun), and NodeSet -> NodeSet (via srun).

See https://slurm.schedmd.com/overview.html#architecture for details.

Also, across both charts, networkPolicy=true probably should not be the default. Without openning all ports for srun or constraining them by default with SrunPortRange.

I've missed this completely, let me fix it

@SkylerMalinowski
Copy link
Copy Markdown
Contributor

This also should handle cases where the ports for Slurm, ssh, mariadb are not default.

- LoginSet: add all-TCP egress to slurmd (srun), conditional accounting
  egress (sacct/sacctmgr on TCP 6819)
- NodeSet: allow all TCP from slurmd and login pods (srun ephemeral
  ports), all TCP egress to slurmd
- All templates: add app.kubernetes.io/instance to from/to selectors
  for singleton components (slurmctld, slurmdbd, slurmrestd)
…tors

Singletons (slurmctld, slurmdbd, slurmrestd) use slurm.fullname as
instance. Map components (slurmd, login) iterate over the map to
generate per-instance from/to entries with the CR name as instance.
Accounting ingress now also allows from loginset pods (sacct/sacctmgr).
@giuliocalzo
Copy link
Copy Markdown
Contributor Author

good morning @SkylerMalinowski I've updated the label selector to match the required component and instance, I've also sorted the SrunPortRange.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants