dockerfile: run buildkitd within a cgroup namespace for cgroup v2 #6368

marxarelli · 2025-11-17T20:18:38Z

Introduce a new entrypoint script for the Linux image that, if cgroup v2 is in use, creates a new cgroup and mount namespace for buildkitd within a new entrypoint using unshare and remounts /sys/fs/cgroup to restrict its view of the unified cgroup hierarchy. This will ensure its init cgroup and all OCI worker managed cgroups are kept beneath the root cgroup of the initial entrypoint process.

When buildkitd is run in a managed environment like Kubernetes without its own cgroup namespace (the default behavior of privileged pods in Kubernetes where cgroup v2 is in use; see cgroup v2 KEP), the OCI worker will spawn processes in cgroups that are outside of the cgroup hierarchy that was created for the buildkitd container, leading to incorrect resource accounting and enforcement which in turn can cause OOM errors and CPU contention on the node.

Example behavior without this change:

root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/buildkit/{runc-container-id}

Example behavior with this change:

root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id}

Note this was developed as an alternative approach to #6343

marxarelli · 2025-11-17T20:27:53Z

@tonistiigi this is the alternative approach I mentioned in #6343 (comment).

Note that I first tried to implement the ns creation and remounting in buildkitd using calls to unix.Unshare and unix.Mount but encountered some strange behavior: The main buildkitd process was placed in a new cgroup namespace but for some reason buildkit-runc was not. It may have been that not all Go threads were moved into the cgroup, I'm not sure.

In any case, using unshare in the entrypoint seems less error prone.

Introduce a new entrypoint script for the Linux image that, if cgroup v2 is in use, creates a new cgroup and mount namespace for buildkitd within a new entrypoint using `unshare` and remounts `/sys/fs/cgroup` to restrict its view of the unified cgroup hierarchy. This will ensure its `init` cgroup and all OCI worker managed cgroups are kept beneath the root cgroup of the initial entrypoint process. When buildkitd is run in a managed environment like Kubernetes without its own cgroup namespace (the default behavior of privileged pods in Kubernetes where cgroup v2 is in use; see [cgroup v2 KEP][kep]), the OCI worker will spawn processes in cgroups that are outside of the cgroup hierarchy that was created for the buildkitd container, leading to incorrect resource accounting and enforcement which in turn can cause OOM errors and CPU contention on the node. Example behavior without this change: ```console root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup 0::/init root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup 0::/buildkit/{runc-container-id} ``` Example behavior with this change: ```console root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup 0::/kubepods/burstable/pod{pod-id}/{container-id}/init root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup 0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id} ``` Note this was developed as an alternative approach to moby#6343 [kep]: https://github.com/kubernetes/enhancements/tree/6d3210f7dd5d547c8f7f6a33af6a09eb45193cd7/keps/sig-node/2254-cgroup-v2#cgroup-namespace Signed-off-by: Dan Duvall <dduvall@wikimedia.org>

github-actions bot added the area/project label Nov 17, 2025

github-actions bot assigned marxarelli Nov 17, 2025

marxarelli force-pushed the review/unshare-cgroupns-entrypoint branch from e12f7af to 3cf93c3 Compare November 17, 2025 20:40

marxarelli force-pushed the review/unshare-cgroupns-entrypoint branch from 3cf93c3 to 7a50ed7 Compare November 17, 2025 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dockerfile: run buildkitd within a cgroup namespace for cgroup v2 #6368

dockerfile: run buildkitd within a cgroup namespace for cgroup v2 #6368

marxarelli commented Nov 17, 2025

Uh oh!

marxarelli commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dockerfile: run buildkitd within a cgroup namespace for cgroup v2 #6368

Are you sure you want to change the base?

dockerfile: run buildkitd within a cgroup namespace for cgroup v2 #6368

Conversation

marxarelli commented Nov 17, 2025

Uh oh!

marxarelli commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant