Skip to content

Conversation

@marxarelli
Copy link
Contributor

Introduce a new entrypoint script for the Linux image that, if cgroup v2 is in use, creates a new cgroup and mount namespace for buildkitd within a new entrypoint using unshare and remounts /sys/fs/cgroup to restrict its view of the unified cgroup hierarchy. This will ensure its init cgroup and all OCI worker managed cgroups are kept beneath the root cgroup of the initial entrypoint process.

When buildkitd is run in a managed environment like Kubernetes without its own cgroup namespace (the default behavior of privileged pods in Kubernetes where cgroup v2 is in use; see cgroup v2 KEP), the OCI worker will spawn processes in cgroups that are outside of the cgroup hierarchy that was created for the buildkitd container, leading to incorrect resource accounting and enforcement which in turn can cause OOM errors and CPU contention on the node.

Example behavior without this change:

root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/buildkit/{runc-container-id}

Example behavior with this change:

root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id}

Note this was developed as an alternative approach to #6343

@marxarelli
Copy link
Contributor Author

@tonistiigi this is the alternative approach I mentioned in #6343 (comment).

Note that I first tried to implement the ns creation and remounting in buildkitd using calls to unix.Unshare and unix.Mount but encountered some strange behavior: The main buildkitd process was placed in a new cgroup namespace but for some reason buildkit-runc was not. It may have been that not all Go threads were moved into the cgroup, I'm not sure.

In any case, using unshare in the entrypoint seems less error prone.

@marxarelli marxarelli force-pushed the review/unshare-cgroupns-entrypoint branch from e12f7af to 3cf93c3 Compare November 17, 2025 20:40
Introduce a new entrypoint script for the Linux image that, if cgroup v2
is in use, creates a new cgroup and mount namespace for buildkitd within
a new entrypoint using `unshare` and remounts `/sys/fs/cgroup` to
restrict its view of the unified cgroup hierarchy. This will ensure its
`init` cgroup and all OCI worker managed cgroups are kept beneath the
root cgroup of the initial entrypoint process.

When buildkitd is run in a managed environment like Kubernetes without
its own cgroup namespace (the default behavior of privileged pods in
Kubernetes where cgroup v2 is in use; see [cgroup v2 KEP][kep]), the OCI
worker will spawn processes in cgroups that are outside of the cgroup
hierarchy that was created for the buildkitd container, leading to
incorrect resource accounting and enforcement which in turn can cause
OOM errors and CPU contention on the node.

Example behavior without this change:

```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/buildkit/{runc-container-id}
```

Example behavior with this change:

```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id}
```

Note this was developed as an alternative approach to moby#6343

[kep]: https://github.com/kubernetes/enhancements/tree/6d3210f7dd5d547c8f7f6a33af6a09eb45193cd7/keps/sig-node/2254-cgroup-v2#cgroup-namespace

Signed-off-by: Dan Duvall <dduvall@wikimedia.org>
@marxarelli marxarelli force-pushed the review/unshare-cgroupns-entrypoint branch from 3cf93c3 to 7a50ed7 Compare November 17, 2025 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant