Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 19 additions & 6 deletions src/nspawn/nspawn-mount.c
Original file line number Diff line number Diff line change
Expand Up @@ -440,6 +440,7 @@ int tmpfs_patch_options(
int mount_sysfs(const char *dest, MountSettingsMask mount_settings) {
_cleanup_free_ char *top = NULL, *full = NULL;;
unsigned long extra_flags = 0;
bool is_mount_point;
int r;

top = path_join(dest, "/sys");
Expand All @@ -449,12 +450,9 @@ int mount_sysfs(const char *dest, MountSettingsMask mount_settings) {
r = path_is_mount_point(top);
if (r < 0)
return log_error_errno(r, "Failed to determine if '%s' is a mountpoint: %m", top);
if (r == 0) {
/* If this is not a mount point yet, then mount a tmpfs there */
r = mount_nofollow_verbose(LOG_ERR, "tmpfs", top, "tmpfs", MS_NOSUID|MS_NOEXEC|MS_NODEV, "mode=0555" TMPFS_LIMITS_SYS);
if (r < 0)
return r;
} else {
is_mount_point = r > 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: nit: bool is_mount_point is declared at the top but not initialized until several lines later. It could be declared at the point of initialization for clarity:

bool is_mount_point = r > 0;


if (is_mount_point) {
r = path_is_fs_type(top, SYSFS_MAGIC);
if (r < 0)
return log_error_errno(r, "Failed to determine filesystem type of %s: %m", top);
Expand All @@ -467,6 +465,21 @@ int mount_sysfs(const char *dest, MountSettingsMask mount_settings) {
return 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: suggestion: When is_mount_point is true AND the existing mount is already sysfs (checked in the block above), the function returns 0 at line 465 before reaching the new !FLAGS_SET(mount_settings, MOUNT_APPLY_APIVFS_RO) check. This means if /sys is already mounted as sysfs and we're in the user namespace path, we return without ensuring it's read-only. The comment at line 468 says "we still have to mount it read-only" but this early-return path doesn't enforce that. Is this intentional?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: must-fix: When /sys is already a mount point with sysfs (the is_mount_point && is_sysfs path), the function returns 0 here before reaching the new !FLAGS_SET(mount_settings, MOUNT_APPLY_APIVFS_RO) check at line 469. In the user namespace case MOUNT_APPLY_APIVFS_RO is cleared, so the intent is to mount sysfs read-only directly — but this early return bypasses that entirely. The container ends up with whatever mount flags sysfs already had, contradicting the comment at line 468 that says "we still have to mount it read-only". The early return should either be moved below the new userns check, or this path should explicitly handle the read-only remount.

}

/* When running in a user namespace, to enable mounting sysfs in nested containers, we cannot
* overmount it, so we mount it as is. While the user namespace won't be able to write to sysfs, we
* still have to mount it read-only as that's part of the container interface and various units
* conditionalize themselves based on whether /sys is mounted read-only or not. */
if (!FLAGS_SET(mount_settings, MOUNT_APPLY_APIVFS_RO))
return mount_nofollow_verbose(LOG_ERR, "sysfs", top, "sysfs",
MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, NULL);

if (!is_mount_point) {
/* If this is not a mount point yet, then mount a tmpfs there */
r = mount_nofollow_verbose(LOG_ERR, "tmpfs", top, "tmpfs", MS_NOSUID|MS_NOEXEC|MS_NODEV, "mode=0555" TMPFS_LIMITS_SYS);
if (r < 0)
return r;
}

full = path_join(top, "/full");
if (!full)
return log_oom();
Expand Down
44 changes: 38 additions & 6 deletions src/nspawn/nspawn.c
Original file line number Diff line number Diff line change
Expand Up @@ -1642,6 +1642,19 @@ static int parse_argv(int argc, char *argv[]) {
return 1;
}

static int container_in_userns(void) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: suggestion: container_in_userns() calls namespace_is_init() (a stat() syscall on /proc/ns/user) every time it is invoked, and it is called in 5 different places (verify_arguments, setup_boot_id, setup_kmsg, run_container, and indirectly via mount_sysfs). Since the result cannot change during the process lifetime, consider computing this once in verify_arguments() and encoding the result into a flag (e.g., in arg_mount_settings), rather than re-checking the namespace status at each call site.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: suggestion: container_in_userns() calls namespace_is_init() (a procfs stat) on every invocation, and it is called from 4 separate sites (verify_arguments, setup_boot_id, setup_kmsg, run_container). Since the user namespace status cannot change during the process lifetime, consider computing this once in verify_arguments() and storing the result in a flag (e.g., extending arg_mount_settings or a dedicated static variable) to avoid redundant syscalls and the repeated error-handling boilerplate at each call site.

int r;

if (arg_userns_mode != USER_NAMESPACE_NO)
return true;

r = namespace_is_init(NAMESPACE_USER);
if (r < 0 && !IN_SET(r, -EBADR, -ENOSYS))
return log_error_errno(r, "Failed to check if in initial user namespace: %m");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: suggestion: When namespace_is_init() returns -EBADR or -ENOSYS, the error is silently swallowed and the function falls through to return r == 0, which evaluates to false ("not in a user namespace"). The -ENOSYS case (procfs not mounted / not available) could indicate a constrained environment where assuming "not in a user namespace" may be incorrect. Consider whether these error cases should default to "assume we're in a user namespace" for safety, or at minimum log a debug/warning message so the fallback behavior is observable.


return r == 0;
}

static int verify_arguments(void) {
int r;

Expand All @@ -1654,6 +1667,15 @@ static int verify_arguments(void) {

SET_FLAG(arg_mount_settings, MOUNT_USE_USERNS, arg_userns_mode != USER_NAMESPACE_NO);

/* When running in a user namespace the kernel will protect procfs/sysfs for us, so there's no need
* to mount them read-only or mask individual files. This applies both when we allocate a user
* namespace ourselves, and when nspawn is invoked from within an existing user namespace. */
r = container_in_userns();
if (r < 0)
return r;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: suggestion: Clearing MOUNT_APPLY_APIVFS_RO disables not just read-only sysfs/procfs mounts but also all PROC_INACCESSIBLE_REG and PROC_READ_ONLY overmounts in mount_all() — entries like /proc/kallsyms, /proc/kcore, /proc/keys, /proc/sysrq-trigger, /proc/timer_list, and directories like /proc/acpi, /proc/bus, /proc/scsi are left exposed. While the kernel does restrict access to most of these in user namespaces, some entries like /proc/kallsyms (depending on kptr_restrict sysctl) may still leak kernel address information. The commit message and comments should explicitly acknowledge that these masking overmounts are also being removed, not just the read-only flags.

if (r > 0)
arg_mount_settings &= ~MOUNT_APPLY_APIVFS_RO;

if (arg_private_network)
SET_FLAG(arg_mount_settings, MOUNT_APPLY_APIVFS_NETNS, arg_private_network);

Expand Down Expand Up @@ -1735,9 +1757,6 @@ static int verify_arguments(void) {
if (arg_userns_mode != USER_NAMESPACE_NO && (arg_mount_settings & MOUNT_APPLY_APIVFS_NETNS) && !arg_private_network)
return log_error_errno(SYNTHETIC_ERRNO(EINVAL), "Invalid namespacing settings. Mounting sysfs with --private-users requires --private-network.");

if (arg_userns_mode != USER_NAMESPACE_NO && !(arg_mount_settings & MOUNT_APPLY_APIVFS_RO))
return log_error_errno(SYNTHETIC_ERRNO(EINVAL), "Cannot combine --private-users with read-write API VFS mounts.");

if (arg_expose_ports && !arg_private_network)
return log_error_errno(SYNTHETIC_ERRNO(EINVAL), "Cannot use --port= without private networking.");

Expand Down Expand Up @@ -2130,6 +2149,10 @@ static int setup_boot_id(void) {
const char *to;
int r;

r = container_in_userns();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: must-fix: setup_boot_id() skips generating a unique boot ID when in a user namespace, meaning the container will inherit the host's boot ID. This is a functional regression — containers sharing the host boot ID can confuse systemd-journald and other services that use boot ID to distinguish boots. If bind-mounting over /proc/sys/kernel/random/boot_id is not possible in a user namespace, this should at minimum be documented with a comment explaining why skipping is acceptable, or an alternative mechanism should be used.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: suggestion: In both setup_boot_id() (here) and setup_kmsg() (line 2565), the pattern if (r != 0) return r; returns true (1) on the success-skip path from functions that conventionally return 0 on success or negative on error. While the callers only check r < 0, propagating a positive value from a setup function is non-idiomatic. Consider separating the error and skip cases:

r = container_in_userns();
if (r < 0)
        return r;
if (r > 0)
        return 0;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: must-fix: setup_boot_id() uses if (r != 0) return r; which returns true (1) when container_in_userns() indicates a user namespace. Functions like setup_boot_id() follow the 0-on-success / negative-on-error convention. The current callers only check r < 0 so this works by accident, but returning a positive value is non-idiomatic and fragile — any future caller checking r > 0 or r != 0 would misinterpret the result. Please split the check:

r = container_in_userns();
if (r < 0)
        return r;
if (r > 0)
        return 0;

if (r != 0)
return r;

/* Generate a new randomized boot ID, so that each boot-up of the container gets a new one */

r = tempfn_random_child("/run", "proc-sys-kernel-random-boot-id", &path);
Expand Down Expand Up @@ -2539,6 +2562,10 @@ static int setup_kmsg(int fd_inner_socket) {

assert(fd_inner_socket >= 0);

r = container_in_userns();
if (r != 0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: must-fix: Same issue as setup_boot_id(): setup_kmsg() returns 1 (true) via if (r != 0) return r; on the user-namespace skip path. Should be:

r = container_in_userns();
if (r < 0)
        return r;
if (r > 0)
        return 0;

return r;

BLOCK_WITH_UMASK(0000);

/* We create the kmsg FIFO as a temporary file in /run, but immediately delete it after bind mounting it to
Expand Down Expand Up @@ -5793,9 +5820,14 @@ static int run_container(
(void) sd_event_add_signal(event, NULL, SIGCHLD, on_sigchld, pid);

/* Retrieve the kmsg fifo allocated by inner child */
fd_kmsg_fifo = receive_one_fd(fd_inner_socket_pair[0], 0);
if (fd_kmsg_fifo < 0)
return log_error_errno(fd_kmsg_fifo, "Failed to receive kmsg fifo from inner child: %m");
r = container_in_userns();
if (r < 0)
return r;
if (r == 0) {
fd_kmsg_fifo = receive_one_fd(fd_inner_socket_pair[0], 0);
if (fd_kmsg_fifo < 0)
return log_error_errno(fd_kmsg_fifo, "Failed to receive kmsg fifo from inner child: %m");
}

if (arg_expose_ports) {
r = expose_port_watch_rtnl(event, fd_inner_socket_pair[0], on_address_change, expose_args, &rtnl);
Expand Down
Loading