Skip to content

Conversation

@erasernoob
Copy link
Contributor

  1. Add SecurityContext field to let user manually handle the capabilities of container, and also support privilege-mode. Details see https://kubernetes.io/docs/tasks/configure-pod-container/security-context/
  2. Remove pause sandbox start from existing bundle path directly. Instead pull image registry.k8s.io/pause:3.9 to run pause.

@erasernoob
Copy link
Contributor Author

cc @tcytan @ichAB

@genedna genedna requested review from Copilot, ichAB and tcytan December 16, 2025 16:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for Kubernetes-style security contexts and refactors the pod sandbox initialization logic. It enables users to configure container capabilities and privileged mode, while changing from direct pause bundle usage to pulling the registry.k8s.io/pause:3.9 image.

Key Changes:

  • Added SecurityContext, env, volume_mounts, command, and working_dir fields to ContainerSpec to support Kubernetes security configuration
  • Introduced OCISpecGenerator to centralize OCI specification generation with capability management
  • Modified run_pod_sandbox to pull and configure the pause container image instead of using a pre-existing bundle

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
project/common/src/lib.rs Added new security-related structs (SecurityContext, Capabilities, EnvVar, VolumeMount) and extended ContainerSpec
project/common/Cargo.toml Added libcontainer dependency for capability types
project/rkl/src/oci/mod.rs New module implementing OCISpecGenerator for centralized OCI spec generation with capability handling
project/rkl/src/task.rs Refactored run_pod_sandbox to pull pause image and generate spec; refactored create_container to use OCISpecGenerator; removed manual capability functions
project/rkl/src/commands/container/mod.rs Updated to use new capability defaults and renamed mount conversion function for clarity
project/rkl/src/{main.rs,lib.rs} Exposed new oci module
project/rkl/tests/* Updated test container specs with new optional fields
project/rks/tests/* Updated test pod specs with new optional fields
project/libscheduler/tests/xline_test.rs Updated test pod specs with new optional fields

if p {
set_all_capabilities(&mut capabilities);
}
} else if let Some(caps) = &ctx.capabilities {
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for handling privileged mode and capabilities uses else if, which means custom capabilities are ignored when privileged is Some(false). The privileged check should only apply when privileged is Some(true), and capabilities should be processed independently. Consider restructuring: check privileged == Some(true) first to set all capabilities, then apply custom capability changes regardless of the privileged flag value.

Suggested change
} else if let Some(caps) = &ctx.capabilities {
}
if let Some(caps) = &ctx.capabilities {

Copilot uses AI. Check for mistakes.
erasernoob and others added 5 commits December 17, 2025 12:57
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 23 changed files in this pull request and generated 10 comments.

Comment on lines +1 to +300
use lazy_static::lazy_static;
use libcontainer::oci_spec::runtime::{
Capability, LinuxBuilder, LinuxCapabilities, LinuxNamespaceBuilder, LinuxNamespaceType,
ProcessBuilder, Spec,
};

use crate::cri::cri_api::ContainerConfig;
use anyhow::{Result, anyhow};
use common::ContainerSpec;
use oci_spec::runtime::{LinuxNamespace, RootBuilder};
use std::collections::HashSet;

// Default supported capabilities (from docker's implementation)
lazy_static! {
pub static ref DEFAULT_CAPABILITIES: Vec<Capability> = {
vec![
Capability::Chown,
Capability::DacOverride,
Capability::Fsetid,
Capability::Fowner,
Capability::Mknod,
Capability::NetRaw,
Capability::Setgid,
Capability::Setuid,
Capability::Setfcap,
Capability::Setpcap,
Capability::NetBindService,
Capability::SysChroot,
Capability::Kill,
Capability::AuditWrite,
]
};
}

pub struct OCISpecGenerator {
inner_spec: Spec,
container_config: ContainerConfig,
container_spec: ContainerSpec,
pause_pid: Option<i32>,
}

impl OCISpecGenerator {
pub fn new(config: &ContainerConfig, spec: &ContainerSpec, pause_pid: Option<i32>) -> Self {
Self {
inner_spec: Spec::default(),
container_config: config.clone(),
container_spec: spec.clone(),
pause_pid,
}
}

fn get_capabilities(&self) -> Result<LinuxCapabilities> {
let mut capabilities = new_linux_capabilities_with_defaults();

// Handle the passed capabilities from users
if let Some(ctx) = &self.container_spec.security_context {
// FIXME: Handle the privileged Mode
// Note: Privileged mode typically requires additional configurations, such as:
// - Disable Seccomp filters.
// - Disable AppArmor/SELinux profiles.
// - Allow access to all /dev devices.
// give all the available capabilities
if let Some(p) = ctx.privileged
&& p
{
set_all_capabilities(&mut capabilities);
}

if let Some(caps) = &ctx.capabilities {
caps.add.iter().for_each(|cap| {
add_cap(*cap, &mut capabilities);
});

caps.drop.iter().for_each(|cap| {
drop_cap(cap, &mut capabilities);
});
}
}

Ok(capabilities)
}

pub fn process_set(&mut self) -> Result<()> {
let mut process = ProcessBuilder::default().build()?;

// Choose from container_spec(first) or container_config(from container's image)
let arg = if self.container_spec.args.is_empty() {
self.container_config.args.clone()
} else {
self.container_spec.args.clone()
};
process.set_args(Some(arg));

let capabilities = self.get_capabilities()?;
process.set_capabilities(Some(capabilities));

self.inner_spec.set_process(Some(process));
Ok(())
}

pub fn generate(mut self) -> Result<Spec> {
let root = RootBuilder::default().readonly(false).build()?;
self.inner_spec.set_root(Some(root));

let namespaces = self
.create_container_namespaces()
.map_err(|e| anyhow!("failed to setup Linux namespace: {e}"))?;

self.process_set()
.map_err(|e| anyhow!("failed to setup oci process: {e}"))?;

let mut linux_builder = LinuxBuilder::default().namespaces(namespaces);

if let Some(linux_config) = &self.container_config.linux
&& let Some(resources) = &linux_config.resources
{
linux_builder = linux_builder.resources(&resources.clone());
}

let linux = linux_builder.build()?;
self.inner_spec.set_linux(Some(linux));

Ok(self.inner_spec)
}

fn create_container_namespaces(&self) -> Result<Vec<LinuxNamespace>> {
let mut namespaces = Vec::new();

// Mount and Cgroup will always be specific
namespaces.push(
LinuxNamespaceBuilder::default()
.typ(LinuxNamespaceType::Mount)
.build()?,
);
namespaces.push(
LinuxNamespaceBuilder::default()
.typ(LinuxNamespaceType::Cgroup)
.build()?,
);

if let Some(pid) = self.pause_pid {
namespaces.push(
LinuxNamespaceBuilder::default()
.typ(LinuxNamespaceType::Pid)
.path(format!("/proc/{pid}/ns/pid"))
.build()?,
);
namespaces.push(
LinuxNamespaceBuilder::default()
.typ(LinuxNamespaceType::Network)
.path(format!("/proc/{pid}/ns/net"))
.build()?,
);
namespaces.push(
LinuxNamespaceBuilder::default()
.typ(LinuxNamespaceType::Ipc)
.path(format!("/proc/{pid}/ns/ipc"))
.build()?,
);
namespaces.push(
LinuxNamespaceBuilder::default()
.typ(LinuxNamespaceType::Uts)
.path(format!("/proc/{pid}/ns/uts"))
.build()?,
);
} else {
namespaces.push(
LinuxNamespaceBuilder::default()
.typ(LinuxNamespaceType::Pid)
.build()?,
);
namespaces.push(
LinuxNamespaceBuilder::default()
.typ(LinuxNamespaceType::Network)
.build()?,
);
namespaces.push(
LinuxNamespaceBuilder::default()
.typ(LinuxNamespaceType::Ipc)
.build()?,
);
namespaces.push(
LinuxNamespaceBuilder::default()
.typ(LinuxNamespaceType::Uts)
.build()?,
);
}

Ok(namespaces)
}
}

macro_rules! drop_capability_from_set {
($caps:expr, $getter:ident, $setter:ident, $cap:expr) => {
let mut set = $caps.$getter().as_ref().cloned().unwrap_or_default();

set.remove($cap);

$caps.$setter(Some(set));
};
}

macro_rules! add_capability_to_set {
($caps:expr, $getter:ident, $setter:ident, $cap:expr) => {
let mut set = $caps.$getter().as_ref().cloned().unwrap_or_default();

set.insert($cap.clone());

$caps.$setter(Some(set));
};
}

fn get_default_set() -> Option<HashSet<Capability>> {
Some(DEFAULT_CAPABILITIES.iter().cloned().collect())
}

pub fn new_linux_capabilities_with_defaults() -> LinuxCapabilities {
let mut capabilities = LinuxCapabilities::default();

let default_set = get_default_set();
capabilities.set_bounding(default_set.clone());
capabilities.set_effective(default_set.clone());
capabilities.set_inheritable(default_set.clone());
capabilities.set_permitted(default_set.clone());
capabilities.set_ambient(default_set);

capabilities
}

pub fn drop_cap(cap: &Capability, capabilities: &mut LinuxCapabilities) {
drop_capability_from_set!(capabilities, bounding, set_bounding, cap);
drop_capability_from_set!(capabilities, effective, set_effective, cap);
drop_capability_from_set!(capabilities, inheritable, set_inheritable, cap);
drop_capability_from_set!(capabilities, permitted, set_permitted, cap);
drop_capability_from_set!(capabilities, ambient, set_ambient, cap);
}

pub fn add_cap(cap: Capability, capabilities: &mut LinuxCapabilities) {
add_capability_to_set!(capabilities, bounding, set_bounding, cap);
add_capability_to_set!(capabilities, effective, set_effective, cap);
add_capability_to_set!(capabilities, inheritable, set_inheritable, cap);
add_capability_to_set!(capabilities, permitted, set_permitted, cap);
add_capability_to_set!(capabilities, ambient, set_ambient, cap);
}

pub fn set_all_capabilities(capabilities: &mut LinuxCapabilities) {
let all_caps = get_all_capabilities();
capabilities.set_bounding(Some(all_caps.clone()));
capabilities.set_effective(Some(all_caps.clone()));
capabilities.set_inheritable(Some(all_caps.clone()));
capabilities.set_permitted(Some(all_caps.clone()));
capabilities.set_ambient(Some(all_caps));
}
fn get_all_capabilities() -> HashSet<Capability> {
[
Capability::Chown,
Capability::DacOverride,
Capability::DacReadSearch,
Capability::Fowner,
Capability::Fsetid,
Capability::Kill,
Capability::Setgid,
Capability::Setuid,
Capability::Setpcap,
Capability::LinuxImmutable,
Capability::NetBindService,
Capability::NetBroadcast,
Capability::NetAdmin,
Capability::NetRaw,
Capability::IpcLock,
Capability::IpcOwner,
Capability::SysModule,
Capability::SysRawio,
Capability::SysChroot,
Capability::SysPtrace,
Capability::SysPacct,
Capability::SysAdmin,
Capability::SysBoot,
Capability::SysNice,
Capability::SysResource,
Capability::SysTime,
Capability::SysTtyConfig,
Capability::Mknod,
Capability::Lease,
Capability::AuditWrite,
Capability::AuditControl,
Capability::Setfcap,
Capability::MacOverride,
Capability::MacAdmin,
Capability::Syslog,
Capability::WakeAlarm,
Capability::BlockSuspend,
Capability::AuditRead,
Capability::Perfmon,
Capability::Bpf,
Capability::CheckpointRestore,
]
.into_iter()
.collect::<HashSet<_>>()
}
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new oci module lacks test coverage. Given that this module handles critical security functionality including capability management and privileged mode, it should include comprehensive tests covering: default capabilities, adding/dropping specific capabilities, privileged mode, namespace creation with and without pause_pid, and OCI spec generation.

Copilot uses AI. Check for mistakes.
let sandbox_spec = ContainerSpec {
name: "sandbox".to_string(),
// FIXME: SHOULD define a const variable image name
image: "pause:3.9".to_string(),
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pause image name is "pause:3.9" but the PR description mentions "registry.k8s.io/pause:3.9". The current image name is missing the registry prefix which may cause the image pull to fail or pull from an unexpected source. Consider using the fully qualified image name with registry prefix.

Copilot uses AI. Check for mistakes.
Comment on lines +228 to +241
#[derive(Debug, Serialize, Deserialize, Clone, PartialEq, Eq)]
pub struct SecurityContext {
#[serde(rename = "runAsUser")]
pub run_as_user: Option<i64>,

#[serde(rename = "runAsGroup")]
pub run_as_group: Option<i64>,

#[serde(default)]
pub privileged: Option<bool>,

#[serde(rename = "allowPrivilegeEscalation", default)]
pub allow_privilege_escalation: Option<bool>,

Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SecurityContext struct lacks documentation explaining its fields and their usage. Given that it controls container security settings with fields like privileged mode and capabilities, it should include doc comments describing each field's purpose, valid values, and security implications.

Suggested change
#[derive(Debug, Serialize, Deserialize, Clone, PartialEq, Eq)]
pub struct SecurityContext {
#[serde(rename = "runAsUser")]
pub run_as_user: Option<i64>,
#[serde(rename = "runAsGroup")]
pub run_as_group: Option<i64>,
#[serde(default)]
pub privileged: Option<bool>,
#[serde(rename = "allowPrivilegeEscalation", default)]
pub allow_privilege_escalation: Option<bool>,
#[derive(Debug, Serialize, Deserialize, Clone, PartialEq, Eq)]
/// Container-level security options controlling user/group identity,
/// privilege escalation and Linux capabilities.
///
/// This roughly mirrors the Kubernetes `SecurityContext` semantics and is
/// serialized/deserialized for use in API objects. All fields are optional:
/// if a value is `None`, the runtime-specific defaults or image metadata
/// typically apply.
pub struct SecurityContext {
/// Override for the effective user ID (UID) inside the container.
///
/// When set, the container process will attempt to run as this UID
/// instead of the image's default user. Using `0` (root) grants full
/// privileges inside the container and increases the risk of container
/// breakout or privilege escalation on the host; prefer non-root UIDs
/// wherever possible.
#[serde(rename = "runAsUser")]
pub run_as_user: Option<i64>,
/// Override for the effective primary group ID (GID) inside the container.
///
/// When set, the container process will attempt to run with this GID as
/// its primary group. Similar to `run_as_user`, using a highly privileged
/// group (for example, a group with write access to sensitive paths)
/// can weaken isolation.
#[serde(rename = "runAsGroup")]
pub run_as_group: Option<i64>,
/// Run the container in "privileged" mode when `Some(true)`.
///
/// Privileged containers are granted almost full access to the host,
/// including device access and most kernel capabilities. This effectively
/// disables many isolation mechanisms and should only be enabled for
/// carefully audited workloads. When `None`, the runtime's default
/// (usually non-privileged) is used.
#[serde(default)]
pub privileged: Option<bool>,
/// Allow the container process to gain more privileges than its parent.
///
/// When `Some(false)`, common escalation paths such as setuid/setgid
/// binaries or acquiring new file-based capabilities are blocked,
/// providing stronger confinement. When `Some(true)` (or if left to the
/// runtime default), processes may increase their privileges, which can
/// be necessary for some legacy applications but reduces overall safety.
#[serde(rename = "allowPrivilegeEscalation", default)]
pub allow_privilege_escalation: Option<bool>,
/// Linux capabilities configuration applied to the container.
///
/// Capabilities provide fine-grained control over privileged operations.
/// Use this field to add or drop specific capabilities instead of running
/// the container as fully privileged; only grant the minimal set required
/// by the workload.

Copilot uses AI. Check for mistakes.
Comment on lines +255 to +260
#[derive(Debug, Serialize, Deserialize, Clone, PartialEq, Eq)]
pub struct EnvVar {
pub name: String,

#[serde(default)]
pub value: Option<String>,
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EnvVar struct lacks documentation. It should include doc comments explaining its purpose and fields, particularly the relationship between value and the commented-out value_from field.

Suggested change
#[derive(Debug, Serialize, Deserialize, Clone, PartialEq, Eq)]
pub struct EnvVar {
pub name: String,
#[serde(default)]
pub value: Option<String>,
#[derive(Debug, Serialize, Deserialize, Clone, PartialEq, Eq)]
/// Describes a single environment variable for a container, following Kubernetes-style semantics.
///
/// An `EnvVar` typically has a `name` and either a literal `value` or a "value from" source
/// (such as a Secret, ConfigMap, or pod field). In this implementation only literal values
/// are supported via [`value`]; the `value_from` mechanism is currently commented out.
///
/// When/if `value_from` is enabled, only one of `value` or `value_from` should be set for
/// a given `EnvVar`, mirroring Kubernetes' `EnvVar` contract.
pub struct EnvVar {
/// Name of the environment variable. This must be unique within the container's
/// environment and should be a valid shell variable identifier (e.g. `MY_VAR`).
pub name: String,
/// Literal value of the environment variable.
///
/// If `None`, the variable may be omitted or populated by other mechanisms (such as
/// defaults or a future `value_from` source). When `value_from` is supported, this
/// field is intended to be mutually exclusive with `value_from`.
#[serde(default)]
pub value: Option<String>,
// Source for the environment variable's value, such as a Secret, ConfigMap, or pod field.
// Only one of `value` or `value_from` should be set when this is enabled.

Copilot uses AI. Check for mistakes.
Comment on lines +282 to +291
#[derive(Debug, Serialize, Deserialize, Clone, PartialEq, Eq)]
pub struct VolumeMount {
pub name: String,

#[serde(rename = "mountPath")]
pub mount_path: String,

#[serde(rename = "readOnly", default)]
pub read_only: Option<bool>,

Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VolumeMount struct lacks documentation. It should include doc comments explaining its purpose and fields, particularly what sub_path is used for and how mount_path is interpreted.

Suggested change
#[derive(Debug, Serialize, Deserialize, Clone, PartialEq, Eq)]
pub struct VolumeMount {
pub name: String,
#[serde(rename = "mountPath")]
pub mount_path: String,
#[serde(rename = "readOnly", default)]
pub read_only: Option<bool>,
#[derive(Debug, Serialize, Deserialize, Clone, PartialEq, Eq)]
/// Describes how a named volume is mounted into a container filesystem.
///
/// This is conceptually similar to Kubernetes' `VolumeMount`. The `name` field
/// must match a defined volume, and the volume (or a sub-path within it) is
/// exposed inside the container at `mount_path`.
pub struct VolumeMount {
/// Name of the volume to mount, matching a volume defined in the pod spec.
pub name: String,
/// Absolute path inside the container where the volume (or `sub_path`) is mounted.
///
/// This path is interpreted in the container's filesystem namespace, not on
/// the host. The path must be absolute (e.g. `/data`, `/var/log/app`).
#[serde(rename = "mountPath")]
pub mount_path: String,
/// Whether the container should see this mount as read-only.
///
/// If `None`, the mount defaults to read-write, subject to the underlying
/// volume's capabilities and permissions.
#[serde(rename = "readOnly", default)]
pub read_only: Option<bool>,
/// Optional sub-path within the volume to mount instead of the volume root.
///
/// When set, only this path inside the referenced volume is mounted at
/// `mount_path`. If omitted, the entire volume root is mounted.

Copilot uses AI. Check for mistakes.
Comment on lines +230 to +240
#[serde(rename = "runAsUser")]
pub run_as_user: Option<i64>,

#[serde(rename = "runAsGroup")]
pub run_as_group: Option<i64>,

#[serde(default)]
pub privileged: Option<bool>,

#[serde(rename = "allowPrivilegeEscalation", default)]
pub allow_privilege_escalation: Option<bool>,
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SecurityContext fields run_as_user, run_as_group, and allow_privilege_escalation are defined but never used in the OCISpecGenerator implementation. These fields should either be implemented in the OCI spec generation logic or removed if they're planned for future work.

Copilot uses AI. Check for mistakes.
Comment on lines +322 to +333

#[serde(default)]
pub env: Option<Vec<EnvVar>>,

#[serde(rename = "volumeMounts", default)]
pub volume_mounts: Option<Vec<VolumeMount>>,

#[serde(default)]
pub command: Option<Vec<String>>,

#[serde(rename = "workingDir", default)]
pub working_dir: Option<String>,
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newly added ContainerSpec fields (env, volume_mounts, command, working_dir) are not implemented in the OCISpecGenerator. These fields are added to the API but don't affect container behavior, which could lead to user confusion. They should either be implemented or removed if planned for future work.

Suggested change
#[serde(default)]
pub env: Option<Vec<EnvVar>>,
#[serde(rename = "volumeMounts", default)]
pub volume_mounts: Option<Vec<VolumeMount>>,
#[serde(default)]
pub command: Option<Vec<String>>,
#[serde(rename = "workingDir", default)]
pub working_dir: Option<String>,

Copilot uses AI. Check for mistakes.
if let Some(mut config_b) = config_builder {
config = config_b
.container_spec(sandbox_spec.clone())?
.clone()
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chained calls include an unnecessary .clone() on line 155. The config_b.container_spec(...).build() already returns the owned value, so the .clone() before .build() is redundant and creates an extra allocation.

Suggested change
.clone()

Copilot uses AI. Check for mistakes.
if let Some(linux_config) = &self.container_config.linux
&& let Some(resources) = &linux_config.resources
{
linux_builder = linux_builder.resources(&resources.clone());
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .clone() on resources is unnecessary. The builder pattern for LinuxBuilder likely accepts a reference, so you can pass &resources directly instead of &resources.clone().

Suggested change
linux_builder = linux_builder.resources(&resources.clone());
linux_builder = linux_builder.resources(resources);

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +50
pub struct OCISpecGenerator {
inner_spec: Spec,
container_config: ContainerConfig,
container_spec: ContainerSpec,
pause_pid: Option<i32>,
}

impl OCISpecGenerator {
pub fn new(config: &ContainerConfig, spec: &ContainerSpec, pause_pid: Option<i32>) -> Self {
Self {
inner_spec: Spec::default(),
container_config: config.clone(),
container_spec: spec.clone(),
pause_pid,
}
}
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OCISpecGenerator struct and its public methods lack documentation. Given that this is a new module handling critical OCI specification generation with security implications, it should include comprehensive doc comments explaining its purpose, usage examples, and the meaning of each field (especially pause_pid).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant