Skip to content

Commit 8887d7c

Browse files
authored
fix(sandbox): harden seccomp filter to block dangerous syscalls (#740)
1 parent b56f830 commit 8887d7c

3 files changed

Lines changed: 248 additions & 9 deletions

File tree

architecture/sandbox.md

Lines changed: 38 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ All paths are relative to `crates/openshell-sandbox/src/`.
2424
| `sandbox/mod.rs` | Platform abstraction -- dispatches to Linux or no-op |
2525
| `sandbox/linux/mod.rs` | Linux composition: Landlock then seccomp |
2626
| `sandbox/linux/landlock.rs` | Filesystem isolation via Landlock LSM (ABI V1) |
27-
| `sandbox/linux/seccomp.rs` | Syscall filtering via BPF on `SYS_socket` |
27+
| `sandbox/linux/seccomp.rs` | Syscall filtering via BPF: socket domain blocks, dangerous syscall blocks, conditional flag blocks |
2828
| `bypass_monitor.rs` | Background `/dev/kmsg` reader for iptables bypass detection events |
2929
| `sandbox/linux/netns.rs` | Network namespace creation, veth pair setup, bypass detection iptables rules, cleanup on drop |
3030
| `l7/mod.rs` | L7 types (`L7Protocol`, `TlsMode`, `EnforcementMode`, `L7EndpointConfig`), config parsing, validation, access preset expansion, deprecated `tls` value handling |
@@ -451,22 +451,52 @@ Kernel-level error behavior (e.g., Landlock ABI unavailable) depends on `Landloc
451451

452452
**File:** `crates/openshell-sandbox/src/sandbox/linux/seccomp.rs`
453453

454-
Seccomp blocks socket creation for specific address families. The filter targets a single syscall (`SYS_socket`) and inspects argument 0 (the domain).
455-
456-
**Always blocked** (regardless of network mode):
457-
- `AF_NETLINK`, `AF_PACKET`, `AF_BLUETOOTH`, `AF_VSOCK`
458-
459-
**Additionally blocked in `Block` mode** (no proxy):
460-
- `AF_INET`, `AF_INET6`
454+
Seccomp provides three layers of syscall restriction: socket domain blocks, unconditional syscall blocks, and conditional syscall blocks. The filter uses a default-allow policy (`SeccompAction::Allow`) with targeted rules that return `Errno(EPERM)`.
461455

462456
**Skipped entirely** in `Allow` mode.
463457

464458
Setup:
465459
1. `prctl(PR_SET_NO_NEW_PRIVS, 1)` -- required before seccomp
466460
2. `seccompiler::apply_filter()` with default action `Allow` and per-rule action `Errno(EPERM)`
467461

462+
#### Socket domain blocks
463+
464+
| Domain | Always blocked | Additionally blocked in Block mode |
465+
|--------|:-:|:-:|
466+
| `AF_PACKET` | Yes | |
467+
| `AF_BLUETOOTH` | Yes | |
468+
| `AF_VSOCK` | Yes | |
469+
| `AF_INET` | | Yes |
470+
| `AF_INET6` | | Yes |
471+
| `AF_NETLINK` | | Yes |
472+
468473
In `Proxy` mode, `AF_INET`/`AF_INET6` are allowed because the sandboxed process needs to connect to the proxy over the veth pair. The network namespace ensures it can only reach the proxy's IP (`10.200.0.1`).
469474

475+
#### Unconditional syscall blocks
476+
477+
These syscalls are blocked entirely (EPERM for any invocation):
478+
479+
| Syscall | Reason |
480+
|---------|--------|
481+
| `memfd_create` | Fileless binary execution bypasses Landlock filesystem restrictions |
482+
| `ptrace` | Cross-process memory inspection and code injection |
483+
| `bpf` | Kernel BPF program loading |
484+
| `process_vm_readv` | Cross-process memory read |
485+
| `io_uring_setup` | Async I/O subsystem with extensive CVE history |
486+
| `mount` | Filesystem mount could subvert Landlock or overlay writable paths |
487+
488+
#### Conditional syscall blocks
489+
490+
These syscalls are only blocked when specific flag patterns are present:
491+
492+
| Syscall | Condition | Reason |
493+
|---------|-----------|--------|
494+
| `execveat` | `AT_EMPTY_PATH` flag set (arg4) | Fileless execution from an anonymous fd |
495+
| `unshare` | `CLONE_NEWUSER` flag set (arg0) | User namespace creation enables privilege escalation |
496+
| `seccomp` | operation == `SECCOMP_SET_MODE_FILTER` (arg0) | Prevents sandboxed code from replacing the active filter |
497+
498+
Conditional blocks use `MaskedEq` for flag checks (bit-test) and `Eq` for exact-value matches. This allows normal use of these syscalls while blocking the dangerous flag combinations.
499+
470500
### Network namespace isolation
471501

472502
**File:** `crates/openshell-sandbox/src/sandbox/linux/netns.rs`

architecture/security-policy.md

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -850,6 +850,10 @@ The response includes an `X-OpenShell-Policy` header and `Connection: close`. Se
850850

851851
## Seccomp Filter Details
852852

853+
The seccomp filter uses a default-allow policy (`SeccompAction::Allow`) with targeted rules that return `EPERM`. It provides three layers of protection: socket domain blocks, unconditional syscall blocks, and conditional syscall blocks. See `crates/openshell-sandbox/src/sandbox/linux/seccomp.rs`.
854+
855+
### Blocked socket domains
856+
853857
Regardless of network mode, certain socket domains are always blocked:
854858

855859
| Domain | Constant | Reason |
@@ -861,7 +865,30 @@ Regardless of network mode, certain socket domains are always blocked:
861865

862866
In proxy mode (which is always active), `AF_INET` (2) and `AF_INET6` (10) are allowed so the sandbox process can reach the proxy.
863867

864-
The seccomp filter uses a default-allow policy (`SeccompAction::Allow`) with specific `socket()` syscall rules that return `EPERM` when the first argument (domain) matches a blocked value. See `crates/openshell-sandbox/src/sandbox/linux/seccomp.rs`.
868+
### Blocked syscalls
869+
870+
These syscalls are blocked unconditionally (EPERM for any invocation):
871+
872+
| Syscall | NR (x86-64) | Reason |
873+
|---------|-------------|--------|
874+
| `memfd_create` | 319 | Fileless binary execution bypasses Landlock filesystem restrictions |
875+
| `ptrace` | 101 | Cross-process memory inspection and code injection |
876+
| `bpf` | 321 | Kernel BPF program loading |
877+
| `process_vm_readv` | 310 | Cross-process memory read |
878+
| `io_uring_setup` | 425 | Async I/O subsystem with extensive CVE history |
879+
| `mount` | 165 | Filesystem mount could subvert Landlock or overlay writable paths |
880+
881+
### Conditionally blocked syscalls
882+
883+
These syscalls are blocked only when specific flag patterns are present in their arguments:
884+
885+
| Syscall | NR (x86-64) | Condition | Reason |
886+
|---------|-------------|-----------|--------|
887+
| `execveat` | 322 | `AT_EMPTY_PATH` (0x1000) set in flags (arg4) | Fileless execution from an anonymous fd |
888+
| `unshare` | 272 | `CLONE_NEWUSER` (0x10000000) set in flags (arg0) | User namespace creation enables privilege escalation |
889+
| `seccomp` | 317 | operation == `SECCOMP_SET_MODE_FILTER` (1) in arg0 | Prevents sandboxed code from replacing the active filter |
890+
891+
Flag checks use `MaskedEq` (`(arg & mask) == mask`) to detect the flag bit regardless of other bits. The `seccomp` syscall check uses `Eq` for exact value comparison on the operation argument.
865892

866893
---
867894

crates/openshell-sandbox/src/sandbox/linux/seccomp.rs

Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,15 @@
22
// SPDX-License-Identifier: Apache-2.0
33

44
//! Seccomp syscall filtering.
5+
//!
6+
//! The filter uses a default-allow policy with targeted blocks:
7+
//!
8+
//! 1. **Socket domain blocks** -- prevent raw/kernel sockets that bypass the proxy
9+
//! 2. **Unconditional syscall blocks** -- block syscalls that enable sandbox escape
10+
//! (fileless exec, ptrace, BPF, cross-process memory access, io_uring, mount)
11+
//! 3. **Conditional syscall blocks** -- block dangerous flag combinations on otherwise
12+
//! needed syscalls (execveat+AT_EMPTY_PATH, unshare+CLONE_NEWUSER,
13+
//! seccomp+SET_MODE_FILTER)
514
615
use crate::policy::{NetworkMode, SandboxPolicy};
716
use miette::{IntoDiagnostic, Result};
@@ -13,6 +22,9 @@ use std::collections::BTreeMap;
1322
use std::convert::TryInto;
1423
use tracing::debug;
1524

25+
/// Value of `SECCOMP_SET_MODE_FILTER` (linux/seccomp.h).
26+
const SECCOMP_SET_MODE_FILTER: u64 = 1;
27+
1628
pub fn apply(policy: &SandboxPolicy) -> Result<()> {
1729
if matches!(policy.network.mode, NetworkMode::Allow) {
1830
return Ok(());
@@ -37,6 +49,7 @@ pub fn apply(policy: &SandboxPolicy) -> Result<()> {
3749
fn build_filter(allow_inet: bool) -> Result<seccompiler::BpfProgram> {
3850
let mut rules: BTreeMap<i64, Vec<SeccompRule>> = BTreeMap::new();
3951

52+
// --- Socket domain blocks ---
4053
let mut blocked_domains = vec![libc::AF_PACKET, libc::AF_BLUETOOTH, libc::AF_VSOCK];
4154
if !allow_inet {
4255
blocked_domains.push(libc::AF_INET);
@@ -49,6 +62,51 @@ fn build_filter(allow_inet: bool) -> Result<seccompiler::BpfProgram> {
4962
add_socket_domain_rule(&mut rules, domain)?;
5063
}
5164

65+
// --- Unconditional syscall blocks ---
66+
// These syscalls are blocked entirely (empty rule vec = unconditional EPERM).
67+
68+
// Fileless binary execution via memfd bypasses Landlock filesystem restrictions.
69+
rules.entry(libc::SYS_memfd_create).or_default();
70+
// Cross-process memory inspection and code injection.
71+
rules.entry(libc::SYS_ptrace).or_default();
72+
// Kernel BPF program loading.
73+
rules.entry(libc::SYS_bpf).or_default();
74+
// Cross-process memory read.
75+
rules.entry(libc::SYS_process_vm_readv).or_default();
76+
// Async I/O subsystem with extensive CVE history.
77+
rules.entry(libc::SYS_io_uring_setup).or_default();
78+
// Filesystem mount could subvert Landlock or overlay writable paths.
79+
rules.entry(libc::SYS_mount).or_default();
80+
81+
// --- Conditional syscall blocks ---
82+
83+
// execveat with AT_EMPTY_PATH enables fileless execution from an anonymous fd.
84+
add_masked_arg_rule(
85+
&mut rules,
86+
libc::SYS_execveat,
87+
4, // flags argument
88+
libc::AT_EMPTY_PATH as u64,
89+
)?;
90+
91+
// unshare with CLONE_NEWUSER allows creating user namespaces to escalate privileges.
92+
add_masked_arg_rule(
93+
&mut rules,
94+
libc::SYS_unshare,
95+
0, // flags argument
96+
libc::CLONE_NEWUSER as u64,
97+
)?;
98+
99+
// seccomp(SECCOMP_SET_MODE_FILTER) would let sandboxed code replace the active filter.
100+
let condition = SeccompCondition::new(
101+
0, // operation argument
102+
SeccompCmpArgLen::Dword,
103+
SeccompCmpOp::Eq,
104+
SECCOMP_SET_MODE_FILTER,
105+
)
106+
.into_diagnostic()?;
107+
let rule = SeccompRule::new(vec![condition]).into_diagnostic()?;
108+
rules.entry(libc::SYS_seccomp).or_default().push(rule);
109+
52110
let arch = std::env::consts::ARCH
53111
.try_into()
54112
.map_err(|_| miette::miette!("Unsupported architecture for seccomp"))?;
@@ -74,3 +132,127 @@ fn add_socket_domain_rule(rules: &mut BTreeMap<i64, Vec<SeccompRule>>, domain: i
74132
rules.entry(libc::SYS_socket).or_default().push(rule);
75133
Ok(())
76134
}
135+
136+
/// Block a syscall when a specific bit pattern is set in an argument.
137+
///
138+
/// Uses `MaskedEq` to check `(arg & flag_bit) == flag_bit`, which triggers
139+
/// EPERM when the flag is present regardless of other bits in the argument.
140+
fn add_masked_arg_rule(
141+
rules: &mut BTreeMap<i64, Vec<SeccompRule>>,
142+
syscall: i64,
143+
arg_index: u8,
144+
flag_bit: u64,
145+
) -> Result<()> {
146+
let condition = SeccompCondition::new(
147+
arg_index,
148+
SeccompCmpArgLen::Dword,
149+
SeccompCmpOp::MaskedEq(flag_bit),
150+
flag_bit,
151+
)
152+
.into_diagnostic()?;
153+
let rule = SeccompRule::new(vec![condition]).into_diagnostic()?;
154+
rules.entry(syscall).or_default().push(rule);
155+
Ok(())
156+
}
157+
158+
#[cfg(test)]
159+
mod tests {
160+
use super::*;
161+
162+
#[test]
163+
fn build_filter_proxy_mode_compiles() {
164+
let filter = build_filter(true);
165+
assert!(filter.is_ok(), "build_filter(true) should succeed");
166+
}
167+
168+
#[test]
169+
fn build_filter_block_mode_compiles() {
170+
let filter = build_filter(false);
171+
assert!(filter.is_ok(), "build_filter(false) should succeed");
172+
}
173+
174+
#[test]
175+
fn add_masked_arg_rule_creates_entry() {
176+
let mut rules: BTreeMap<i64, Vec<SeccompRule>> = BTreeMap::new();
177+
let result = add_masked_arg_rule(&mut rules, libc::SYS_execveat, 4, 0x1000);
178+
assert!(result.is_ok());
179+
assert!(
180+
rules.contains_key(&libc::SYS_execveat),
181+
"should have an entry for SYS_execveat"
182+
);
183+
assert_eq!(
184+
rules[&libc::SYS_execveat].len(),
185+
1,
186+
"should have exactly one rule"
187+
);
188+
}
189+
190+
#[test]
191+
fn unconditional_blocks_present_in_filter() {
192+
let mut rules: BTreeMap<i64, Vec<SeccompRule>> = BTreeMap::new();
193+
194+
// Simulate what build_filter does for unconditional blocks
195+
rules.entry(libc::SYS_memfd_create).or_default();
196+
rules.entry(libc::SYS_ptrace).or_default();
197+
rules.entry(libc::SYS_bpf).or_default();
198+
rules.entry(libc::SYS_process_vm_readv).or_default();
199+
rules.entry(libc::SYS_io_uring_setup).or_default();
200+
rules.entry(libc::SYS_mount).or_default();
201+
202+
// Unconditional blocks have an empty Vec (no conditions = always match)
203+
for syscall in [
204+
libc::SYS_memfd_create,
205+
libc::SYS_ptrace,
206+
libc::SYS_bpf,
207+
libc::SYS_process_vm_readv,
208+
libc::SYS_io_uring_setup,
209+
libc::SYS_mount,
210+
] {
211+
assert!(
212+
rules.contains_key(&syscall),
213+
"syscall {syscall} should be in the rules map"
214+
);
215+
assert!(
216+
rules[&syscall].is_empty(),
217+
"syscall {syscall} should have empty rules (unconditional block)"
218+
);
219+
}
220+
}
221+
222+
#[test]
223+
fn conditional_blocks_have_rules() {
224+
// Build a real filter and verify the conditional syscalls have rule entries
225+
// (non-empty Vec means conditional match)
226+
let mut rules: BTreeMap<i64, Vec<SeccompRule>> = BTreeMap::new();
227+
228+
add_masked_arg_rule(
229+
&mut rules,
230+
libc::SYS_execveat,
231+
4,
232+
libc::AT_EMPTY_PATH as u64,
233+
)
234+
.unwrap();
235+
add_masked_arg_rule(&mut rules, libc::SYS_unshare, 0, libc::CLONE_NEWUSER as u64).unwrap();
236+
237+
let condition = SeccompCondition::new(
238+
0,
239+
SeccompCmpArgLen::Dword,
240+
SeccompCmpOp::Eq,
241+
SECCOMP_SET_MODE_FILTER,
242+
)
243+
.unwrap();
244+
let rule = SeccompRule::new(vec![condition]).unwrap();
245+
rules.entry(libc::SYS_seccomp).or_default().push(rule);
246+
247+
for syscall in [libc::SYS_execveat, libc::SYS_unshare, libc::SYS_seccomp] {
248+
assert!(
249+
rules.contains_key(&syscall),
250+
"syscall {syscall} should be in the rules map"
251+
);
252+
assert!(
253+
!rules[&syscall].is_empty(),
254+
"syscall {syscall} should have conditional rules"
255+
);
256+
}
257+
}
258+
}

0 commit comments

Comments
 (0)