Skip to content

Commit b0bb43b

Browse files
committed
fix(supervisor): drop sandbox child capability bounding set
Reduce the Linux capability bounding set in the common privilege-drop path before executing sandbox workloads or connect shells and use capctl Signed-off-by: Adrien Langou <alangou@nvidia.com>
1 parent 7bce122 commit b0bb43b

6 files changed

Lines changed: 270 additions & 36 deletions

File tree

Cargo.lock

Lines changed: 42 additions & 24 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

architecture/sandbox.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,12 @@ Each sandbox workload has two trust levels:
1414
| Agent child | Runs as an unprivileged user with filesystem, process, and network restrictions applied. |
1515

1616
The supervisor keeps enough privilege to manage the sandbox, but the agent child
17-
loses that privilege before user code runs.
17+
loses that privilege before user code runs. On Linux, child setup clears the
18+
capability bounding set during privilege drop so later execs cannot regain
19+
container-granted capabilities. This is fail-closed: the supervisor retains
20+
`CAP_SETPCAP` solely to perform the clear, and spawning the workload or SSH shell
21+
aborts unless the bounding set ends up empty. A `setpcap` `EPERM` is tolerated
22+
only when the set is already empty; any other outcome fails the spawn.
1823

1924
## Startup Flow
2025

crates/openshell-driver-podman/README.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ The container spec in `container.rs` sets these security-critical fields:
4242
|---|---|---|
4343
| `user` | `0:0` | The supervisor needs root inside the container for namespace creation, proxy setup, Landlock, seccomp, and filesystem preparation. |
4444
| `cap_drop` | Selected unneeded defaults | Podman's default capability set is already restricted. The driver drops capabilities the supervisor does not need. |
45-
| `cap_add` | `SYS_ADMIN`, `NET_ADMIN`, `SYS_PTRACE`, `SYSLOG`, `DAC_READ_SEARCH` | Grants supervisor-only capabilities required for namespace setup, process identity, and bypass diagnostics. |
45+
| `cap_add` | `SYS_ADMIN`, `NET_ADMIN`, `SYS_PTRACE`, `SYSLOG`, `DAC_READ_SEARCH`, `SETPCAP` | Grants supervisor-only capabilities required for namespace setup, process identity, bypass diagnostics, and child bounding-set cleanup. |
4646
| `no_new_privileges` | `true` | Prevents privilege escalation after exec. |
4747
| `seccomp_profile_path` | `unconfined` | The supervisor installs its own policy-aware BPF filter. A container-level profile can block Landlock/seccomp syscalls during setup. |
4848
| `mounts` | Private tmpfs at `/run/netns` | Lets the supervisor create named network namespaces in rootless Podman. |
@@ -98,12 +98,15 @@ openshell sandbox create \
9898
| `SYS_PTRACE` | Reading `/proc/<pid>/exe` and walking process ancestry for binary identity. |
9999
| `SYSLOG` | Reading `/dev/kmsg` for bypass-detection diagnostics. |
100100
| `DAC_READ_SEARCH` | Reading `/proc/<pid>/fd/` across UIDs so the proxy can resolve the binary responsible for a connection. |
101+
| `SETPCAP` | Clearing the restricted child process capability bounding set before exec. |
101102

102103
The driver intentionally keeps Podman's default `SETUID`, `SETGID`, `CHOWN`,
103104
and `FOWNER` capabilities because the supervisor needs them to drop privileges
104-
and prepare writable sandbox directories. It drops unneeded defaults such as
105+
and prepare writable sandbox directories. It also keeps `SETPCAP` until child
106+
setup so `drop_privileges()` can clear the child capability bounding set before
107+
exec. It drops unneeded defaults such as
105108
`DAC_OVERRIDE`, `FSETID`, `KILL`, `NET_BIND_SERVICE`, `NET_RAW`, `SETFCAP`,
106-
`SETPCAP`, and `SYS_CHROOT`.
109+
and `SYS_CHROOT`.
107110

108111
## Supervisor Sideloading
109112

crates/openshell-driver-podman/src/container.rs

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -877,8 +877,6 @@ pub fn build_container_spec_with_token_and_gpu_devices(
877877
"NET_RAW".into(),
878878
// Not needed: the supervisor does not manipulate file capabilities.
879879
"SETFCAP".into(),
880-
// Not needed: the supervisor does not manage its own capability bounding set.
881-
"SETPCAP".into(),
882880
// Not needed: the supervisor does not call chroot().
883881
"SYS_CHROOT".into(),
884882
],
@@ -899,13 +897,18 @@ pub fn build_container_spec_with_token_and_gpu_devices(
899897
// Without it the proxy cannot determine which binary made each outbound
900898
// connection and all traffic is denied.
901899
"DAC_READ_SEARCH".into(),
900+
// Child setup clears the capability bounding set before exec, which
901+
// requires CAP_SETPCAP in the supervisor until drop_privileges().
902+
"SETPCAP".into(),
902903
],
903-
// SETUID, SETGID, CHOWN, and FOWNER are intentionally kept from Podman's
904-
// default set and not dropped:
904+
// SETUID, SETGID, SETPCAP, CHOWN, and FOWNER are intentionally kept from
905+
// Podman's default set and not dropped:
905906
// SETUID/SETGID – drop_privileges(): setuid()/setgid()/initgroups() to the
906907
// sandbox user. In rootless Podman cap_drop:ALL removes them
907908
// from the bounding set even though uid=0 owns the user
908909
// namespace — so we keep them by not dropping them explicitly.
910+
// SETPCAP – drop_privileges(): clears the child capability
911+
// bounding set before the sandbox user execs.
909912
// CHOWN – prepare_filesystem(): chown(path, uid, gid) on newly
910913
// created read_write directories so the sandbox user can
911914
// write to them.
@@ -1451,12 +1454,14 @@ mod tests {
14511454
added.contains(&"DAC_READ_SEARCH"),
14521455
"missing DAC_READ_SEARCH"
14531456
);
1457+
assert!(added.contains(&"SETPCAP"), "missing SETPCAP");
14541458

14551459
// SETUID and SETGID are NOT in cap_add — they remain available from the
14561460
// default bounding set because we no longer use cap_drop:ALL. Verify they
1457-
// are also not explicitly dropped. Similarly CHOWN and FOWNER must not be
1458-
// dropped because prepare_filesystem() calls chown() on newly created
1459-
// read_write directories before the supervisor drops privileges.
1461+
// are also not explicitly dropped. Similarly SETPCAP, CHOWN and FOWNER
1462+
// must not be dropped because child setup clears the bounding set and
1463+
// prepare_filesystem() calls chown() on newly created read_write
1464+
// directories before the supervisor drops privileges.
14601465
let dropped: Vec<&str> = spec["cap_drop"]
14611466
.as_array()
14621467
.expect("cap_drop should be an array")
@@ -1473,6 +1478,10 @@ mod tests {
14731478
!dropped.contains(&"FOWNER"),
14741479
"FOWNER must not be dropped (needed for chown on non-owned files)"
14751480
);
1481+
assert!(
1482+
!dropped.contains(&"SETPCAP"),
1483+
"SETPCAP must not be dropped (needed for child bounding-set clear)"
1484+
);
14761485
assert!(
14771486
!dropped.contains(&"ALL"),
14781487
"must not use cap_drop:ALL in rootless Podman"

crates/openshell-supervisor-process/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ libc = "0.2"
3535
rustix = { workspace = true }
3636

3737
[target.'cfg(target_os = "linux")'.dependencies]
38+
capctl = "0.2.4"
3839
landlock = "0.4"
3940
seccompiler = "0.5"
4041
tempfile = "3"

0 commit comments

Comments
 (0)