Skip to content

Commit

Permalink
README: split into docs/ folder
Browse files Browse the repository at this point in the history
  • Loading branch information
nadiamoe committed May 31, 2024
1 parent c3862c2 commit 860b68b
Show file tree
Hide file tree
Showing 3 changed files with 204 additions and 186 deletions.
193 changes: 7 additions & 186 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,191 +1,12 @@
# 🐊🪙 Crocochrome

Crocochrome is a chromium supervisor.
Crocochrome is a chromium supervisor, which runs and reaps chromium processes on demand.

## On linux capabilities
Crocochrome needs to be granted certain linux capabilities to funciton, see [docs/capabilities.md](/doc/capabilities.md) for details.

This project, by nature of using a supervisor _and_ chromium, interfaces strongly with linux capabilities inside kubernetes and/or docker. This is a complex topic and even the maintainer's knowledge of it is not perfect. We try to summarize our findings, assumptions, and the things we have verified below.
Crocochrome runs chromium with `--no-sandbox`. The reason for this is that to run with sandboxing enabled, [chromium needs user namespaces to work](docs/chromium-sandbox), which are not available everywhere.

The first relevant bit is that the crocochrome supervisor wants to run chromium as a different user. We do this to prevent chromium from reading certain files (while allowing crocochrome to do so), and to prevent the chromium process from interacting with crocochrome altogether, e.g. reading its environment, or sending signals to it. To run chromium as a different user, crocochrome uses Go's ability to specify `syscall.Credential` when launching a process, and it specifies the uid/gid of the nobody user.

Normally, a process running as non-root wouldn't have the ability to do this. To allow crocochrome to do this, we add the `cap_setuid`, `cap_setgid` and `cap_kill` capabilities, with the first two allowing the binary to start another process as another user (any user), and the third allowing it to kill any other process. In order for this to be effective, we also need to add those three capabilities to the "bounding set". In kubernetes, this is done through the container's `securityContext`. These two different places where we specify capabilities (the binary and the container `securityContext` interact in the following ways:
- If the capabilities are not specified anywhere, crocochrome will return an error while trying to run chromium as a different user: It has no permissions.
- If the capabilities are set in the securityContext but not in the binary, the same thing happens. Capabilities added to the securityContext (bounding set) are not granted automatically to any binary inside the container.
- If the capabilities are set in the binary, but not in the securityContext, the CRI will refuse to even start crocochrome, as it has capabilities that are disallowed in the bounding set.
- If capabilities are set both in the binary and in the securityContext, crocochrome will start and will be able to start and kill processes that run as different users.

A third variable is `allowPrivilegeEscalation`. As per the [k8s docs](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/), this enforces the `no_new_privs` flag. This flags is defined in the [linux docs](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt), but the relevant part is the following one:
> With no_new_privs set [...] file capabilities will not add to the permitted set [...]
This means that setting `allowPrivilegeEscalation: false` will effectively result in the same scenario as if we never added the capabilities to our binary (scenario #2 in the list above).

### On chromium in particular

Being a web browser, chromium implements a series of security measures to try and isolate individual processes that run external code (javascript). This functionality is on by default, an can be disabled by launching chromium with `--no-sandbox`.

On linux, chromium achieves this isolation by creating new PID and network namespaces to child processes using [clone with `CLONE_NEWPID | CLONE_NEWNET`](https://man7.org/linux/man-pages/man2/clone.2.html). Linux restricts the use of these flags to processes that have `CAP_SYS_ADMIN` (or, it is implied, run as `root`).

It would seem that chromium tries multiple ways of doing this. If the `chromium` process is started as non-root, or without `CAP_SYS_ADMIN`, it will try to use a helper called `chrome-sandbox`, present in `/usr/lib/chromium/chrome-sandbox`. This helper has the setuid bit, so it will run as root regardless of the user who invokes this.

We have verified that chromium will _not_ use this helper if it has `CAP_SYS_ADMIN`, by simply removing that file and trying to start chromium with the sandbox enabled: It runs without errors:

```Dockerfile
FROM alpine:3.20.0

RUN adduser --home / --uid 6666 --shell /bin/nologin --disabled-password k6
RUN apk --no-cache add chromium-swiftshader
RUN rm /usr/lib/chromium/chrome-sandbox

USER k6

ENTRYPOINT [ "chromium", "--headless", "--remote-debugging-address=0.0.0.0", "--remote-debugging-port=5222" ]
```

```
# We have run rm /usr/lib/chromium/chrome-sandbox in this image.
18:43:00 ~/Devel/crocochrome $> dr --cap-add=sys_admin localhost:5000/browser:latest
[0529/164303.395670:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0529/164303.395966:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0529/164303.395993:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
Fontconfig error: No writable cache directories
[0529/164303.403245:INFO:config_dir_policy_loader.cc(118)] Skipping mandatory platform policies because no policy file was found at: /etc/chromium/policies/managed
[0529/164303.403264:INFO:config_dir_policy_loader.cc(118)] Skipping recommended platform policies because no policy file was found at: /etc/chromium/policies/recommended
DevTools listening on ws://0.0.0.0:5222/devtools/browser/b6f2885e-1f1b-4119-8182-12eea1bae20f
[0529/164303.405298:WARNING:bluez_dbus_manager.cc(248)] Floss manager not present, cannot set Floss enable/disable.
[0529/164303.410270:WARNING:sandbox_linux.cc(420)] InitializeSandbox() called with multiple threads in process gpu-process.
```

If we do not add that specific capability, chromium won't start:
```
# We have run rm /usr/lib/chromium/chrome-sandbox in this image.
18:47:58 ~/Devel/crocochrome $> dr --cap-add=all --cap-drop=sys_admin localhost:5000/browser:latest
[0529/164805.302206:FATAL:zygote_host_impl_linux.cc(126)] No usable sandbox! Update your kernel or see https://chromium.googlesource.com/chromium/src/+/main/docs/linux/suid_sandbox_development.md for more information on developing with the SUID sandbox. If you want to live dangerously and need an immediate workaround, you can try using --no-sandbox.
```

Interestingly, the latter scenario does **not** reproduce in k8s: Chromium is able to start with the sandbox enabled, wihtout the `sys_admin` capability, and with the `/usr/lib/chromium/chrome-sandbox` helper being removed from the image:

```
[0529/173417.131639:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0529/173417.132032:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0529/173417.132056:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[0529/173417.133214:ERROR:zygote_host_impl_linux.cc(273)] Failed to adjust OOM score of renderer with pid 29: Permission denied (13)
Fontconfig error: No writable cache directories
Fontconfig error: No writable cache directories
[0529/173417.139017:INFO:config_dir_policy_loader.cc(118)] Skipping mandatory platform policies because no policy file was found at: /etc/chromium/policies/managed
[0529/173417.139027:INFO:config_dir_policy_loader.cc(118)] Skipping recommended platform policies because no policy file was found at: /etc/chromium/policies/recommended
DevTools listening on ws://0.0.0.0:5222/devtools/browser/7bf56a77-f480-415e-996d-b6e1d9861362
[0529/173417.140865:WARNING:bluez_dbus_manager.cc(248)] Floss manager not present, cannot set Floss enable/disable.
[0529/173417.144703:WARNING:sandbox_linux.cc(420)] InitializeSandbox() called with multiple threads in process gpu-process.
```

The logs above are produced by the following pod:

```yaml
apiVersion: v1
kind: Pod
metadata:
name: chromium
labels:
app.kubernetes.io/name: chromium
spec:
securityContext:
runAsUser: 6666
runAsGroup: 6666
fsGroup: 6666
containers:
- name: chromium-tip
image: localhost:5000/browser # Same image as we ran with docker
imagePullPolicy: Always
securityContext:
runAsNonRoot: true
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop: ["all"]
```
### How chromium sandboxes stuff
Chromium can sandbox processes on linux in a two different ways. One is a helper binary with the `setuid` bit set (`/usr/lib/chromium/chrome-sandbox`), and another one is using user namespaces. There's evidence of this here:
https://github.com/chromium/chromium/blob/ebb9dbcfdb158f1d45ef1c4ecab2c5a143c90355/sandbox/policy/linux/sandbox_linux.h#L46-L49

The code that seems to be initializing the sandbox(es) is this function:
https://github.com/chromium/chromium/blob/ebb9dbcfdb158f1d45ef1c4ecab2c5a143c90355/content/browser/zygote_host/zygote_host_impl_linux.cc#L84

There is where `--no-sandbox` is handled, for example, and where it checks whether the setuid'd sandbox binary (`/usr/lib/chromium/chrome-sandbox`) is present and sane.

The code below juggles some flags for preferences and some sanity checks and will error as we saw in docker if no sandbox is available and a sandbox was implicitly requested.
Of interest is the call to `CanCreateProcessInNewUserNS()`, which determines whether chromium will fall back to the setuid'd helper or not, as that is checked first.

https://github.com/chromium/chromium/blob/ebb9dbcfdb158f1d45ef1c4ecab2c5a143c90355/sandbox/linux/services/credentials.cc#L263

`CanCreateProcessInNewUserNS` checks what the name says by sheer experimentation, doing quite some checks, let's go over them:
1. First, it calls `GetRESIds`. This function essentially fetches the uid and gid of the process by calling `getresuid(2)` and `getresgid(2)` respectively, while also sanity checking that effective and current are equal.
2. Then, it tries to call `ForkWithFlags`, a [wrapper for clone with fork-like behavior](https://github.com/chromium/chromium/blob/ebb9dbcfdb158f1d45ef1c4ecab2c5a143c90355/base/process/launch.h#L432) with [`CLONE_NEWUSER`](https://man7.org/linux/man-pages/man7/user_namespaces.7.html), and returns false if the attempt fails.
3. Checks continue in the child, which the parent monitors and fails if the child itself fails. The child checks the following:
1. It calls `SetGidAndUiddmaps`, as a prerequisite for calling `unshare(2)` later according to a comment. This function essentially writes current uid/gid to `/proc/self/{u,g}id_map`, while also calling `KernelSupportsDenySetgroups` which will fail if `/proc/self/setgroups` exists but `deny` cannot be written to it.
2. It calls `DropAllCapabilities`, which eventually just calls `capset(2)` with an empty list.
3. It attempts to call `unshare(2)` with `CLONE_NEWUSER`.

3.3 is interesting, as according to the [kernel docs](https://man7.org/linux/man-pages/man7/user_namespaces.7.html)
> A call to clone(2) or unshare(2) with the CLONE_NEWUSER flag
makes the new child process (for clone(2)) or the caller (for
unshare(2)) a member of the new user namespace created by the
call.

In any case, running chromium with `strace` on both docker and k8s reveals the hidden truth:
```Dockerfile
ENTRYPOINT [ "strace", "--", "chromium", "--enable-logging=stderr", "--v=1", "--headless", "--remote-debugging-address=0.0.0.0", "--remote-debugging-port=5222" ]
```

In docker:
```
[...]
# Check of the setuid sandbox exists, as part of the setuid sandbox init. This is done before checking if the sandbox is there.
access("/usr/lib/chromium/chrome-sandbox", F_OK) = -1 ENOENT (No such file or directory)
stat("/proc/self/exe", {st_mode=S_IFREG|0755, st_size=209133120, ...}) = 0
getuid() = 6666
# Call to GetRESIds
getresuid([6666], [6666], [6666]) = 0
getresgid([6666], [6666], [6666]) = 0
rt_sigprocmask(SIG_BLOCK, ~[], [], 8) = 0
# Try to ForkWithFlags
clone(child_stack=0x7ffd0e837588, flags=CLONE_NEWUSER|SIGCHLD) = -1 EPERM (Operation not permitted)
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
gettid() = 9
# Complain.
write(2, "[0530/111418.691871:FATAL:zygote"..., 352[0530/111418.691871:FATAL:zygote_host_impl_linux.cc(126)] No usable sandbox! Update your kernel or see https://chromium.googlesource.com/chromium/src/+/main/docs/linux/suid_sandbox_development.md for more information on developing with the SUID sandbox. If you want to live dangerously and need an immediate workaround, you can try using --no-sandbox.
```

In k8s, however:
```
2024-05-30T11:26:05.289753521Z access("/usr/lib/chromium/chrome-sandbox", F_OK) = -1 ENOENT (No such file or directory)
2024-05-30T11:26:05.289775254Z stat("/proc/self/exe", {st_mode=S_IFREG|0755, st_size=209133120, ...}) = 0
2024-05-30T11:26:05.289790531Z getuid() = 6666
2024-05-30T11:26:05.289810913Z getresuid([6666], [6666], [6666]) = 0
2024-05-30T11:26:05.289827857Z getresgid([6666], [6666], [6666]) = 0
2024-05-30T11:26:05.289845350Z rt_sigprocmask(SIG_BLOCK, ~], ], 8) = 0
# Clone succeeds!
2024-05-30T11:26:05.290460149Z clone(child_stack=0x7ffdbb321468, flags=CLONE_NEWUSER|SIGCHLD) = 13
2024-05-30T11:26:05.290481058Z rt_sigprocmask(SIG_SETMASK, ], NULL, 8) = 0
# Wait for child.
2024-05-30T11:26:05.291189440Z wait4(13, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 13
# Child exits correctly.
2024-05-30T11:26:05.291200788Z --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=13, si_uid=6666, si_status=0, si_utime=0, si_stime=0} ---
```

Both in docker and k8s this should be allowed as per the kernel setting:
```
cat /proc/sys/kernel/unprivileged_userns_clone
1
```

But docker denies somewhere on its own [seccomp policies](https://docs.docker.com/engine/security/seccomp/#significant-syscalls-blocked-by-the-default-profile). CRI-O, and possibly containerd, do not deny this, as reported [here](https://github.com/cgwalters/container-cve-2021-22555/blob/main/README.md#note-criopodman-runtimedefault-policy-vs-docker) ([archive.org](https://web.archive.org/web/20240530113241/https://github.com/cgwalters/container-cve-2021-22555/blob/main/README.md#note-criopodman-runtimedefault-policy-vs-docker)).

This is verifying by asserting that this successfuly launch chromium, with the setuid'd binary removed:
```console
$ dr --cap-drop=all --security-opt seccomp=unconfined localhost:5000/browser:latest
```
Moreover, chromium's sandbox focuses on isolating the processes running untrusted code from other processes, the network, and the filesystem.
Regarding process isolation, we only run one chromium process concurrently, and that's the only process in the container running as the (unprivileged) container. Therefore we do not see much value in this isolation.
Regarding filesystem access, the whole container is run with a read-only filesystem. The Crocochrome binary is not readable or runnable by the user chromium is running on, and there should be no sensitive files to be accessed whatsoever.
Regarding the network, we can use `NetworkPolicy` objects to forbid the Crocochrome container from accessing private IP ranges.
27 changes: 27 additions & 0 deletions doc/capabilities.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# On linux capabilities

This project, by nature of using a supervisor _and_ chromium, interfaces strongly with linux capabilities inside kubernetes and/or docker. This is a complex topic and even the maintainer's knowledge of it is not perfect. We try to summarize our findings, assumptions, and the things we have verified below.

The first relevant bit is that the crocochrome supervisor wants to run chromium as a different user. We do this to prevent chromium from reading certain files (while allowing crocochrome to do so), and to prevent the chromium process from interacting with crocochrome altogether, e.g. reading its environment, or sending signals to it. To run chromium as a different user, crocochrome uses Go's ability to specify `syscall.Credential` when launching a process, and it specifies the uid/gid of the nobody user.

Normally, a process running as non-root wouldn't have the ability to do this. To allow crocochrome to do this, we add the `cap_setuid`, `cap_setgid` and `cap_kill` capabilities, with the first two allowing the binary to start another process as another user (any user), and the third allowing it to kill any other process. In order for this to be effective, we also need to add those three capabilities to the "bounding set". In kubernetes, this is done through the container's `securityContext`. These two different places where we specify capabilities (the binary and the container `securityContext` interact in the following ways:
- If the capabilities are not specified anywhere, crocochrome will return an error while trying to run chromium as a different user: It has no permissions.
- If the capabilities are set in the securityContext but not in the binary, the same thing happens. Capabilities added to the securityContext (bounding set) are not granted automatically to any binary inside the container.
- If the capabilities are set in the binary, but not in the securityContext, the CRI will refuse to even start crocochrome, as it has capabilities that are disallowed in the bounding set.
- If capabilities are set both in the binary and in the securityContext, crocochrome will start and will be able to start and kill processes that run as different users.

A third variable is `allowPrivilegeEscalation`. As per the [k8s docs](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/), this enforces the `no_new_privs` flag. This flags is defined in the [linux docs](https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt), but the relevant part is the following one:
> With no_new_privs set [...] file capabilities will not add to the permitted set [...]
This means that setting `allowPrivilegeEscalation: false` will effectively result in the same scenario as if we never added the capabilities to our binary (scenario #2 in the list above).

The recommended (container) `securityContext` for crocochrome is the following:
```yaml
securityContext:
runAsNonRoot: true
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
add: ["setuid", "setgid", "kill"] # For dropping privileges and killing children.
drop: ["all"]
```
Loading

0 comments on commit 860b68b

Please sign in to comment.