ci: published arm64 image runs without file capabilities (rly.best deploy crashes) #54

Closed
opened 2026-05-03 15:38:10 +00:00 by gofix · 0 comments
Owner

Summary

The arm64 image successfully built and pushed by the new Kaniko-based daemonless workflow (#50) is missing the binary file capabilities that the relay needs to configure the kernel WireGuard overlay as a non-root process. The container starts and immediately crash-loops on:

initialize wireguard overlay runtime
configure kernel wireguard interface wg-portal on udp port 51820
Operation not permitted (os error 1)

This came up trying to deploy v2.1.8+rs.1 (commit ab8f612) onto the production relay at rly.best. The deploy was rolled back; rly.best is back on the prior locally-built image (portal-relay-gofix:169c301-arm64).

Evidence

The Dockerfile already runs setcap in the build stage:

setcap cap_net_admin,cap_net_bind_service=+ep /usr/local/bin/portal-relay

Verified by extracting the binary from each image on the relay VM:

=== OLD (locally-built portal-relay-gofix:169c301-arm64) ===
/tmp/relay-old cap_net_bind_service,cap_net_admin=ep
=== NEW (registry-pulled portal-tunnel-rs:v2.1.8-rs.1-arm64) ===
# (no caps set)

Same Dockerfile line, different result. The two builds differ in:

  1. Builder: previously local docker buildx, now Kaniko via gcr.io/kaniko-project/executor:debug (introduced in #50).
  2. The setcap is applied in the build stage; the binary is then COPY --from=build --chown=65532:65532 into a gcr.io/distroless/cc-debian12:nonroot final stage.

Kaniko has a known class of issues where xattrs (which security.capability is) do not survive COPY --chown across multi-stage builds. The local docker buildx path preserves them.

Impact

The published v2.1.8+rs.1 container image (arm64) is non-functional out of the box. Any user pulling code.rly.best/gofix/portal-tunnel-rs:v2.1.8-rs.1-arm64 and running it with --cap-add NET_ADMIN --user 65532:65532 (the documented setup) will hit the same crash loop. Together with #53 (amd64 not building at all), this means no user of the published v2.1.8+rs.1 release can actually run the relay.

Suggested fixes

  1. Apply setcap after the COPY in the final stage. Distroless has no setcap, so this would require either a multi-FROM trick or a small intermediate image that holds setcap.
  2. Use a non-distroless final stage that has libcap2-bin, then RUN setcap ... && rm -rf /var/lib/apt/lists/*.
  3. Use Linux ambient capabilities at runtime instead of file capabilities, by wrapping the binary in a small launcher that uses prctl(PR_CAP_AMBIENT_RAISE) after gaining caps from --cap-add. Less invasive but adds a launcher dependency.
  4. Switch the workflow back to docker buildx (or another builder that preserves xattrs across multi-stage COPY --chown).

Fix path 1 with a builder image alongside distroless is probably the cleanest:

FROM debian:bookworm-slim AS capstamp
RUN apt-get update && apt-get install -y --no-install-recommends libcap2-bin && rm -rf /var/lib/apt/lists/*
COPY --from=build /usr/local/bin/portal-relay /portal-relay
RUN setcap cap_net_admin,cap_net_bind_service=+ep /portal-relay

FROM gcr.io/distroless/cc-debian12:nonroot
COPY --from=capstamp --chown=65532:65532 /portal-relay /usr/local/bin/portal-relay

Building setcap in a stage whose output is COPY'd straight in (no --chown in the offending step, or using a cap-preserving builder for that one COPY) should keep the xattr.

Release status implication

With #53 (amd64 not built) and this issue (arm64 published image broken), v2.1.8+rs.1 is not actually shippable. Recommend:

  • Treat v2.1.8+rs.1 as a botched release and re-cut as v2.1.8+rs.2 once both this issue and #53 are resolved.
  • Or roll back the tag and Forgejo release entry until both are fixed.

Reproduction

docker run --rm --cap-add NET_ADMIN --user 65532:65532 \
  -e WIREGUARD_PORT=51820 -e API_PORT=4017 -e SNI_PORT=443 \
  -e PORTAL_URL=https://example -e BOOTSTRAPS=https://example \
  -e DISCOVERY=true -e LANDING_PAGE_ENABLED=true \
  -e IDENTITY_PATH=/portal-certs \
  -v $(mktemp -d):/portal-certs \
  code.rly.best/gofix/portal-tunnel-rs:v2.1.8-rs.1-arm64

Expect crash loop with Operation not permitted (os error 1) from the WireGuard interface configuration step.

Related: #53 (amd64 not building).

## Summary The arm64 image successfully built and pushed by the new Kaniko-based daemonless workflow (#50) is missing the binary file capabilities that the relay needs to configure the kernel WireGuard overlay as a non-root process. The container starts and immediately crash-loops on: ``` initialize wireguard overlay runtime configure kernel wireguard interface wg-portal on udp port 51820 Operation not permitted (os error 1) ``` This came up trying to deploy `v2.1.8+rs.1` (commit `ab8f612`) onto the production relay at `rly.best`. The deploy was rolled back; rly.best is back on the prior locally-built image (`portal-relay-gofix:169c301-arm64`). ## Evidence The Dockerfile already runs `setcap` in the build stage: ```dockerfile setcap cap_net_admin,cap_net_bind_service=+ep /usr/local/bin/portal-relay ``` Verified by extracting the binary from each image on the relay VM: ``` === OLD (locally-built portal-relay-gofix:169c301-arm64) === /tmp/relay-old cap_net_bind_service,cap_net_admin=ep === NEW (registry-pulled portal-tunnel-rs:v2.1.8-rs.1-arm64) === # (no caps set) ``` Same Dockerfile line, different result. The two builds differ in: 1. Builder: previously local `docker buildx`, now Kaniko via `gcr.io/kaniko-project/executor:debug` (introduced in #50). 2. The `setcap` is applied in the build stage; the binary is then `COPY --from=build --chown=65532:65532` into a `gcr.io/distroless/cc-debian12:nonroot` final stage. Kaniko has a known class of issues where xattrs (which `security.capability` is) do not survive `COPY --chown` across multi-stage builds. The local docker buildx path preserves them. ## Impact The published v2.1.8+rs.1 container image (arm64) is non-functional out of the box. Any user pulling `code.rly.best/gofix/portal-tunnel-rs:v2.1.8-rs.1-arm64` and running it with `--cap-add NET_ADMIN --user 65532:65532` (the documented setup) will hit the same crash loop. Together with #53 (amd64 not building at all), this means **no user of the published v2.1.8+rs.1 release can actually run the relay**. ## Suggested fixes 1. **Apply `setcap` after the COPY in the final stage.** Distroless has no `setcap`, so this would require either a multi-FROM trick or a small intermediate image that holds setcap. 2. **Use a non-distroless final stage** that has `libcap2-bin`, then `RUN setcap ... && rm -rf /var/lib/apt/lists/*`. 3. **Use Linux ambient capabilities at runtime** instead of file capabilities, by wrapping the binary in a small launcher that uses `prctl(PR_CAP_AMBIENT_RAISE)` after gaining caps from `--cap-add`. Less invasive but adds a launcher dependency. 4. **Switch the workflow back to docker buildx** (or another builder that preserves xattrs across multi-stage `COPY --chown`). Fix path 1 with a builder image alongside distroless is probably the cleanest: ```dockerfile FROM debian:bookworm-slim AS capstamp RUN apt-get update && apt-get install -y --no-install-recommends libcap2-bin && rm -rf /var/lib/apt/lists/* COPY --from=build /usr/local/bin/portal-relay /portal-relay RUN setcap cap_net_admin,cap_net_bind_service=+ep /portal-relay FROM gcr.io/distroless/cc-debian12:nonroot COPY --from=capstamp --chown=65532:65532 /portal-relay /usr/local/bin/portal-relay ``` Building `setcap` in a stage whose output is COPY'd straight in (no `--chown` in the offending step, or using a cap-preserving builder for that one COPY) should keep the xattr. ## Release status implication With #53 (amd64 not built) and this issue (arm64 published image broken), `v2.1.8+rs.1` is not actually shippable. Recommend: - Treat `v2.1.8+rs.1` as a botched release and re-cut as `v2.1.8+rs.2` once both this issue and #53 are resolved. - Or roll back the tag and Forgejo release entry until both are fixed. ## Reproduction ``` docker run --rm --cap-add NET_ADMIN --user 65532:65532 \ -e WIREGUARD_PORT=51820 -e API_PORT=4017 -e SNI_PORT=443 \ -e PORTAL_URL=https://example -e BOOTSTRAPS=https://example \ -e DISCOVERY=true -e LANDING_PAGE_ENABLED=true \ -e IDENTITY_PATH=/portal-certs \ -v $(mktemp -d):/portal-certs \ code.rly.best/gofix/portal-tunnel-rs:v2.1.8-rs.1-arm64 ``` Expect crash loop with `Operation not permitted (os error 1)` from the WireGuard interface configuration step. Related: #53 (amd64 not building).
gofix closed this issue 2026-05-03 15:48:32 +00:00
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
gofix/portal-tunnel-rs#54
No description provided.