fix: cap amd64 build memory to fit Forgejo runner budget #56

Merged
gofix merged 1 commit from fix/amd64-build-oom-mitigation into master 2026-05-03 16:36:11 +00:00
Owner

Refs #53.

Diagnosis

The amd64 container-image job has been failing deterministically inside Kaniko's cargo build step (signal: 9, SIGKILL) with no log surfaced through the Forgejo Actions API. Two pieces of evidence pointed at OOM:

  1. The same SIGKILL signature reproduced when building the same Dockerfile on the rly.best VM (462 MiB RAM, 475 MiB swap). The rustc invocation that died was the final binary link, just like the Forgejo runner.
  2. Adding a 3 GiB temporary swap file on rly.best made the build complete cleanly. No code changes, just memory headroom.

The Forgejo amd64 runner appears to have a similar memory ceiling. Kaniko's runner cleanup also wipes the task log artifacts, which is why the API has been returning task with job_id N and attempt 0: resource does not exist for these failed runs — the OOM happens during the longest step, the runner gets killed, the log never flushes.

Fix

Cap memory pressure inside the Dockerfile so the build no longer needs runner-side swap:

  • CARGO_BUILD_JOBS=1 — only one rustc at a time, eliminating the multiplicative peak from parallel codegen on a 4-8 core runner.
  • CARGO_PROFILE_RELEASE_CODEGEN_UNITS=256 — smaller codegen units mean a lower per-unit memory peak when the linker pulls them in. With jobs=1 the wall-clock impact is small.
  • --verbose — if it still crashes, rustc's progress shows up in the Forgejo log.
  • uname / nproc / free -h / df -h printed before and after the build so the next failure tells us the runner envelope without UI log access.

Why this doesn't change the released binary

codegen-units controls the compile-time chunking of crate code, not the optimizer's behavior on the resulting code. opt-level is unchanged from the release default (3). LTO is off (default). The output binary is functionally and performance-equivalent.

Validation

Local Rust CI matrix on uvm:

  • cargo fmt --check — clean.
  • cargo test --locked — 86 passed, 3 ignored (kernel-WG, expected).
  • cargo clippy --locked --all-targets -- -D warnings — clean.

CI will need a tag push to actually exercise the container-image workflow. After this PR merges I will push a temporary v0.0.0-debug-amd64 tag, watch the workflow, and either close #53 (if amd64 succeeds) or attach the new error signature for further work.

Refs #53. ## Diagnosis The amd64 container-image job has been failing deterministically inside Kaniko's `cargo build` step (`signal: 9, SIGKILL`) with no log surfaced through the Forgejo Actions API. Two pieces of evidence pointed at OOM: 1. The same SIGKILL signature reproduced when building the same Dockerfile on the rly.best VM (462 MiB RAM, 475 MiB swap). The rustc invocation that died was the final binary link, just like the Forgejo runner. 2. Adding a 3 GiB temporary swap file on rly.best made the build complete cleanly. No code changes, just memory headroom. The Forgejo amd64 runner appears to have a similar memory ceiling. Kaniko's runner cleanup also wipes the task log artifacts, which is why the API has been returning `task with job_id N and attempt 0: resource does not exist` for these failed runs — the OOM happens during the longest step, the runner gets killed, the log never flushes. ## Fix Cap memory pressure inside the Dockerfile so the build no longer needs runner-side swap: - `CARGO_BUILD_JOBS=1` — only one rustc at a time, eliminating the multiplicative peak from parallel codegen on a 4-8 core runner. - `CARGO_PROFILE_RELEASE_CODEGEN_UNITS=256` — smaller codegen units mean a lower per-unit memory peak when the linker pulls them in. With jobs=1 the wall-clock impact is small. - `--verbose` — if it still crashes, rustc's progress shows up in the Forgejo log. - `uname / nproc / free -h / df -h` printed before and after the build so the next failure tells us the runner envelope without UI log access. ## Why this doesn't change the released binary `codegen-units` controls the compile-time chunking of crate code, not the optimizer's behavior on the resulting code. `opt-level` is unchanged from the release default (3). LTO is off (default). The output binary is functionally and performance-equivalent. ## Validation Local Rust CI matrix on uvm: - `cargo fmt --check` — clean. - `cargo test --locked` — 86 passed, 3 ignored (kernel-WG, expected). - `cargo clippy --locked --all-targets -- -D warnings` — clean. CI will need a tag push to actually exercise the container-image workflow. After this PR merges I will push a temporary `v0.0.0-debug-amd64` tag, watch the workflow, and either close #53 (if amd64 succeeds) or attach the new error signature for further work.
fix: cap amd64 build memory to fit Forgejo runner budget
All checks were successful
Rust CI / Format, lint, and test (pull_request) Successful in 1m26s
5af7c6e083
The amd64 container-image job has been failing deterministically inside
the Kaniko cargo build step (`signal: 9, SIGKILL`) because the Rust
release link plus parallel rustc invocations exceed the runner's memory
budget. The same OOM signature reproduces on the rly.best arm64 VM at
462 MiB RAM, where it disappears once a temporary swap file is added.

Cap memory pressure inside the Dockerfile so the build fits regardless
of runner sizing:

- `CARGO_BUILD_JOBS=1` so only one rustc runs at a time, eliminating
  the 8x peak from parallel codegen.
- `CARGO_PROFILE_RELEASE_CODEGEN_UNITS=256` so each codegen unit is
  smaller in memory, even though only one runs at once. With jobs=1
  the total compile time is similar; the win is a lower per-unit peak
  during link.
- `--verbose` so the Forgejo log surfaces what rustc is actually doing
  if it crashes again.
- Diagnostics block (`uname`, `nproc`, `free -h`, `df -h`) printed
  before and after the build so the next failure surfaces the runner's
  memory and disk envelope without needing UI log access.

This does not change the released binary's runtime behavior;
codegen-units only affects the compilation pipeline, not the binary's
optimization (`opt-level` is unchanged from default release).

Refs #53.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gofix merged commit 7e55e73d94 into master 2026-05-03 16:36:11 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
gofix/portal-tunnel-rs!56
No description provided.