ci: amd64 container image build fails deterministically (release v2.1.8+rs.1 blocked) #53

Closed
opened 2026-05-03 15:26:08 +00:00 by gofix · 0 comments
Owner

Summary

The Build and publish container image workflow consistently fails on the linux/amd64 matrix job. The linux/arm64 job succeeds with the same code, Dockerfile, build args, and registry auth. As a result, multi-arch manifest publishing is skipped and tagged releases (e.g., v2.1.8+rs.1) ship with only the arm64 image in the registry; pulling :<tag> (no arch suffix) fails.

This blocks the first stable Rust port release.

Evidence

Last two workflow runs for tag v2.1.8+rs.1 (commit ab8f612):

Run amd64 arm64 publish
39 (first push) failure 1m16s success 5m17s skipped
40 (re-push) failure 30s success 1m11s skipped

The second attempt fails ~3× faster than the first, which strongly suggests a deterministic Kaniko-step failure being short-circuited by warm caches rather than a flaky runner.

Full container-image workflow history on this repo:

run 40  failure  v2.1.8+rs.1   (chore: derive RELEASE_VERSION ...)
run 39  failure  v2.1.8+rs.1   (initial tag push)
run 17  failure  v2.1.8-rs.dev5
run 10  failure  v2.1.8-rs.dev4
run  7  failure  v2.1.8-rs.dev4
run  6  failure  v2.1.8-rs.dev4
run  2  success  v2.1.8-rs.dev2  (legacy buildx workflow, not Kaniko)
run  1  failure  v2.1.8-rs.dev2

The daemonless Kaniko workflow introduced in #50 was validated only against linux/arm64 with --no-push; the linux/amd64 matrix arm has never produced a successful image.

What is in the registry now

  • code.rly.best/gofix/portal-tunnel-rs:v2.1.8-rs.1-arm64 ✓ published
  • code.rly.best/gofix/portal-tunnel-rs:v2.1.8-rs.1 ✗ multi-arch index missing (publish job skipped)
  • code.rly.best/gofix/portal-tunnel-rs:latest ✗ not updated

The git tag v2.1.8+rs.1 and the corresponding Forgejo release entry exist but reference a release whose container image is only partially shipped.

Steps that succeed before the failure

From the run page state of run #40, job Build linux/amd64 image:

Set up job                                    success  2s
Checkout repository                           success  1s
Compute image tag and rs build identifier     success  0s
Prepare registry authentication               success  0s
Build and publish image with Kaniko           failure  27s   <-- here
Complete job                                  failure  0s

So: tag-to-image-tag conversion (+-), rs.N extraction from the ref, and registry auth are all working. The failure is inside the Kaniko executor invocation itself.

Why this issue does not include a log excerpt

Forgejo's job log endpoints return task with job_id N and attempt 0: resource does not exist for both run #39 and run #40 amd64 jobs. The runner appears to either drop logs on failed cleanup or never upload them in this configuration. The UI may still show partial output that the API does not expose.

Suspected causes (none confirmed without logs)

  1. amd64 runner OOM during Rust release build (-C lto=..., aws-lc-sys, ring C deps are memory-hungry).
  2. amd64 runner disk full from accumulated Cargo cache mounts (portal-cargo-target-amd64 has been used by every previous failed run; the cache is sharing=locked so it persists across runs).
  3. Kaniko cache-repo push permission missing for code.rly.best/gofix/portal-tunnel-rs-cache-amd64 specifically (REGISTRY_TOKEN may have access to the main repo but not this side repo).
  4. amd64 runner missing one of gcc-x86-64-linux-gnu / libc6-dev-amd64-cross / linux-libc-dev-amd64-cross packages from a base-image change.
  5. Cross-compile env var setup in the Dockerfile happens to interact badly with native-amd64 build (CC_x86_64_unknown_linux_gnu=x86_64-linux-gnu-gcc is set even when building natively, which forces use of a cross GCC that may behave differently from the default).

Suggested next steps

  • Pull amd64 job log from the Forgejo Actions UI (web view, not API) and attach.
  • If OOM: reduce concurrent build parallelism (CARGO_BUILD_JOBS=1) for amd64 only, or move LTO off for the release profile.
  • If disk: prune the portal-cargo-target-amd64 cache mount and portal-tunnel-rs-cache-amd64 cache repo on the registry.
  • If cache-repo auth: verify REGISTRY_TOKEN has push permission on the per-arch cache repo.
  • If neither of the above, simplify the amd64 Dockerfile path to skip the cross-GCC export when TARGETARCH == amd64 and the runner is also amd64 — let cargo use the default linker.

Release status

Until this is resolved, v2.1.8+rs.1 is effectively arm64-only. Consider one of:

  • Hold the release as draft until amd64 is fixed.
  • Document an arm64-only first stable in the release notes and re-release as v2.1.8+rs.2 once amd64 builds.
  • Post-publish a manual amd64 build attached to the existing manifest.

Filed while attempting v2.1.8+rs.1. Not specific to that tag — symptom predates the tag and predates the k3s-style versioning convention adopted in #51 / #52.

## Summary The `Build and publish container image` workflow consistently fails on the `linux/amd64` matrix job. The `linux/arm64` job succeeds with the same code, Dockerfile, build args, and registry auth. As a result, multi-arch manifest publishing is skipped and tagged releases (e.g., `v2.1.8+rs.1`) ship with **only the arm64 image** in the registry; pulling `:<tag>` (no arch suffix) fails. This blocks the first stable Rust port release. ## Evidence Last two workflow runs for tag `v2.1.8+rs.1` (commit `ab8f612`): | Run | amd64 | arm64 | publish | | --- | --- | --- | --- | | 39 (first push) | failure 1m16s | success 5m17s | skipped | | 40 (re-push) | failure 30s | success 1m11s | skipped | The second attempt fails ~3× faster than the first, which strongly suggests a deterministic Kaniko-step failure being short-circuited by warm caches rather than a flaky runner. Full container-image workflow history on this repo: ``` run 40 failure v2.1.8+rs.1 (chore: derive RELEASE_VERSION ...) run 39 failure v2.1.8+rs.1 (initial tag push) run 17 failure v2.1.8-rs.dev5 run 10 failure v2.1.8-rs.dev4 run 7 failure v2.1.8-rs.dev4 run 6 failure v2.1.8-rs.dev4 run 2 success v2.1.8-rs.dev2 (legacy buildx workflow, not Kaniko) run 1 failure v2.1.8-rs.dev2 ``` The daemonless Kaniko workflow introduced in #50 was validated only against `linux/arm64` with `--no-push`; the `linux/amd64` matrix arm has never produced a successful image. ## What is in the registry now - `code.rly.best/gofix/portal-tunnel-rs:v2.1.8-rs.1-arm64` ✓ published - `code.rly.best/gofix/portal-tunnel-rs:v2.1.8-rs.1` ✗ multi-arch index missing (publish job skipped) - `code.rly.best/gofix/portal-tunnel-rs:latest` ✗ not updated The git tag `v2.1.8+rs.1` and the corresponding Forgejo release entry exist but reference a release whose container image is only partially shipped. ## Steps that succeed before the failure From the run page state of run #40, job `Build linux/amd64 image`: ``` Set up job success 2s Checkout repository success 1s Compute image tag and rs build identifier success 0s Prepare registry authentication success 0s Build and publish image with Kaniko failure 27s <-- here Complete job failure 0s ``` So: tag-to-image-tag conversion (`+`→`-`), `rs.N` extraction from the ref, and registry auth are all working. The failure is inside the Kaniko executor invocation itself. ## Why this issue does not include a log excerpt Forgejo's job log endpoints return `task with job_id N and attempt 0: resource does not exist` for both run #39 and run #40 amd64 jobs. The runner appears to either drop logs on failed cleanup or never upload them in this configuration. The UI may still show partial output that the API does not expose. ## Suspected causes (none confirmed without logs) 1. amd64 runner OOM during Rust release build (`-C lto=...`, `aws-lc-sys`, `ring` C deps are memory-hungry). 2. amd64 runner disk full from accumulated Cargo cache mounts (`portal-cargo-target-amd64` has been used by every previous failed run; the cache is `sharing=locked` so it persists across runs). 3. Kaniko cache-repo push permission missing for `code.rly.best/gofix/portal-tunnel-rs-cache-amd64` specifically (REGISTRY_TOKEN may have access to the main repo but not this side repo). 4. amd64 runner missing one of `gcc-x86-64-linux-gnu` / `libc6-dev-amd64-cross` / `linux-libc-dev-amd64-cross` packages from a base-image change. 5. Cross-compile env var setup in the Dockerfile happens to interact badly with native-amd64 build (`CC_x86_64_unknown_linux_gnu=x86_64-linux-gnu-gcc` is set even when building natively, which forces use of a cross GCC that may behave differently from the default). ## Suggested next steps - [ ] Pull amd64 job log from the Forgejo Actions UI (web view, not API) and attach. - [ ] If OOM: reduce concurrent build parallelism (`CARGO_BUILD_JOBS=1`) for amd64 only, or move LTO off for the release profile. - [ ] If disk: prune the `portal-cargo-target-amd64` cache mount and `portal-tunnel-rs-cache-amd64` cache repo on the registry. - [ ] If cache-repo auth: verify REGISTRY_TOKEN has push permission on the per-arch cache repo. - [ ] If neither of the above, simplify the amd64 Dockerfile path to skip the cross-GCC export when `TARGETARCH == amd64` and the runner is also amd64 — let cargo use the default linker. ## Release status Until this is resolved, `v2.1.8+rs.1` is effectively arm64-only. Consider one of: - Hold the release as `draft` until amd64 is fixed. - Document an arm64-only first stable in the release notes and re-release as `v2.1.8+rs.2` once amd64 builds. - Post-publish a manual amd64 build attached to the existing manifest. Filed while attempting `v2.1.8+rs.1`. Not specific to that tag — symptom predates the tag and predates the k3s-style versioning convention adopted in #51 / #52.
gofix closed this issue 2026-05-03 16:36:11 +00:00
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
gofix/portal-tunnel-rs#53
No description provided.