ARC-ADR-020 — Self-Hosted CI Runner Trust & Isolation Policy¶

Field	Value
ID	ARC-ADR-020
Status	Proposed
Date	2026-05-25
Deciders	Architecture Review (HITL — hub owner decides)
Supersedes	—
Superseded by	—
Tags	ci, security, self-hosted-runners, actions, isolation, supply-chain, aca, docker-local

Context and Problem Statement¶

To escape the GitHub-hosted Actions minutes cap, the fleet migrated all PR-gating CI across the hub + 3 spokes onto self-hosted runners (ACA/KEDA aca-linux + local docker-local), triggered on pull_request. Those jobs execute repository code (npm ci, cargo build, pip install, npm run build-storybook, etc.).

A Codex review (on frontend-core storybook.yml, 2026-05-25) flagged the general risk, correctly: self-hosted runner + untrusted pull_request code = arbitrary code execution on the runner, which can read whatever credentials and network the runner can reach. This is a well-known GitHub anti-pattern, and here it is fleet-wide (every self-hosted pull_request job, not one workflow).

Verified current exposure (2026-05-25): all four repos are PRIVATE, have zero outside collaborators, 0 forks, and fork-PR workflow access is none. So untrusted parties cannot open PRs → cannot run code on the runners. The risk is not exploitable today.

But the safety lives entirely in repo settings, not the workflow YAML — so it can regress silently. The moment any repo goes public, or gains an untrusted collaborator, every self-hosted PR job becomes a remote-code-execution vector. The blast radius is real: the ACA runners currently hold ACR-admin credentials + Key Vault access.

The decision: what is the trust/isolation policy that keeps self-hosted CI safe and keeps it from silently regressing?

Decision Drivers¶

#	Driver
D1	Keep CI free (self-hosted, off the metered minutes cap) — the reason for the migration.
D2	No untrusted code on self-hosted runners, ever — untrusted PR code must not execute where it can reach fleet credentials/network.
D3	The safety must be enforceable / hard to regress, not a tribal-knowledge "keep it private" hope.
D4	Least privilege the runner identity — a compromised job should reach as little as possible (ACR-admin + KV is too much standing power).
D5	Don't break the current fast, free PR feedback for trusted contributors.

Considered Options¶

Private-only invariant + fork-origin guard + fork-PR approval (recommended). Keep all repos private with trusted-only write (the current posture, made explicit as an invariant). Add defense-in-depth so a settings regression can't silently arm the vector: (a) a fork-origin guard on every self-hosted job — if: github.event.pull_request.head.repo.full_name == github.repository (a no-op today; permanently prevents fork-origin PRs from running on self-hosted), (b) GitHub "require approval for fork PRs", and (c) least-privilege the runner identity (drop standing ACR-admin/KV where possible; scope per-repo).
Split routing — trusted → self-hosted, fork/untrusted → GitHub-hosted. Gate by PR origin: same-repo branches run self-hosted (free); fork PRs run on ubuntu-latest. Safe for public-repo futures, but fork CI then consumes the minutes cap (the thing we're avoiding), and adds per-workflow conditional complexity.
Status quo — rely on the private posture alone. Works today; no guardrails. A single visibility flip = instant fleet-wide RCE with no tripwire. Cheapest now, highest latent risk (fails D3).

Decision Outcome¶

To be decided by the hub owner (Proposed stub — options + recommendation, not a unilateral call).

Recommendation note (not a decision)¶

The owner has confirmed the threat model excludes forks and untrusted collaborators (2026-05-25: "we're not building forks") — all work is branches within private repos by the owner + trusted agent apps. That puts the fork-RCE vector out of scope, so Option 1's fork-origin guard + fork-PR approval machinery is not adopted: it would guard a path that does not exist in this model, at a maintenance cost for zero real coverage.

Adopted posture (lightweight): 1. Document the invariant — self-hosted CI is safe because these repos stay private, trusted-only write, no forks. That is the security boundary; record it so it isn't silently lost. 2. Tripwire on the invariant, not the PR — if any repo is ever made public (or gains an untrusted collaborator), self-hosted CI must be revisited before that change lands, at which point Option 1 (fork-origin if: guard) or Option 2 (route fork PRs to GitHub-hosted) applies. Until then, no per-job guards. 3. Optional defense-in-depth — least-privilege the runner identity (drop standing ACR-admin/KV where a build doesn't need it). This caps blast radius from a supply-chain compromise (a malicious dependency in even a trusted PR) — the one residual risk that survives a no-forks model. Adopt if cheap; not urgent.

Options 2 and 3 stay documented for a possible public-repo future, but under the current no-forks model the fork machinery isn't warranted.

Affected Layers / Repos¶

Layer	Repo	Impact
all	hub + frontend-core + backend-core + middle-core	fork-origin `if:` guard on every self-hosted `pull_request` job; fork-PR approval setting
(infra)	hub templates (`aca-github-runner`, `local-docker-runner`)	least-privilege the runner identity; document the trust invariant in the template README

Pros and Cons of the Options¶

Option 1 — Private invariant + fork-origin guard + approval (recommended)¶

Pros: keeps CI free (D1); makes safety enforceable not implicit (D3); the if: guard is a zero-cost permanent tripwire; least-privilege shrinks blast radius (D4). Cons: a guard line to maintain on each self-hosted job; fork PRs (if ever enabled) get no CI until explicitly routed.

Option 2 — Split routing (fork → GitHub-hosted)¶

Pros: safe even for public repos; trusted PRs stay free. Cons: fork CI consumes the minutes cap (re-introduces the cost problem); more conditional complexity per workflow.

Option 3 — Status quo (private-only, no guardrails)¶

Pros: nothing to do now. Cons: safety is invisible and one settings flip from fleet-wide RCE; fails D3.

ARC-ADR-011 — runtime secret resolution: the runner identity's creds (least-privilege per D4) resolve via that scheme.
ARC-ADR-015 (backlog) — deployment & promotion: where/how the runners are deployed and their identity scoped.
ARC-ADR-017 (backlog) — connector egress/SSRF: sibling "untrusted input reaching privileged execution" concern, on the data-plane rather than CI.
Origin: Codex PR review on frontend-core storybook.yml (2026-05-25); the self-hosted runner migration (hub #180/#183 + spoke equivalents).

Revision History¶

Version	Date	Author	Change
0.1	2026-05-25	security review (Codex finding)	Initial Proposed draft — options + recommendation; HITL decision pending