ADR 0005 — Full cross-resource lookups in v1¶
Status. Accepted, implemented in v1.
Context¶
A privileged-container check is simple — container.securityContext.privileged == true. Real security policies often need cross-resource reasoning: "every Deployment in production must have at least one PodDisruptionBudget selecting it", "no Service of type LoadBalancer may exist in a namespace marked internal", "every Pod must be selected by a default-deny NetworkPolicy".
These rules are impossible to express against object.* alone — they require querying the rest of the cluster.
The early PLAN draft scoped cross-resource lookups to v2 (informer-based, but layered on top of the v1 admission/audit pipeline). The final PLAN promotes it to v1.
Decision¶
Cross-resource is in v1. cluster.<gvk>.byName(ns, name) and cluster.<gvk>.list(ns, selector) are first-class expression-language bindings, backed by internal/audit's shared informer caches, with a reverse-dependency index re-evaluating dependent rules when referenced resources change.
The implementation:
internal/lookup/cluster.go— the helper bindings.internal/lookup/depindex.go— the reverse-dependency index (bounded LRU, default 500 k entries).internal/lookup/virtual.go— admission-time virtual cluster view (inbound object materialised into the read path so admission rules see the soon-to-be cluster state).
Rationale¶
- It's the single biggest expressiveness win for security rules. Without it, Portal is "another podwatcher-poc with admission". With it, Portal covers PDB-coverage, allowed-registries-via-namespace-allowlist, dependent-resource checks — the things real policy teams want.
- The plumbing is already there. v1 ships informers for audit. Exposing the caches behind a helper API is incremental, not a new subsystem. Skipping it to v2 would mean shipping v1 with the informers idle for the lookup module's purpose.
- Per-rule engine selection (v3) needs a stable env shape. If
cluster.*lands in v2, every rule that uses it has to be re-validated when v3 changes the env. Landing it in v1 freezes the shape early.
Cost we accept¶
- Reverse-dependency index complexity. When rule R evaluates object O and reads X, we record
(referenced=X) → depends=(R,O). On informer events for X, every(R, O)dependent is re-enqueued for evaluation. This: - Adds a bounded LRU sized to 500 k entries by default. Per
internal/lookup/depindex.go— capacity is configurable. - Bounds the cluster-wide blast radius of a single resource change. A
Secretupdate touching 10 k dependents would otherwise pile-drive the work queue. - Cycle protection. A rule that re-derives its own dependency on every eval would loop forever. The mitigation: per
(rule, object)pair, allow at most N (default 3) re-evals in a sliding W-second (default 10s) window. Excess incrementsportal_lookup_cycle_suppressed_totaland is captured in the audit log. Correctness preserved by the 10-minute resync safety net. - Admission-time consistency. The inbound object isn't in the informer cache yet (CREATE), or is stale (UPDATE).
internal/lookup/virtual.goprovides a per-request overlay that materialises the inbound object before reads. For rules that need stricter semantics — e.g. cluster-wide uniqueness —cluster.consistent.<gvk>.byName(...)bypasses the cache and does a direct API call (one round-trip added; opt-in). - RBAC widening. The informer needs
get,list,watchon every GVK referenced bycluster.*. The chart computes this from thewatchedGvksHelm value plus a startup pass over the loaded rule corpus.
Alternative considered — defer to v2¶
- Pro. v1 ships sooner; the dep-index + cycle protection + virtual view are all non-trivial.
- Con. Every cross-resource policy a user wants in v1 has to be implemented externally (a different tool, or a one-off controller). The wedge against Kyverno weakens — Kyverno has cross-resource lookups today.
- Decision. The complexity is one-time engineering; the user-visible payoff is permanent. Land it now.
Consequences¶
- The interface seam (
api.Lookup,api.DepRecorderininternal/api/lookup.go) is part of v1's public surface. Refactoring it requires CR migration. We took care to land it once. - Tests under
internal/lookup/cover the dep-index, cycle protection, and the virtual view. The latency budget for cross-resource lookups is "cache read = nanoseconds; consistent path = single API round-trip" — seedocs/concepts/cross-resource.md(parallel author) for the user-facing semantics. - The
cluster.consistent.<gvk>.*opt-in is documented as a power tool: use only when correctness demands it.