|
| 1 | +# Terraform & Bicep Settings Lifecycle Refresh |
| 2 | + |
| 3 | +- **Author**: Yetkin Timocin (@ytimocin) |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +We will externalize Radius Terraform and Bicep recipe configuration into dedicated settings resources, centralize Terraform binary lifecycle, and let platform teams supply Terraform settings exactly as they do today. This keeps Radius orchestration intact while removing opinionated guardrails that block mature Terraform estates. The work is anchored to the feature spec [`2025-08-14-terraform-bicep-settings.md`](../features/2025-08-14-terraform-bicep-settings.md). |
| 8 | + |
| 9 | +## Terms and Definitions |
| 10 | + |
| 11 | +- **TerraformSettings**: New `Radius.Core/terraformSettings` resource encapsulating `.terraformrc`, backend, environment, and logging settings. Migrates everything currently in `recipeConfig.terraform` (provider mirrors/credentials, backend blocks, `env` variables, trace logging flags) while shifting env secret injection to recipe parameters. |
| 12 | +- **BicepSettings**: New `Radius.Core/bicepSettings` resource describing registry authentication. Carries forward the existing `recipeConfig.bicep.authentication` map (registry host → secret ID); no additional Bicep settings exist today. |
| 13 | +- **Installer Async Handler**: Implementation registered with the existing async worker service (`pkg/server/asyncworker.go`) that consumes install/uninstall queue messages (`pkg/components/queue`) and manages Terraform binaries. |
| 14 | + |
| 15 | +## Objectives |
| 16 | + |
| 17 | +> **Issue Reference:** <https://github.com/radius-project/radius/issues/10615> |
| 18 | +
|
| 19 | +### Goals |
| 20 | + |
| 21 | +**Terraform** |
| 22 | + |
| 23 | +- Add `Radius.Core/terraformSettings` resources and wire `Radius.Core/environments` (the new environment type) to reference reusable configuration. |
| 24 | +- Introduce an installer async pipeline and CLI-driven Terraform binary lifecycle (`rad terraform install|uninstall|status|validate`) with operator control over version, source URL, and checksum. |
| 25 | +- Allow Terraform settings (provider mirrors, credentials, env vars, backend blocks) to flow through unchanged so Radius is unopinionated about Terraform configuration. |
| 26 | +- Support Tier-1 backends (kubernetes, azurerm, s3) with trace logging and secret-safe execution while paving a path for Tier-2 backends as pass-through (for example `oss`, `gcs`, `http`, `oci`, `pg`, `cos`). |
| 27 | + |
| 28 | +**Bicep** |
| 29 | + |
| 30 | +- Add `bicepSettings` resources so registry authentication is reusable outside `recipeConfig`. |
| 31 | +- Preserve support for BasicAuth, Azure workload identity, and AWS IRSA secret injection with no new runtime behaviors. |
| 32 | + |
| 33 | +### Non Goals |
| 34 | + |
| 35 | +- Delivering `terraform plan` functionality (covered by a future spec). |
| 36 | +- Supporting additional Terraform backends beyond Tier-1 in this release. |
| 37 | +- Reworking Bicep execution beyond registry authentication parity. |
| 38 | +- Adding new capabilities for legacy `Applications.Core/environments` (kept as-is; migration targets `Radius.Core/environments`). |
| 39 | +- Changing recipe parameter or SecretStore semantics; env secret injection continues via recipe parameters. |
| 40 | +- Modifying Bicep runtime behavior or the bundled Bicep CLI; only registry authentication moves into `bicepSettings`. |
| 41 | + |
| 42 | +### User Scenarios (optional) |
| 43 | + |
| 44 | +#### User Story 1 - Terraform lifecycle |
| 45 | + |
| 46 | +A platform engineer runs `rad terraform install --version 1.6.4 --wait` to seed the control plane with the organization's pinned Terraform build. The installer async handler downloads from the internal mirror, validates the checksum, writes metadata, and exposes status. A follow-up `rad terraform install --version 1.7.0` automatically queues behind the first job and runs after it completes. The engineer confirms success with `rad terraform status` and re-runs `rad terraform validate --environment prod-east` (Phase 2) before dispatching recipe executions. Result: the control plane holds a single active Terraform version at any time, and sequential installs guard against race conditions or partial upgrades. |
| 47 | + |
| 48 | +#### User Story 2 - Migrating settings |
| 49 | + |
| 50 | +Another engineer owns an environment that still uses `recipeConfig`. They deploy a new `terraformSettings` resource mirroring their existing `.terraformrc`, backend block, and env vars, plus a `bicepSettings` resource for private registries. After updating the environment to reference the new resources, `rad terraform migrate --environment prod-east --apply` removes the legacy fields. Recipes keep running with no downtime, and the environment controller raises telemetry showing the new settings path is active. Result: legacy configuration disappears from the environment definition while Terraform/Bicep recipes continue operating against the new reusable settings resources. |
| 51 | + |
| 52 | +## User Experience (if applicable) |
| 53 | + |
| 54 | +**Sample Input:** |
| 55 | + |
| 56 | +```bash |
| 57 | +rad terraform install --version 1.6.4 |
| 58 | +rad terraform status |
| 59 | +rad terraform validate --environment my-env # Phase 2 |
| 60 | +``` |
| 61 | + |
| 62 | +**Sample Output:** |
| 63 | + |
| 64 | +```text |
| 65 | +Terraform 1.6.4 install started... |
| 66 | +Terraform 1.6.4 ready (installed 2025-10-10T10:30Z) |
| 67 | +Environment my-env: configuration valid; backend azurerm reachable. |
| 68 | +``` |
| 69 | + |
| 70 | +## Design |
| 71 | + |
| 72 | +### High Level Design |
| 73 | + |
| 74 | +- Environments reference `terraformSettings` / `bicepSettings` resources. |
| 75 | +- Controllers prefer new settings resources and fallback to legacy `recipeConfig` during migration. |
| 76 | +- Migration tooling removes legacy `recipeConfig` once environments point at the new resources; legacy fields exist only for compatibility during rollout. |
| 77 | +- Installer REST endpoint stores install/uninstall requests in a dedicated async queue (`pkg/components/queue/queueprovider`) configured with single-flight semantics. The installer async handler (running inside the existing worker service) consumes jobs sequentially, manages binaries on the shared Terraform storage, and updates status metadata. (The Helm chart will drop the old init-container download path in favour of this queue-driven workflow.) |
| 78 | +- Terraform executor resolves versioned binary paths, renders `.terraformrc`, configures backends, and emits structured logs. |
| 79 | + |
| 80 | +### Architecture Diagram |
| 81 | + |
| 82 | +```mermaid |
| 83 | +flowchart LR |
| 84 | + subgraph Operator Workstation |
| 85 | + CLI[rad CLI] |
| 86 | + end |
| 87 | + subgraph Kubernetes Cluster |
| 88 | + subgraph Radius Control Plane |
| 89 | + API[Installer REST Endpoint] |
| 90 | + QUEUE["Installer Queue\n(QueueMessage CR)"] |
| 91 | + CTRL["Installer Async Handler"] |
| 92 | + ENVCTRL[Environment Controller] |
| 93 | + EXEC[Terraform Executor] |
| 94 | + end |
| 95 | + PVC[(Terraform Binary PVC)] |
| 96 | + SETTINGS[terraformSettings / bicepSettings] |
| 97 | + end |
| 98 | +
|
| 99 | + CLI -->|install/uninstall/status/validate| API |
| 100 | + API -->|enqueue job| QUEUE |
| 101 | + QUEUE -->|lease job| CTRL |
| 102 | + CTRL -->|download & verify| PVC |
| 103 | + CTRL -->|persist metadata| API |
| 104 | + ENVCTRL --> SETTINGS |
| 105 | + ENVCTRL --> EXEC |
| 106 | + EXEC -->|reads binaries| PVC |
| 107 | + SETTINGS --> ENVCTRL |
| 108 | +``` |
| 109 | + |
| 110 | +### Detailed Design |
| 111 | + |
| 112 | +- TypeSpec adds `Radius.Core/terraformSettings@2025-08-01-preview` and `Radius.Core/bicepSettings@2025-08-01-preview` aligned with the feature spec. |
| 113 | +- Controllers read the new settings resource when present, fall back to the legacy environment config during migration, and validate required secrets/backends before Terraform executes; implementation reuses the existing air-gapped download/auth/TLS helpers so we avoid duplicating plumbing. |
| 114 | +- Installer async handler (plugged into the existing worker pipeline in `pkg/server/asyncworker.go`) downloads Terraform binaries, verifies checksums, persists version metadata (requested URL, checksum, install timestamp, health) in installer status storage, and places binaries on the shared PVC mount (for example `/mnt/radius-terraform/<version>/`). Its queue is configured with `MaxOperationConcurrency = 1`, so install/uninstall jobs execute strictly in submission order; uninstall only removes a version once no executions reference it. |
| 115 | +- CLI invokes installer APIs for install/uninstall/status/validate; status reports active versions and health probes. |
| 116 | +- TerraformSettings serializer covers `.terraformrc` (provider mirrors, credentials, env vars) and backend blocks. Backend builders (azurerm, s3, kubernetes) translate settings into Terraform JSON and inject credentials via env vars. Tier-2 backend definitions pass through but ship without managed auth until prioritized. |
| 117 | +- Migration tooling converts legacy `recipeConfig` to settings resources and adds deprecation warnings. |
| 118 | +- Recipe execution resolves the pinned binary path via the stored metadata, preserving multi-tenant isolation so environments can run different Terraform versions without interference. |
| 119 | + |
| 120 | +### API Design (if applicable) |
| 121 | + |
| 122 | +- New ARM resources `Radius.Core/terraformSettings` and `Radius.Core/bicepSettings` (preview `2025-08-01`). |
| 123 | +- Installer REST endpoints: |
| 124 | + - `POST /installer/terraform/install` `{ "version": "1.6.4", "source": {...} }` |
| 125 | + - `POST /installer/terraform/uninstall` `{ "version": "1.5.7" }` |
| 126 | + - `GET /installer/terraform/status` |
| 127 | + - `POST /installer/terraform/validate` `{ "environmentId": "/.../environments/my-env" }` |
| 128 | + |
| 129 | +### CLI Design (if applicable) |
| 130 | + |
| 131 | +- `rad terraform install [--version|--url|--checksum]` _(required by feature spec)_ |
| 132 | +- `rad terraform uninstall [--version]` _(required by feature spec)_ |
| 133 | +- `rad terraform status` _(new; surfaces installer async status persisted by the worker so operators can diagnose installs quickly)_ |
| 134 | +- `rad terraform validate --environment <envId>` _(Phase 2; runs preflight validation of settings/backends to catch errors before recipe execution)_ |
| 135 | +- `rad terraform migrate --environment <envId> [--apply]` _(Phase 2; optional helper for migrating legacy `recipeConfig` to settings resources)_ |
| 136 | + |
| 137 | +**Sync / Async Options** |
| 138 | + |
| 139 | +- `rad terraform install` and `rad terraform uninstall` accept an optional `--wait` flag. By default the command returns immediately after submitting the request (async). When `--wait` is supplied the CLI polls status until the operation succeeds or fails, giving teams flexibility for interactive or automated flows. |
| 140 | + |
| 141 | +**Why Async?** |
| 142 | + |
| 143 | +- Terraform archives can be large; returning immediately prevents CLI timeouts and lets installs continue even if a terminal disconnects. |
| 144 | +- The installer queue runs with `MaxOperationConcurrency = 1`, so repeated installs (for example `rad terraform install 1.6.0` followed by `1.7.0`) execute strictly in order without overlapping downloads. |
| 145 | +- Automation pipelines can trigger installs and move on, using `rad terraform status` (or `--wait`) to gate later steps when needed. |
| 146 | + |
| 147 | +### Implementation Details |
| 148 | + |
| 149 | +#### UCP (if applicable) |
| 150 | + |
| 151 | +- Register an installer-specific async handler with the existing worker service (`pkg/server/asyncworker.go`). Installer REST endpoints enqueue jobs using `pkg/components/queue` under a dedicated queue name configured with `MaxOperationConcurrency = 1`, and that worker loop dequeues and executes jobs sequentially—so no new Kubernetes controller is required. |
| 152 | + |
| 153 | +- Ensure Helm charts mount Terraform binary PVC and expose installer endpoints. |
| 154 | + |
| 155 | +#### Core RP (if applicable) |
| 156 | + |
| 157 | +- Update Environments controller to resolve settings and validate secrets/backends. |
| 158 | + |
| 159 | +#### Portable Resources / Recipes RP (if applicable) |
| 160 | + |
| 161 | +- Terraform driver consumes `terraformSettings` data for `.terraformrc`, backend config, env vars, and logging. Secret injection for custom providers continues via recipe parameters referencing Radius Secrets; no sensitive values are persisted. |
| 162 | +- Read `bicepSettings` for registry auth (BasicAuth, Azure workload identity, AWS IRSA) and drop the legacy Secret kind switch; no execution changes beyond reference handling. Azure WI client/tenant IDs and AWS IAM ARN remain plain properties (not Secrets) per the feature spec. |
| 163 | + |
| 164 | +### Error Handling |
| 165 | + |
| 166 | +- If a download, checksum, or `terraform init` step fails we keep the previous Terraform version and mark the install as failed so operators can retry. |
| 167 | +- When required secrets or backend settings are missing, the environment reconcile stops before any Terraform code runs. |
| 168 | +- CLI commands return clear errors such as “install in progress, retry after status shows Succeeded” so users know what to do next. |
| 169 | + |
| 170 | +## Test Plan |
| 171 | + |
| 172 | +- Unit tests cover the new schemas, installer REST endpoints, queue handler, and CLI flag parsing. |
| 173 | +- Integration tests cover sequential installs, status reporting, rollback behaviour, and the legacy fallback path. |
| 174 | +- Functional pipelines (`functional-test-cloud`, `functional-test-noncloud`, `long-running-azure`, nightly CLI jobs) are updated to run `rad terraform install` before Terraform recipes and to verify that recipes succeed with the new settings resources. |
| 175 | + |
| 176 | +## Security |
| 177 | + |
| 178 | +- Secrets stay in `Radius.Security/secrets`; we only fetch them at runtime and never write the values to disk or logs. |
| 179 | +- Installer downloads use HTTPS, and operators can supply custom CA bundles when needed. |
| 180 | +- Only authenticated callers can hit the installer REST/CLI entry points; no new identities are introduced. |
| 181 | + |
| 182 | +## Compatibility (Optional) |
| 183 | + |
| 184 | +- `Applications.Core/environments` keep working during migration; we emit warnings when legacy `recipeConfig` is still in use. |
| 185 | +- The new CLI is required for installer commands, but older CLIs continue to run legacy recipes until environments migrate. |
| 186 | + |
| 187 | +## Monitoring and Logging |
| 188 | + |
| 189 | +- Metrics track queue depth, install/uninstall duration, success and failure counts, and the active Terraform version. |
| 190 | +- Logs include environment IDs, Terraform versions, and correlation IDs; Terraform stdout/stderr continues to flow through the standard sink. |
| 191 | +- Distributed traces wrap installer requests and Terraform execution so operators can see end-to-end timing. |
| 192 | + |
| 193 | +## Deployment Considerations |
| 194 | + |
| 195 | +- **Helm Upgrades**: the chart mounts the shared Terraform PVC and registers the installer queue (single concurrency). After upgrading, operators run `rad terraform install` to seed the desired version, then deploy `terraformSettings`/`bicepSettings` like any other ARM resource. |
| 196 | +- **GitOps**: commit the new settings resources to your repo and add a bootstrap step (pipeline or operator action) that runs the installer so Flux/Argo has a Terraform binary available. |
| 197 | +- **Air-gapped**: set mirror URLs, checksums, and TLS bundles in `terraformSettings`. The installer honours those values, reusing the existing air-gapped logic for provider mirrors and registries. |
| 198 | + |
| 199 | +## Development Plan |
| 200 | + |
| 201 | +Work delivers in two phases. We will work from a dedicated feature branch and land each numbered item as its own PR into that branch before merging back to `main` once the plan is complete. |
| 202 | + |
| 203 | +1. **Phase 1 – Core Implementation** |
| 204 | + |
| 205 | + 1. Add `terraformSettings` / `bicepSettings` TypeSpec definitions, regenerate SDKs, update datamodel converters, and plumb through `Radius.Core/environments` validation. Include unit tests for the new schemas and conversions. |
| 206 | + - Reuse the shapes and conversion coverage already prototyped in the air-gapped branch (`radius-air-gapped/typespec/Applications.Core/environments.tsp`, `pkg/corerp/datamodel/recipe_types.go`, and `pkg/corerp/api/v20231001preview/environment_conversion*_test.go`) as the authoritative source for mirror/module registry/version/TLS/auth fields. |
| 207 | + 2. Introduce installer status storage (datamodel + persistence layer) and the installer REST endpoints (install/uninstall/status) including request validation and unit tests. |
| 208 | + 3. Register a dedicated installer queue/worker (`pkg/server/asyncworker.go`, `pkg/components/queue`) with `MaxOperationConcurrency = 1`; implement the async handler that downloads, verifies, and stages Terraform binaries with `current/previous` symlink management. Add integration tests exercising sequential installs and failure rollback. |
| 209 | + 4. Implement binary lifecycle helpers (mirror downloads, checksum validation, PVC layout) and ensure uninstall removes unused versions only when idle. Cover these helpers with unit tests. |
| 210 | + - Lift the validation and mirror logic out of `radius-air-gapped/pkg/recipes/terraform/install.go` and `pkg/recipes/terraform/customsource/*` (custom releases URL, direct archive download, TLS enforcement) so the installer reuses those hardened code paths. |
| 211 | + 5. Update the `rad` CLI with `terraform install|uninstall|status --wait` semantics, polling logic, and CLI unit tests. |
| 212 | + 6. Teach controllers/executors to consume `terraformSettings`/`bicepSettings`, fall back to legacy `recipeConfig`, and emit adoption telemetry. Update regression tests to cover both paths. |
| 213 | + - The air-gapped branch already wires provider mirror auth, env/secret extraction, and registry logging through the Terraform driver (`pkg/recipes/driver/terraform/*.go`) and keeps the Bicep registry auth flow (`pkg/recipes/driver/bicep/bicep.go`, `pkg/rp/util/authclient/*`); adapt that implementation to pull data from the new settings resources instead of `recipeConfig`. |
| 214 | + 7. Add migration scaffolding (warnings + legacy compatibility checks) and integration tests that submit sequential installs, conflicting installs, and migrations. |
| 215 | + - Reuse the secret-tracking helpers added in `radius-air-gapped/pkg/recipes/terraform/types.go` to ensure migrations surface missing secrets when environments move off legacy configuration. |
| 216 | + 8. Update deployment assets (Helm values/ConfigMaps) to provision the installer queue, PVC mounts, default Terraform download settings, and adjust GitHub workflows/functional test pipelines (e.g., `functional-test-cloud`, `functional-test-noncloud`, `long-running-azure`, nightly CLI tests) so they seed Terraform via the installer. |
| 217 | + 9. Extend monitoring: metrics for installer queue latency, install/uninstall duration, failure counts, and structured logging; refresh unit/integration/functional suites where needed and ensure CI passes on the feature branch. |
| 218 | + - Carry forward the Terraform log-level plumbing from `radius-air-gapped` (`pkg/recipes/driver/terraform/terraform.go`, Helm chart `global.terraform.loglevel`) so trace logging is available once installs run through the new pipeline. |
| 219 | + |
| 220 | +2. **Phase 2 – Enhancements & Nice-to-haves** |
| 221 | + 1. Implement `rad terraform validate` (preflight backend/env checks) leveraging installer status metadata. |
| 222 | + - Build atop the existing backend/TLS verification routines in `radius-air-gapped/pkg/recipes/terraform/execute.go` and `pkg/recipes/driver/terraform/registry.go`, which already test connectivity to mirrors and inject CA bundles. |
| 223 | + 2. Provide `rad terraform migrate` tooling to move environments off `recipeConfig`. |
| 224 | + 3. Add richer telemetry/dashboards, documentation polish, and any optional automation once adoption targets are met. |
| 225 | + |
| 226 | +Each task maps to reviewable PRs following the backlog captured in this document, merged into the feature branch before the branch itself merges to `main`. |
| 227 | + |
| 228 | +## Open Questions |
| 229 | + |
| 230 | +## Alternatives Considered |
| 231 | + |
| 232 | +- **Controller vs. other install mechanisms** |
| 233 | + - Keep the initContainer and let it pull from mirrors (rejected: still per-pod download, no central version control, contradicts spec). |
| 234 | + - Bake Terraform into the Application RP image (rejected: single version for all environments, rebuilds required for upgrades, no rollback validation). |
| 235 | + - Run ad-hoc installer Jobs from the CLI (rejected: lacks idempotency/status, prone to race conditions, no persisted metadata). |
| 236 | + - Require external tooling to provision Terraform binaries (rejected: Radius would not meet the spec requirement to install/manage Terraform). |
| 237 | +- Encode settings per Environment only (rejected: no reuse/sharing, hard to manage at scale). |
| 238 | + |
| 239 | +## Design Review Notes |
| 240 | + |
| 241 | +_TBD post review_ |
0 commit comments