Skip to content

Feasibility/design: SSH bridge for AuthBridge (proxy-sidecar mode) — guardrails, authz, audit for agent SSH egress #528

Description

@huang195

Summary

Investigate / design an SSH bridge for AuthBridge — outbound SSH interception in proxy-sidecar mode only — as a sibling to the recently-merged TLS bridge (#522, #523). The goal is to bring SSH egress under the same guardrail / authz / audit model AuthBridge already applies to HTTP(S), and to close the SSH-shaped hole in the egress story.

Motivation

Agentic workloads emit a meaningful amount of SSH that AuthBridge currently treats as an opaque tunnel:

Agentic workload SSH it emits Guardrail / authz / audit value
SRE / infra-ops / "fix-my-server" agent interactive shell + exec into VMs/bastions Strongest case: SSH shell = arbitrary RCE with no other choke point. Command allow/deny, per-host authz (staging vs prod), full command/keystroke audit + replay. PAM/bastion-for-agents.
Config-management agent (Ansible/Salt) heavy exec + file copy (Ansible transport is SSH) Restrict hosts/playbooks, block destructive modules, audit every module run.
CI/CD & deploy agents ssh deploy@host '…', rsync/scp Authz on deploy targets, audit what shipped where.
Data-movement / backup agents SFTP/SCP subsystem DLP — block exfil of sensitive paths, size caps, file/direction audit.
Network-automation agents exec to routers/switches/firewalls Command guardrails on high-blast-radius config; per-device authz.
Any agent using SSH tunneling (incl. compromised) direct-tcpip / SOCKS / reverse shell Egress-completeness (see below).
Git agents git-upload-pack/receive-pack over exec Repo allowlist, block force-push, secret-scan on push.

Two arguments are platform-level, not workload-specific:

  1. Egress-completeness / closing the bypass. AuthBridge's egress controls (TLS bridge, HTTP_PROXY, iptables L7 filtering) have a hole: a sophisticated or compromised agent opens one SSH connection and tunnels everything over direct-tcpip, defeating the HTTP-layer controls. Today port 22 is blind-tunneled, so the hole is open. (Relates to threat-model gap C5 — blind-tunnel.)
  2. Credential substitution = the SSH analog of OBO. SSH authenticates with keys at connection setup, not per-request headers, so there is no Authorization header to inject. The value-add instead is: the bridge mints a short-lived SSH cert scoped to allowed principals/hosts per session, so the agent never holds a standing SSH key — the same "agents never hold downstream creds" property OBO gives HTTP.

Caveat (don't oversize the surface): much "remote access" is actually HTTPS, not SSH, and already falls under the TLS bridge — kubectl exec (SPDY/websocket over HTTPS), EC2 Instance Connect / GCP OS Login (API-mediated), cloud-shell. The genuinely-SSH surface is VM/bastion shell, Ansible, rsync/SFTP, network gear, and tunneling.

Feasibility (vs the TLS bridge)

Honest assessment: an SSH bridge is meaningfully more work than the TLS bridge — the two things that made the TLS bridge cheap do not transfer. SSH multiplexes channels, and each can carry a different guardrail:

  • session + exec → one-shot remote command (git, ansible, deploy)
  • session + shell/pty-req → interactive shell
  • session + subsystem → SFTP/SCP
  • direct-tcpip / forwarded-tcpip → port forwarding / tunneling

Reusable (cheap):

  • Connection capture — in enforce-redirect mode the iptables catch-all (proxy-init/init-iptables.sh:336) already redirects all TCP incl. port 22 into HandleTransparentConn; SSH is captured today and just blind-tunneled. (The legacy redirect/envoy chain explicitly RETURNs :22 — another reason to scope this to proxy-sidecar mode.)
  • Config/cmd plumbing — mirror the TLS-bridge pattern: an SSHBridge *SSHBridgeConfig pointer + Validate() + ~40 lines of cmd glue + one fpSrv.X = engine assignment, feature-flagged off by default.
  • Decision/skip machinery — port-classify + runtime skip-set + fall-open is transport-agnostic; swap looksLikeTLSRecord for an SSH-2.0- banner sniff.
  • Operator CRD mode field — cheap, reuses the existing AuthBridgeMode/MTLSMode slot.

Does NOT carry over (the real cost):

  1. Zero pipeline reuse — the TLS bridge's biggest free lunch. After TLS termination the stream is HTTP, so the existing outbound pipeline runs unchanged (OBO/Authorization injection, redaction, reject). SSH is not HTTP: terminator.go + serve.go + the HTTP handler get replaced by a from-scratch SSH channel-relay engine (~200–400 LOC, protocol-nuanced).
  2. Trust model mismatch — TLS injects a forged CA into the agent trust store (NODE_EXTRA_CA_CERTS/SSL_CERT_FILE). SSH pins per-host keys in known_hosts (TOFU). Interception requires an OpenSSH-CA @cert-authority entry (needs SSH-cert host keys, not the x509 minter.go output) or disabling host-key checking (defeats the purpose). tlsbridge/ca.go does not map.
  3. Auth-substitution design — agent→bridge vs bridge→upstream are different credentials; new design with no TLS-side analog, plus a new per-agent credential reconciler + secret shape on the operator side.

Sizing: same ~60 LOC config/cmd and free capture, but a from-scratch engine + harder trust-injection + new auth-substitution design ⇒ roughly 2–3× the authbridge-side work of the TLS bridge, and a harder operator/trust story. Crypto primitives exist (golang.org/x/crypto/ssh: NewServerConn/Dial), so it is feasible, not "similar effort."

Proposed Phase 1 scope

Anchor on the highest-value, lowest-ambiguity slice:

  1. SSH channel-type policy (no termination required first): on captured :22 connections, allow exec/shell to approved hosts and hard-block direct-tcpip port-forwarding — closes the egress bypass cheaply.
  2. exec-command policy + audit for the ops/Ansible/deploy agents (command allow/deny + structured audit log).
  3. Short-lived SSH-cert credential substitution (the OBO analog) — agent never holds a standing key.

Git push/pull guardrails are real but the weakest standalone justification — defer.

Open questions

  • Termination vs metadata-only: how much can be done by inspecting channel-open/exec requests without fully terminating + re-originating?
  • Host-key trust injection: OpenSSH-CA @cert-authority into agent known_hosts — which agent runtimes/images can we support, and how does the operator provision it?
  • Bridge→upstream credential store + per-agent reconciler shape.
  • Interaction with SPIRE/SPIFFE identity for minting the agent's SSH cert.

References


Assisted-By: Claude Code

Metadata

Metadata

Assignees

Labels

IdentityIssues related to agent identity attestation, SPIFFE, and Authorization bearer tokensenhancementNew feature or request

Type

No type

Fields

No fields configured for issues without a type.

Projects

Status
New/ToDo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions