Skip to content

rfc-0008: shared SDK rust core and language specific bindings#1764

Open
maxdubrinsky wants to merge 1 commit into
mainfrom
md/rfc-0005-shared-sdk-core
Open

rfc-0008: shared SDK rust core and language specific bindings#1764
maxdubrinsky wants to merge 1 commit into
mainfrom
md/rfc-0005-shared-sdk-core

Conversation

@maxdubrinsky

@maxdubrinsky maxdubrinsky commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Summary

Lands RFC 0005 — Shared Rust SDK core and TypeScript binding as a standalone document. This is the design split out of the larger implementation PR #1617 so the direction can be reviewed and merged on its own; the openshell-sdk crate and the @openshell/sdk Node binding will follow as their own focused PRs.

RFC 0005 proposes extracting an openshell-sdk Rust crate from the gRPC client plumbing that currently lives in openshell-cli (transport, TLS, OIDC refresh, edge tunnel), refactoring the CLI to consume it, and shipping a TypeScript SDK (@openshell/sdk) as a napi-rs wrapper over the same core — so the CLI, the TS SDK, and future bindings share one transport, auth, and error implementation.

The document covers:

  • Motivation and what exists today (a hand-written Python gRPC client plus the CLI's production transport stack).
  • Proposed crate layout and the openshell-sdk / openshell-sdk-node API surfaces.
  • The five transport/auth modes the MVP must preserve: plaintext, mTLS, OIDC bearer, Cloudflare Access tunnel, and insecure TLS.
  • Key design decisions: napi-rs v3, async-only API, thiserror → napi error mapping with a discriminable code, single-flight OIDC refresh in the core, and a raw escape hatch for uncovered RPCs.
  • A dependency-ordered three-phase implementation plan, plus risks, alternatives, prior art, and open questions.

Non-goals: replacing the Python SDK, gRPC contract changes, browser/WASM support, and bundling the CLI binary inside the npm package.

Related Issue

Split out of #1617, which will be closed in favor of three focused PRs: this RFCopenshell-sdk (crate extraction + CLI refactor) → openshell-sdk-node (napi binding, Pi example, CI).

Changes

  • Add rfc/0005-shared-sdk-core-and-ts-binding/README.md. Documentation only — no code, schema, or behavior changes.

Testing

  • mise run pre-commit passes (lint, format, license headers, full Rust test suite — 767 passed)
  • Unit tests added/updated — N/A, documentation only
  • E2E tests added/updated (if applicable) — N/A, documentation only

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable) — N/A, this PR is the design doc

@drew drew left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good direction to me.


## Proposal

### New and changed crates

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to refactor a lot of the TUI to reuse this crate as well. cc @johntmyers


## Open questions

- **Retry policy shape.** Builder on `ClientConfig` (declarative) or `tower::Layer` (composable)? Composable is more flexible; declarative is friendlier for napi/PyO3 consumers who can't construct a `Layer`.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the concern found was if we rely on tower to handle our retry logic, that won't allow for any configuration in the python/ts layer in the same way that recompiling the rust sdk with different values would. This was found by an agent, feels a bit overblown.

## Open questions

- **Retry policy shape.** Builder on `ClientConfig` (declarative) or `tower::Layer` (composable)? Composable is more flexible; declarative is friendlier for napi/PyO3 consumers who can't construct a `Layer`.
- **Should `OpenShellClient::from_gateway_name(name)` exist in `openshell-sdk` at all,** or only in a CLI-config helper crate? Tradeoff between ergonomics and keeping `openshell-sdk` filesystem-free.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can openshell-sdk be used as the Rust SDK? If so, I think it makes sense to have client factories for this.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, as long as we're careful to decouple the actual client from the factory that can read from config files. Think this came from me pushing the POC agent to develop independent of the config structure the CLI relies on.

- Wire the OIDC refresh callback path between Rust and JS.
- Map SDK errors to JS errors with a discriminable `code` field.
- Resolve the tunnel-vs-refresh interaction with one targeted test (does the CF tunnel re-handshake on bearer rotation, swap headers in place, or tear down and rebuild?).
- Smoke test against a plaintext local gateway.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to build some sort of test suite around this. e2e/typescript?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd be the goal. Currently the tests in #1617 don't cover plain auth (just OIDC), but it would make sense to cover everything.

@drew drew changed the title docs(rfc): RFC 0005 — shared SDK core and TypeScript binding rfc-0008: shared SDK rust core and language specific bindings Jun 16, 2026
@drew

drew commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

@maxdubrinsky rename to rfc-0008

@drew drew moved this from Todo to In progress in OpenShell Roadmap Jun 16, 2026
@pimlock pimlock mentioned this pull request Jun 24, 2026
8 tasks
Captures the design behind extracting the shared client core out of
openshell-cli into a standalone openshell-sdk crate, plus the napi-rs
TypeScript binding (openshell-sdk-node, published as @openshell/sdk).

Covers motivation (CLI/TUI/embedders sharing one transport, OIDC, and
edge-tunnel implementation), surface area, error model, and the path
for future language bindings.

Signed-off-by: Max Dubrinsky <mdubrinsky@nvidia.com>
@maxdubrinsky maxdubrinsky force-pushed the md/rfc-0005-shared-sdk-core branch from 6df4e19 to bb5f5c6 Compare June 26, 2026 16:42
@copy-pr-bot

copy-pr-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.


## Non-goals

- **Replacing the pure-Python SDK.** That migration is a separate, larger decision (API parity, deprecation window, packaging). This RFC keeps Python on its current pure-Python stack and only ensures the shared core is shaped so a future PyO3 wrapper is feasible.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxdubrinsky Is this still the plan? I.e. to leave Python SDK as is?

One item I have on my list to tackle is to remove the CLI from Python SDK and depending on how the new SDK work goes, I could do this right away and keep Python SDK as pure-Python, which greatly simplifies the build process.

If we are going to use Rust as a base for Python SDK in the future, it could be better to keep the build process (it took a while to get it to a reasonably okay state).

WDYT?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 we definitely want to replace the python sdk with this rust core.

@rhuss

rhuss commented Jul 1, 2026

Copy link
Copy Markdown

Nice RFC, @maxdubrinsky. Extracting the transport/auth code from the CLI into its own crate is a good idea regardless of where the FFI discussion lands.

I've been looking at this from the perspective of a potential Go SDK and wanted to raise a question: is ~1,230 lines of transport plumbing enough shared logic to justify the FFI layer?

What would actually be shared

I (and my friendly french AI agent :) went through the code that would move into openshell-sdk:

Module Lines What it does
tls.rs 451 TLS config, channel construction, cert loading
oidc_auth.rs 534 OIDC browser flow, token refresh
edge_tunnel.rs 245 Cloudflare Access tunnel proxy
Total ~1,230 Transport/auth plumbing

The SDK methods proposed for the MVP (sandbox CRUD, health, exec, wait) are thin wrappers over the proto API. The Python SDK's wait_ready is a 10-line poll loop. Create/Get/Delete are 5-line proto call + type conversion each. The whole Python SDK including transport setup is 1,382 lines. This isn't algorithmic complexity where a single implementation prevents bugs. It's configuration assembly, and every language's gRPC stack already handles it natively.

The RFC mentions OIDC single-flight refresh coalescing and the Cloudflare Access tunnel as areas of non-trivial logic. But single-flight token refresh is a well-known pattern that takes ~50 lines in any language (Go's singleflight package, Python's asyncio.Lock, Node's promise caching). The CF tunnel proxy is 245 lines of Rust. File transfer/upload could add real complexity, but it hasn't been designed yet.

The prior art is compelling, but the scale is different

The RFC cites Polars, Temporal, and Statsig. All solid examples. I wonder, though, whether they share a trait that OpenShell doesn't have yet: massive, correctness-critical business logic where behavioral divergence between languages would be an actual bug.

Polars has thousands of optimized query operators. Temporal has durable-execution state machines that must behave identically across languages (and Spencer Judge still called async FFI bridging "particularly challenging"). Statsig runs a complex feature-flag evaluation engine across 24 SDKs with 7 developers, and they're candid that the approach "makes things a little worse before they get better."

OpenShell's proposed shared core is transport configuration and auth token management. Those are well-understood problems with mature, battle-tested libraries in every gRPC ecosystem.

Proto versioning already prevents SDK drift

One argument for the Rust core is preventing SDK consumers from getting out of sync when the API evolves. But the proto definition already serves as the compile-time contract check. When you add a field to CreateSandboxRequest, every language's protoc codegen picks it up. Additive field changes are wire-compatible by design; gRPC has well-established patterns for field deprecation and service versioning.

When you add a new RPC (say PauseSandbox), the Rust-core path still requires updating the Rust crate, updating the napi-rs binding, updating the PyO3 binding, and updating any SDK that doesn't use the core (like Go). That's the same number of touchpoints as updating each native SDK directly. You've replaced "update Python gRPC client" with "update PyO3 FFI wrapper."

Kubernetes deals with this at larger scale (10+ officially supported client libraries, rapidly evolving API) and their approach is generating independent native clients from the OpenAPI spec. No shared runtime core. The spec itself is the single source of truth.

Costs worth considering

The RFC flags "napi-rs prebuilt binary CI complexity" as a risk and notes the six-target build matrix "has only been exercised on darwin-arm64 so far." That's not a one-time setup cost. An napi-rs SDK ships prebuilt native binaries for each platform, which means CI cross-compiles Rust to every target using platform-specific toolchains (zig linker, musl, per-target Docker images). When any piece of that chain changes (a new Rust edition, GitHub switching macOS runners from Intel to M1, a new platform target), the builds can silently break. A native TypeScript SDK using @grpc/grpc-js or a Go SDK using go build simply doesn't have this class of problem. And every SDK contributor now needs Rust plus the binding layer plus the target language.

For Go, cgo makes this worse. It disables cross-compilation by default, breaks static linking, slows builds, and loses Go's goroutine scheduling. The Go ecosystem treats cgo as a last resort. Users expect go get to work without a C/Rust toolchain installed. A Go SDK on a Rust core via cgo would be harder to adopt than a pure-Go one.

Alternative: start native, share tests

  1. Extract openshell-sdk as a Rust crate (Phase 1 of the RFC, which makes sense on its own to clean up the CLI).
  2. Build native TypeScript and Go SDKs using each language's gRPC ecosystem. The SDK methods are thin enough that the per-language effort is modest.
  3. Share a conformance test suite against a common test gateway that verifies behavioral parity across SDKs. This catches the "forgot to update the Python SDK" problem directly and is independent of implementation language.
  4. Revisit the shared-core question once the SDK surface grows to include genuinely complex client-side logic (local caching, retry with circuit-breaking, client-side policy evaluation). That would be the point where a shared core clearly pays for itself.

Data point: a pure-Go prototype

As a concrete example, I've been prototyping a pure-Go SDK that covers the full proto surface without any Rust dependency. Some numbers on the effort beyond vanilla gRPC wrappers:

Area Lines What it adds over raw proto stubs
TCP forwarding 376 Bidirectional stream multiplexing over ForwardTcp
Exec (streaming + interactive) 306 Stream chunking, stdin piping, exit-code extraction
SSH session management 152 Session lifecycle, key handling
Proto-to-SDK type converters 1,894 Idiomatic Go types, deep copy at boundaries
Transport/auth setup 143 TLS config, bearer interceptor, channel construction

Total: ~8,300 lines of non-test code (plus ~13,700 lines of tests). The transport/auth layer is 143 lines. The bulk of the work is type conversion and the streaming RPCs (TCP, exec), not transport plumbing. All of it uses standard google.golang.org/grpc and compiles with a plain go build.

This is early-stage, but it suggests the per-language effort for a native SDK is manageable, and the parts that need the most code (type mapping, streaming) are inherently language-specific anyway.


I'm new to the project but happy to dig into any of these points or help shape the conformance test approach if there's interest.

AI attribution: AIA HAb SeNc Hin R Claude Opus 4.6 v1.0

@rhuss

rhuss commented Jul 1, 2026

Copy link
Copy Markdown

Small correction to my Go SDK numbers above: the transport/auth layer is actually 220 lines (not 143), once you include the internal TLS credential builder. It covers all four TLS modes from the RFC (plaintext, CA-only, mTLS, insecure) plus static bearer token auth.

What it doesn't include: OIDC interactive flows (browser auth, token refresh, discovery) and the Cloudflare Access tunnel. But looking into the planned design, those are explicitly scoped out of openshell-sdk too ("SDK never sees a browser", "auth token file loading NOT in openshell-sdk directly"). The OIDC browser flow stays in the CLI, and the SDK consumes a Refresh trait that the CLI implements.

So the 220 lines in Go and the ~1,230 lines in Rust aren't really comparable at face value. Roughly half of the Rust OIDC code (lines 300-534 of oidc_auth.rs) is a localhost HTTP callback server for the browser flow plus tests. That's CLI-specific UX, not SDK logic that would be shared through FFI bindings.

The part of the Rust transport code that would actually cross the FFI boundary into language bindings is closer to the same scope the Go SDK already covers natively in 220 lines.

I'm planning to implement the remaining pieces (OIDC token refresh, single-flight coalescing, from_active_cluster config loading) in the Go prototype and will report back with the full numbers. That should give us a concrete data point on what the complete transport/auth layer looks like in a native SDK.

@maxdubrinsky

Copy link
Copy Markdown
Collaborator Author

Thanks for the review @rhuss, sorry for taking long to get back. Did some digging and I think you're right on the FFI-front and a proto-first SDK approach is the way forward with a "hand"-written layer in the various languages we will be compiling for.

This RFC was written before a number of RPC endpoints were made available, and you're right that the only Rust code we'd be sharing is the auth/transport layer (something individual languages do better for themselves vs. a Rust crate).

To that end (and @drew keep me honest here), I'm going to post a PR with a TS SDK similar to your Go SDK. It'll be using a generated proto client along with some handwritten surfaces to make using things like ExecSandbox a bit cleaner.

Once that's in, I think we should start looking at how we might bring in a first-party Go SDK using the same principles since it's my understanding that we want to keep all this code in-repo.

@rhuss

rhuss commented Jul 3, 2026

Copy link
Copy Markdown

Thanks for the thoughtful response, @maxdubrinsky , and for landing on the proto-first direction. I know my earlier comment was lengthy, but I felt it was important to back my arguments in concrete numbers rather than hand-waving about complexity. Glad the analysis was useful (even when mostly driven by AI, and manually verified)

On the Go SDK: I'd love to bring my prototype into this repo (if this is helpful). It's alread more than a prototype, as it covers the full gRPC API surface: Sandbox, Provider, Exec, File, Health, Watch, Services, Profiles, and Credential Refresh. That includes OIDC token refresh with single-flight coalescing, browser-based auth flow integration, TLS with all four modes, a complete fake client package for testing, and CI with a proto sync workflow. Unit test coverage is also quite decent with ~ 90% coverage. 32 test files, 368 test functions. Details and design docs are in the proposal issue #2044 and the prototype repo. Of course it still has to be battle-tested manually, especially the OIDC flow, but I'm on it.

Happy to prepare a PR that moves it into sdk/go/ (mirroring the TS SDK's layout), wired up with mise tasks and license headers. If there's a preferred structure, I can adapt. One open point to discuss it probably where to put the API documentation (this is currently created as github pages on my repo)

One design question worth discussing early: should we aim for a common API shape across SDKs (same method names, same resource grouping, same error taxonomy), or should each SDK be idiomatic to its language ecosystem?

I'd lean toward idiomatic per language. The Go SDK follows client-go conventions (typed sub-clients, watch.Interface, IsNotFound() helpers, functional options) because that's what Go developers building operators and controllers for Kubernetes already know. A TypeScript SDK should feel like a TypeScript library. Forcing a uniform surface across languages means every SDK feels foreign to its audience, and you end up fighting the language's conventions rather than using them.

The protos already give us the shared contract at the wire level. Above that, each SDK should meet developers where they are. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

4 participants