rfc-0008: shared SDK rust core and language specific bindings#1764
rfc-0008: shared SDK rust core and language specific bindings#1764maxdubrinsky wants to merge 1 commit into
Conversation
drew
left a comment
There was a problem hiding this comment.
Looks like a good direction to me.
|
|
||
| ## Proposal | ||
|
|
||
| ### New and changed crates |
There was a problem hiding this comment.
We should be able to refactor a lot of the TUI to reuse this crate as well. cc @johntmyers
|
|
||
| ## Open questions | ||
|
|
||
| - **Retry policy shape.** Builder on `ClientConfig` (declarative) or `tower::Layer` (composable)? Composable is more flexible; declarative is friendlier for napi/PyO3 consumers who can't construct a `Layer`. |
There was a problem hiding this comment.
I think the concern found was if we rely on tower to handle our retry logic, that won't allow for any configuration in the python/ts layer in the same way that recompiling the rust sdk with different values would. This was found by an agent, feels a bit overblown.
| ## Open questions | ||
|
|
||
| - **Retry policy shape.** Builder on `ClientConfig` (declarative) or `tower::Layer` (composable)? Composable is more flexible; declarative is friendlier for napi/PyO3 consumers who can't construct a `Layer`. | ||
| - **Should `OpenShellClient::from_gateway_name(name)` exist in `openshell-sdk` at all,** or only in a CLI-config helper crate? Tradeoff between ergonomics and keeping `openshell-sdk` filesystem-free. |
There was a problem hiding this comment.
Can openshell-sdk be used as the Rust SDK? If so, I think it makes sense to have client factories for this.
There was a problem hiding this comment.
I think so, as long as we're careful to decouple the actual client from the factory that can read from config files. Think this came from me pushing the POC agent to develop independent of the config structure the CLI relies on.
| - Wire the OIDC refresh callback path between Rust and JS. | ||
| - Map SDK errors to JS errors with a discriminable `code` field. | ||
| - Resolve the tunnel-vs-refresh interaction with one targeted test (does the CF tunnel re-handshake on bearer rotation, swap headers in place, or tear down and rebuild?). | ||
| - Smoke test against a plaintext local gateway. |
There was a problem hiding this comment.
Would be good to build some sort of test suite around this. e2e/typescript?
There was a problem hiding this comment.
That'd be the goal. Currently the tests in #1617 don't cover plain auth (just OIDC), but it would make sense to cover everything.
|
@maxdubrinsky rename to rfc-0008 |
Captures the design behind extracting the shared client core out of openshell-cli into a standalone openshell-sdk crate, plus the napi-rs TypeScript binding (openshell-sdk-node, published as @openshell/sdk). Covers motivation (CLI/TUI/embedders sharing one transport, OIDC, and edge-tunnel implementation), surface area, error model, and the path for future language bindings. Signed-off-by: Max Dubrinsky <mdubrinsky@nvidia.com>
6df4e19 to
bb5f5c6
Compare
|
|
||
| ## Non-goals | ||
|
|
||
| - **Replacing the pure-Python SDK.** That migration is a separate, larger decision (API parity, deprecation window, packaging). This RFC keeps Python on its current pure-Python stack and only ensures the shared core is shaped so a future PyO3 wrapper is feasible. |
There was a problem hiding this comment.
@maxdubrinsky Is this still the plan? I.e. to leave Python SDK as is?
One item I have on my list to tackle is to remove the CLI from Python SDK and depending on how the new SDK work goes, I could do this right away and keep Python SDK as pure-Python, which greatly simplifies the build process.
If we are going to use Rust as a base for Python SDK in the future, it could be better to keep the build process (it took a while to get it to a reasonably okay state).
WDYT?
There was a problem hiding this comment.
+1 we definitely want to replace the python sdk with this rust core.
|
Nice RFC, @maxdubrinsky. Extracting the transport/auth code from the CLI into its own crate is a good idea regardless of where the FFI discussion lands. I've been looking at this from the perspective of a potential Go SDK and wanted to raise a question: is ~1,230 lines of transport plumbing enough shared logic to justify the FFI layer? What would actually be sharedI (and my friendly french AI agent :) went through the code that would move into
The SDK methods proposed for the MVP (sandbox CRUD, health, exec, wait) are thin wrappers over the proto API. The Python SDK's The RFC mentions OIDC single-flight refresh coalescing and the Cloudflare Access tunnel as areas of non-trivial logic. But single-flight token refresh is a well-known pattern that takes ~50 lines in any language (Go's The prior art is compelling, but the scale is differentThe RFC cites Polars, Temporal, and Statsig. All solid examples. I wonder, though, whether they share a trait that OpenShell doesn't have yet: massive, correctness-critical business logic where behavioral divergence between languages would be an actual bug. Polars has thousands of optimized query operators. Temporal has durable-execution state machines that must behave identically across languages (and Spencer Judge still called async FFI bridging "particularly challenging"). Statsig runs a complex feature-flag evaluation engine across 24 SDKs with 7 developers, and they're candid that the approach "makes things a little worse before they get better." OpenShell's proposed shared core is transport configuration and auth token management. Those are well-understood problems with mature, battle-tested libraries in every gRPC ecosystem. Proto versioning already prevents SDK driftOne argument for the Rust core is preventing SDK consumers from getting out of sync when the API evolves. But the proto definition already serves as the compile-time contract check. When you add a field to When you add a new RPC (say Kubernetes deals with this at larger scale (10+ officially supported client libraries, rapidly evolving API) and their approach is generating independent native clients from the OpenAPI spec. No shared runtime core. The spec itself is the single source of truth. Costs worth consideringThe RFC flags "napi-rs prebuilt binary CI complexity" as a risk and notes the six-target build matrix "has only been exercised on darwin-arm64 so far." That's not a one-time setup cost. An napi-rs SDK ships prebuilt native binaries for each platform, which means CI cross-compiles Rust to every target using platform-specific toolchains (zig linker, musl, per-target Docker images). When any piece of that chain changes (a new Rust edition, GitHub switching macOS runners from Intel to M1, a new platform target), the builds can silently break. A native TypeScript SDK using For Go, cgo makes this worse. It disables cross-compilation by default, breaks static linking, slows builds, and loses Go's goroutine scheduling. The Go ecosystem treats cgo as a last resort. Users expect Alternative: start native, share tests
Data point: a pure-Go prototypeAs a concrete example, I've been prototyping a pure-Go SDK that covers the full proto surface without any Rust dependency. Some numbers on the effort beyond vanilla gRPC wrappers:
Total: ~8,300 lines of non-test code (plus ~13,700 lines of tests). The transport/auth layer is 143 lines. The bulk of the work is type conversion and the streaming RPCs (TCP, exec), not transport plumbing. All of it uses standard This is early-stage, but it suggests the per-language effort for a native SDK is manageable, and the parts that need the most code (type mapping, streaming) are inherently language-specific anyway. I'm new to the project but happy to dig into any of these points or help shape the conformance test approach if there's interest. AI attribution: AIA HAb SeNc Hin R Claude Opus 4.6 v1.0 |
|
Small correction to my Go SDK numbers above: the transport/auth layer is actually 220 lines (not 143), once you include the internal TLS credential builder. It covers all four TLS modes from the RFC (plaintext, CA-only, mTLS, insecure) plus static bearer token auth. What it doesn't include: OIDC interactive flows (browser auth, token refresh, discovery) and the Cloudflare Access tunnel. But looking into the planned design, those are explicitly scoped out of So the 220 lines in Go and the ~1,230 lines in Rust aren't really comparable at face value. Roughly half of the Rust OIDC code (lines 300-534 of The part of the Rust transport code that would actually cross the FFI boundary into language bindings is closer to the same scope the Go SDK already covers natively in 220 lines. I'm planning to implement the remaining pieces (OIDC token refresh, single-flight coalescing, |
|
Thanks for the review @rhuss, sorry for taking long to get back. Did some digging and I think you're right on the FFI-front and a proto-first SDK approach is the way forward with a "hand"-written layer in the various languages we will be compiling for. This RFC was written before a number of RPC endpoints were made available, and you're right that the only Rust code we'd be sharing is the auth/transport layer (something individual languages do better for themselves vs. a Rust crate). To that end (and @drew keep me honest here), I'm going to post a PR with a TS SDK similar to your Go SDK. It'll be using a generated proto client along with some handwritten surfaces to make using things like Once that's in, I think we should start looking at how we might bring in a first-party Go SDK using the same principles since it's my understanding that we want to keep all this code in-repo. |
|
Thanks for the thoughtful response, @maxdubrinsky , and for landing on the proto-first direction. I know my earlier comment was lengthy, but I felt it was important to back my arguments in concrete numbers rather than hand-waving about complexity. Glad the analysis was useful (even when mostly driven by AI, and manually verified) On the Go SDK: I'd love to bring my prototype into this repo (if this is helpful). It's alread more than a prototype, as it covers the full gRPC API surface: Sandbox, Provider, Exec, File, Health, Watch, Services, Profiles, and Credential Refresh. That includes OIDC token refresh with single-flight coalescing, browser-based auth flow integration, TLS with all four modes, a complete fake client package for testing, and CI with a proto sync workflow. Unit test coverage is also quite decent with ~ 90% coverage. 32 test files, 368 test functions. Details and design docs are in the proposal issue #2044 and the prototype repo. Of course it still has to be battle-tested manually, especially the OIDC flow, but I'm on it. Happy to prepare a PR that moves it into One design question worth discussing early: should we aim for a common API shape across SDKs (same method names, same resource grouping, same error taxonomy), or should each SDK be idiomatic to its language ecosystem? I'd lean toward idiomatic per language. The Go SDK follows client-go conventions (typed sub-clients, The protos already give us the shared contract at the wire level. Above that, each SDK should meet developers where they are. What do you think? |
Summary
Lands RFC 0005 — Shared Rust SDK core and TypeScript binding as a standalone document. This is the design split out of the larger implementation PR #1617 so the direction can be reviewed and merged on its own; the
openshell-sdkcrate and the@openshell/sdkNode binding will follow as their own focused PRs.RFC 0005 proposes extracting an
openshell-sdkRust crate from the gRPC client plumbing that currently lives inopenshell-cli(transport, TLS, OIDC refresh, edge tunnel), refactoring the CLI to consume it, and shipping a TypeScript SDK (@openshell/sdk) as a napi-rs wrapper over the same core — so the CLI, the TS SDK, and future bindings share one transport, auth, and error implementation.The document covers:
openshell-sdk/openshell-sdk-nodeAPI surfaces.thiserror→ napi error mapping with a discriminablecode, single-flight OIDC refresh in the core, and arawescape hatch for uncovered RPCs.Non-goals: replacing the Python SDK, gRPC contract changes, browser/WASM support, and bundling the CLI binary inside the npm package.
Related Issue
Split out of #1617, which will be closed in favor of three focused PRs: this RFC →
openshell-sdk(crate extraction + CLI refactor) →openshell-sdk-node(napi binding, Pi example, CI).Changes
rfc/0005-shared-sdk-core-and-ts-binding/README.md. Documentation only — no code, schema, or behavior changes.Testing
mise run pre-commitpasses (lint, format, license headers, full Rust test suite — 767 passed)Checklist