Releases: NVIDIA/OSMO
6.3.1
Highlights
- Bulk cancel workflows — Select multiple workflows in the UI list and cancel them with a single confirmed action.
- Gateway PodMonitors — Envoy and oauth2-proxy now expose Prometheus metrics endpoints for cluster monitoring.
- Streaming endpoints stop timing out — Workflow log, event, and error-log streams no longer hit the default route timeout mid-stream.
- Envoy v1.38.1 with header-sanitization refactor — Gateway image upgraded from v1.29.0 and identity-header stripping moved off the Lua filter onto Envoy's native mechanism.
- OAuth2 callback no longer self-blocks — When oauth2-proxy is enabled, its own callback endpoints bypass the authn metadata filter so login completes.
- Default role policy merge respects scope — Updates to the shipped default roles append actions only to policies with matching effect and resources, preserving operator-added grants.
Helm Charts
- Gateway PodMonitors: The service chart now ships PodMonitors for Envoy's admin metrics endpoint and oauth2-proxy's
/metricsendpoint whenpodMonitor.enabledis true and the matching gateway component is deployed. (#1095) - Envoy upgrade to v1.38.1: Default
gateway.envoy.imagebumped fromenvoyproxy/envoy:v1.29.0. (#1081) - Identity header sanitization refactor: Client
x-osmo-{user,roles,allowed-pools}headers are now stripped via Envoy's native header sanitization instead of a Lua filter. JWT-only deployments (oauth2-proxy and authz disabled, JWT providers configured) now sanitize client headers as well. Minimal/demo mode (all three auth sources disabled) continues to trust them — see the chart README for the full identity-header trust table. (#1081) - HPA-managed deployments skip the replicas field: Gateway Envoy, oauth2-proxy, and authz deployments omit
spec.replicasfrom their manifests, so Helm apply no longer contends with the autoscaler on each reconcile. (#1081) - ConfigMap extra annotations: New
services.configs.extraAnnotationsannotates the generated configs ConfigMap, useful for settingargocd.argoproj.io/sync-options: ServerSideApply=trueon large config payloads. (#1081) - Streaming API route timeouts: Workflow
/logs,/events, and/error_logsroutes usetimeout: 0swithidle_timeout: 60sso quiet-but-open streams are not cut. Other/api/and/client/routes get an explicit 60s timeout, up from Envoy's 15s default. (#1085) - OAuth2 control routes skip ext_authz:
/signoutand/oauth2/routes disable the external authorization filter so the browser can complete login and logout without authz sidecar calls. (#1085) - OAuth2 callback added to authn skip paths: Setting
gateway.oauth2Proxy.enablednow adds/oauth2/and/signoutto the authn skip set, so oauth2-proxy callbacks reach the proxy instead of being pre-checked by its own/oauth2/authendpoint. (#1091) - Router affinity cookie defaults to session lifetime:
gateway.envoy.routerRoute.cookie.ttlnow defaults to0s(session cookie) instead of60s. CLI sessions no longer get reassigned to a different router pod after 60 seconds of idle activity. (#1098) - Logger upstream uses headless service: The gateway now connects to
osmo-logger-headless:8000so Envoy load-balances directly to pod IPs, avoiding the default 1024-connection circuit breaker that the cluster-IP service was hitting. (#1098)
Workflow Execution
- Retry pods recreate generated file secrets: Retrying a pod that mounts file-backed credential secrets now recreates the per-pod Secret alongside the new pod, so the retry no longer fails to mount missing file references. (#1090)
Web UI
- Bulk cancel workflows: Select multiple rows in the Workflows list and cancel them with one action behind a confirmation dialog. (#1050)
Authorization
- Default role policy merge by scope: Default-role updates compare and append actions per
(effect, resources)scope instead of flattening everything into the first policy of the existing role. Operator-added grants are preserved verbatim, and missing scopes from the shipped default are appended as new policies. (#1072) - Default
osmo-userrole narrowed to default pool: The shipped role now grants read/list actions across all resources and scopesworkflow:*topool/defaultonly. Existing deployments retain their stored policies because the merge is append-only; fresh installs and operators who reseed defaults pick up the narrower scope. (#1072)
Getting OSMO
Helm Charts and Containers
Helm charts and container images are available on NGC.
CLI Client
Installers for the CLI client for macOS (Apple Silicon), x86-64 Linux, and ARM64 Linux are attached as assets to this release.
6.3.0
Highlights
- ConfigMap-based configuration — All service configs (pools, backends, pod templates, roles, and more) can now be managed as Helm values via a Kubernetes ConfigMap, following standard K8s patterns and enabling GitOps workflows.
- TLS support — The service chart now terminates TLS at the gateway, with values for cert/key, redirect from HTTP, and SAN configuration.
- Service chart consolidation — The standalone
routerandweb-uiHelm charts have been folded into theservicechart, making a full deployment a single Helm release. - Multi-provider deploy scripts —
deploy-k8s.shnow provisions OSMO on Azure AKS, AWS EKS, microk8s, or any existing Kubernetes cluster, with idempotent installers for KAI Scheduler, GPU Operator, MinIO, and configurable storage backends (MinIO, Azure Blob, AWS S3, BYO S3). - Per-group timeouts —
exec_timeoutandqueue_timeoutnow meter each group independently instead of running against the workflow as a whole, so a stuck simulation group no longer kills the rest of the workflow. - Dataset CLI and API deprecated —
osmo datasetcommands and the/datasetsAPI endpoints are deprecated and will be removed in 6.4. Migrate to workflow-managed dataset outputs. - Rsync download support — Pull files from running workflow tasks to your local machine with
osmo workflow rsync download, complementing the existing upload capability. - Visual transfer progress — File sync operations now display a progress bar showing bytes transferred, percentage, rate, and ETA.
- Workload identity for core services — Run OSMO services under a cloud-issued federated identity (Azure Workload Identity on AKS/Arc, AWS IRSA / EKS Pod Identity) via new cloud-neutral
serviceAccountannotations and per-componentextraPodLabelshooks, removing the need to mount cloud storage keys as Kubernetes Secrets. - Privilege escalation fix — Policies with empty resources lists no longer grant access to resource-scoped endpoints.
Breaking Changes
- Router chart removed: The standalone
routerHelm chart is gone. Router pods now deploy as part of theservicechart. Existing router resources (osmo-router,osmo-router-headless) continue to work, but you must remove the separate router Helm release before upgrading. See the 6.2 to 6.3 upgrade guide for migration steps. (#897) - Web UI chart removed: The standalone
web-uiHelm chart has been merged into theservicechart. Setui.enabled: truein service values to deploy the UI alongside the API. Remove the separateweb-uirelease before upgrading. (#907) - Squid proxy removed from backend operator: The egress allowlist and squid-proxy sidecar have been removed from the backend operator chart. Network policies now restrict pod-to-pod access directly. (#823)
- Per-group timeout semantics:
exec_timeoutandqueue_timeoutare now enforced per group (clock starts on the group'sRUNNINGorSCHEDULINGtransition) instead of per workflow. An expired group is markedFAILED_EXEC_TIMEOUTorFAILED_QUEUE_TIMEOUT; sibling groups continue and the workflow status aggregates only after all groups finish. (#925) - Dataset CLI and API deprecated: All
osmo datasetsubcommands print a stderr deprecation warning, and the/datasetsREST endpoints are marked deprecated in the OpenAPI schema. The Datasets page in the UI shows a deprecation banner. Both will be removed in 6.4. (#872) - S3 addressing default: For S3-compatible backends with a custom
endpoint_url, the addressing style now defaults to virtual-hosted instead of boto3's auto-selection (which picks path style for custom endpoints), fixing compatibility with providers that require virtual hosts. If a backend requires path addressing, set theaddressing_styleattribute to path, or force OSMO to always use path addressing via theAWS_S3_FORCE_PATH_STYLEenvironment variable. (#950)
Helm Charts
-
ConfigMap configuration mode: Set
services.configs.enabled: trueto manage all service configs via Helm values. CLI/API writes return HTTP 409 when active. The chart ships with default roles, pod templates, resource validations, backend, and pool. (#822) -
ConfigMap mode for worker, agent, and logger: The ConfigMapWatcher now runs in the worker, agent, and logger services. Previously only the API service watched the ConfigMap, so workflow pods built by the worker could be constructed from stale config. (#926)
-
TLS termination at the gateway: Configure a serving cert/key, optional HTTP-to-HTTPS redirect, and SAN list via
gateway.tls. The gateway template generates the matching Envoy listener config. (#953) -
Cloud workload identity: New top-level
serviceAccountblock (create,name,annotations) and per-componentextraPodLabelsonagent,api,worker,logger,router, anddelayedJobMonitor. The hooks are cloud-neutral — set the annotations and labels your CSP's identity webhook expects:- Azure (AKS / Arc): annotate the SA with
azure.workload.identity/client-id: <uami-client-id>and label pods withazure.workload.identity/use: "true". The Azure storage backend falls back toDefaultAzureCredentialwhen no static connection string is supplied. - AWS (EKS IRSA / Pod Identity): annotate the SA with
eks.amazonaws.com/role-arn: <iam-role-arn>. The S3 backend picks up the federated token from boto3's default credential chain — no pod labels required.
- Azure (AKS / Arc): annotate the SA with
-
Gateway consolidation: A unified gateway now handles load balancing for all service types (API, router, UI), simplifying ingress configuration. (#817, #799)
-
Gateway extension hooks: Inject custom Envoy filters and additive auth-skip paths via
gateway.envoy.extensionsandgateway.envoy.authSkipPaths, useful for sidecar integrations and bypassing authz on specific endpoints. (#1009) -
Default identity headers: Minimal deployments can now inject default
x-osmo-user,x-osmo-roles, andx-osmo-allowed-poolsheaders for unauthenticated browser requests viagateway.envoy.defaultIdentityvalues. (#902) -
oauth2-proxy extraEnv: Expose environment variables on the oauth2-proxy container via
gateway.oauth2Proxy.extraEnv, needed for Redis AUTH when using session storage. (#898) -
Custom HPA metrics: Specify custom metrics for Horizontal Pod Autoscalers on service components. (#858)
-
Pool computed fields resolved at load time: ConfigMap pools no longer require pre-expanded
parsed_pod_templateandparsed_resource_validations, reducing config file size by ~60%. (#866) -
Per-field Secret mounts: Create credential Secrets with
kubectl --from-literalinstead of packaging all fields into a singlecred.yaml. (#884) -
Default pod templates on default pool: The chart's default pool now sets
common_pod_template, so workflows submitted without an explicit template pick updefault_ctrlanddefault_userautomatically. (#1010, #1012) -
Backend-operator startup probe configurable:
startupProbethresholds on the backend listener and worker are now exposed in values, with relaxed defaults to handle slow image pulls on cold clusters. (#961) -
Service startup probe extended: The API service
startupProbefailure threshold now allows up to ~2 minutes for migrations and DB warm-up before the pod is restarted. (#967) -
podMonitor disabled by default: Both the service and backend-operator charts now default
podMonitor.enabledtofalse, avoiding errors on clusters without Prometheus Operator CRDs installed. (#962, #963) -
Config export script: New
deployments/upgrades/export_configs_to_helm.pyexports existing database configs to Helm values format. (#866)
Deployment Scripts
- Multi-provider deploy:
deploy-k8s.shprovisions a Kubernetes cluster on Azure AKS, AWS EKS, microk8s, or registers an existing cluster, then installs OSMO end-to-end. Cluster-agnostic dependency installers detect existing KAI Scheduler, GPU Operator, and MinIO so re-runs are safe. (#979) - Storage backend wiring:
configure-storage.shprovisions and registers the workflow storage backend for MinIO, Azure Blob, AWS S3, or a bring-your-own S3 endpoint, including credential creation and bucket setup. (#979, #988) - Idempotent token mint: Backend operator token reconciliation now deletes any pre-existing
backend-tokenbefore re-minting, so partial prior runs and microk8s PVC carryover no longer wedge re-deploys. (#988) - Helm values for minimal install:
deploy-osmo-minimal.shaccepts--valuesto layer custom Helm values on top of the minimal preset. (#993)
Workflow Execution
- Per-group exec and queue timeouts: Each group's clock starts on its own
RUNNING(exec) orSCHEDULING(queue) transition. Expired groups are markedFAILED_EXEC_TIMEOUTorFAILED_QUEUE_TIMEOUT; downstream groups cascade asFAILED_UPSTREAM, sibling groups keep running. Delayed jobs serialized before the upgrade fall back to the previous workflow-level enforcement with a warning log. (#925) - Pool quota accounting handles Jinja:
osmo-ctrlresource requests and limits are now pre-rendered for pool-quota accounting, so templated values like{% if USER_CPU > 2 %}2{% else %}{{USER_CPU}}{% endif %}are counted correctly instead of being silently treated as zero. (#931) - service_auth wired into worker, agent, logger: These services now read
service_authand stop readingservice_base_urlfrom the database, fixing config-mode authentication for non-API pods. (#930) - KAI queues sync on every registration: Backend registration now syncs KAI Scheduler queues unconditionally, instead of only on the first registration. (#941)
CLI
- Rsync download: Pull files from running tasks to your local machine with `osmo workflow rsync download wf-id ...
6.3.0-prerelease-rc11
Getting OSMO
Helm Charts and Containers
Helm charts and container images are available on NGC.
CLI Client
Installers for the CLI client for macOS (Apple Silicon), x86-64 Linux, and ARM64 Linux are attached as assets to this release.
6.3.0-prerelease-rc9
Getting OSMO
Helm Charts and Containers
Helm charts and container images are available on NGC.
CLI Client
Installers for the CLI client for macOS (Apple Silicon), x86-64 Linux, and ARM64 Linux are attached as assets to this release.
6.2.10
Highlights
- Authorization Bug Fixes — Multiple RBAC path corrections for credentials, workflow exec, and rsync operations
- Interactive Exec Improvements — Dynamic terminal sizing and resize support for full-screen tools like vim when exec-ing into a running workflow task
- Default Pool Submission — Users can now submit workflows to the default pool via osmo-user
Authorization
- Credentials create path: Fixed RBAC action registry to include the more specific path required for credential creation (#737)
- Workflow exec permissions: Corrected authorization paths for workflow exec and rsync operations (#738, #739)
- Restart API access: Added missing action registry entry for the restart API (#716)
Interactive Development
osmo workflow exec and the browser shell now correctly handle terminal geometry, improving the experience when running interactive tools inside a workflow task:
- Dynamic terminal sizing: The terminal reports its actual dimensions to the runtime, fixing rendering of full-screen applications like vim (#717)
- Shell resize support: Resizing the browser shell window propagates correctly to the running session (#727)
Workflow Engine
- Default pool submission:
osmo-usercan now submit workflows without specifying a pool, falling back to the user's default pool (#728)
Brev Deployment
- New UI support: The new UI is now supported by allowing cross-domain URL proxying
- KAI Scheduler v0.13.4: Quick-start and one-click launchable updated to KAI Scheduler v0.13.4 with corrected Helm chart registry path (#725)
Getting OSMO
Helm Charts and Containers
Helm charts and docker containers are available in NGC
CLI Client
The installers for the CLI client for MacOS (Apple Silicon), x86-64 Linux, and ARM64 Linux are attached as assets to this release.
6.2.8
Highlights
- RBAC & Authentication Overhaul — Full OAuth2 proxy integration, RBAC authorization sidecar, user mapping, and JWT-based auth APIs
- New UI Platform — Complete UI rewrite with OAuth2 integration, dataset collections, workflow submission flow, and WCAG 2.1 accessibility compliance
- AI Agentic Skills — Agent skills framework with workflow-expert, logs-reader, and language-specific expert sub-agents for autoscaling workflow submissions
- Database Migration — pgroll-based database migration system
- NVLink & Topology-Aware Scheduling — NVLink topology support and intelligent pool grouping for shared nodes within the same nodeset
Authentication & Authorization
OSMO introduces a comprehensive authentication and authorization layer to secure access across all services:
- RBAC Authorization Sidecar: Dedicated authz sidecar deployed alongside services, enabled by default, enforcing role-based access control at the request level (#445, #471)
- OAuth2 Proxy Integration: Full OAuth2 proxy support for both UI and backend services, replacing the previous auth model with standard OAuth2 flows including device code login and token refresh (#443, #520, #585)
- User Mapping: Map external identity provider users to OSMO roles and pool permissions, with syncing between role maps and pool assignments (#418, #515)
AI Agentic Skills
A new agent framework enables AI-driven workflow management and codebase assistance:
- Skills Framework: Extensible skill system with cross-platform installation via npx, structured for framework-agnostic usage (#555, #598, #599, #605)
- Workflow Expert Agent: Specialized agent with detailed knowledge of workflow execution phases for intelligent troubleshooting and guidance (#565)
Scheduling & Compute
- NVLink Topology Support: Scheduling-aware NVLink detection enabling topology-aware task placement for multi-GPU workloads (#479)
- KAI Scheduler Default: Switched default scheduler to KAI for improved scheduling performance (#115)
Workflow Engine & Backend
- CLI Workflow Events: Workflow event streaming available through the CLI for real-time monitoring (#533)
- Supporting Large Workflows: Websocket connection between agent service and backend worker will no longer break on large workflows, and status updates are now sped up by at least 30% (#398, #391, #655, #676)
- Workflow Submission Speedup: For large workflows (e.g. 100 tasks), workflow submission response is 4x faster (#701)
Data & Storage
- Non-AWS S3 Support: S3-compatible storage backends (MinIO, Azure Blob, etc.) work without requiring AWS environment variables, with automatic endpoint detection during data auth validation (#421, #385)
- Credential-less Data Operations: Data Access Layer supports operations without explicit credentials when environment-based auth is available, with client-side auth checks (#159, #177)
Database
- pgroll Migration System: Pre-upgrade migration jobs using pgroll for schema changes
Web UI
- The UI has been completely rewritten and relocated to
/src/ui, replacing the legacy frontend.
Getting OSMO
Helm Charts and Containers
Helm charts and docker containers are available in NGC
CLI Client
The installers for the CLI client for MacOS (Apple Silicon), x86-64 Linux, and ARM64 Linux are attached as assests to this release.
6.2.6
Release Candidate for v6.2
6.0.0
Major features
Workflow Management
OSMO provides a sophisticated workflow orchestration system that allows users to define, submit, and monitor complex AI workflows through both a web UI and CLI:
- Multi-Task Orchestration: Define complex workflows with serial and parallel task execution patterns, with automatic dependency management and synchronization through barriers
- Priority-Based Scheduling: Support for HIGH, NORMAL, and LOW priority levels with intelligent preemption and GPU borrowing across pools to maximize utilization
- Interactive Development: Exec into running containers, port-forward services, and rsync files between local workstations and remote tasks for seamless debugging
- Resource Management: Flexible resource specification with support for GPUs, CPUs, memory, and storage across multiple platforms and node types
- Automatic Rescheduling: Handle transient failures gracefully with configurable retry policies and exit code handling
- Template Support: Create reusable, parameterized workflow specifications with Jinja templating for automation and scaling
Data Management
OSMO's data layer provides efficient storage and access to datasets with versioning and metadata support:
- Dataset Versioning: Track dataset evolution with automatic versioning and deduplication to optimize storage
- Multiple Storage Backends: Support for AWS S3, Azure Blob Storage, and Google Storage with seamless integration
- Efficient Data Transfer: Multi-threaded, multi-process uploads and downloads with automatic resume capabilities
- Collections: Group related datasets together for easier organization and management
- Metadata and Labels: Tag datasets with custom metadata and labels for powerful querying and discovery
- Regex-Based Selection: Upload or download partial datasets using regex patterns for fine-grained control
Applications
The Apps feature allows users to create reusable applications from workflow specifications:
- Parameterized Applications: Define apps with customizable parameters that users can adjust at launch time
- Easy Sharing: Package complex workflows as simple-to-launch applications for team collaboration
- Workflow Abstraction: Hide complexity behind user-friendly interfaces while maintaining full workflow power
Pools and Resource Management
OSMO introduces a sophisticated resource management system based on pools and platforms:
- Pool-Based Access Control: Teams are granted access to specific resource pools with RBAC for secure multi-tenancy within an organization
- Dynamic Pool Sizing: Pool sizes can be adjusted dynamically to respond to changing workload priorities
- Platform Support: Each pool supports one or more platform types (GPU models, architectures) with automatic resource validation
- Resource Sharing: Resources can be allocated to multiple pools simultaneously for maximum utilization
- Quota Management: View and track resource quotas and usage across pools
- Maintenance Mode: Admins can mark pools for maintenance to prevent new submissions during updates
Compute Backend Integration
OSMO seamlessly integrates with Kubernetes clusters and various compute backends:
- Multi-Cluster Support: Manage workflows across multiple Kubernetes clusters (AWS EKS, Azure AKS, GCP GKE, on-premise)
- KAI Scheduler Support: Support for advanced workflow scheduling with NVIDIA KAI Scheduler
- Customizable Pod Templates: Flexible pod template configurations allowing administrators to customize resource requests, limits, tolerations, and node affinities per backend
Web User Interface
A modern, responsive web interface provides comprehensive workflow and data management:
- Workflow Dashboard: View, filter, and manage workflows with real-time status updates
- Interactive Task Graphs: Visualize workflow structure and task dependencies
- Live Log Streaming: Stream logs in real-time from running workflows with syntax highlighting
- Resource Visualization: Monitor cluster resources, pool quotas, and node utilization
- Dataset Browser: Browse, visualize, and manage datasets with metadata editing
- Shell Access: Browser-native terminal for executing commands in running tasks
- Pool Management: View pool information, supported platforms, and available quotas
Command Line Interface
A powerful CLI provides full access to OSMO capabilities with scripting and automation support:
- Intuitive Commands: Organized command structure for workflows, datasets, resources, pools, and configuration
- Multi-Platform Support: Native support for Mac (Apple Silicon), as well as both x86-64 and ARM 64 architectures on Linux
- Auto-Completion: Tab completion support on Linux and macOS for faster command entry
- Multiple Output Formats: JSON and human-readable text output formats for easy integration with scripts
- Profile Management: Configure default settings for backend, bucket, and notification preferences
- Automatic Reconnection: Port-forward and exec commands automatically reconnect on disconnection
Security and Authentication
Enterprise-grade security features protect workflows and data:
- OIDC Integration: OAuth2.0-based authentication via Keycloak, which can be configured to connect to other OAuth 2.0 or SAML authentication providers
- RBAC: Role-based access control for pools, backends, and resources
- Token Scoping: Limited-scope JWT tokens with appropriate time-to-live durations
- Limited Scope Access Tokens: Users can create access tokens with restricted scopes, enabling secure and granular control over permissions. These tokens can be used for login and automated access, ensuring users and services only have the access they require.
Framework Integration
OSMO integrates seamlessly with popular AI/ML frameworks and tools, and comes with tutorials to demonstrate their use:
- Distributed Training: TorchRun, DeepSpeed, and elastic training support for multi-node DNN training
- Reinforcement Learning: Isaac Lab integration for RL training workflows
- Simulation: Isaac Sim integration for synthetic data generation (SDG) and simulation workflows
- ROS/ROS2: Support for robotics workflows with multi-node communication and hardware-in-the-loop testing
- Development Tools: Jupyter Notebook, VSCode, and File Browser integration for interactive development
- ML Tools: Weights & Biases (wandb) integration for experiment tracking
- NVIDIA GR00T: Sample workflows for Gr00T finetuning, GR00T mimic, and GR00T interactive notebook
Getting OSMO
Helm Charts and Containers
Helm charts and docker containers are available in NGC
CLI Client
The installers for the CLI client for MacOS (Apple Silicon), x86-64 Linux, and ARM64 Linux are attached as assests to this release.