Skip to content

Releases: NVIDIA/OSMO

6.3.1

22 Jun 18:28
ed59b69

Choose a tag to compare

Highlights

  • Bulk cancel workflows — Select multiple workflows in the UI list and cancel them with a single confirmed action.
  • Gateway PodMonitors — Envoy and oauth2-proxy now expose Prometheus metrics endpoints for cluster monitoring.
  • Streaming endpoints stop timing out — Workflow log, event, and error-log streams no longer hit the default route timeout mid-stream.
  • Envoy v1.38.1 with header-sanitization refactor — Gateway image upgraded from v1.29.0 and identity-header stripping moved off the Lua filter onto Envoy's native mechanism.
  • OAuth2 callback no longer self-blocks — When oauth2-proxy is enabled, its own callback endpoints bypass the authn metadata filter so login completes.
  • Default role policy merge respects scope — Updates to the shipped default roles append actions only to policies with matching effect and resources, preserving operator-added grants.

Helm Charts

  • Gateway PodMonitors: The service chart now ships PodMonitors for Envoy's admin metrics endpoint and oauth2-proxy's /metrics endpoint when podMonitor.enabled is true and the matching gateway component is deployed. (#1095)
  • Envoy upgrade to v1.38.1: Default gateway.envoy.image bumped from envoyproxy/envoy:v1.29.0. (#1081)
  • Identity header sanitization refactor: Client x-osmo-{user,roles,allowed-pools} headers are now stripped via Envoy's native header sanitization instead of a Lua filter. JWT-only deployments (oauth2-proxy and authz disabled, JWT providers configured) now sanitize client headers as well. Minimal/demo mode (all three auth sources disabled) continues to trust them — see the chart README for the full identity-header trust table. (#1081)
  • HPA-managed deployments skip the replicas field: Gateway Envoy, oauth2-proxy, and authz deployments omit spec.replicas from their manifests, so Helm apply no longer contends with the autoscaler on each reconcile. (#1081)
  • ConfigMap extra annotations: New services.configs.extraAnnotations annotates the generated configs ConfigMap, useful for setting argocd.argoproj.io/sync-options: ServerSideApply=true on large config payloads. (#1081)
  • Streaming API route timeouts: Workflow /logs, /events, and /error_logs routes use timeout: 0s with idle_timeout: 60s so quiet-but-open streams are not cut. Other /api/ and /client/ routes get an explicit 60s timeout, up from Envoy's 15s default. (#1085)
  • OAuth2 control routes skip ext_authz: /signout and /oauth2/ routes disable the external authorization filter so the browser can complete login and logout without authz sidecar calls. (#1085)
  • OAuth2 callback added to authn skip paths: Setting gateway.oauth2Proxy.enabled now adds /oauth2/ and /signout to the authn skip set, so oauth2-proxy callbacks reach the proxy instead of being pre-checked by its own /oauth2/auth endpoint. (#1091)
  • Router affinity cookie defaults to session lifetime: gateway.envoy.routerRoute.cookie.ttl now defaults to 0s (session cookie) instead of 60s. CLI sessions no longer get reassigned to a different router pod after 60 seconds of idle activity. (#1098)
  • Logger upstream uses headless service: The gateway now connects to osmo-logger-headless:8000 so Envoy load-balances directly to pod IPs, avoiding the default 1024-connection circuit breaker that the cluster-IP service was hitting. (#1098)

Workflow Execution

  • Retry pods recreate generated file secrets: Retrying a pod that mounts file-backed credential secrets now recreates the per-pod Secret alongside the new pod, so the retry no longer fails to mount missing file references. (#1090)

Web UI

  • Bulk cancel workflows: Select multiple rows in the Workflows list and cancel them with one action behind a confirmation dialog. (#1050)

Authorization

  • Default role policy merge by scope: Default-role updates compare and append actions per (effect, resources) scope instead of flattening everything into the first policy of the existing role. Operator-added grants are preserved verbatim, and missing scopes from the shipped default are appended as new policies. (#1072)
  • Default osmo-user role narrowed to default pool: The shipped role now grants read/list actions across all resources and scopes workflow:* to pool/default only. Existing deployments retain their stored policies because the merge is append-only; fresh installs and operators who reseed defaults pick up the narrower scope. (#1072)

Getting OSMO

Helm Charts and Containers

Helm charts and container images are available on NGC.

CLI Client

Installers for the CLI client for macOS (Apple Silicon), x86-64 Linux, and ARM64 Linux are attached as assets to this release.

6.3.0

05 May 20:44
07b7140

Choose a tag to compare

Highlights

  • ConfigMap-based configuration — All service configs (pools, backends, pod templates, roles, and more) can now be managed as Helm values via a Kubernetes ConfigMap, following standard K8s patterns and enabling GitOps workflows.
  • TLS support — The service chart now terminates TLS at the gateway, with values for cert/key, redirect from HTTP, and SAN configuration.
  • Service chart consolidation — The standalone router and web-ui Helm charts have been folded into the service chart, making a full deployment a single Helm release.
  • Multi-provider deploy scriptsdeploy-k8s.sh now provisions OSMO on Azure AKS, AWS EKS, microk8s, or any existing Kubernetes cluster, with idempotent installers for KAI Scheduler, GPU Operator, MinIO, and configurable storage backends (MinIO, Azure Blob, AWS S3, BYO S3).
  • Per-group timeoutsexec_timeout and queue_timeout now meter each group independently instead of running against the workflow as a whole, so a stuck simulation group no longer kills the rest of the workflow.
  • Dataset CLI and API deprecatedosmo dataset commands and the /datasets API endpoints are deprecated and will be removed in 6.4. Migrate to workflow-managed dataset outputs.
  • Rsync download support — Pull files from running workflow tasks to your local machine with osmo workflow rsync download, complementing the existing upload capability.
  • Visual transfer progress — File sync operations now display a progress bar showing bytes transferred, percentage, rate, and ETA.
  • Workload identity for core services — Run OSMO services under a cloud-issued federated identity (Azure Workload Identity on AKS/Arc, AWS IRSA / EKS Pod Identity) via new cloud-neutral serviceAccount annotations and per-component extraPodLabels hooks, removing the need to mount cloud storage keys as Kubernetes Secrets.
  • Privilege escalation fix — Policies with empty resources lists no longer grant access to resource-scoped endpoints.

Breaking Changes

  • Router chart removed: The standalone router Helm chart is gone. Router pods now deploy as part of the service chart. Existing router resources (osmo-router, osmo-router-headless) continue to work, but you must remove the separate router Helm release before upgrading. See the 6.2 to 6.3 upgrade guide for migration steps. (#897)
  • Web UI chart removed: The standalone web-ui Helm chart has been merged into the service chart. Set ui.enabled: true in service values to deploy the UI alongside the API. Remove the separate web-ui release before upgrading. (#907)
  • Squid proxy removed from backend operator: The egress allowlist and squid-proxy sidecar have been removed from the backend operator chart. Network policies now restrict pod-to-pod access directly. (#823)
  • Per-group timeout semantics: exec_timeout and queue_timeout are now enforced per group (clock starts on the group's RUNNING or SCHEDULING transition) instead of per workflow. An expired group is marked FAILED_EXEC_TIMEOUT or FAILED_QUEUE_TIMEOUT; sibling groups continue and the workflow status aggregates only after all groups finish. (#925)
  • Dataset CLI and API deprecated: All osmo dataset subcommands print a stderr deprecation warning, and the /datasets REST endpoints are marked deprecated in the OpenAPI schema. The Datasets page in the UI shows a deprecation banner. Both will be removed in 6.4. (#872)
  • S3 addressing default: For S3-compatible backends with a custom endpoint_url, the addressing style now defaults to virtual-hosted instead of boto3's auto-selection (which picks path style for custom endpoints), fixing compatibility with providers that require virtual hosts. If a backend requires path addressing, set the addressing_style attribute to path, or force OSMO to always use path addressing via the AWS_S3_FORCE_PATH_STYLE environment variable. (#950)

Helm Charts

  • ConfigMap configuration mode: Set services.configs.enabled: true to manage all service configs via Helm values. CLI/API writes return HTTP 409 when active. The chart ships with default roles, pod templates, resource validations, backend, and pool. (#822)

  • ConfigMap mode for worker, agent, and logger: The ConfigMapWatcher now runs in the worker, agent, and logger services. Previously only the API service watched the ConfigMap, so workflow pods built by the worker could be constructed from stale config. (#926)

  • TLS termination at the gateway: Configure a serving cert/key, optional HTTP-to-HTTPS redirect, and SAN list via gateway.tls. The gateway template generates the matching Envoy listener config. (#953)

  • Cloud workload identity: New top-level serviceAccount block (create, name, annotations) and per-component extraPodLabels on agent, api, worker, logger, router, and delayedJobMonitor. The hooks are cloud-neutral — set the annotations and labels your CSP's identity webhook expects:

    • Azure (AKS / Arc): annotate the SA with azure.workload.identity/client-id: <uami-client-id> and label pods with azure.workload.identity/use: "true". The Azure storage backend falls back to DefaultAzureCredential when no static connection string is supplied.
    • AWS (EKS IRSA / Pod Identity): annotate the SA with eks.amazonaws.com/role-arn: <iam-role-arn>. The S3 backend picks up the federated token from boto3's default credential chain — no pod labels required.
  • Gateway consolidation: A unified gateway now handles load balancing for all service types (API, router, UI), simplifying ingress configuration. (#817, #799)

  • Gateway extension hooks: Inject custom Envoy filters and additive auth-skip paths via gateway.envoy.extensions and gateway.envoy.authSkipPaths, useful for sidecar integrations and bypassing authz on specific endpoints. (#1009)

  • Default identity headers: Minimal deployments can now inject default x-osmo-user, x-osmo-roles, and x-osmo-allowed-pools headers for unauthenticated browser requests via gateway.envoy.defaultIdentity values. (#902)

  • oauth2-proxy extraEnv: Expose environment variables on the oauth2-proxy container via gateway.oauth2Proxy.extraEnv, needed for Redis AUTH when using session storage. (#898)

  • Custom HPA metrics: Specify custom metrics for Horizontal Pod Autoscalers on service components. (#858)

  • Pool computed fields resolved at load time: ConfigMap pools no longer require pre-expanded parsed_pod_template and parsed_resource_validations, reducing config file size by ~60%. (#866)

  • Per-field Secret mounts: Create credential Secrets with kubectl --from-literal instead of packaging all fields into a single cred.yaml. (#884)

  • Default pod templates on default pool: The chart's default pool now sets common_pod_template, so workflows submitted without an explicit template pick up default_ctrl and default_user automatically. (#1010, #1012)

  • Backend-operator startup probe configurable: startupProbe thresholds on the backend listener and worker are now exposed in values, with relaxed defaults to handle slow image pulls on cold clusters. (#961)

  • Service startup probe extended: The API service startupProbe failure threshold now allows up to ~2 minutes for migrations and DB warm-up before the pod is restarted. (#967)

  • podMonitor disabled by default: Both the service and backend-operator charts now default podMonitor.enabled to false, avoiding errors on clusters without Prometheus Operator CRDs installed. (#962, #963)

  • Config export script: New deployments/upgrades/export_configs_to_helm.py exports existing database configs to Helm values format. (#866)

Deployment Scripts

  • Multi-provider deploy: deploy-k8s.sh provisions a Kubernetes cluster on Azure AKS, AWS EKS, microk8s, or registers an existing cluster, then installs OSMO end-to-end. Cluster-agnostic dependency installers detect existing KAI Scheduler, GPU Operator, and MinIO so re-runs are safe. (#979)
  • Storage backend wiring: configure-storage.sh provisions and registers the workflow storage backend for MinIO, Azure Blob, AWS S3, or a bring-your-own S3 endpoint, including credential creation and bucket setup. (#979, #988)
  • Idempotent token mint: Backend operator token reconciliation now deletes any pre-existing backend-token before re-minting, so partial prior runs and microk8s PVC carryover no longer wedge re-deploys. (#988)
  • Helm values for minimal install: deploy-osmo-minimal.sh accepts --values to layer custom Helm values on top of the minimal preset. (#993)

Workflow Execution

  • Per-group exec and queue timeouts: Each group's clock starts on its own RUNNING (exec) or SCHEDULING (queue) transition. Expired groups are marked FAILED_EXEC_TIMEOUT or FAILED_QUEUE_TIMEOUT; downstream groups cascade as FAILED_UPSTREAM, sibling groups keep running. Delayed jobs serialized before the upgrade fall back to the previous workflow-level enforcement with a warning log. (#925)
  • Pool quota accounting handles Jinja: osmo-ctrl resource requests and limits are now pre-rendered for pool-quota accounting, so templated values like {% if USER_CPU > 2 %}2{% else %}{{USER_CPU}}{% endif %} are counted correctly instead of being silently treated as zero. (#931)
  • service_auth wired into worker, agent, logger: These services now read service_auth and stop reading service_base_url from the database, fixing config-mode authentication for non-API pods. (#930)
  • KAI queues sync on every registration: Backend registration now syncs KAI Scheduler queues unconditionally, instead of only on the first registration. (#941)

CLI

  • Rsync download: Pull files from running tasks to your local machine with `osmo workflow rsync download wf-id ...
Read more

6.3.0-prerelease-rc11

22 May 01:52
07b7140

Choose a tag to compare

6.3.0-prerelease-rc11 Pre-release
Pre-release

Getting OSMO

Helm Charts and Containers

Helm charts and container images are available on NGC.

CLI Client

Installers for the CLI client for macOS (Apple Silicon), x86-64 Linux, and ARM64 Linux are attached as assets to this release.

6.3.0-prerelease-rc9

14 May 22:23
ea72cff

Choose a tag to compare

6.3.0-prerelease-rc9 Pre-release
Pre-release

Getting OSMO

Helm Charts and Containers

Helm charts and container images are available on NGC.

CLI Client

Installers for the CLI client for macOS (Apple Silicon), x86-64 Linux, and ARM64 Linux are attached as assets to this release.

6.2.10

25 Mar 15:55
7904629

Choose a tag to compare

Highlights

  • Authorization Bug Fixes — Multiple RBAC path corrections for credentials, workflow exec, and rsync operations
  • Interactive Exec Improvements — Dynamic terminal sizing and resize support for full-screen tools like vim when exec-ing into a running workflow task
  • Default Pool Submission — Users can now submit workflows to the default pool via osmo-user

Authorization

  • Credentials create path: Fixed RBAC action registry to include the more specific path required for credential creation (#737)
  • Workflow exec permissions: Corrected authorization paths for workflow exec and rsync operations (#738, #739)
  • Restart API access: Added missing action registry entry for the restart API (#716)

Interactive Development

osmo workflow exec and the browser shell now correctly handle terminal geometry, improving the experience when running interactive tools inside a workflow task:

  • Dynamic terminal sizing: The terminal reports its actual dimensions to the runtime, fixing rendering of full-screen applications like vim (#717)
  • Shell resize support: Resizing the browser shell window propagates correctly to the running session (#727)

Workflow Engine

  • Default pool submission: osmo-user can now submit workflows without specifying a pool, falling back to the user's default pool (#728)

Brev Deployment

  • New UI support: The new UI is now supported by allowing cross-domain URL proxying
  • KAI Scheduler v0.13.4: Quick-start and one-click launchable updated to KAI Scheduler v0.13.4 with corrected Helm chart registry path (#725)

Getting OSMO

Helm Charts and Containers

Helm charts and docker containers are available in NGC

CLI Client

The installers for the CLI client for MacOS (Apple Silicon), x86-64 Linux, and ARM64 Linux are attached as assets to this release.

6.2.8

18 Mar 20:01
8dab0c7

Choose a tag to compare

Highlights

  • RBAC & Authentication Overhaul — Full OAuth2 proxy integration, RBAC authorization sidecar, user mapping, and JWT-based auth APIs
  • New UI Platform — Complete UI rewrite with OAuth2 integration, dataset collections, workflow submission flow, and WCAG 2.1 accessibility compliance
  • AI Agentic Skills — Agent skills framework with workflow-expert, logs-reader, and language-specific expert sub-agents for autoscaling workflow submissions
  • Database Migration — pgroll-based database migration system
  • NVLink & Topology-Aware Scheduling — NVLink topology support and intelligent pool grouping for shared nodes within the same nodeset

Authentication & Authorization

OSMO introduces a comprehensive authentication and authorization layer to secure access across all services:

  • RBAC Authorization Sidecar: Dedicated authz sidecar deployed alongside services, enabled by default, enforcing role-based access control at the request level (#445, #471)
  • OAuth2 Proxy Integration: Full OAuth2 proxy support for both UI and backend services, replacing the previous auth model with standard OAuth2 flows including device code login and token refresh (#443, #520, #585)
  • User Mapping: Map external identity provider users to OSMO roles and pool permissions, with syncing between role maps and pool assignments (#418, #515)

AI Agentic Skills

A new agent framework enables AI-driven workflow management and codebase assistance:

  • Skills Framework: Extensible skill system with cross-platform installation via npx, structured for framework-agnostic usage (#555, #598, #599, #605)
  • Workflow Expert Agent: Specialized agent with detailed knowledge of workflow execution phases for intelligent troubleshooting and guidance (#565)

Scheduling & Compute

  • NVLink Topology Support: Scheduling-aware NVLink detection enabling topology-aware task placement for multi-GPU workloads (#479)
  • KAI Scheduler Default: Switched default scheduler to KAI for improved scheduling performance (#115)

Workflow Engine & Backend

  • CLI Workflow Events: Workflow event streaming available through the CLI for real-time monitoring (#533)
  • Supporting Large Workflows: Websocket connection between agent service and backend worker will no longer break on large workflows, and status updates are now sped up by at least 30% (#398, #391, #655, #676)
  • Workflow Submission Speedup: For large workflows (e.g. 100 tasks), workflow submission response is 4x faster (#701)

Data & Storage

  • Non-AWS S3 Support: S3-compatible storage backends (MinIO, Azure Blob, etc.) work without requiring AWS environment variables, with automatic endpoint detection during data auth validation (#421, #385)
  • Credential-less Data Operations: Data Access Layer supports operations without explicit credentials when environment-based auth is available, with client-side auth checks (#159, #177)

Database

  • pgroll Migration System: Pre-upgrade migration jobs using pgroll for schema changes

Web UI

  • The UI has been completely rewritten and relocated to /src/ui, replacing the legacy frontend.

Getting OSMO

Helm Charts and Containers

Helm charts and docker containers are available in NGC

CLI Client

The installers for the CLI client for MacOS (Apple Silicon), x86-64 Linux, and ARM64 Linux are attached as assests to this release.

6.2.6

05 Mar 21:07
94e6f63

Choose a tag to compare

6.2.6 Pre-release
Pre-release

Release Candidate for v6.2

6.0.0

20 Nov 18:25

Choose a tag to compare

Major features

Workflow Management

OSMO provides a sophisticated workflow orchestration system that allows users to define, submit, and monitor complex AI workflows through both a web UI and CLI:

  • Multi-Task Orchestration: Define complex workflows with serial and parallel task execution patterns, with automatic dependency management and synchronization through barriers
  • Priority-Based Scheduling: Support for HIGH, NORMAL, and LOW priority levels with intelligent preemption and GPU borrowing across pools to maximize utilization
  • Interactive Development: Exec into running containers, port-forward services, and rsync files between local workstations and remote tasks for seamless debugging
  • Resource Management: Flexible resource specification with support for GPUs, CPUs, memory, and storage across multiple platforms and node types
  • Automatic Rescheduling: Handle transient failures gracefully with configurable retry policies and exit code handling
  • Template Support: Create reusable, parameterized workflow specifications with Jinja templating for automation and scaling

Data Management

OSMO's data layer provides efficient storage and access to datasets with versioning and metadata support:

  • Dataset Versioning: Track dataset evolution with automatic versioning and deduplication to optimize storage
  • Multiple Storage Backends: Support for AWS S3, Azure Blob Storage, and Google Storage with seamless integration
  • Efficient Data Transfer: Multi-threaded, multi-process uploads and downloads with automatic resume capabilities
  • Collections: Group related datasets together for easier organization and management
  • Metadata and Labels: Tag datasets with custom metadata and labels for powerful querying and discovery
  • Regex-Based Selection: Upload or download partial datasets using regex patterns for fine-grained control

Applications

The Apps feature allows users to create reusable applications from workflow specifications:

  • Parameterized Applications: Define apps with customizable parameters that users can adjust at launch time
  • Easy Sharing: Package complex workflows as simple-to-launch applications for team collaboration
  • Workflow Abstraction: Hide complexity behind user-friendly interfaces while maintaining full workflow power

Pools and Resource Management

OSMO introduces a sophisticated resource management system based on pools and platforms:

  • Pool-Based Access Control: Teams are granted access to specific resource pools with RBAC for secure multi-tenancy within an organization
  • Dynamic Pool Sizing: Pool sizes can be adjusted dynamically to respond to changing workload priorities
  • Platform Support: Each pool supports one or more platform types (GPU models, architectures) with automatic resource validation
  • Resource Sharing: Resources can be allocated to multiple pools simultaneously for maximum utilization
  • Quota Management: View and track resource quotas and usage across pools
  • Maintenance Mode: Admins can mark pools for maintenance to prevent new submissions during updates

Compute Backend Integration

OSMO seamlessly integrates with Kubernetes clusters and various compute backends:

  • Multi-Cluster Support: Manage workflows across multiple Kubernetes clusters (AWS EKS, Azure AKS, GCP GKE, on-premise)
  • KAI Scheduler Support: Support for advanced workflow scheduling with NVIDIA KAI Scheduler
  • Customizable Pod Templates: Flexible pod template configurations allowing administrators to customize resource requests, limits, tolerations, and node affinities per backend

Web User Interface

A modern, responsive web interface provides comprehensive workflow and data management:

  • Workflow Dashboard: View, filter, and manage workflows with real-time status updates
  • Interactive Task Graphs: Visualize workflow structure and task dependencies
  • Live Log Streaming: Stream logs in real-time from running workflows with syntax highlighting
  • Resource Visualization: Monitor cluster resources, pool quotas, and node utilization
  • Dataset Browser: Browse, visualize, and manage datasets with metadata editing
  • Shell Access: Browser-native terminal for executing commands in running tasks
  • Pool Management: View pool information, supported platforms, and available quotas

Command Line Interface

A powerful CLI provides full access to OSMO capabilities with scripting and automation support:

  • Intuitive Commands: Organized command structure for workflows, datasets, resources, pools, and configuration
  • Multi-Platform Support: Native support for Mac (Apple Silicon), as well as both x86-64 and ARM 64 architectures on Linux
  • Auto-Completion: Tab completion support on Linux and macOS for faster command entry
  • Multiple Output Formats: JSON and human-readable text output formats for easy integration with scripts
  • Profile Management: Configure default settings for backend, bucket, and notification preferences
  • Automatic Reconnection: Port-forward and exec commands automatically reconnect on disconnection

Security and Authentication

Enterprise-grade security features protect workflows and data:

  • OIDC Integration: OAuth2.0-based authentication via Keycloak, which can be configured to connect to other OAuth 2.0 or SAML authentication providers
  • RBAC: Role-based access control for pools, backends, and resources
  • Token Scoping: Limited-scope JWT tokens with appropriate time-to-live durations
  • Limited Scope Access Tokens: Users can create access tokens with restricted scopes, enabling secure and granular control over permissions. These tokens can be used for login and automated access, ensuring users and services only have the access they require.

Framework Integration

OSMO integrates seamlessly with popular AI/ML frameworks and tools, and comes with tutorials to demonstrate their use:

  • Distributed Training: TorchRun, DeepSpeed, and elastic training support for multi-node DNN training
  • Reinforcement Learning: Isaac Lab integration for RL training workflows
  • Simulation: Isaac Sim integration for synthetic data generation (SDG) and simulation workflows
  • ROS/ROS2: Support for robotics workflows with multi-node communication and hardware-in-the-loop testing
  • Development Tools: Jupyter Notebook, VSCode, and File Browser integration for interactive development
  • ML Tools: Weights & Biases (wandb) integration for experiment tracking
  • NVIDIA GR00T: Sample workflows for Gr00T finetuning, GR00T mimic, and GR00T interactive notebook

Getting OSMO

Helm Charts and Containers

Helm charts and docker containers are available in NGC

CLI Client

The installers for the CLI client for MacOS (Apple Silicon), x86-64 Linux, and ARM64 Linux are attached as assests to this release.