Skip to content

Releases: mivertowski/RustCompute

v1.1.0

20 Apr 20:29

Choose a tag to compare

v1.1.0 — Multi-GPU runtime + VynGraph NSAI integration

Second release. Adds multi-GPU migration over NVLink P2P, per-tenant
K2K isolation, PROV-O provenance, hot rule reload, live introspection
streaming, and TLA+ formally verified protocols. Validated on 2× H100
NVL (Azure NC80adis_H100_v5, NV12 NVLink topology).

Headline results (2× NVIDIA H100 NVL, NVLink 12-link):
- NVLink P2P migration: 8.7× faster than host-staging at 16 MiB
- Multi-GPU K2K sustained bandwidth: 258 GB/s (81% of 318 GB/s peak)
- K2K tier hierarchy measured directly: SMEM 6.7us / DSMEM 9-15us /
  HBM 10-18us (all three tiers via cluster_hbm_k2k kernel)
- Lifecycle rule overhead: 23 ns mean / 30 ns p99, flat across all
  5 rules (Spawn/Activate/Quiesce/Terminate/Restart)
- Sustained throughput: 5.10M ops/s, CV 0.66% over 4× 60s trials
- Cross-tenant leak count: 0 across 13 isolation tests
- Formal verification: 6/6 TLA+ specs pass TLC, no counterexamples

Single-GPU v1.0 baseline preserved:
- 8,698× faster persistent actor injection vs cuLaunchKernel
- 3,005× faster than CUDA Graph replay
- 0.628 us cluster.sync() (2.98× vs grid.sync())
- 0.544 ns zero-copy serialization

New since v1.0.0:
- Multi-GPU runtime facade (cuCtxEnablePeerAccess + cuMemcpyPeerAsync)
- NVLink topology probe + PlacementHint::NvlinkPreferred
- 3-phase actor migration with CRC32 byte-for-byte verification
- PROV-O provenance header (8 relations, chain walk, signature hook)
- Multi-tenant K2K (per-tenant sub-brokers, AuditTag{org_id, engagement_id},
  quota enforcement, cross-tenant rejection audit)
- Hot rule reload (CompiledRule artifact, version-monotonic, rollback)
- Live introspection streaming (EWMA, drop-tolerant ring)
- Six TLA+ specs + TLC model-checking pipeline
- HBM tier direct K2K measurement via cluster_hbm_k2k kernel
- Intra-block warp work stealing (warp_work_steal kernel)
- Delta checkpoints (content_digest, delta_from, applied_with_delta)
- cudarc 0.19.3 upgrade; RUSTUP_TOOLCHAIN stabilized to 1.95 in CI

See CHANGELOG.md for full detail and
docs/benchmarks/v1.1-2x-h100-results.md for the reproducible paper-
quality benchmark suite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

v0.4.2: Warp-Shuffle Reductions, __nanosleep, libcu++ Atomics

06 Feb 22:13

Choose a tag to compare

What's New

This release upgrades the CUDA codegen with practical findings from CUDA hardware research, targeting CC 6.0+ GPUs with the existing cudarc 0.18.2 runtime.

Warp-Shuffle Block Reductions

  • Two-phase warp-shuffle reduction replaces tree reduction in all generated CUDA reduction code
  • Phase 1: Intra-warp __shfl_down_sync(0xFFFFFFFF, val, offset) — zero __syncthreads() calls
  • Phase 2: Cross-warp reduction via shared memory — one __syncthreads() call
  • Reduces barrier count from O(log N) to 1 per block reduction (e.g., 9 → 1 for 512-thread blocks)
  • Applied to: persistent FDTD energy reduction, standalone block/grid reduce helpers, and all inline reduction generators

__nanosleep() Power Efficiency

  • Persistent FDTD idle spin-wait now uses __nanosleep() instead of volatile counter loop
  • Software grid barrier spin-loop uses __nanosleep(100) to reduce power consumption
  • Configurable via PersistentFdtdConfig::with_idle_sleep(ns) (default: 1000ns)

libcu++ Ordered Atomics (opt-in)

  • Opt-in cuda::atomic_ref support for H2K/K2H queue operations and software barriers
  • Uses memory_order_acquire/memory_order_release instead of __threadfence_system() pairs
  • Software barrier uses cuda::thread_scope_device (narrower scope) with memory_order_acq_rel
  • Compile-time CUDA 11.0+ version guard
  • Enable via PersistentFdtdConfig::with_libcupp_atomics(true)

Files Changed

  • crates/ringkernel-cuda-codegen/src/persistent_fdtd.rs — config fields, nanosleep, warp-shuffle reduction, libcu++ atomics
  • crates/ringkernel-cuda-codegen/src/reduction_intrinsics.rs — warp-shuffle upgrade for all reduction helpers

Test Results

  • 215 codegen unit tests + 12 integration tests — all passing
  • 6 CUDA GPU execution tests — verified on RTX 2000 Ada (CC 8.9)
  • Full workspace — zero failures

Full Changelog: v0.4.1...v0.4.2

v0.4.1

06 Feb 21:19

Choose a tag to compare

What's New

Property-Based Testing

  • 13 proptest property tests for queue invariants (FIFO ordering, capacity bounds, stats consistency) and HLC properties (total ordering, causality preservation, pack/unpack round-trip)

Ecosystem Feature Bundles

  • web = axum + tower + grpc
  • data = arrow + polars
  • monitoring = tracing-integration + prometheus

Codebase Consolidation

  • Shared DSL marker functions — 27 functions deduplicated across CUDA and WGSL codegen backends (~300 lines removed)
  • unavailable_backend! macro — single macro replaces triplicated backend stubs (~100 lines removed)
  • Structured logging — replaced eprintln! with tracing macros across 6 crates
  • Unsafe documentation// SAFETY: comments on all ~80+ unsafe blocks in GPU code
  • Hot-path #[inline] — queue operations, HLC timestamps, control block accessors

Bug Fixes

  • Tenant suspension now correctly deactivates tenants (was a no-op)
  • Handler registration returns Result instead of panicking on duplicate ID
  • TLS session resumption stores actual session ticket data
  • CloudWatch audit sink returns explicit error instead of silently dropping events

Security Upgrades

  • jsonwebtoken 9.2 → 10.3.0 (type confusion auth bypass)
  • pyo3 0.22 → 0.24.2 (buffer overflow in PyString)
  • iced 0.13 → 0.14.0 (fixes lru Stacked Borrows violation)
  • bytes 1.11.0 → 1.11.1 (integer overflow in BytesMut)
  • time 0.3.44 → 0.3.47 (stack exhaustion DoS)

Stats

  • 1,416 tests passing, 0 failures, 96 GPU-only ignored
  • Zero clippy warnings
  • Net -224 lines of code (consolidation)

Install

[dependencies]
ringkernel = "0.4.1"

Full Changelog: v0.4.0...v0.4.1

v0.4.0: GPU Infrastructure Generalization & Python Bindings

25 Jan 21:23

Choose a tag to compare

Highlights

This release extracts ~7,000+ lines of proven GPU infrastructure from RustGraph into RingKernel, making these capabilities available to all RingKernel users.

New: Python Bindings (ringkernel-python)

PyO3-based Python wrapper with full async/await support:

import ringkernel
import asyncio

async def main():
    runtime = await ringkernel.RingKernel.create(backend="cpu")
    kernel = await runtime.launch("processor", ringkernel.LaunchOptions())
    await kernel.terminate()
    await runtime.shutdown()

asyncio.run(main())

Features:

  • Async/await with sync fallbacks
  • HLC timestamps and K2K messaging
  • CUDA device enumeration and GPU memory pool management
  • Benchmark framework with regression detection
  • Hybrid CPU/GPU dispatcher with adaptive thresholds
  • Resource guard for memory limit enforcement
  • Type stubs for IDE support

New: PTX Compilation Cache

Disk-based PTX caching for faster kernel loading with SHA-256 content hashing and compute capability awareness.

New: GPU Stratified Memory Pool

Size-stratified GPU VRAM pool with 6 size classes (256B-256KB), O(1) allocation from free lists.

New: Multi-Stream Execution Manager

Multi-stream CUDA execution for compute/transfer overlap with event-based synchronization.

New: Benchmark Framework

Comprehensive benchmarking with regression detection, baseline comparison, and multiple report formats (Markdown, JSON, LaTeX).

New: Hybrid CPU-GPU Dispatcher

Intelligent workload routing with adaptive threshold learning between CPU and GPU execution.

New: Resource Guard

Memory limit enforcement with safety margins and RAII reservation patterns.

New: Kernel Mode Selector

Intelligent kernel launch configuration based on workload profile and GPU architecture.


See CHANGELOG.md for full details.

v0.3.2: GPU Profiling Infrastructure

21 Jan 09:54

Choose a tag to compare

What's New

GPU Profiling Infrastructure

  • CUDA event-based timing and NVTX markers
  • Memory allocation tracking
  • Chrome trace export for visualization

Publishing Fixes

  • Fixed publish script to add User-Agent header for crates.io API
  • Updated dependency versions across all crates for v0.3.2 publishing
  • ringkernel-ir, ringkernel-graph, ringkernel-montecarlo now use workspace versions

Crates Published

  • ringkernel-core, ringkernel-cuda-codegen, ringkernel-wgpu-codegen
  • ringkernel-derive, ringkernel-cpu, ringkernel-cuda, ringkernel-wgpu, ringkernel-metal
  • ringkernel-codegen, ringkernel-ecosystem, ringkernel-audio-fft
  • ringkernel (main crate)

See crates.io/crates/ringkernel for the published crates.

v0.3.1: Enterprise Readiness

19 Jan 20:16

Choose a tag to compare

RingKernel v0.3.1: Enterprise Readiness

This release adds comprehensive enterprise-grade features for production deployments.

🔐 Enterprise Security

  • Real Cryptography: AES-256-GCM, ChaCha20-Poly1305, Argon2 key derivation
  • Secrets Management: SecretStore trait with key rotation, caching, and chained stores
  • K2K Message Encryption: Kernel-to-kernel encryption with forward secrecy
  • TLS/mTLS Support: Full TLS with rustls, certificate rotation, SNI resolution

🔑 Authentication & Authorization

  • Authentication Providers: ApiKeyAuth, JwtAuth (RS256/HS256), ChainedAuthProvider
  • RBAC: Role-based access control with deny-by-default PolicyEvaluator
  • Multi-tenancy: TenantContext, ResourceQuota, usage tracking

📊 Observability

  • OpenTelemetry: OTLP export to Jaeger, Honeycomb, Datadog, Grafana Cloud
  • Structured Logging: Multi-sink logger with trace correlation (JSON/Text)
  • Alert Routing: Severity-based routing with deduplication (Slack, Teams, PagerDuty)
  • Remote Audit Sinks: Syslog, CloudWatch Logs, Elasticsearch

⚡ Rate Limiting

  • Algorithms: TokenBucket, SlidingWindow, LeakyBucket
  • Builder API: Fluent configuration with RateLimiterBuilder
  • Distributed: SharedRateLimiter for multi-instance deployments

🔧 Operational Excellence

  • Automatic Recovery: Configurable policies per failure type (Restart, Migrate, Checkpoint, Notify, Escalate, Circuit)
  • Operation Timeouts: Deadline propagation with Timeout and Deadline types
  • Recovery Manager: Retry tracking, cooldown periods, automatic escalation

📦 Feature Flags

[dependencies]
ringkernel-core = { version = "0.3.1", features = ["enterprise"] }

# Or select specific features:
ringkernel-core = { version = "0.3.1", features = ["crypto", "auth", "tls", "rate-limiting", "alerting"] }

📈 Metrics

  • Test Coverage: 900+ tests (up from 825+)
  • Crates Published: 21 crates to crates.io

🚀 Quick Start

use ringkernel_core::prelude::*;

// Enterprise runtime with production preset
let runtime = RuntimeBuilder::new()
    .production()
    .build()?;

// API key authentication
let auth = ApiKeyAuth::new()
    .add_key("sk-prod-abc123", Identity::new("service-a"));

// Rate limiting
let limiter = RateLimiterBuilder::new()
    .algorithm(RateLimitAlgorithm::TokenBucket)
    .rate(1000)
    .burst(100)
    .build();

Full Changelog

See CHANGELOG.md for complete details.

v0.3.0: Multi-Kernel Dispatch, Memory Pools, Global Reductions

19 Jan 09:34

Choose a tag to compare

RingKernel v0.3.0

GPU-native persistent actor model framework for Rust. This release adds multi-kernel dispatch, memory pools, global reduction primitives, and two new crates.

Highlights

  • 21 crates published to crates.io - Full workspace now available
  • 825+ tests across the workspace
  • cudarc 0.18.2 and wgpu 27.0 support

New Features

Multi-Kernel Dispatch and Persistent Message Routing

  • #[derive(PersistentMessage)] macro for GPU kernel dispatch
  • KernelDispatcher component with builder pattern and metrics
  • CUDA handler dispatch code generator (CudaDispatchTable)
  • Queue tiering system (QueueTier, QueueFactory, QueueMonitor)

Memory Pool Management

  • StratifiedMemoryPool with 5 size buckets (256B to 64KB)
  • AnalyticsContext for grouped buffer lifecycle
  • PressureHandler for memory pressure monitoring
  • CUDA ReductionBufferCache and WebGPU StagingBufferPool

Global Reduction Primitives

  • ReductionOp enum: Sum, Min, Max, And, Or, Xor, Product
  • ReductionBuffer<T> using mapped memory (zero-copy host read)
  • Multi-phase kernel execution with SyncMode (Cooperative, SoftwareBarrier, MultiLaunch)
  • PageRank example with dangling node handling

CUDA NVRTC Compilation

  • compile_ptx() function for runtime CUDA compilation
  • Downstream crates can compile CUDA without direct cudarc dependency

Domain System

  • 20 business domains with reserved type ID ranges
  • #[message(domain = "FraudDetection")] attribute
  • Domains: GraphAnalytics, FraudDetection, ProcessIntelligence, Banking, etc.

New Crates

  • ringkernel-montecarlo - Philox RNG, antithetic variates, control variates, importance sampling
  • ringkernel-graph - CSR matrix, BFS, SCC (Tarjan/Kosaraju), Union-Find, SpMV

Breaking Changes

  • cudarc API updated to 0.18.2 (module loading, kernel launch builder pattern)
  • wgpu API updated to 27.0 (Arc-based resources)

Installation

[dependencies]
ringkernel = "0.3.0"

# Optional backends
ringkernel-cuda = "0.3.0"
ringkernel-wgpu = "0.3.0"

Documentation

Full Changelog: v0.2.0...v0.3.0

RingKernel v0.2.0

14 Jan 16:48

Choose a tag to compare

What's Changed

  • Claude/persistent kernel implementation d nc3 o by @mivertowski in #9

Full Changelog: v0.1.3...v0.2.0

v0.1.3 - Dependency Updates & CI Fixes

17 Dec 14:18

Choose a tag to compare

Highlights

  • wgpu 27.0 - Major update with Arc-based resource tracking (~40% performance improvement in some workloads)
  • Dependency updates - tokio 1.48, axum 0.8, tonic 0.14, egui 0.31, winit 0.30
  • CI/CD fixes - Workspace builds without CUDA/nvcc installed

What's Changed

Dependencies Updated

Package From To
wgpu 0.19 27.0
tokio 1.35 1.48
thiserror 1.0 2.0
axum 0.7 0.8
tower 0.4 0.5
tonic 0.11 0.14
prost 0.12 0.14
egui/egui-wgpu/egui-winit 0.27 0.31
winit 0.29 0.30
glam 0.27 0.29
metal 0.27 0.31
arrow 52 54
polars 0.39 0.46
rayon 1.10 1.11
actix-rt 2.9 2.10

Deferred Updates

  • iced: Kept at 0.13 (0.14 requires major application API rewrite)
  • rkyv: Kept at 0.7 (0.8 has incompatible data format)

CI/CD Improvements

  • CUDA features are now opt-in (not default)
  • Workspace builds succeed without nvcc installed
  • Feature-gated CUDA tests with #[cfg(feature = "cuda")]

See CHANGELOG.md for full details.

v0.1.2

11 Dec 09:55

Choose a tag to compare

Release v0.1.2

- **WaveSim3D** - 3D acoustic wave simulation with realistic physics
  - Full 3D FDTD wave propagation solver
  - Binaural audio rendering with HRTF support
  - Volumetric ray marching visualization
  - GPU-native actor system for distributed simulation

- Expanded GPU intrinsics from ~45 to 120+ operations across 13 categories
- Atomic operations: and, or, xor, inc, dec
- 3D stencil intrinsics: up, down, at(dx, dy, dz)
- Warp match/reduce operations (Volta+/SM 8.0+)
- Bit manipulation, memory, special, and timing ops
- 171 tests (up from 143)

- Added required-features to CUDA-only wavesim binaries
- Updated GitHub Actions release workflow

See CHANGELOG.md for full details.