Skip to content

mivertowski/RustCompute

Repository files navigation

RingKernel v1.1.0

Crates.io Documentation License Rust Tests

Persistent GPU actors for NVIDIA CUDA, proven on H100 and 2x H100 NVL.

RingKernel treats GPU thread blocks as long-running actors that maintain state, communicate via lock-free queues, and manage lifecycle (create, destroy, restart, supervise) -- all within a single persistent kernel launch. No kernel re-launch overhead. No host round-trip for inter-actor messaging. v1.1 adds multi-GPU migration over NVLink P2P, per-tenant isolation, PROV-O provenance, hot rule reload, and TLA+ formally verified protocols.

Key Results

Single-GPU (H100 NVL, v1.0 baseline)

Measured with locked clocks, exclusive compute mode, and statistical rigor (95% CI, Cohen's d, Welch's t-test). Full data in docs/benchmarks/ACADEMIC_PROOF.md.

Metric Value vs Baseline
Persistent actor command injection 55 ns 8,698x faster than cuLaunchKernel
vs CUDA Graphs (best traditional) 0.2 us total 3,005x faster than graph replay
Sustained throughput (60 seconds) 5.54M ops/s CV 0.05%, zero degradation
Cluster sync (H100 Hopper) 0.628 us/sync 2.98x faster than grid.sync()
Zero-copy serialization 0.544 ns Sub-nanosecond pointer cast
Actor create (on-GPU) 163 us Erlang-style, no kernel re-launch
Actor restart (on-GPU) 197 us State reset + reactivation
Streaming pipeline 610K events/s 4-stage GPU-native pipeline
WaveSim3D GPU stencil 78K Mcells/s 217.9x vs CPU (40-core EPYC)
Async memory alloc 878 ns 116.9x vs cuMemAlloc

Multi-GPU (2x H100 NVL, v1.1, NVLink NV12 topology)

Full data in docs/benchmarks/v1.1-2x-h100-results.md.

Metric Value vs Baseline
K2K SMEM tier latency 6.7 us baseline (intra-block, payload-independent)
K2K DSMEM tier latency 9.0 - 15.3 us 1.3 - 2.3x vs SMEM (cross-block within cluster)
K2K HBM tier latency 10.5 - 18.2 us 1.5 - 2.7x vs SMEM (cross-cluster via global memory)
NVLink P2P migration (16 MiB) 69 us 8.7x faster than host-staged
Multi-GPU K2K sustained bandwidth 258 GB/s @ 16 MiB 81% of 318 GB/s NVLink peak
Lifecycle rule overhead (all 5 rules) 23 ns mean / 30 ns p99 Flat within measurement noise
Cross-tenant leak count 0 13 isolation tests, multiple stress scenarios
TLA+ specs verified (distinct states) 6 / 6 24K+ combined, no counterexamples

Quick Start

CPU Backend (default, no GPU required)

[dependencies]
ringkernel = "1.1"
tokio = { version = "1.48", features = ["full"] }
use ringkernel::prelude::*;

#[tokio::main]
async fn main() -> std::result::Result<(), Box<dyn std::error::Error>> {
    let runtime = RingKernel::builder()
        .backend(Backend::Cpu)
        .build()
        .await?;

    // Launch a persistent kernel (auto-activates)
    let kernel = runtime.launch("processor", LaunchOptions::default()).await?;
    println!("State: {:?}", kernel.state());

    // Lifecycle management
    kernel.deactivate().await?;
    kernel.activate().await?;
    kernel.terminate().await?;

    runtime.shutdown().await?;
    Ok(())
}

CUDA Backend (requires NVIDIA GPU + CUDA toolkit)

[dependencies]
ringkernel = { version = "1.1", features = ["cuda"] }
tokio = { version = "1.48", features = ["full"] }
use ringkernel::prelude::*;

#[tokio::main]
async fn main() -> std::result::Result<(), Box<dyn std::error::Error>> {
    let runtime = RingKernel::builder()
        .backend(Backend::Cuda)
        .build()
        .await?;

    let kernel = runtime.launch("gpu_processor", LaunchOptions::default()).await?;
    println!("Running on CUDA: {:?}", kernel.state());

    kernel.terminate().await?;
    runtime.shutdown().await?;
    Ok(())
}

Architecture

Host (CPU)                              Device (GPU)
+----------------------------+          +------------------------------------+
|  Application (async)       |          |  Supervisor (Block 0)              |
|  ActorSupervisor           |<- DMA -->|  Actor Pool (Blocks 1-N)           |
|  ActorRegistry             |          |    +- Control Block (256B each)    |
|  FlowController            |          |    +- H2K/K2H Queues              |
|  DeadLetterQueue           |          |    +- Inter-Actor Buffers          |
|  MemoryPressureMonitor     |          |    +- K2K Routes (device mem)      |
+----------------------------+          |  grid.sync() / cluster.sync()      |
                                        +------------------------------------+

Core Components

  • ringkernel-core: Actor lifecycle, lock-free queues, HLC timestamps, control blocks, K2K messaging (single- and multi-tenant), PubSub, enterprise features, PROV-O provenance, hot rule reload, introspection streaming, incremental/delta checkpoints
  • ringkernel-cuda: Persistent CUDA kernels, cooperative groups, Thread Block Clusters, DSMEM, async memory pools, NVTX profiling, multi-GPU runtime with real cuCtxEnablePeerAccess + cuMemcpyPeerAsync, NVLink topology probe, migration protocol, hierarchical work stealing (block / cluster / grid), opt-in NVSHMEM bindings
  • ringkernel-cuda-codegen: Rust-to-CUDA transpiler with 155+ intrinsics (global, stencil, ring, persistent FDTD kernels)
  • ringkernel-ir: Unified intermediate representation for code generation; supports BF16, FP8 (E4M3/E5M2), FP6 (E3M2/E2M3), and FP4 (E2M1) scalar types with per-type compute-capability gating

Multi-GPU runtime (v1.1)

use ringkernel_cuda::multi_gpu::{MultiGpuRuntime, PlacementHint, MigrationPlan};
use ringkernel_core::actor::ActorId;

// Probe NVLink topology, retain 2 primary contexts, wire peer access.
let mgpu = MultiGpuRuntime::new(&[0, 1]).await?;

// NVLink-aware placement: co-locate on connected GPUs.
let a = mgpu.launch("actor-a", PlacementHint::Pinned(0), opts.clone()).await?;
let _b = mgpu.launch("actor-b", PlacementHint::NvlinkPreferred(0), opts).await?;

// 3-phase migration (quiesce, transfer via cuMemcpyPeerAsync, swap).
let report = mgpu.migrate_actor(MigrationPlan::new(0, 1, ActorId(42))).await?;
assert!(report.used_nvlink && report.state_checksum_match);

GPU Actor Lifecycle

Actors on GPU follow the same lifecycle as Erlang processes:

Dormant -> Initializing -> Active <-> Draining -> Terminated / Failed
use ringkernel_core::actor::{ActorSupervisor, ActorConfig, RestartPolicy};

let mut supervisor = ActorSupervisor::new(128); // 127 actor slots + 1 supervisor

let config = ActorConfig::named("sensor-reader")
    .with_restart_policy(RestartPolicy::OneForOne {
        max_restarts: 3,
        window: Duration::from_secs(60),
    });
let actor = supervisor.create_actor(&config, None)?;
supervisor.activate_actor(actor)?;

// Cascading kill (Erlang-style)
supervisor.kill_tree(actor); // Kills actor and all descendants

// Named registry for service discovery
let mut registry = ActorRegistry::new();
registry.register("isa_ontology", kernel_id);
let actor = registry.lookup("standards/isa/*"); // Wildcard patterns

Crate Structure

Crate Purpose Status
ringkernel Main facade, re-exports everything Stable
ringkernel-core Core traits, actor lifecycle, HLC, K2K, queues, enterprise Stable
ringkernel-derive Proc macros (#[derive(RingMessage)], #[ring_kernel]) Stable
ringkernel-cpu CPU backend (testing, fallback) Stable
ringkernel-cuda NVIDIA CUDA: persistent kernels, Hopper features, cooperative groups Stable, H100-verified
ringkernel-cuda-codegen Rust-to-CUDA transpiler (155+ intrinsics) Stable
ringkernel-codegen GPU kernel code generation Stable
ringkernel-ir Unified IR for multi-backend codegen Stable
ringkernel-ecosystem Actix, Axum, Tower, gRPC, Arrow, Polars integrations Stable
ringkernel-cli CLI: scaffolding, codegen, compatibility checking Stable
ringkernel-montecarlo Philox RNG, variance reduction, importance sampling Stable
ringkernel-graph CSR, BFS, SCC, Union-Find, SpMV Stable
ringkernel-audio-fft GPU-accelerated audio FFT Stable
ringkernel-wavesim 2D wave simulation with educational modes Stable
ringkernel-txmon GPU transaction monitoring with fraud detection Stable
ringkernel-accnet GPU accounting network visualization Stable
ringkernel-procint GPU process intelligence, DFG mining, conformance checking Stable
ringkernel-python Python bindings Stable

GPU Compatibility

RingKernel targets NVIDIA datacenter GPUs (Tesla / A-series / H-series / B-series). Consumer cards (GeForce / RTX) will compile and run the core persistent-actor path, but several features are intentionally gated to datacenter silicon: ECC memory, NVLink P2P, MIG, and the Hopper/Blackwell tensor formats assume a datacenter pipeline. There is no downlevel emulation — a kernel that asks for a type or launch mode the device can't execute is rejected at codegen.

Gating is not hand-maintained documentation; it lives in the runtime. GpuArchitecture::from_compute_capability(major, minor) (in ringkernel-cuda/src/launch_config/mode.rs) routes the detected device to a preset, and each feature is guarded by a supports_*() method. Scalar types carry their own ScalarType::min_compute_capability() (in ringkernel-ir/src/types.rs).

Architecture to compute capability

Arch CC Datacenter GPUs RingKernel status
Volta 7.0 V100 Compiles; cooperative groups only (no persistent-actor verification)
Ampere 8.0 / 8.6 A100, A30, A40 Persistent actors work; no DSMEM / cluster.sync
Ada 8.9 L40, L40S, L4 Persistent actors work; no DSMEM / cluster.sync
Hopper 9.0 H100, H100 NVL, H200, GH200 Primary target — v1.0 and v1.1 verified
Blackwell 10.0 / 11.x B100, B200, GB200 Codegen stubs compile; runtime paths await B200 silicon
Rubin 12.x (future) Preset placeholder only

DGX Spark is not datacenter Blackwell. The GB10 Superchip in DGX Spark reports compute capability sm_121 — it carries the Blackwell ISA name but is a single-die consumer/workstation part (48 SMs, LPDDR5X shared with Grace via NVLink-C2C, no HBM, no NVSwitch fabric, no MIG, reduced FP64 density). Datacenter Blackwell (B100 / B200 / GB200) is sm_100 / sm_101 / sm_103 with HBM3e and multi-GPU NVLink 5. RingKernel's v1.1 multi-GPU K2K path (cuMemcpyPeerAsync between peer GPUs) does not apply to DGX Spark — there is one GPU die. FP4/FP6 tensor ops and the cluster programming model are available; NVLink peer bandwidth, peer-to-peer migration, and multi-tenant GPU partitioning (MIG) are not.

Feature to minimum compute capability

Feature Min CC Arch floor Gate v1.x status
Persistent kernel / grid.sync() 6.0 Pascal supports_cooperative_groups() Shipped, v1.0
Cluster launch (cuLaunchKernelEx + cluster dims) 9.0 Hopper supports_cluster_launch() Shipped, v1.0
DSMEM ring-topology K2K 9.0 Hopper Implicit (requires clusters) Shipped, v1.0
cluster.sync() at 0.628 us 9.0 Hopper Implicit (requires clusters) Shipped, v1.0
TMA (tensor memory accelerator) 9.0 Hopper Implicit (Hopper preset) Used by codegen, v1.0
NVLink 4 P2P (cuMemcpyPeerAsync) 9.0 datacenter Hopper PeerAccessManager::supports_p2p() Shipped, v1.1 (258 GB/s @ 16 MiB)
Confidential Computing / TEE 9.0 Hopper supports_tee() Surface exposed; not exercised in benchmarks
Dynamic Cluster Launch Control 10.0 Blackwell supports_cluster_launch_control() Query wired; runtime awaits B200
NVLink 5 10.0 Blackwell supports_nvlink5() Query wired; awaits B200
BF16 scalar type 8.0 Ampere ScalarType::BF16.min_compute_capability() Codegen (CUDA / MSL / WGSL)
FP8 (E4M3 / E5M2) scalar types 9.0 Hopper ScalarType::FP8E4M3.min_compute_capability() Codegen; rejected on pre-Hopper
FP6 (E3M2 / E2M3) scalar types 10.0 Blackwell ScalarType::FP6E3M2.min_compute_capability() Codegen; rejected on pre-Blackwell
FP4 (E2M1) scalar type 10.0 Blackwell ScalarType::FP4E2M1.min_compute_capability() Codegen; rejected on pre-Blackwell
NVSHMEM symmetric heap any CUDA-capable (host-ABI feature) nvshmem Cargo feature on ringkernel-cuda Shipped as v1.2 groundwork; opt-in

Practical guidance

  • Running the published benchmarks (persistent command injection, DSMEM K2K, NVLink P2P migration) requires Hopper (H100 / H100 NVL / H200 / GH200). They will not be reproducible on Ampere, because cluster.sync and DSMEM do not exist below CC 9.0.
  • Running the persistent-actor programming model (queues, supervision, HLC, multi-tenant isolation) works on anything from Pascal upward. The performance numbers change; the API does not.
  • Blackwell-specific codegen paths compile today and will start exercising at runtime once we have B200 access. The supports_* queries guarantee that pre-Blackwell devices refuse the kernel rather than launching something they cannot execute.

Build Commands

# Build
cargo build --workspace                          # Build entire workspace
cargo build --workspace --features cuda           # With CUDA backend

# Test
cargo test --workspace                            # Run all tests (1,617 passing)
cargo test -p ringkernel-core                     # Core tests (600+ tests, incl. multi-tenant / provenance / rules / delta checkpoints)
cargo test -p ringkernel-cuda --test gpu_execution_verify  # CUDA GPU tests (requires NVIDIA GPU)
cargo test -p ringkernel-ecosystem --features "persistent,actix,tower,axum,grpc"
cargo bench --package ringkernel                  # Criterion benchmarks

# Examples
cargo run -p ringkernel --example basic_hello_kernel
cargo run -p ringkernel --example kernel_to_kernel
cargo run -p ringkernel --example global_kernel         # CUDA codegen
cargo run -p ringkernel --example stencil_kernel
cargo run -p ringkernel --example ring_kernel_codegen
cargo run -p ringkernel --example educational_modes
cargo run -p ringkernel --example enterprise_runtime
cargo run -p ringkernel --example pagerank_reduction --features cuda

# Applications
cargo run -p ringkernel-txmon --release --features cuda-codegen
cargo run -p ringkernel-txmon --bin txmon-benchmark --release --features cuda-codegen
cargo run -p ringkernel-procint --release
cargo run -p ringkernel-procint --bin procint-benchmark --release
cargo run -p ringkernel-ecosystem --example axum_persistent_api --features "axum,persistent"

# CLI
cargo run -p ringkernel-cli -- new my-app --template persistent-actor
cargo run -p ringkernel-cli -- codegen src/kernels/mod.rs --backend cuda
cargo run -p ringkernel-cli -- check --backends all

Performance

H100 Benchmarks (95% CI)

Command Injection Latency

Model Per-Command (us) 95% CI (us) vs Traditional
Traditional (cuLaunchKernel) 1.583 +/-6.0 1.0x
CUDA Graph Replay 0.547 +/-12.3 2.9x
Persistent Actor 0.000 +/-0.0 8,698x

Lock-Free Queue Throughput

Payload Latency (ns) 95% CI (ns) Throughput (Mmsg/s)
64 B 72.28 [72.26, 72.32] 13.83
256 B 74.67 [74.65, 74.70] 13.39
1 KB 82.64 [82.61, 82.68] 12.10
4 KB 177.50 [177.47, 177.53] 5.63

60-Second Sustained Run: 315M operations, 5.54 Mops/s, CV 0.05%, zero throughput degradation, zero memory growth.

Hopper Features: cluster.sync() at 0.628 us/sync (2.98x faster than grid.sync()), DSMEM ring-topology K2K messaging verified across 4-block clusters.

Documentation

Citation

If you use RingKernel or the formal model in academic work, please cite the preprint:

Michael Ivertowski. Persistent GPU Actors: A Formal Model for Living Computation on the Device. Preprint, April 2026. DOI: 10.13140/RG.2.2.24010.12484.

BibTeX:

@misc{ivertowski2026ringkernel,
  author       = {Michael Ivertowski},
  title        = {Persistent {GPU} Actors: A Formal Model for Living
                  Computation on the Device},
  year         = {2026},
  month        = apr,
  howpublished = {Preprint},
  doi          = {10.13140/RG.2.2.24010.12484},
  url          = {https://doi.org/10.13140/RG.2.2.24010.12484},
  note         = {arXiv version in preparation}
}

The paper source is in docs/paper/ (LaTeX); make in that directory reproduces the PDF.

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

About

Persistent GPU actors for NVIDIA CUDA

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

 

Contributors

Languages