Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
c0d53b9
AD-324: Switch RO-Crate generation to PROV model
Apr 9, 2026
aa21ca9
AD-324: fix RO-Crate Linux validation fallout
Apr 9, 2026
a2f9e88
Merge branch 'main' into AD-324-ro-create-mods-for-naerm-data-team
Apr 13, 2026
5dee39b
resolving merge conflicts
Apr 13, 2026
5d918b0
Merge branch 'AD-324-ro-create-mods-for-naerm-data-team' of https://g…
Apr 13, 2026
be7f28e
AD-324: format RO-Crate utils after pull
Apr 13, 2026
a67836f
removing references to data team
Apr 13, 2026
eb8c0ab
Merge branch 'AD-324-ro-create-mods-for-naerm-data-team' of https://g…
Apr 13, 2026
f2fe824
AD-324: Stop tracking tmp artifacts
Apr 13, 2026
3b000fb
wip
Apr 20, 2026
0581f58
fixing pipeline failures
arjlai221 Apr 21, 2026
1c30d32
fix client test api compatibility
arjlai221 Apr 21, 2026
597db62
reverting unintentional AI mods
arjlai221 Apr 21, 2026
ceffce6
failing test
arjlai221 Apr 21, 2026
aad2da7
Refine RO-Crate export behavior
arjlai221 Apr 21, 2026
0a94a9f
merging with main
arjlai221 Apr 21, 2026
4474bec
removing old run_id parameter, adding data flow design doc
arjlai221 Apr 21, 2026
1d6f7f2
review mods
arjlai221 Apr 27, 2026
febed43
docs update
arjlai221 Apr 28, 2026
63708ac
format md file
arjlai221 Apr 29, 2026
05d41ac
move provenance creation to workflow creation
arjlai221 May 4, 2026
a98f4b3
Merge branch 'main' into AD-324-ro-create-mods-for-naerm-data-team
arjlai221 May 4, 2026
84fe49e
applying updates from main branch
arjlai221 May 4, 2026
22ad32a
fixing compiliation errors
arjlai221 May 4, 2026
f483f4f
using #201 improvements for entity id filtering
arjlai221 May 4, 2026
b8666fe
refine RO-Crate provenance creation and run timing updates
arjlai221 May 5, 2026
0977f9a
implement cache for getting file model
arjlai221 May 5, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,8 @@ jobs:
run: cargo clippy --all --all-targets --all-features -- -D warnings

- name: Check Rust-owned OpenAPI artifacts
env:
DATABASE_URL: sqlite:db/sqlite/dev.db
run: |
cargo test --lib --no-default-features --features openapi-codegen
bash api/check_openapi_codegen_parity.sh
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -38,3 +38,4 @@ output
torc_output
.mcp.json
.dprint-cache/
/tmp
1 change: 1 addition & 0 deletions docs/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,7 @@
- [AI-Assisted Recovery Design](./specialized/design/ai-assisted-recovery.md)
- [Workflow Graph](./specialized/design/workflow-graph.md)
- [Interface Architecture](./specialized/design/interfaces.md)
- [RO-Crate Generation Design](./specialized/design/ro-crate.md)
- [API Generation Architecture](./specialized/design/api-generation.md)
- [Slurm Job Step Monitoring](./specialized/design/srun-monitoring.md)

Expand Down
38 changes: 25 additions & 13 deletions docs/src/core/concepts/ro-crate.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,9 @@ other research object with JSON-LD properties. Entities can be:
### Always recorded (all workflows)

During workflow initialization, Torc creates **SoftwareApplication** entities for the torc binaries
(server, job runner, etc.) that processed the workflow. These record the software name and version,
providing a baseline provenance record even when full RO-Crate tracking is not enabled.
(server, CLI, job runner, etc.) that processed the workflow. In the current model, these are written
as both `SoftwareApplication` and `prov:SoftwareAgent` so the exported RO-Crate uses a PROV-shaped
provenance model.

### When `enable_ro_crate: true`

Expand All @@ -42,12 +43,15 @@ When you enable RO-Crate on a workflow, Torc additionally creates file and job p

- File entities are created for all **input files** (files that exist on disk)
- Entities include MIME type inference, file size, and modification date
- Torc creates workflow-level provenance entities: `#torc-workflow` and `#torc-run-{run_id}`

**When jobs complete successfully:**

- File entities are created for all **output files**
- CreateAction entities are created for each job (provenance)
- Output files are linked to their producing job via `wasGeneratedBy`
- Output files are linked to their producing job via `prov:wasGeneratedBy`
- Output files are linked to the workflow run via `prov:wasAttributedTo`
- Output files are linked to file inputs via `prov:wasDerivedFrom`

This creates a complete provenance graph linking inputs → jobs → outputs.

Expand All @@ -58,13 +62,15 @@ Automatically generated File entities include:
```json
{
"@id": "data/output.csv",
"@type": "File",
"@type": ["File", "prov:Entity"],
"name": "output.csv",
"encodingFormat": "text/csv",
"contentSize": 1024,
"sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"dateModified": "2024-01-01T00:00:00Z",
"wasGeneratedBy": { "@id": "#job-42-attempt-1" }
"prov:wasGeneratedBy": { "@id": "#job-42-attempt-1" },
"prov:wasAttributedTo": { "@id": "#torc-run-1" },
"prov:wasDerivedFrom": { "@id": "data/input.csv" }
}
```

Expand All @@ -75,13 +81,18 @@ Job provenance is captured as CreateAction entities:
```json
{
"@id": "#job-42-attempt-1",
"@type": "CreateAction",
"@type": ["CreateAction", "prov:Activity"],
"name": "process_data",
"instrument": { "@id": "#workflow-123" },
"prov:hadPlan": { "@id": "#torc-workflow" },
"isPartOf": { "@id": "#torc-run-1" },
"instrument": { "@id": "#software-torc-run-1" },
"prov:used": { "@id": "data/input.csv" },
"result": [{ "@id": "data/output.csv" }]
}
```

The exported `@context` includes the `prov` namespace.

## Enabling Automatic RO-Crate

Add `enable_ro_crate: true` to your workflow specification:
Expand Down Expand Up @@ -111,8 +122,9 @@ workflow is created. Files that exist are marked as inputs; files that don't exi
After running this workflow:

- `input_data` will have an RO-Crate File entity (created during initialization)
- `output_data` will have an RO-Crate File entity with `wasGeneratedBy` linking to the job
- `output_data` will have an RO-Crate File entity with `prov:wasGeneratedBy` linking to the job
- A CreateAction entity will describe the `process` job execution
- `#torc-workflow` and `#torc-run-{run_id}` will describe the workflow plan and run activity

## Dataset Entities for Directories

Expand Down Expand Up @@ -149,11 +161,11 @@ Dataset entities include file count, total size, and an optional hash:

Torc supports three hash modes for datasets:

| Mode | Description | Speed | Detects |
| ---------- | ----------------------------------------- | ------- | ---------------------------------- |
| `manifest` | Hash of sorted (path, size, mtime) list | Fast | File additions, deletions, renames |
| `content` | SHA256 of all file contents (Merkle tree) | Slow | Any content change |
| `none` | No hash, only file count and size | Fastest | Nothing (stats only) |
| Mode | Description | Speed | Detects |
| ---------- | --------------------------------- | ------- | -------------------------- |
| `manifest` | Hash of sorted path/size/mtime | Fast | Additions, deletions, move |
| `content` | SHA256 of all file contents | Slow | Any content change |
| `none` | No hash, only file count and size | Fastest | Nothing |

For large datasets, `manifest` mode provides a good balance—it detects structural changes without
the I/O cost of reading terabytes of data.
Expand Down
29 changes: 18 additions & 11 deletions docs/src/core/how-to/ro-crate-metadata.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,10 @@ When automatic generation is enabled:

- **Input files** (files that exist on disk) get File entities created during workflow
initialization
- **Output files** get File entities with provenance (`wasGeneratedBy`) created when jobs complete
- **Jobs** get CreateAction entities linking to their output files
- **Output files** get File entities with provenance (`prov:wasGeneratedBy`) created when jobs
complete
- **Jobs** get CreateAction entities linking inputs, outputs, plan, and run metadata
- **Workflow runs** get `#torc-workflow` and `#torc-run-{run_id}` provenance entities

After running the workflow, export the metadata:

Expand All @@ -50,10 +52,11 @@ The exported document includes complete provenance:
```json
{
"@id": "data/output.csv",
"@type": "File",
"@type": ["File", "prov:Entity"],
"name": "output.csv",
"encodingFormat": "text/csv",
"wasGeneratedBy": { "@id": "#job-1-attempt-1" }
"prov:wasGeneratedBy": { "@id": "#job-1-attempt-1" },
"prov:wasAttributedTo": { "@id": "#torc-run-1" }
}
```

Expand Down Expand Up @@ -86,7 +89,7 @@ Each RO-Crate entity has:
| `file_id` | Optional link to a Torc file record |

Entities are stored per-workflow. The `export` command assembles them into a complete RO-Crate
document with the required metadata descriptor and root dataset.
document with the required metadata descriptor, root dataset, and PROV-aware context.

## Creating Entities

Expand Down Expand Up @@ -300,7 +303,10 @@ The exported document has this structure:

```json
{
"@context": "https://w3id.org/ro/crate/1.1/context",
"@context": [
"https://w3id.org/ro/crate/1.1/context",
{ "prov": "http://www.w3.org/ns/prov#" }
],
"@graph": [
{
"@id": "ro-crate-metadata.json",
Expand All @@ -319,22 +325,23 @@ The exported document has this structure:
},
{
"@id": "data/output.parquet",
"@type": "File",
"@type": ["File", "prov:Entity"],
"name": "Simulation Output",
"encodingFormat": "application/x-parquet"
"encodingFormat": "application/x-parquet",
"prov:wasGeneratedBy": {"@id": "#job-1-attempt-1"}
},
{
"@id": "https://example.com/simulation/v2.1",
"@type": "SoftwareApplication",
"@type": ["SoftwareApplication", "prov:SoftwareAgent"],
"name": "My Simulation",
"version": "2.1.0"
}
]
}
```

The `@id` and `@type` fields are always set from the entity record, overriding any values in the
metadata JSON.
Torc preserves any explicit `@id` and `@type` already present in the stored metadata. If either
field is missing, the exporter fills it in from the entity record.

## Workflow Export/Import

Expand Down
1 change: 1 addition & 0 deletions docs/src/specialized/design/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,5 +13,6 @@ Internal design documentation for developers.
- [AI-Assisted Recovery Design](./ai-assisted-recovery.md) - AI-assisted error classification
- [Workflow Graph](./workflow-graph.md) - Dependency graph implementation
- [Interface Architecture](./interfaces.md) - Interface design patterns
- [RO-Crate Generation Design](./ro-crate.md) - RO-Crate entity lifecycle and provenance flow
- [Slurm Job Step Monitoring](./srun-monitoring.md) - srun wrapping, sstat monitoring, sacct
collection
138 changes: 138 additions & 0 deletions docs/src/specialized/design/ro-crate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# RO-Crate Generation Design

This page describes how Torc creates and updates automatic RO-Crate provenance entities in the
current branch.
Comment thread
daniel-thom marked this conversation as resolved.

## Current Model

The important identity rules are:

- Workflow plan entity: one per workflow, `#torc-workflow`
- Workflow run entity: one per run, `#torc-run-{run_id}`
- Torc software entities: one per run, `#software-{binary_name}-run-{run_id}`
- Job execution entities: one per job attempt, `#job-{job_id}-attempt-{attempt_id}`
- File entities: one per file record/path, updated in place across runs

That last point is why `build_file_entity()` does not take `run_id`. Plain file entities are not
modeled as run-scoped records. Run-scoped provenance is attached through relationships:

- Output files link to the workflow run with `prov:wasAttributedTo`
- Output files link to the producing job with `prov:wasGeneratedBy`
- Job `CreateAction` entities link to the run with `isPartOf`
- Job `CreateAction` entities link to software agents with `instrument` and `prov:wasAssociatedWith`

If `run_id` were written directly into the base file entity metadata again, it would mix a stable
file identity with run-specific state. The current code instead keeps file identity stable and
updates the same file entity as a file moves from "input known at initialization" to "output with
provenance after job completion".

This design is also consistent with the multi-run behavior covered by
`test_auto_ro_crate_second_run_replaces_entities`: file entities are replaced in place, while
software and job execution entities accumulate across runs and attempts.

## Entity Creation Flow

```mermaid
flowchart TD
A[Workflow initialize_jobs] --> B{enable_ro_crate?}
A --> C[Server creates<br/>#software-torc-server-run-N]
A --> D[Client attempts to create<br/>#software-torc-run-N<br/>and optional<br/>#software-torc-slurm-job-runner-run-N]

B -->|yes| E[Server upserts input File entities<br/>from DB rows with st_mtime]
B -->|yes| F[Client creates or updates<br/>#torc-workflow and #torc-run-N]
B -->|yes| G[Client creates or updates<br/>input File entities]
B -->|no| H[No automatic file provenance]

G --> I[Workflow execution]
E --> I
F --> I
C --> I
D --> I

I --> J[Job completes successfully]
J --> J2{Job has output files?}
J2 -->|yes| K[Client refreshes<br/>#torc-workflow and #torc-run-N]
J2 -->|yes| L[Client creates<br/>#job-job_id-attempt-attempt_id]
J2 -->|yes| M[Client creates or updates<br/>output File entity]
J2 -->|no| P[No additional automatic<br/>RO-Crate entities for this job]

L --> N[Job CreateAction metadata]
N --> N1[prov:hadPlan -> #torc-workflow]
N --> N2[isPartOf -> #torc-run-N]
N --> N3[instrument -> #software-torc-run-N]
N --> N4[prov:used -> input file paths]
N --> N5[result -> output file paths]

M --> O[Output File metadata]
O --> O1[prov:wasGeneratedBy -> job CreateAction]
O --> O2[prov:wasAttributedTo -> #torc-run-N]
O --> O3[prov:wasDerivedFrom -> input file paths]

classDef init fill:#dbeafe,stroke:#1d4ed8,color:#0f172a,stroke-width:2px;
classDef software fill:#dcfce7,stroke:#15803d,color:#0f172a,stroke-width:2px;
classDef input fill:#fef3c7,stroke:#b45309,color:#0f172a,stroke-width:2px;
classDef run fill:#ede9fe,stroke:#6d28d9,color:#0f172a,stroke-width:2px;
classDef job fill:#fee2e2,stroke:#b91c1c,color:#0f172a,stroke-width:2px;
classDef output fill:#cffafe,stroke:#0f766e,color:#0f172a,stroke-width:2px;
classDef disabled fill:#e5e7eb,stroke:#4b5563,color:#111827,stroke-dasharray: 5 3;

class A,I,J init;
class C,D software;
class E,G input;
class F,K run;
class L,N,N1,N2,N3,N4,N5 job;
class M,O,O1,O2,O3 output;
class H,P disabled;
```

## What Gets Created

### Torc binaries

- The server always creates `#software-torc-server-run-{run_id}` during `initialize_jobs()`
- The client attempts to create run-scoped software entities for `torc` and, on Linux,
`torc-slurm-job-runner`
- Client-side software entities are skipped when the corresponding binary cannot be found next to
the current executable or on `PATH`
- These are `SoftwareApplication` plus `prov:SoftwareAgent`

### Jobs

- The client creates one `CreateAction` per successful job completion **that has at least one output
file**
- The entity id is `#job-{job_id}-attempt-{attempt_id}`
- Jobs with no output files currently do not emit an automatic `CreateAction`
- When present, the job entity is the main join point between inputs, outputs, workflow run, and
software agents

### Input files

- Input files are detected by `st_mtime IS NOT NULL`
- During initialization, both the server and the client currently upsert the same input file entity
- The entity is keyed by workflow and `file_id`, with `entity_id = file.path`
- Input file entities are expected to exist before jobs run, but the code does not rely on them
being create-only; it is intentionally upsert-based

### Output files

- Output file entities are created or replaced after a job succeeds and the file record has been
refreshed
- If a file already had an entity from initialization or a prior run, the same DB row is updated
rather than creating a new file entity for each run
- Run-specific provenance is recorded in the metadata relationships, not by giving the file entity a
run-specific identity
- The same successful-job path also refreshes `#torc-workflow`, refreshes `#torc-run-{run_id}`, and
creates the job `CreateAction`, but only when there is at least one output file to process

## Important Asymmetries

- Software entities are run-scoped and accumulate across runs
- Job `CreateAction` entities are attempt-scoped and accumulate across attempts
- File entities are file-scoped and are replaced in place across runs

These asymmetries are intentional and match `tests/test_auto_ro_crate.rs`, especially
`test_auto_ro_crate_second_run_replaces_entities`, which expects:

- file entity count to stay stable across runs
- software entity count to grow across runs
- output file provenance to point at the newer `#torc-run-{run_id}`
Loading
Loading