AD-324: Switch RO-Crate provenance export to a PROV-shaped model by arjlai221 · Pull Request #262 · NatLabRockies/torc

arjlai221 · 2026-04-09T16:43:15Z

Torc RO-Crate Provenance Change Rationale

Decision

Torc now uses a PROV-shaped RO-Crate format as the canonical export and generation
model. I chose the breaking-change path because the assignment explicitly allowed it and because a
translation layer would have kept two provenance models alive at once.

That would have increased long-term cost in three ways:

every generator change would need a matching mapper change
every export/import path would need dual-format tests
provenance bugs would become harder to diagnose because the stored model and exported model would
differ

Using the target model directly keeps Torc's stored entities, auto-generated metadata, and exported
ro-crate-metadata.json aligned.

Core Modifications

1. File provenance now uses the PROV-facing shape

Generated file entities now use:

@type: ["File", "prov:Entity"]
prov:wasGeneratedBy
prov:wasAttributedTo
prov:wasDerivedFrom

Removed torc:run_id because it was Torc-specific bookkeeping, not a provenance relationship in
the requested model.

2. Job provenance is modeled as PROV activities

Generated job entities now use:

@type: ["CreateAction", "prov:Activity"]
prov:hadPlan
isPartOf
prov:used
prov:wasAssociatedWith

This makes job execution records describe both the workflow plan they follow and the inputs they
consume, instead of only pointing at outputs.

3. Workflow-level provenance entities were added

Torc now creates:

#torc-workflow
#torc-run-{run_id}

These entities are necessary because the requested model refers to a workflow plan and a workflow
run explicitly. Without them, prov:hadPlan and run attribution would point to synthetic IDs that
did not exist as entities.

4. Software entities were aligned with the target model

Torc software records now use:

@type: ["SoftwareApplication", "prov:SoftwareAgent"]

That keeps Torc's own binaries compatible with both RO-Crate consumers and the data team's PROV
interpretation.

5. Export now preserves the richer stored metadata

The exporter no longer flattens stored metadata back to Torc's older shape. It now:

preserves stored @type arrays
keeps stored @id values when present
synthesizes #torc-workflow and #torc-run-{run_id} if older records do not already have them
adds localEvidenceGraph
emits a prov namespace in @context

This was important because switching the generators alone would not have been enough. The exported
crate had to look like the data team's example even when some metadata was entered manually or came
from older workflows.

6. Workflow export/import remapping still works

The import/export ID remapping logic was updated so job provenance references continue to remap when
entity IDs change. The key case here was switching from wasGeneratedBy to
prov:wasGeneratedBy.

Assumptions

These choices were made explicitly:

file lineage is derived from a job's declared input_file_ids
run attribution should be represented by #torc-run-{run_id}
the current Torc run_id is the right identifier to use for workflow-run provenance
workflow/run provenance entities should be created eagerly during input-file initialization and
again during output generation so they stay present and current
software provenance should keep using Torc's existing binary discovery logic instead of adding a
larger agent-model redesign

Why I Did Not Add a Mapping Layer

I did not keep the old storage model and export through a conversion layer because that would have
preserved internal semantics the data team explicitly does not want. A mapper would be useful only
if Torc still needed to support both formats as first-class outputs. That was not the assignment's
bias.

Why I Did Not Change the Database Schema

The database already stores RO-Crate metadata as JSON strings plus a few indexing fields
(workflow_id, file_id, entity_id, entity_type). That was already flexible enough for the
new model.

Changing the schema would not have improved provenance quality. It would only have increased risk
and migration cost for no practical gain.

Validation Status

Validated directly:

RO-Crate generator unit tests for file entities and CreateAction entities
workflow export/import unit tests for job-ID remapping
WSL build for the client/default-feature path

Partially blocked in this worktree:

full server-feature integration validation
end-to-end RO-Crate integration tests that require the feature-gated server binary path

Those failures were not caused by the RO-Crate logic itself. This workspace already has unrelated
server-feature build issues and test-harness assumptions about feature-gated binaries.

Known Follow-Ups

If this needs to be production-hardened further, the next useful follow-ups are:

decide whether workflow plan typing should remain SoftwareApplication + prov:Plan or move to a
more domain-specific plan entity later
decide whether script-level agents should be auto-generated beyond Torc's own binaries

Adopt the data team's PROV-shaped RO-Crate metadata as Torc's canonical generation and export format. Update file, job, software, workflow, and run provenance entities to use the new relationships and type arrays. Adjust export/import remapping, refresh the RO-Crate docs, and add a rationale document covering the design choices and assumptions behind the change.

…ithub.com/NatLabRockies/torc into AD-324-ro-create-mods-for-naerm-data-team

Remove the accidentally committed tmp workspace files from the index while keeping them on disk locally. Keep /tmp in .gitignore so future scratch notes and examples stay untracked by default.

Copilot

Pull request overview

This PR switches Torc’s RO-Crate provenance generation/export to a PROV-shaped model so stored metadata and exported ro-crate-metadata.json align with the requested PROV interpretation.

Changes:

Update generated File/CreateAction/Software entities to use PROV properties and @type arrays (e.g., prov:wasGeneratedBy, prov:Activity, prov:SoftwareAgent).
Add workflow-level provenance entities (#torc-workflow, #torc-run-{run_id}) and ensure export can synthesize them when missing.
Update tests and documentation to reflect the PROV-shaped RO-Crate output and access-group naming changes.

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/test_workflow_export.rs	Updates job-id remapping test to use `prov:wasGeneratedBy` and PROV `@type` arrays.
tests/test_auto_ro_crate.rs	Adjusts auto-generation assertions for PROV-shaped metadata and synthetic workflow/run entities.
tests/test_access_groups.rs	Renames “data-team” to “analytics-team” in tests and helper setup.
tests/common.rs	Updates access-control fixture docs to reflect “Analytics team” naming.
src/server/api/ro_crate.rs	Updates server-side input-file entity generation to PROV `@type` arrays and adds hashing/size fields; adds workflow provenance entity upsert logic.
src/client/workflow_manager.rs	Creates workflow provenance entities during input-file initialization path.
src/client/ro_crate_utils.rs	Shifts client-side entity builders to PROV shape; adds workflow plan/run entity builders and provenance links (`prov:used`, `prov:wasDerivedFrom`, etc.).
src/client/commands/workflow_export.rs	Updates ID remapping tests/docs to remap `prov:wasGeneratedBy`.
src/client/commands/ro_crate.rs	Changes export assembly to preserve stored `@id/@type`, synthesize workflow/run entities, add `localEvidenceGraph`, and emit PROV context.
docs/src/specialized/admin/access-groups-tutorial.md	Renames “Data Team” to “Analytics Team” in the tutorial examples.
docs/src/core/how-to/ro-crate-metadata.md	Updates how-to to describe PROV-shaped export and context array.
docs/src/core/concepts/ro-crate.md	Updates conceptual docs for PROV-shaped entities and new provenance relationships.
.gitignore	Ignores `/tmp`.
.github/workflows/lint.yml	Ensures `DATABASE_URL` is set for the OpenAPI codegen parity test step.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-18T15:42:04Z

+    let run_entity =
+        build_workflow_run_entity(workflow_id, run_id, workflow_name, Utc::now(), None);
+    create_or_update_entity_by_entity_id(config, workflow_id, run_entity);


create_workflow_provenance_entities always builds the run entity with Utc::now() as startTime and then updates the existing #torc-run-{run_id} entity if present. Since this function is called more than once (e.g., from job completion), the run's startTime will drift forward over time and no longer represent when the run actually started. Preserve an existing startTime on update (only set it when inserting), or pass an explicit run start timestamp captured once at run start.

Suggested change

let run_entity =

build_workflow_run_entity(workflow_id, run_id, workflow_name, Utc::now(), None);

create_or_update_entity_by_entity_id(config, workflow_id, run_entity);

let run_entity_id = format!("#torc-run-{}", run_id);

if find_entity_by_entity_id(config, workflow_id, &run_entity_id).is_none() {

let run_entity =

build_workflow_run_entity(workflow_id, run_id, workflow_name, Utc::now(), None);

create_or_update_entity_by_entity_id(config, workflow_id, run_entity);

}

start time is now pulled from the entity and only falls back to startTime as Utc::now() when there is no valid datetime in the RO-crate entry

Copilot

Pull request overview

Copilot reviewed 16 out of 17 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

arjlai221

Submitting the pending review so I can add inline responses on the existing threads.

arjlai221 · 2026-04-21T16:17:35Z

+    let run_entity =
+        build_workflow_run_entity(workflow_id, run_id, workflow_name, Utc::now(), None);
+    create_or_update_entity_by_entity_id(config, workflow_id, run_entity);


start time is now pulled from the entity and only falls back to startTime as Utc::now() when there is no valid datetime in the RO-crate entry

arjlai221 · 2026-04-30T22:47:36Z

@daniel-thom addressed your comments over the last 4 commits

Copilot

Pull request overview

Copilot reviewed 16 out of 17 changed files in this pull request and generated 9 comments.

Comments suppressed due to low confidence (1)

src/client/commands/ro_crate.rs:1

The Export command still accepts a format, but this change drops it at the call site and removes it from handle_export. That means ro-crate export --format json now always emits a full RO-Crate document instead of the raw entity list that callers previously requested.

use std::io::Read;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

arjlai221 · 2026-05-05T00:25:34Z

+    let mut existing_ids: HashSet<String> = entities.iter().map(|e| e.entity_id.clone()).collect();
+    let run_entity_id = format!("#torc-run-{}", run_id);
+    let mut synthetic_entities: Vec<serde_json::Value> = Vec::new();
+
+    if !existing_ids.contains("#torc-workflow") {
+        synthetic_entities.push(serde_json::json!({
+            "@id": "#torc-workflow",
+            "@type": ["SoftwareApplication", "prov:Plan"],
+            "name": workflow_name.clone()
+        }));
+        existing_ids.insert("#torc-workflow".to_string());
+    }
+
+    if !existing_ids.contains(&run_entity_id) {
+        let run_entity = serde_json::json!({
+            "@id": run_entity_id.clone(),
+            "@type": ["CreateAction", "prov:Activity"],
+            "name": format!("{} Run {}", workflow_name, run_id),
+            "prov:hadPlan": { "@id": "#torc-workflow" },
+            "instrument": { "@id": format!("#software-torc-run-{}", run_id) },
+            "prov:wasAssociatedWith": [
+                { "@id": format!("#software-torc-run-{}", run_id) },
+                { "@id": format!("#software-torc-server-run-{}", run_id) }
+            ]
+        });
+        synthetic_entities.push(run_entity);
+    }
+
+    // Build user and synthetic entities first so hasPart can include the final set.
+    let mut graph_entities: Vec<serde_json::Value> = synthetic_entities;
+    for entity in &entities {
+        if let Ok(mut parsed) = serde_json::from_str::<serde_json::Value>(&entity.metadata) {
+            if let Some(obj) = parsed.as_object_mut() {
+                obj.entry("@id".to_string())
+                    .or_insert_with(|| serde_json::json!(entity.entity_id));


no action taken; this problem only surfaces when accounting for backwards compatibility

arjlai221 · 2026-05-05T00:34:14Z

+    let start_time = existing_run_entity
+        .as_ref()
+        .and_then(|entity| parse_entity_datetime(entity, "startTime"))
+        .unwrap_or_else(Utc::now);
+    let end_time = existing_run_entity
+        .as_ref()
+        .and_then(|entity| parse_entity_datetime(entity, "endTime"));
+    let run_entity =
+        build_workflow_run_entity(workflow_id, run_id, workflow_name, start_time, end_time);
+    create_or_update_entity_by_entity_id(config, workflow_id, run_entity);


reworked the run entity path to isolate structure creation from timing application and preserve existing timing when updating.

arjlai221 · 2026-05-05T00:26:21Z

+        if self.workflow.enable_ro_crate == Some(true) {
+            crate::client::ro_crate_utils::create_workflow_provenance_entities(
+                &self.config,
+                self.workflow_id,
+                self.run_id,
+                &self.workflow.name,
+            );
+        }
+        crate::client::ro_crate_utils::create_software_entities(
+            &self.config,
+            self.workflow_id,
+            self.run_id,
+        );


removed the per-job duplication and moved workflow/software provenance creation to worker startup instead of job completion

arjlai221 · 2026-05-05T00:26:53Z

+        let run_id = self.get_run_id().unwrap_or(0);
+        crate::client::ro_crate_utils::create_workflow_provenance_entities(
+            &self.config,
+            self.workflow_id,
+            run_id,
+            &workflow.name,
+        );


fixed; provenance creation is skipped when run_id lookup fails instead of silently fabricating run 0

arjlai221 · 2026-05-05T00:28:36Z

+    let run_id = match apis::workflows_api::get_workflow_status(config, workflow_id) {
+        Ok(status) => status.run_id,
+        Err(e) => {
+            print_error("getting workflow status", &e);
+            std::process::exit(1);
+        }
+    };


decided that successful workflow lookup was required for export; ignoring this edge case

arjlai221 · 2026-05-05T00:29:14Z

    let ro_crate = serde_json::json!({
        "@context": [
            "https://w3id.org/ro/crate/1.1/context",
-            {"torc": "https://github.com/NatLabRockies/torc/terms/"}
+            {
+                "prov": "http://www.w3.org/ns/prov#",
+                "torc": "https://github.com/NatLabRockies/torc/terms/"
+            }
        ],
        "@graph": graph


code is correct, ignoring for now

arjlai221 · 2026-05-05T00:30:25Z

+        let input_file_paths: Vec<String> = job
+            .input_file_ids
+            .clone()
+            .unwrap_or_default()
+            .into_iter()
+            .filter_map(|file_id| {
+                match self.send_with_retries(|| {
+                    Self::box_retry_error(apis::files_api::get_file(&self.config, file_id))
+                }) {
+                    Ok(file) => Some(file.path),
+                    Err(e) => {
+                        warn!(
+                            "Could not fetch input file {} for RO-Crate creation on job {}: {}",
+                            file_id, job_id, e
+                        );
+                        None
+                    }
+                }
+            })
+            .collect();


applied a small local optimization, but did not introduce a larger batched API/design change. A real fix would require batched file lookup or carrying resolved paths from earlier job setup. That is beyond the intended scope here. @daniel-thom Open a new issue for this problem?

The main problem here is that the code is sending get_file to the server for the same file over and over. Consider the fan-in case suggested by Copilot. If there are 100k jobs that all use the same input file, we are going to send this API command 100k times. We could cache the provenance information for the input files in the job_runner's memory.

A separate issue is whether we need a list_files(file_ids) API command. I'm not sure we do.

lai25 and others added 2 commits April 9, 2026 08:38

AD-324: fix RO-Crate Linux validation fallout

aa21ca9

arjlai221 requested a review from daniel-thom April 9, 2026 16:43

lai25 and others added 4 commits April 13, 2026 09:47

Merge branch 'main' into AD-324-ro-create-mods-for-naerm-data-team

a2f9e88

resolving merge conflicts

5dee39b

Merge branch 'AD-324-ro-create-mods-for-naerm-data-team' of https://g…

5d918b0

…ithub.com/NatLabRockies/torc into AD-324-ro-create-mods-for-naerm-data-team

AD-324: format RO-Crate utils after pull

be7f28e

arjlai221 changed the title ~~Ad 324 ro create mods for naerm data team~~ AD-324: Switch RO-Crate provenance export to a PROV-shaped model Apr 13, 2026

lai25 added 3 commits April 13, 2026 11:09

removing references to data team

a67836f

Merge branch 'AD-324-ro-create-mods-for-naerm-data-team' of https://g…

eb8c0ab

…ithub.com/NatLabRockies/torc into AD-324-ro-create-mods-for-naerm-data-team

AD-324: Stop tracking tmp artifacts

f2fe824

Remove the accidentally committed tmp workspace files from the index while keeping them on disk locally. Keep /tmp in .gitignore so future scratch notes and examples stay untracked by default.

daniel-thom requested a review from Copilot April 18, 2026 15:35

Copilot started reviewing on behalf of daniel-thom April 18, 2026 15:36 View session

Copilot AI reviewed Apr 18, 2026

View reviewed changes

daniel-thom reviewed Apr 20, 2026

View reviewed changes

Comment thread docs/src/specialized/admin/access-groups-tutorial.md Outdated

Comment thread src/client/ro_crate_utils.rs Outdated

wip

3b000fb

daniel-thom reviewed Apr 21, 2026

View reviewed changes

Comment thread reviews/pr_262_comment_response_report.md Outdated

fixing pipeline failures

0581f58

arjlai221 force-pushed the AD-324-ro-create-mods-for-naerm-data-team branch from 19fb680 to 0581f58 Compare April 21, 2026 01:22

arjlai221 added 6 commits April 20, 2026 18:37

fix client test api compatibility

1c30d32

reverting unintentional AI mods

597db62

failing test

ceffce6

Refine RO-Crate export behavior

aad2da7

merging with main

0a94a9f

removing old run_id parameter, adding data flow design doc

4474bec

daniel-thom requested a review from Copilot April 21, 2026 23:28

Copilot started reviewing on behalf of daniel-thom April 21, 2026 23:29 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

Comment thread tests/common.rs Outdated

Comment thread tests/common.rs Outdated

Comment thread tests/common.rs

daniel-thom requested changes Apr 24, 2026

View reviewed changes

arjlai221 commented Apr 27, 2026

View reviewed changes

review mods

1d6f7f2

daniel-thom reviewed Apr 27, 2026

View reviewed changes

Comment thread docs/src/specialized/admin/access-groups-tutorial.md Outdated

Comment thread docs/src/specialized/design/ro-crate.md

Comment thread docs/src/specialized/design/ro-crate.md Outdated

Comment thread docs/src/specialized/design/ro-crate.md Outdated

arjlai221 added 2 commits April 28, 2026 09:11

docs update

febed43

format md file

63708ac

arjlai221 added 2 commits May 4, 2026 10:21

move provenance creation to workflow creation

05d41ac

Merge branch 'main' into AD-324-ro-create-mods-for-naerm-data-team

a98f4b3

daniel-thom requested a review from Copilot May 4, 2026 18:56

Copilot AI reviewed May 4, 2026

View reviewed changes

applying updates from main branch

84fe49e

Copilot started reviewing on behalf of daniel-thom May 4, 2026 20:20 View session

arjlai221 added 4 commits May 4, 2026 13:32

fixing compiliation errors

22ad32a

using #201 improvements for entity id filtering

f483f4f

refine RO-Crate provenance creation and run timing updates

b8666fe

implement cache for getting file model

0977f9a

daniel-thom approved these changes May 5, 2026

View reviewed changes

arjlai221 merged commit 547b8f6 into main May 5, 2026
9 checks passed

Conversation

arjlai221 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Torc RO-Crate Provenance Change Rationale

Decision

Core Modifications

1. File provenance now uses the PROV-facing shape

2. Job provenance is modeled as PROV activities

3. Workflow-level provenance entities were added

4. Software entities were aligned with the target model

5. Export now preserves the richer stored metadata

6. Workflow export/import remapping still works

Assumptions

Why I Did Not Add a Mapping Layer

Why I Did Not Change the Database Schema

Validation Status

Known Follow-Ups

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arjlai221 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arjlai221 commented Apr 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

arjlai221 commented Apr 9, 2026 •

edited

Loading