-
Notifications
You must be signed in to change notification settings - Fork 5
AD-324: Switch RO-Crate provenance export to a PROV-shaped model #262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
27 commits
Select commit
Hold shift + click to select a range
c0d53b9
AD-324: Switch RO-Crate generation to PROV model
aa21ca9
AD-324: fix RO-Crate Linux validation fallout
a2f9e88
Merge branch 'main' into AD-324-ro-create-mods-for-naerm-data-team
5dee39b
resolving merge conflicts
5d918b0
Merge branch 'AD-324-ro-create-mods-for-naerm-data-team' of https://g…
be7f28e
AD-324: format RO-Crate utils after pull
a67836f
removing references to data team
eb8c0ab
Merge branch 'AD-324-ro-create-mods-for-naerm-data-team' of https://g…
f2fe824
AD-324: Stop tracking tmp artifacts
3b000fb
wip
0581f58
fixing pipeline failures
arjlai221 1c30d32
fix client test api compatibility
arjlai221 597db62
reverting unintentional AI mods
arjlai221 ceffce6
failing test
arjlai221 aad2da7
Refine RO-Crate export behavior
arjlai221 0a94a9f
merging with main
arjlai221 4474bec
removing old run_id parameter, adding data flow design doc
arjlai221 1d6f7f2
review mods
arjlai221 febed43
docs update
arjlai221 63708ac
format md file
arjlai221 05d41ac
move provenance creation to workflow creation
arjlai221 a98f4b3
Merge branch 'main' into AD-324-ro-create-mods-for-naerm-data-team
arjlai221 84fe49e
applying updates from main branch
arjlai221 22ad32a
fixing compiliation errors
arjlai221 f483f4f
using #201 improvements for entity id filtering
arjlai221 b8666fe
refine RO-Crate provenance creation and run timing updates
arjlai221 0977f9a
implement cache for getting file model
arjlai221 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -38,3 +38,4 @@ output | |
| torc_output | ||
| .mcp.json | ||
| .dprint-cache/ | ||
| /tmp | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,138 @@ | ||
| # RO-Crate Generation Design | ||
|
|
||
| This page describes how Torc creates and updates automatic RO-Crate provenance entities in the | ||
| current branch. | ||
|
|
||
| ## Current Model | ||
|
|
||
| The important identity rules are: | ||
|
|
||
| - Workflow plan entity: one per workflow, `#torc-workflow` | ||
| - Workflow run entity: one per run, `#torc-run-{run_id}` | ||
| - Torc software entities: one per run, `#software-{binary_name}-run-{run_id}` | ||
| - Job execution entities: one per job attempt, `#job-{job_id}-attempt-{attempt_id}` | ||
| - File entities: one per file record/path, updated in place across runs | ||
|
|
||
| That last point is why `build_file_entity()` does not take `run_id`. Plain file entities are not | ||
| modeled as run-scoped records. Run-scoped provenance is attached through relationships: | ||
|
|
||
| - Output files link to the workflow run with `prov:wasAttributedTo` | ||
| - Output files link to the producing job with `prov:wasGeneratedBy` | ||
| - Job `CreateAction` entities link to the run with `isPartOf` | ||
| - Job `CreateAction` entities link to software agents with `instrument` and `prov:wasAssociatedWith` | ||
|
|
||
| If `run_id` were written directly into the base file entity metadata again, it would mix a stable | ||
| file identity with run-specific state. The current code instead keeps file identity stable and | ||
| updates the same file entity as a file moves from "input known at initialization" to "output with | ||
| provenance after job completion". | ||
|
|
||
| This design is also consistent with the multi-run behavior covered by | ||
| `test_auto_ro_crate_second_run_replaces_entities`: file entities are replaced in place, while | ||
| software and job execution entities accumulate across runs and attempts. | ||
|
|
||
| ## Entity Creation Flow | ||
|
|
||
| ```mermaid | ||
| flowchart TD | ||
| A[Workflow initialize_jobs] --> B{enable_ro_crate?} | ||
| A --> C[Server creates<br/>#software-torc-server-run-N] | ||
| A --> D[Client attempts to create<br/>#software-torc-run-N<br/>and optional<br/>#software-torc-slurm-job-runner-run-N] | ||
|
|
||
| B -->|yes| E[Server upserts input File entities<br/>from DB rows with st_mtime] | ||
| B -->|yes| F[Client creates or updates<br/>#torc-workflow and #torc-run-N] | ||
| B -->|yes| G[Client creates or updates<br/>input File entities] | ||
| B -->|no| H[No automatic file provenance] | ||
|
|
||
| G --> I[Workflow execution] | ||
| E --> I | ||
| F --> I | ||
| C --> I | ||
| D --> I | ||
|
|
||
| I --> J[Job completes successfully] | ||
| J --> J2{Job has output files?} | ||
| J2 -->|yes| K[Client refreshes<br/>#torc-workflow and #torc-run-N] | ||
| J2 -->|yes| L[Client creates<br/>#job-job_id-attempt-attempt_id] | ||
| J2 -->|yes| M[Client creates or updates<br/>output File entity] | ||
| J2 -->|no| P[No additional automatic<br/>RO-Crate entities for this job] | ||
|
|
||
| L --> N[Job CreateAction metadata] | ||
| N --> N1[prov:hadPlan -> #torc-workflow] | ||
| N --> N2[isPartOf -> #torc-run-N] | ||
| N --> N3[instrument -> #software-torc-run-N] | ||
| N --> N4[prov:used -> input file paths] | ||
| N --> N5[result -> output file paths] | ||
|
|
||
| M --> O[Output File metadata] | ||
| O --> O1[prov:wasGeneratedBy -> job CreateAction] | ||
| O --> O2[prov:wasAttributedTo -> #torc-run-N] | ||
| O --> O3[prov:wasDerivedFrom -> input file paths] | ||
|
|
||
| classDef init fill:#dbeafe,stroke:#1d4ed8,color:#0f172a,stroke-width:2px; | ||
| classDef software fill:#dcfce7,stroke:#15803d,color:#0f172a,stroke-width:2px; | ||
| classDef input fill:#fef3c7,stroke:#b45309,color:#0f172a,stroke-width:2px; | ||
| classDef run fill:#ede9fe,stroke:#6d28d9,color:#0f172a,stroke-width:2px; | ||
| classDef job fill:#fee2e2,stroke:#b91c1c,color:#0f172a,stroke-width:2px; | ||
| classDef output fill:#cffafe,stroke:#0f766e,color:#0f172a,stroke-width:2px; | ||
| classDef disabled fill:#e5e7eb,stroke:#4b5563,color:#111827,stroke-dasharray: 5 3; | ||
|
|
||
| class A,I,J init; | ||
| class C,D software; | ||
| class E,G input; | ||
| class F,K run; | ||
| class L,N,N1,N2,N3,N4,N5 job; | ||
| class M,O,O1,O2,O3 output; | ||
| class H,P disabled; | ||
| ``` | ||
|
|
||
| ## What Gets Created | ||
|
|
||
| ### Torc binaries | ||
|
|
||
| - The server always creates `#software-torc-server-run-{run_id}` during `initialize_jobs()` | ||
| - The client attempts to create run-scoped software entities for `torc` and, on Linux, | ||
| `torc-slurm-job-runner` | ||
| - Client-side software entities are skipped when the corresponding binary cannot be found next to | ||
| the current executable or on `PATH` | ||
| - These are `SoftwareApplication` plus `prov:SoftwareAgent` | ||
|
|
||
| ### Jobs | ||
|
|
||
| - The client creates one `CreateAction` per successful job completion **that has at least one output | ||
| file** | ||
| - The entity id is `#job-{job_id}-attempt-{attempt_id}` | ||
| - Jobs with no output files currently do not emit an automatic `CreateAction` | ||
| - When present, the job entity is the main join point between inputs, outputs, workflow run, and | ||
| software agents | ||
|
|
||
| ### Input files | ||
|
|
||
| - Input files are detected by `st_mtime IS NOT NULL` | ||
| - During initialization, both the server and the client currently upsert the same input file entity | ||
| - The entity is keyed by workflow and `file_id`, with `entity_id = file.path` | ||
| - Input file entities are expected to exist before jobs run, but the code does not rely on them | ||
| being create-only; it is intentionally upsert-based | ||
|
|
||
| ### Output files | ||
|
|
||
| - Output file entities are created or replaced after a job succeeds and the file record has been | ||
| refreshed | ||
| - If a file already had an entity from initialization or a prior run, the same DB row is updated | ||
| rather than creating a new file entity for each run | ||
| - Run-specific provenance is recorded in the metadata relationships, not by giving the file entity a | ||
| run-specific identity | ||
| - The same successful-job path also refreshes `#torc-workflow`, refreshes `#torc-run-{run_id}`, and | ||
| creates the job `CreateAction`, but only when there is at least one output file to process | ||
|
|
||
| ## Important Asymmetries | ||
|
|
||
| - Software entities are run-scoped and accumulate across runs | ||
| - Job `CreateAction` entities are attempt-scoped and accumulate across attempts | ||
| - File entities are file-scoped and are replaced in place across runs | ||
|
|
||
| These asymmetries are intentional and match `tests/test_auto_ro_crate.rs`, especially | ||
| `test_auto_ro_crate_second_run_replaces_entities`, which expects: | ||
|
|
||
| - file entity count to stay stable across runs | ||
| - software entity count to grow across runs | ||
| - output file provenance to point at the newer `#torc-run-{run_id}` | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.