Skip to content

feat(scan): add opt-in transitive reference scanning#225

Open
rodboev wants to merge 6 commits into
NVIDIA:mainfrom
rodboev:pr/transitive-external-reference-scanning
Open

feat(scan): add opt-in transitive reference scanning#225
rodboev wants to merge 6 commits into
NVIDIA:mainfrom
rodboev:pr/transitive-external-reference-scanning

Conversation

@rodboev

@rodboev rodboev commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds opt-in transitive scanning for source references inside scanned skill content, including normalized trust-prefix checks, bounded traversal budgets, shared dependency result reuse across sibling roots, and source-aware reporting.

Closes #97

Root cause

The current scan pipeline resolves one input into one local directory, builds context from that tree, and writes the graph result from that single invocation. Recursive mode only repeats that one-hop scan for immediate local child skill directories, and report emitters assume every finding belongs to the directly requested source. The original transitive branch added follow-up scanning on top of that model, but it still split normalization, caching, and merged coverage across separate helpers, which left path-trust escapes, unbounded fan-out, sibling result suppression, and path-only report merging behind.

Diff Notes

  • Add a shared transitive traversal module that extracts source-like external references from file_cache, filters out adjacent non-source URLs, canonicalizes source identities with dot-segment and unreserved percent-decoding normalization, enforces allow or deny prefix checks on normalized path boundaries, and owns per-root visited-state mutation.
  • Add --transitive, --transitive-depth, and repeated allow or deny prefix controls to skillspector scan, then route both single-skill and recursive multi-skill entrypoints through a cached traversal helper.
  • Replace the breadth-only traversal with explicit per-scan budgets for target count, scanned bytes, and elapsed time, and surface truncation in the merged report metadata when traversal stops early.
  • Reuse cached child scan results across referring roots instead of sharing only a visited set, so sibling skills still receive findings and provenance from the same dependency without rescanning it.
  • Reuse InputHandler for approved external targets so existing host allowlists, SSRF checks, clone or download handling, and archive protections stay authoritative, including archive URLs on allowed Git hosts.
  • Add transitive_depth and source_url to Finding, preserve the root cleanup path through both the merged and zero-depth transitive paths, keep baseline fingerprints scoped by provenance, and merge component coverage with source-aware keys so same-named dependency files do not collapse or appear skipped.
  • Restore the report tests that had fallen out of collection, then add focused coverage for normalized prefix checks, bounded traversal truncation, sibling cache reuse, source-aware coverage, transitive failure isolation, and report provenance.
  • Replay the branch onto current main.

Scope

This change stays on source types already supported by InputHandler, normalized trust-prefix checks, traversal depth plus hard fan-out limits, allow or deny prefix controls, cache reuse across recursive siblings, and provenance in reports. It does not add a web crawler, new allowed hosts, MCP behavior, or any default behavior change when --transitive is absent.

Verification

  • python -m pytest tests/unit/test_transitive.py tests/unit/test_cli.py tests/nodes/test_report.py tests/nodes/test_sarif_rules_and_empty_findings.py -q - pass, 110 passed
  • uv run ruff check src/ tests/ - pass
  • uv run ruff format --check src/ tests/ - not clean locally; currently reports src/skillspector/nodes/analyzers/mcp_least_privilege.py
  • CI Lint & Test (Python 3.12), Lint & Test (Python 3.13), and DCO Check - pending maintainer approval and rerun after push

@rng1995 rng1995 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes. Transitive prefix controls can be escaped through non-normalized URL paths, traversal has no fan-out budget, and shared visited state causes sibling skills to omit shared dependency findings. The merged report also misreports dependency coverage, several report tests are no longer collected, and the branch conflicts with current main. Please address the inline blockers and rebase while preserving current CLI, input-handler, and reporting behavior.

Comment thread src/skillspector/transitive.py Outdated
Comment thread src/skillspector/cli.py
Comment thread src/skillspector/cli.py Outdated
Comment thread src/skillspector/cli.py
Comment thread tests/nodes/test_report.py
rodboev added 6 commits June 30, 2026 06:58
Signed-off-by: Rod Boev <rod.boev@gmail.com>
Signed-off-by: Rod Boev <rod.boev@gmail.com>
Signed-off-by: Rod Boev <rod.boev@gmail.com>
Signed-off-by: Rod Boev <rod.boev@gmail.com>
Signed-off-by: Rod Boev <rod.boev@gmail.com>
Signed-off-by: Rod Boev <rod.boev@gmail.com>
@rodboev rodboev force-pushed the pr/transitive-external-reference-scanning branch from 47dfb07 to 4fb451c Compare June 30, 2026 11:15
@rodboev

rodboev commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Addressed the blockers on the transitive traversal path rather than patching each edge case in isolation. The branch now normalizes target identities before allow or deny checks, applies an explicit traversal budget, reuses cached child results per referring root instead of suppressing sibling findings behind one shared visited set, keeps dependency coverage source-aware in reporting, restores the report tests that had fallen out of pytest collection, and replays the whole change onto current main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: transitive link scanning — follow external repos/URLs referenced inside skill files

2 participants