Skip to content

feat(ci): add nightly onboard trace timing summaries#5245

Merged
cv merged 22 commits into
mainfrom
5090-perfonboard-enable-nemoclaw-e2e-trace-timing-slack-summaries-in-gitlab-ci
Jun 13, 2026
Merged

feat(ci): add nightly onboard trace timing summaries#5245
cv merged 22 commits into
mainfrom
5090-perfonboard-enable-nemoclaw-e2e-trace-timing-slack-summaries-in-gitlab-ci

Conversation

@amata-human

@amata-human amata-human commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds nightly cloud-onboard-e2e trace collection, sanitized artifact upload, and Slack/GitHub scorecard timing summaries so maintainers can spot onboard timing regressions without exposing raw trace secrets.

Related Issue

Closes #5090

Changes

  • Enables NEMOCLAW_TRACE_DIR for nightly cloud-onboard-e2e and uploads sanitized trace artifacts as cloud-onboard-traces.
  • Adds compact Slack timing output with total onboard duration, prior-release comparison, and top matching phase changes.
  • Adds a detailed GitHub job summary phase timing table when current and baseline trace phase names match.
  • Adds trace artifact sanitization for sensitive-looking keys and common token formats before upload.
  • Documents nightly trace artifact paths, Slack webhook configuration, baseline behavior, and sanitization.
  • Adds workflow, Slack scorecard, and sanitizer tests.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Angel Mata amata@nvidia.com

Summary by CodeRabbit

  • New Features

    • Optional always-on artifact upload inputs for CI runs; nightly results now include trace timing, top phase changes, and a dedicated Trace section.
  • Documentation

    • Added E2E CI guidance covering trace collection, sanitized artifacts, scorecard reporting, and Slack expectations.
  • Security

    • Expanded trace sanitization with additional secret/value redaction and recursive artifact sanitization.
  • Tests

    • Added/updated tests for artifact uploads, trace-timing analysis, and Slack/GitHub summary output; removed some legacy redaction tests.

Signed-off-by: Angel Mata <amata@nvidia.com>
@amata-human amata-human linked an issue Jun 11, 2026 that may be closed by this pull request
2 tasks
@copy-pr-bot

copy-pr-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds recursive trace sanitization, always-on artifact inputs/uploads for E2E runs, a trace-timing analyzer that compares current vs prior nightly traces, scorecard integration to include traceTimingLine/traceSummaryLines, Slack block updates to render trace summaries, and tests/docs validating wiring and output.

Changes

Nightly Trace Timing and Slack Summaries

Layer / File(s) Summary
Trace sanitization and flush
src/lib/trace.ts, src/lib/trace.test.ts
Detects sensitive keys/values, applies regex-based value redaction, adds recursive sanitizer for nested objects/arrays, and sanitizes full trace artifacts before writing; tests adjusted accordingly.
Action inputs and reusable workflow wiring
.github/actions/run-e2e-script/action.yaml, .github/workflows/e2e-script.yaml
Add always-artifact-name / always-artifact-path inputs to the action and expose always_artifact_name / always_artifact_path in the reusable workflow; pass values through and adjust checkout sparse-checkout formatting.
Trace timing analysis module
scripts/scorecard/analyze-trace-timing.ts
New module selects onboard trace JSON, aggregates phase durations, resolves prior semver tag, finds the latest prior nightly run, downloads/parses its trace artifact, computes per-phase deltas and top changes, and returns markdown-ready traceTimingLine and traceSummaryLines with exported helpers/constants.
Nightly workflow scorecard integration
.github/workflows/nightly-e2e.yaml
Set NEMOCLAW_TRACE_DIR and always-upload cloud-onboard-traces; load and call the trace-timing analyzer from the scorecard generator; include traceTimingLine/traceSummaryLines in the generated scorecard markdown and scorecardData JSON.
Slack blocks, tests, and docs
scripts/scorecard/build-slack-blocks.ts, test/scorecard-blocks.test.ts, test/e2e-script-workflow.test.ts, test/e2e/README.md
ScorecardData gains traceTimingLine; buildBlocks conditionally appends a *Trace:* Slack section; tests assert trace rendering and workflow artifact wiring; README documents trace collection, sanitization, and Slack scorecard configuration.

Sequence Diagram(s)

sequenceDiagram
  participant ComponentA
  participant ComponentB
  ComponentA->>ComponentB: observable interaction
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

area: ci, feature, nightly-e2e, area: onboarding

Suggested reviewers

  • cv
  • jyaunches
  • prekshivyas

Poem

🐰 I hopped through traces, tidy and bright,
Redacting secrets before they take flight.
Phases counted, deltas neatly shown,
A Slack-bound carrot so teams are not alone.
Hoppy CI — timing carrots have grown!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 3.85% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding nightly onboard trace timing summaries to surface performance data in Slack and GitHub.
Linked Issues check ✅ Passed All PR changes comprehensively address the #5090 requirements: enabling NEMOCLAW_TRACE_DIR tracing, uploading sanitized trace artifacts, generating timing summaries with historical comparison, posting to Slack, documenting configuration, and ensuring secrets are redacted.
Out of Scope Changes check ✅ Passed All changes remain focused on trace timing visibility. No out-of-scope changes detected such as tracing framework modifications, orchestration changes, timeout tuning, or unrelated refactoring.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 5090-perfonboard-enable-nemoclaw-e2e-trace-timing-slack-summaries-in-gitlab-ci

Comment @coderabbitai help to get the list of available commands and usage tips.

@amata-human amata-human added the v0.0.64 Release target label Jun 11, 2026
@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: None
Optional E2E: None

Workflow run

Full advisor summary

E2E Recommendation Advisor

Failed: Could not parse JSON from advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/e2e-advisor/e2e-advisor-raw-output.txt

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Recommendation

Required Vitest E2E scenarios: None
Optional Vitest E2E scenarios: None

Workflow run

Full Vitest E2E advisor summary

Vitest E2E Scenario Advisor

Failed: Could not parse JSON from advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/e2e-advisor/e2e-scenario-advisor-raw-output.txt

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 1 worth checking, 0 nice ideas
Top item: PR review advisor unavailable

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • PR review advisor unavailable: The automated advisor could not complete: Could not parse JSON from PR review advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/pr-review-advisor/pr-review-advisor-raw-output.txt
    • Recommendation: Re-run the PR Review Advisor or perform a manual review.
    • Evidence: Could not parse JSON from PR review advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/pr-review-advisor/pr-review-advisor-raw-output.txt

🌱 Nice ideas

  • None.
Consider writing more tests for
  • **Runtime validation** — Add or identify targeted runtime/integration validation for the changed behavior; do not report external E2E job pass/fail here.. Runtime/sandbox/infrastructure paths need behavioral runtime validation: .github/actions/run-e2e-script/action.yaml, .github/actions/run-e2e-script/sanitize-trace-artifacts.py, .github/workflows/e2e-script.yaml, .github/workflows/nightly-e2e.yaml, scripts/scorecard/analyze-trace-timing.ts, scripts/scorecard/build-slack-blocks.ts, src/lib/trace.ts.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/ci/sanitize-trace-artifacts.js (1)

1-104: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: File must use TypeScript extension per CI guardrails.

The pipeline failure indicates this file violates the codebase growth guardrails policy. The project requires using .ts instead of .js/.cjs/.mjs for new Node.js code. Rename this file to sanitize-trace-artifacts.ts and add appropriate TypeScript type annotations.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/ci/sanitize-trace-artifacts.js` around lines 1 - 104, Rename the file
to sanitize-trace-artifacts.ts and convert it from CommonJS to TypeScript:
replace require(...) with TypeScript imports (e.g. import * as fs from
"node:fs"; import * as path from "node:path"), add explicit type annotations for
functions and variables (e.g. redactString(value: string): string,
sanitize(value: unknown, key = ""): unknown, listJsonFiles(directory: string):
string[], sanitizeTraceArtifacts(sourceDirectory: string, outputDirectory:
string): { files: number; outputDirectory: string }), cast JSON.parse results to
unknown/any before sanitizing, and update exports to ESNamed exports (export {
REDACTED, sanitize, sanitizeTraceArtifacts }); keep the CLI block but ensure
TypeScript accepts require.main by either using if (require.main === module)
with a top-level declare const require: any; or convert to an import.meta.url
check, and adjust process.argv typing as string[] so the script compiles under
tsconfig.

Source: Pipeline failures

🧹 Nitpick comments (6)
scripts/ci/sanitize-trace-artifacts.js (3)

73-73: ⚡ Quick win

Add error handling for malformed JSON.

The JSON.parse call will throw if a .json file contains invalid JSON. Consider wrapping this in a try-catch to provide a more informative error message that includes the filename.

🛡️ Proposed fix to add error handling
-    const parsed = JSON.parse(fs.readFileSync(file, "utf8"));
+    let parsed;
+    try {
+      parsed = JSON.parse(fs.readFileSync(file, "utf8"));
+    } catch (error) {
+      throw new Error(`Failed to parse JSON from ${file}: ${error.message}`);
+    }
     const sanitized = sanitize(parsed);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/ci/sanitize-trace-artifacts.js` at line 73, The JSON.parse call that
creates the local variable parsed should be wrapped in a try-catch to handle
malformed JSON; locate the line using JSON.parse(fs.readFileSync(file, "utf8"))
and surround it with a try block, catch the thrown error, and produce an
informative message that includes the filename (file) and the parser error
(e.g., using processLogger.error or console.error), then either rethrow or
skip/continue depending on sanitizeTraceArtifacts' desired behavior; ensure the
catch preserves stack/error details for debugging.

69-71: 💤 Low value

Path validation is correct but document the security contract.

The path safety check correctly rejects .. (directory traversal), absolute paths, and paths that would escape the source root. This prevents writing sanitized output to arbitrary filesystem locations.

Consider adding a comment explaining the security rationale:

// Security: Reject paths that escape sourceRoot via "..", absolute paths,
// or would resolve outside the intended output directory.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/ci/sanitize-trace-artifacts.js` around lines 69 - 71, Add a short
security comment above the path-safety check that documents the contract:
explain that the check using relativePath.startsWith("..") and
path.isAbsolute(relativePath) (and the thrown Error referencing file)
intentionally rejects directory traversal and absolute paths to prevent
sanitized traces from being written outside the intended output/source root;
keep it concise and mention the expected guarantee that resolved paths will
remain inside the intended output directory.

10-18: ⚖️ Poor tradeoff

Verify completeness of sensitive value patterns.

The current patterns cover common token formats (Bearer, Slack, NVIDIA API, GitHub), but consider whether additional patterns are needed:

  • AWS credentials (AKIA..., secret keys)
  • Azure tokens
  • Generic JWT patterns
  • Other cloud provider tokens that might appear in traces

Would you like me to search the codebase for additional credential patterns that should be included?

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/ci/sanitize-trace-artifacts.js` around lines 10 - 18,
SENSITIVE_VALUE_RES currently misses several common credential/token formats;
update the array used alongside SENSITIVE_KEY_RE to add regexes for AWS access
keys (AKIA/ASIA followed by 16 chars) and AWS secret access key-like base64
strings, generic JWT-looking tokens (three dot-separated Base64URL segments),
Azure/AD tokens (eyJ0eXAi... plus long Base64URL), and other common cloud
prefixes (e.g., GCP service account keys, long hex API keys); modify the
SENSITIVE_VALUE_RES constant to include these additional regex patterns and
ensure they are case-insensitive or global where appropriate so the sanitizer
will match and redact these tokens during trace processing (refer to
SENSITIVE_VALUE_RES and SENSITIVE_KEY_RE to locate and update the patterns).
.github/actions/run-e2e-script/action.yaml (1)

87-87: Document CI execution/build strategy for sanitize-trace-artifacts (future TS migration)

  • Right now .github/actions/run-e2e-script/action.yaml runs scripts/ci/sanitize-trace-artifacts.js via node (and the e2e workflow/test reference the same .js), so a TypeScript rename would require updating the action/workflow/tests to either use the repo’s TS runner (e.g., tsx) or rely on a compiled .js output.
  • Add a short note describing the intended CI run/compile strategy to prevent breakage during a future .js.ts migration.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/actions/run-e2e-script/action.yaml at line 87, Add a short
documentation note in the repository explaining how CI will run the
sanitize-trace-artifacts script so a future .js→.ts rename won't break the
workflow: reference the action invocation that currently runs node
"$GITHUB_ACTION_PATH/../../../scripts/ci/sanitize-trace-artifacts.js" and state
the chosen strategy (either keep calling a compiled .js artifact produced by CI
builds or switch the action/workflow/tests to use a TS runner such as tsx in
place of node), and include instructions for updating
.github/actions/run-e2e-script/action.yaml and any workflows/tests that call
sanitize-trace-artifacts.js when the migration happens.
.github/workflows/nightly-e2e.yaml (2)

633-2925: 🏗️ Heavy lift

Consider extracting the trace timing logic to a separate script.

The 2,292-line inline JavaScript block makes this workflow file difficult to maintain and test. The trace analysis logic (semver parsing, artifact download, phase extraction, delta computation) is complex enough to warrant extraction to a dedicated script file (e.g., scripts/scorecard/analyze-trace-timing.js or scripts/ci/trace-timing-report.js).

Benefits:

  • Easier to unit test the logic directly (similar to test/sanitize-trace-artifacts.test.ts)
  • Better code organization and reusability
  • Workflow file remains focused on orchestration
  • Easier to review and maintain the analysis logic

The current inline approach works but creates a maintenance burden as trace analysis requirements evolve.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/nightly-e2e.yaml around lines 633 - 2925, The scorecard
job embeds a very large inline JS trace analysis block (buildTraceTimingResult,
readTraceSummaryFromRun, resolvePriorReleaseTag, selectOnboardTrace,
extractPhaseDurations, formatTraceDelta, etc.), which makes the workflow hard to
maintain; move that logic into a dedicated script (e.g.,
scripts/scorecard/analyze-trace-timing.js) and have the workflow call the script
from the Generate nightly scorecard step (invoke node or run the script via
actions/github-script by loading its exported functions), preserving the
existing function names/behaviour and error handling so the scorecard step still
returns the same traceTimingLine and traceSummaryLines outputs.

2641-2652: 💤 Low value

Phase order knowledge is duplicated.

The ONBOARD_PHASE_ORDER array hardcodes the sequence of onboarding phases. If this ordering logic exists elsewhere in the product code (e.g., in the trace generation code), consider documenting the canonical source or extracting this to a shared constant file that both the tracer and the CI script can reference.

This prevents drift if phase names or ordering change.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/nightly-e2e.yaml around lines 2641 - 2652, Replace the
hardcoded ONBOARD_PHASE_ORDER array with a single source of truth: import the
canonical phase ordering constant (e.g., export named constant like
onboardPhaseOrder or ONBOARD_PHASE_ORDER) from the product code that generates
onboarding traces; if that constant doesn't exist yet, add an exported constant
in the product module used for trace generation and update both the tracer code
and this CI workflow to import it, or at minimum add an inline comment pointing
to the canonical definition (trace generation module/function) so the ordering
isn't duplicated.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/e2e-script.yaml:
- Around line 145-147: The sparse-checkout currently includes the sanitizer
script entry 'scripts/ci/sanitize-trace-artifacts.js' and the action directory
'.github/actions/run-e2e-script'; when the sanitizer is converted to TypeScript,
update this sparse-checkout entry to point to the correct executable artifact
(for example the compiled JS output under scripts/ci/dist or the TypeScript
source and adjust the execution step to use ts-node) so the sanitize step can
find and run the sanitizer; ensure the workflow step that invokes the sanitizer
(the sanitize step) is also updated to run the chosen target.

In `@test/sanitize-trace-artifacts.test.ts`:
- Around line 18-20: Update the test import to match the sanitizer's
TypeScript/ESM conversion: replace the createRequire(...) + require(...) usage
with a direct ESM import like `import { REDACTED, sanitizeTraceArtifacts } from
'../scripts/ci/sanitize-trace-artifacts.js'` (keep the .js extension even though
the source is .ts); if you must retain CommonJS, ensure the require path uses
the .js extension and adjust module interop accordingly for createRequire,
referencing the symbols REDACTED and sanitizeTraceArtifacts.

---

Outside diff comments:
In `@scripts/ci/sanitize-trace-artifacts.js`:
- Around line 1-104: Rename the file to sanitize-trace-artifacts.ts and convert
it from CommonJS to TypeScript: replace require(...) with TypeScript imports
(e.g. import * as fs from "node:fs"; import * as path from "node:path"), add
explicit type annotations for functions and variables (e.g. redactString(value:
string): string, sanitize(value: unknown, key = ""): unknown,
listJsonFiles(directory: string): string[],
sanitizeTraceArtifacts(sourceDirectory: string, outputDirectory: string): {
files: number; outputDirectory: string }), cast JSON.parse results to
unknown/any before sanitizing, and update exports to ESNamed exports (export {
REDACTED, sanitize, sanitizeTraceArtifacts }); keep the CLI block but ensure
TypeScript accepts require.main by either using if (require.main === module)
with a top-level declare const require: any; or convert to an import.meta.url
check, and adjust process.argv typing as string[] so the script compiles under
tsconfig.

---

Nitpick comments:
In @.github/actions/run-e2e-script/action.yaml:
- Line 87: Add a short documentation note in the repository explaining how CI
will run the sanitize-trace-artifacts script so a future .js→.ts rename won't
break the workflow: reference the action invocation that currently runs node
"$GITHUB_ACTION_PATH/../../../scripts/ci/sanitize-trace-artifacts.js" and state
the chosen strategy (either keep calling a compiled .js artifact produced by CI
builds or switch the action/workflow/tests to use a TS runner such as tsx in
place of node), and include instructions for updating
.github/actions/run-e2e-script/action.yaml and any workflows/tests that call
sanitize-trace-artifacts.js when the migration happens.

In @.github/workflows/nightly-e2e.yaml:
- Around line 633-2925: The scorecard job embeds a very large inline JS trace
analysis block (buildTraceTimingResult, readTraceSummaryFromRun,
resolvePriorReleaseTag, selectOnboardTrace, extractPhaseDurations,
formatTraceDelta, etc.), which makes the workflow hard to maintain; move that
logic into a dedicated script (e.g., scripts/scorecard/analyze-trace-timing.js)
and have the workflow call the script from the Generate nightly scorecard step
(invoke node or run the script via actions/github-script by loading its exported
functions), preserving the existing function names/behaviour and error handling
so the scorecard step still returns the same traceTimingLine and
traceSummaryLines outputs.
- Around line 2641-2652: Replace the hardcoded ONBOARD_PHASE_ORDER array with a
single source of truth: import the canonical phase ordering constant (e.g.,
export named constant like onboardPhaseOrder or ONBOARD_PHASE_ORDER) from the
product code that generates onboarding traces; if that constant doesn't exist
yet, add an exported constant in the product module used for trace generation
and update both the tracer code and this CI workflow to import it, or at minimum
add an inline comment pointing to the canonical definition (trace generation
module/function) so the ordering isn't duplicated.

In `@scripts/ci/sanitize-trace-artifacts.js`:
- Line 73: The JSON.parse call that creates the local variable parsed should be
wrapped in a try-catch to handle malformed JSON; locate the line using
JSON.parse(fs.readFileSync(file, "utf8")) and surround it with a try block,
catch the thrown error, and produce an informative message that includes the
filename (file) and the parser error (e.g., using processLogger.error or
console.error), then either rethrow or skip/continue depending on
sanitizeTraceArtifacts' desired behavior; ensure the catch preserves stack/error
details for debugging.
- Around line 69-71: Add a short security comment above the path-safety check
that documents the contract: explain that the check using
relativePath.startsWith("..") and path.isAbsolute(relativePath) (and the thrown
Error referencing file) intentionally rejects directory traversal and absolute
paths to prevent sanitized traces from being written outside the intended
output/source root; keep it concise and mention the expected guarantee that
resolved paths will remain inside the intended output directory.
- Around line 10-18: SENSITIVE_VALUE_RES currently misses several common
credential/token formats; update the array used alongside SENSITIVE_KEY_RE to
add regexes for AWS access keys (AKIA/ASIA followed by 16 chars) and AWS secret
access key-like base64 strings, generic JWT-looking tokens (three dot-separated
Base64URL segments), Azure/AD tokens (eyJ0eXAi... plus long Base64URL), and
other common cloud prefixes (e.g., GCP service account keys, long hex API keys);
modify the SENSITIVE_VALUE_RES constant to include these additional regex
patterns and ensure they are case-insensitive or global where appropriate so the
sanitizer will match and redact these tokens during trace processing (refer to
SENSITIVE_VALUE_RES and SENSITIVE_KEY_RE to locate and update the patterns).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bc202847-24b8-4ce2-bb33-5e23d0b0466b

📥 Commits

Reviewing files that changed from the base of the PR and between 33d1d08 and 4c2498b.

📒 Files selected for processing (10)
  • .github/actions/run-e2e-script/action.yaml
  • .github/workflows/e2e-script.yaml
  • .github/workflows/nightly-e2e.yaml
  • scripts/ci/sanitize-trace-artifacts.js
  • scripts/scorecard/build-slack-blocks.ts
  • test/e2e-script-workflow.test.ts
  • test/e2e/README.md
  • test/fixtures/sensitive-trace-artifact.json
  • test/sanitize-trace-artifacts.test.ts
  • test/scorecard-blocks.test.ts

Comment thread .github/workflows/e2e-script.yaml Outdated
Comment on lines +145 to +147
sparse-checkout: |
.github/actions/run-e2e-script
scripts/ci/sanitize-trace-artifacts.js

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Sparse checkout includes sanitizer script as required.

The sparse-checkout list correctly includes both the action directory and the sanitizer script path, ensuring the sanitizer is available when the sanitize step runs. This matches the coding guideline contract.

When the sanitizer is converted to TypeScript (per the pipeline failure), update this path to reference the compiled output or the TypeScript source with appropriate execution strategy.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/e2e-script.yaml around lines 145 - 147, The
sparse-checkout currently includes the sanitizer script entry
'scripts/ci/sanitize-trace-artifacts.js' and the action directory
'.github/actions/run-e2e-script'; when the sanitizer is converted to TypeScript,
update this sparse-checkout entry to point to the correct executable artifact
(for example the compiled JS output under scripts/ci/dist or the TypeScript
source and adjust the execution step to use ts-node) so the sanitize step can
find and run the sanitizer; ensure the workflow step that invokes the sanitizer
(the sanitize step) is also updated to run the chosen target.

Sources: Coding guidelines, Pipeline failures

Comment thread test/sanitize-trace-artifacts.test.ts Outdated
@cv

cv commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

@amata-human mind addressing the PR feedback comments and CI failures, please?

Signed-off-by: Angel Mata <amata@nvidia.com>
Comment thread scripts/scorecard/analyze-trace-timing.ts Fixed
…le-nemoclaw-e2e-trace-timing-slack-summaries-in-gitlab-ci
Signed-off-by: Angel Mata <amata@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/nightly-e2e.yaml (1)

2772-2823: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Remove the legacy prior-day Trend: block from the scorecard.

This workflow still computes and emits the older prior-day trend comparison, so the scorecard now carries two different baselines: the old day-over-day Trend: line and the new prior-release trace timing analysis. That makes the summary internally inconsistent and keeps pushing stale data into trendLine consumers.

As per coding guidelines, "Remove/avoid older “Trend” comparison logic inside scorecard and rely on the trace-timing analyzer output instead."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/nightly-e2e.yaml around lines 2772 - 2823, Remove the
entire legacy prior-day "Trend:" computation: delete the else branch that
contains the try/catch and loop (the block which populates priorRuns and sets
trendLine based on priorRun.conclusion) and keep only the selective-dispatch
short-circuit; instead initialize trendLine to an empty string (or leave it to
the trace-timing analyzer) and do not reference WORKFLOW_FILE/priorRuns or
e.message anywhere—i.e., remove the code around the symbols trendLine,
isSelectiveDispatch, priorRuns, WORKFLOW_FILE and the try/catch so the scorecard
relies solely on the new trace-timing analyzer output.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 2520-2527: The ternary that sets status incorrectly labels runs
with both passed and cancelled jobs as "✅ All requested jobs passed"; update the
conditional logic around the status assignment (the block that computes status
using failed, missingRequested, cancelled, passed, skipped) to detect mixed
outcomes—specifically ensure a branch like "cancelled.length > 0 &&
passed.length > 0" (or a combined condition checking cancelled.length > 0 &&
passed.length > 0 && failed.length === 0) returns a different headline (e.g.,
"⚠️ Some jobs cancelled — partial pass") before the final "All requested jobs
passed" case so mixed pass/cancel runs are not reported as all passed.

---

Outside diff comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 2772-2823: Remove the entire legacy prior-day "Trend:"
computation: delete the else branch that contains the try/catch and loop (the
block which populates priorRuns and sets trendLine based on priorRun.conclusion)
and keep only the selective-dispatch short-circuit; instead initialize trendLine
to an empty string (or leave it to the trace-timing analyzer) and do not
reference WORKFLOW_FILE/priorRuns or e.message anywhere—i.e., remove the code
around the symbols trendLine, isSelectiveDispatch, priorRuns, WORKFLOW_FILE and
the try/catch so the scorecard relies solely on the new trace-timing analyzer
output.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e021c607-f78e-41e7-9aaa-52096d6adfb3

📥 Commits

Reviewing files that changed from the base of the PR and between dd7e42f and c579e6c.

📒 Files selected for processing (3)
  • .github/workflows/nightly-e2e.yaml
  • scripts/scorecard/analyze-trace-timing.ts
  • test/e2e-script-workflow.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e-script-workflow.test.ts

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/nightly-e2e.yaml (1)

2772-2823: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Remove the legacy prior-day Trend: block from the scorecard.

This workflow still computes and emits the older prior-day trend comparison, so the scorecard now carries two different baselines: the old day-over-day Trend: line and the new prior-release trace timing analysis. That makes the summary internally inconsistent and keeps pushing stale data into trendLine consumers.

As per coding guidelines, "Remove/avoid older “Trend” comparison logic inside scorecard and rely on the trace-timing analyzer output instead."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/nightly-e2e.yaml around lines 2772 - 2823, Remove the
entire legacy prior-day "Trend:" computation: delete the else branch that
contains the try/catch and loop (the block which populates priorRuns and sets
trendLine based on priorRun.conclusion) and keep only the selective-dispatch
short-circuit; instead initialize trendLine to an empty string (or leave it to
the trace-timing analyzer) and do not reference WORKFLOW_FILE/priorRuns or
e.message anywhere—i.e., remove the code around the symbols trendLine,
isSelectiveDispatch, priorRuns, WORKFLOW_FILE and the try/catch so the scorecard
relies solely on the new trace-timing analyzer output.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 2520-2527: The ternary that sets status incorrectly labels runs
with both passed and cancelled jobs as "✅ All requested jobs passed"; update the
conditional logic around the status assignment (the block that computes status
using failed, missingRequested, cancelled, passed, skipped) to detect mixed
outcomes—specifically ensure a branch like "cancelled.length > 0 &&
passed.length > 0" (or a combined condition checking cancelled.length > 0 &&
passed.length > 0 && failed.length === 0) returns a different headline (e.g.,
"⚠️ Some jobs cancelled — partial pass") before the final "All requested jobs
passed" case so mixed pass/cancel runs are not reported as all passed.

---

Outside diff comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 2772-2823: Remove the entire legacy prior-day "Trend:"
computation: delete the else branch that contains the try/catch and loop (the
block which populates priorRuns and sets trendLine based on priorRun.conclusion)
and keep only the selective-dispatch short-circuit; instead initialize trendLine
to an empty string (or leave it to the trace-timing analyzer) and do not
reference WORKFLOW_FILE/priorRuns or e.message anywhere—i.e., remove the code
around the symbols trendLine, isSelectiveDispatch, priorRuns, WORKFLOW_FILE and
the try/catch so the scorecard relies solely on the new trace-timing analyzer
output.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e021c607-f78e-41e7-9aaa-52096d6adfb3

📥 Commits

Reviewing files that changed from the base of the PR and between dd7e42f and c579e6c.

📒 Files selected for processing (3)
  • .github/workflows/nightly-e2e.yaml
  • scripts/scorecard/analyze-trace-timing.ts
  • test/e2e-script-workflow.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e-script-workflow.test.ts
🛑 Comments failed to post (1)
.github/workflows/nightly-e2e.yaml (1)

2520-2527: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don't label mixed pass/cancel runs as “all passed.”

If one requested job passes and another is cancelled, this branch still emits ✅ All requested jobs passed. That headline contradicts the summary you added on Line 2536 and hides the fact that the run produced incomplete signal.

Suggested adjustment
             const status =
               failed.length > 0 || missingRequested.length > 0
                 ? '❌ Some jobs failed'
-                : cancelled.length > 0 && passed.length === 0
-                  ? '⚠️ Run cancelled — no signal'
+                : cancelled.length > 0
+                  ? passed.length === 0
+                    ? '⚠️ Run cancelled — no signal'
+                    : '⚠️ Some jobs were cancelled'
                   : skipped.length > 0 && passed.length === 0
                     ? '⚠️ No requested jobs ran'
                     : '✅ All requested jobs passed';

Also applies to: 2536-2536

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/nightly-e2e.yaml around lines 2520 - 2527, The ternary
that sets status incorrectly labels runs with both passed and cancelled jobs as
"✅ All requested jobs passed"; update the conditional logic around the status
assignment (the block that computes status using failed, missingRequested,
cancelled, passed, skipped) to detect mixed outcomes—specifically ensure a
branch like "cancelled.length > 0 && passed.length > 0" (or a combined condition
checking cancelled.length > 0 && passed.length > 0 && failed.length === 0)
returns a different headline (e.g., "⚠️ Some jobs cancelled — partial pass")
before the final "All requested jobs passed" case so mixed pass/cancel runs are
not reported as all passed.

Signed-off-by: Angel Mata <amata@nvidia.com>
@github-actions

This comment was marked as outdated.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27374885261
Target ref: 5090-perfonboard-enable-nemoclaw-e2e-trace-timing-slack-summaries-in-gitlab-ci
Requested jobs: cloud-onboard-e2e
Summary: 1 passed, 0 failed, 0 cancelled, 0 skipped

Job Result
cloud-onboard-e2e ✅ success

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27377814019
Target ref: 5090-perfonboard-enable-nemoclaw-e2e-trace-timing-slack-summaries-in-gitlab-ci
Requested jobs: cloud-onboard-e2e
Summary: 1 passed, 0 failed, 0 cancelled, 0 skipped

Job Result
cloud-onboard-e2e ✅ success

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@cv cv added v0.0.65 Release target and removed v0.0.64 Release target labels Jun 12, 2026
@wscurran wscurran added area: ci CI workflows, checks, release automation, or GitHub Actions area: onboarding Onboarding FSM, provider setup, sandbox launch, or first-run flow feature PR adds or expands user-visible functionality labels Jun 12, 2026
@wscurran

Copy link
Copy Markdown
Contributor

jyaunches and others added 7 commits June 12, 2026 13:22
Only upload trace artifacts after trusted sanitization succeeds.
Guard against target-created output symlinks.

Signed-off-by: Angel Mata <amata@nvidia.com>
Limit trace timing uploads and parsing to trusted summaries.
Add regression coverage for trace redaction behavior.

Signed-off-by: Angel Mata <amata@nvidia.com>
Upload only sanitizer-created trace summary files.
Cover trace timing fallback and symlink cases.

Signed-off-by: Angel Mata <amata@nvidia.com>
Comment thread test/e2e-script-workflow.test.ts Fixed
amata-human and others added 4 commits June 12, 2026 16:18
Terminate target-ref background processes before trusted trace upload.

Signed-off-by: Angel Mata <amata@nvidia.com>
…ort, function or class'

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
@cv cv merged commit b0a1a5f into main Jun 13, 2026
32 checks passed
@cv cv deleted the 5090-perfonboard-enable-nemoclaw-e2e-trace-timing-slack-summaries-in-gitlab-ci branch June 13, 2026 05:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: ci CI workflows, checks, release automation, or GitHub Actions area: onboarding Onboarding FSM, provider setup, sandbox launch, or first-run flow feature PR adds or expands user-visible functionality v0.0.65 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(onboard): Enable NemoClaw E2E trace timing Slack summaries in Github CI

5 participants