Skip to content

Conversation

@nv-nmailhot
Copy link
Contributor

@nv-nmailhot nv-nmailhot commented Nov 21, 2025

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

Release Notes

  • New Features

    • Added comprehensive build telemetry system capturing build duration, image size, and performance metrics.
    • Implemented layer-level metrics collection from Docker images.
    • Introduced automated build log parsing to track cached layers, warnings, errors, and pull operations.
    • Added artifact uploads for build metrics, layer metrics, and build logs.
  • Chores

    • Enhanced build output with improved formatting and structured sections for better visibility.

✏️ Tip: You can customize this high-level summary in your review settings.

@nv-nmailhot nv-nmailhot requested review from a team as code owners November 21, 2025 01:05
@github-actions github-actions bot added the feat label Nov 21, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 21, 2025

Walkthrough

This pull request introduces end-to-end build telemetry and metrics collection infrastructure across Docker build workflows. It adds layer-level metrics collection, build logging with statistics extraction, artifact uploads, and extends the metrics uploader to handle per-layer data alongside existing job-level metrics.

Changes

Cohort / File(s) Summary
Build Telemetry & Logging
.github/actions/docker-build/action.yml
Introduces comprehensive observability: build start/end timestamps, per-run log files, command logging via tee, and a new "Capture Build Metrics" step that parses logs to extract cached layers, build steps, cache hit rate, warnings, errors, and pull operations. Adds "Collect Layer Metrics" step wrapping a Python-based layer collector. Uploads artifacts for metrics and logs with conditional handling.
Layer Metrics Collection
.github/scripts/collect_layer_metrics.py
New Python script to gather Docker image layer-level metrics via docker history. Includes helper functions: run_command, get_image_layers, parse_size_to_bytes, get_layer_cache_info. Main collect_layer_metrics function orchestrates retrieval, parsing, caching annotation, and size aggregation. Exports metrics JSON with per-layer details (ID, size, creation date, cache status) and collection metadata.
Workflow Artifact Downloads
.github/workflows/container-validation-backends.yml
Adds two download-artifact steps to fetch layer-metrics and build-logs from prior jobs using wildcard patterns and merge-multiple mode. Propagates LAYER_INDEX secret to the metrics upload step in both container-validation and upload-metrics blocks.
Metrics Schema & Layer Upload Integration
.github/workflows/upload_complete_workflow_metrics.py
Extends field constants for layer metrics (cached_layers, total_build_steps, cache_hit_rate_percent, build_warnings, build_errors, pull_operations) and layer-specific data (ID, size, creation_by, cache_status, comment, total_layers, image_tag). Adds _upload_layer_metrics method to WorkflowMetricsUploader to locate, parse, and upload per-layer JSON payloads. Initializes layer_index from LAYER_INDEX environment variable and gates upload on its presence. Wires layer metrics upload into _upload_single_job_metrics post-test workflow.
Build Script Output & Logging
container/build.sh
Expands show_image_options with structured, sectioned output: Docker build configuration, image details, TensorRT-LLM configuration (when applicable), sccache settings, and cache options with emojis and clear labels. Adds early startup logs ("DYNAMO BUILD SCRIPT STARTING"). Introduces per-stage build logging (base-image-build.log, framework-*-build.log) with statistics extraction (cached layers, total steps, cache hit rate, layers pulled). Adds header dividers, architecture detection messages, and status confirmations throughout build flow.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Metrics field consistency: Verify all new field constants across upload_complete_workflow_metrics.py are correctly mapped from source metrics objects and consistently named.
  • Layer metrics parsing robustness: Review .github/scripts/collect_layer_metrics.py for edge cases in JSON parsing, size string conversion (e.g., malformed sizes, non-existent images), and timeout handling.
  • Workflow integration points: Confirm LAYER_INDEX secret is properly gated and that download-artifact merge-multiple and conditional uploads work as intended.
  • Build script log capture: Verify log file paths, tee piping, and statistics extraction regex patterns handle both single-stage and multi-stage builds correctly.

Poem

🐰 A tale of metrics bright,
Layer by layer, logs take flight,
Build telemetry, oh what a sight!
Artifacts dance through GitHub's night, ⚡📊

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description contains only template placeholders with no actual content describing the changes, objectives, or implementation details. Fill in the Overview, Details, and Where should the reviewer start sections with specific information about the changes made, and replace the placeholder issue reference with the actual related GitHub issue number.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: add container layer metrics' clearly and specifically describes the main change—introducing layer metrics collection for containers.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
.github/scripts/collect_layer_metrics.py (1)

78-106: Move import statement to module level.

The re module is imported inside the function, which is inefficient if the function is called multiple times. Move the import to the top of the file with other imports.

Apply this diff:

 import json
 import subprocess
 import sys
+import re
 from datetime import datetime, timezone
 from typing import Any, Dict, List, Optional
 
 ...
 
 def parse_size_to_bytes(size_str: str) -> int:
     """
     Convert Docker size string (e.g., "1.2GB", "500MB", "10kB") to bytes.
     """
     size_str = size_str.strip().upper()
     
     if size_str == "0B" or size_str == "0":
         return 0
     
     # Extract number and unit
-    import re
     match = re.match(r"^([\d.]+)\s*([KMGT]?B)$", size_str)
container/build.sh (1)

529-529: Use quoted parameter expansion for logging.

Shellcheck warns about mixing string and array with $@. While this is safe in the echo context, it's better to use quoted forms for consistency.

Apply this diff:

-echo "Arguments: $@"
+echo "Arguments: $*"

Or use quoted form:

-echo "Arguments: $@"
+echo "Arguments: $*"
.github/workflows/upload_complete_workflow_metrics.py (2)

1199-1206: Improve exception handling specificity.

The code catches blind Exception which makes debugging difficult. Consider catching more specific exceptions or at least logging the exception type and traceback.

Apply this diff:

                     try:
                         self.post_to_db(layer_index, layer_metric)
                         print(
                             f"✅ Uploaded layer {layer_idx}: {layer.get('size_human', '0B')}"
                         )
                         total_layers_processed += 1
-                    except Exception as e:
+                    except (requests.exceptions.RequestException, ValueError) as e:
                         print(f"❌ Failed to upload layer {layer_idx}: {e}")

Or keep Exception but add traceback logging for debugging:

+import traceback
+
                     try:
                         self.post_to_db(layer_index, layer_metric)
                         print(
                             f"✅ Uploaded layer {layer_idx}: {layer.get('size_human', '0B')}"
                         )
                         total_layers_processed += 1
                     except Exception as e:
                         print(f"❌ Failed to upload layer {layer_idx}: {e}")
+                        print(f"   Traceback: {traceback.format_exc()}", file=sys.stderr)

1208-1209: Improve exception handling specificity.

Similar to the previous comment, catching blind Exception makes debugging difficult when processing layer metrics files fails.

Apply this diff:

+import traceback
+
             except Exception as e:
                 print(f"❌ Failed to process layer metrics file {layer_file}: {e}")
+                print(f"   Traceback: {traceback.format_exc()}", file=sys.stderr)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6e24123 and 670d87b.

📒 Files selected for processing (5)
  • .github/actions/docker-build/action.yml (4 hunks)
  • .github/scripts/collect_layer_metrics.py (1 hunks)
  • .github/workflows/container-validation-backends.yml (2 hunks)
  • .github/workflows/upload_complete_workflow_metrics.py (6 hunks)
  • container/build.sh (6 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-30T20:43:49.632Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: container/Dockerfile:437-449
Timestamp: 2025-08-30T20:43:49.632Z
Learning: In the dynamo project's devcontainer setup, the team prioritizes consistency across framework-specific Dockerfiles (like container/Dockerfile, container/Dockerfile.vllm, etc.) by mirroring their structure, even when individual optimizations might be possible, to maintain uniformity in the development environment setup.

Applied to files:

  • container/build.sh
📚 Learning: 2025-08-30T20:43:10.091Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project, devcontainer.json files use templated container names (like "dynamo-vllm-devcontainer") that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.

Applied to files:

  • container/build.sh
📚 Learning: 2025-08-30T20:43:10.091Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project's devcontainer setup, hard-coded container names in devcontainer.json files serve as templates that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.

Applied to files:

  • container/build.sh
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4523/merge) by nv-nmailhot.
.github/scripts/collect_layer_metrics.py

[error] 1-1: Black formatting check failed. reformatted this file by the hook (pre-commit).

container/build.sh

[error] 1-1: Trailing whitespace found and fixed by pre-commit hook.

.github/actions/docker-build/action.yml

[error] 1-1: Trailing whitespace found and fixed by pre-commit hook.

.github/workflows/container-validation-backends.yml

[error] 1-1: Command failed with exit code 1 during pre-commit checks: pre-commit run --show-diff-on-failure --color=always --all-files

.github/workflows/upload_complete_workflow_metrics.py

[error] 1-1: Black formatting check failed. reformatted this file by the hook (pre-commit).

🪛 Ruff (0.14.5)
.github/scripts/collect_layer_metrics.py

19-19: subprocess call: check for execution of untrusted input

(S603)

.github/workflows/upload_complete_workflow_metrics.py

1205-1205: Do not catch blind exception: Exception

(BLE001)


1208-1208: Do not catch blind exception: Exception

(BLE001)

🪛 Shellcheck (0.11.0)
container/build.sh

[error] 529-529: Argument mixes string and array. Use * or separate argument.

(SC2145)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: sglang (arm64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: operator (arm64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: operator (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (2)
.github/workflows/container-validation-backends.yml (2)

507-522: LGTM! Artifact downloads configured correctly.

The layer metrics and build logs artifact download steps are properly configured with:

  • Correct artifact patterns matching the upload patterns in docker-build action
  • Appropriate continue-on-error: true to handle cases where artifacts may not exist
  • Proper merge-multiple configuration for combining artifacts from multiple jobs

540-540: LGTM! LAYER_INDEX propagation is correct.

The LAYER_INDEX environment variable is properly propagated to the metrics upload step, which is required by the upload_complete_workflow_metrics.py script to upload layer-level metrics.

Comment on lines 16 to 29
def run_command(cmd: List[str]) -> str:
"""Run a command and return its output"""
try:
result = subprocess.run(
cmd, capture_output=True, text=True, check=True, timeout=60
)
return result.stdout.strip()
except subprocess.CalledProcessError as e:
print(f"Error running command {' '.join(cmd)}: {e}", file=sys.stderr)
print(f"STDERR: {e.stderr}", file=sys.stderr)
return ""
except subprocess.TimeoutExpired:
print(f"Command timed out: {' '.join(cmd)}", file=sys.stderr)
return ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add input validation for image tags.

The run_command function executes Docker commands with user-provided input (image_tag from command-line arguments). While using a list prevents shell injection, the image tag should be validated to ensure it follows Docker's naming conventions.

Apply this diff to add validation:

+def validate_image_tag(image_tag: str) -> bool:
+    """Validate that image_tag follows Docker naming conventions"""
+    import re
+    # Docker image tag pattern: [registry/]name[:tag]
+    pattern = r'^[\w][\w.-]{0,127}(?:/[\w][\w.-]{0,127})*(?::[\w][\w.-]{0,127})?$'
+    return bool(re.match(pattern, image_tag))
+
 def run_command(cmd: List[str]) -> str:
     """Run a command and return its output"""
     try:
         result = subprocess.run(
-            cmd, capture_output=True, text=True, check=True, timeout=60
+            cmd, capture_output=True, text=True, check=True, timeout=60, shell=False
         )

And in the main function:

     image_tag = sys.argv[1]
+    if not validate_image_tag(image_tag):
+        print(f"Error: Invalid image tag format: {image_tag}", file=sys.stderr)
+        sys.exit(1)
🧰 Tools
🪛 Ruff (0.14.5)

19-19: subprocess call: check for execution of untrusted input

(S603)

🤖 Prompt for AI Agents
In .github/scripts/collect_layer_metrics.py around lines 16 to 29, the
run_command function is fine but the script accepts an image_tag from CLI
without validation; add an image tag validation function that uses a
conservative regex (e.g. allow lowercase repo/name components with optional
separators and an optional :tag consisting of letters, digits, dot, underscore,
dash), enforce max length (<=255), and call it in main before constructing
Docker commands; if validation fails, write an error to stderr and exit non‑zero
so invalid input is rejected before run_command is invoked.

Comment on lines 109 to 128
def get_layer_cache_info(build_log: Optional[str] = None) -> Dict[str, str]:
"""
Parse build log to determine which layers were cached.
Returns a dict mapping layer index to cache status.
"""
cache_info = {}

if not build_log:
return cache_info

# Parse build log for CACHED indicators
# This is best-effort as docker build output format can vary
layer_idx = 0
for line in build_log.split("\n"):
if "CACHED" in line:
cache_info[str(layer_idx)] = "cached"
elif line.strip().startswith("Step "):
layer_idx += 1

return cache_info
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Cache information parsing may be inaccurate.

The get_layer_cache_info function assumes every Docker build step creates a layer, but this isn't always true (e.g., CMD, LABEL, and other instructions may not create new layers). This could cause cache status to be assigned to incorrect layers.

Since this is marked as "best-effort" and the function returns an empty dict if no build log is provided, this is acceptable for now. However, consider improving the parsing logic in a future iteration to match step numbers with actual layer creation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants