feat: add container layer metrics #4523

nv-nmailhot · 2025-11-21T01:05:54Z

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

Release Notes

New Features
- Added comprehensive build telemetry system capturing build duration, image size, and performance metrics.
- Implemented layer-level metrics collection from Docker images.
- Introduced automated build log parsing to track cached layers, warnings, errors, and pull operations.
- Added artifact uploads for build metrics, layer metrics, and build logs.
Chores
- Enhanced build output with improved formatting and structured sections for better visibility.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-21T01:11:33Z

Walkthrough

This pull request introduces end-to-end build telemetry and metrics collection infrastructure across Docker build workflows. It adds layer-level metrics collection, build logging with statistics extraction, artifact uploads, and extends the metrics uploader to handle per-layer data alongside existing job-level metrics.

Changes

Cohort / File(s)	Summary
Build Telemetry & Logging `.github/actions/docker-build/action.yml`	Introduces comprehensive observability: build start/end timestamps, per-run log files, command logging via tee, and a new "Capture Build Metrics" step that parses logs to extract cached layers, build steps, cache hit rate, warnings, errors, and pull operations. Adds "Collect Layer Metrics" step wrapping a Python-based layer collector. Uploads artifacts for metrics and logs with conditional handling.
Layer Metrics Collection `.github/scripts/collect_layer_metrics.py`	New Python script to gather Docker image layer-level metrics via `docker history`. Includes helper functions: `run_command`, `get_image_layers`, `parse_size_to_bytes`, `get_layer_cache_info`. Main `collect_layer_metrics` function orchestrates retrieval, parsing, caching annotation, and size aggregation. Exports metrics JSON with per-layer details (ID, size, creation date, cache status) and collection metadata.
Workflow Artifact Downloads `.github/workflows/container-validation-backends.yml`	Adds two download-artifact steps to fetch layer-metrics and build-logs from prior jobs using wildcard patterns and merge-multiple mode. Propagates `LAYER_INDEX` secret to the metrics upload step in both container-validation and upload-metrics blocks.
Metrics Schema & Layer Upload Integration `.github/workflows/upload_complete_workflow_metrics.py`	Extends field constants for layer metrics (cached_layers, total_build_steps, cache_hit_rate_percent, build_warnings, build_errors, pull_operations) and layer-specific data (ID, size, creation_by, cache_status, comment, total_layers, image_tag). Adds `_upload_layer_metrics` method to `WorkflowMetricsUploader` to locate, parse, and upload per-layer JSON payloads. Initializes `layer_index` from `LAYER_INDEX` environment variable and gates upload on its presence. Wires layer metrics upload into `_upload_single_job_metrics` post-test workflow.
Build Script Output & Logging `container/build.sh`	Expands `show_image_options` with structured, sectioned output: Docker build configuration, image details, TensorRT-LLM configuration (when applicable), sccache settings, and cache options with emojis and clear labels. Adds early startup logs ("DYNAMO BUILD SCRIPT STARTING"). Introduces per-stage build logging (base-image-build.log, framework-*-build.log) with statistics extraction (cached layers, total steps, cache hit rate, layers pulled). Adds header dividers, architecture detection messages, and status confirmations throughout build flow.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Metrics field consistency: Verify all new field constants across upload_complete_workflow_metrics.py are correctly mapped from source metrics objects and consistently named.
Layer metrics parsing robustness: Review .github/scripts/collect_layer_metrics.py for edge cases in JSON parsing, size string conversion (e.g., malformed sizes, non-existent images), and timeout handling.
Workflow integration points: Confirm LAYER_INDEX secret is properly gated and that download-artifact merge-multiple and conditional uploads work as intended.
Build script log capture: Verify log file paths, tee piping, and statistics extraction regex patterns handle both single-stage and multi-stage builds correctly.

Poem

🐰 A tale of metrics bright,
Layer by layer, logs take flight,
Build telemetry, oh what a sight!
Artifacts dance through GitHub's night, ⚡📊

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The pull request description contains only template placeholders with no actual content describing the changes, objectives, or implementation details.	Fill in the Overview, Details, and Where should the reviewer start sections with specific information about the changes made, and replace the placeholder issue reference with the actual related GitHub issue number.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: add container layer metrics' clearly and specifically describes the main change—introducing layer metrics collection for containers.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

.github/scripts/collect_layer_metrics.py (1)
78-106: Move import statement to module level.

The re module is imported inside the function, which is inefficient if the function is called multiple times. Move the import to the top of the file with other imports.

Apply this diff:
 import json
 import subprocess
 import sys
+import re
 from datetime import datetime, timezone
 from typing import Any, Dict, List, Optional
 
 ...
 
 def parse_size_to_bytes(size_str: str) -> int:
     """
     Convert Docker size string (e.g., "1.2GB", "500MB", "10kB") to bytes.
     """
     size_str = size_str.strip().upper()
     
     if size_str == "0B" or size_str == "0":
         return 0
     
     # Extract number and unit
-    import re
     match = re.match(r"^([\d.]+)\s*([KMGT]?B)$", size_str)
container/build.sh (1)
529-529: Use quoted parameter expansion for logging.

Shellcheck warns about mixing string and array with $@. While this is safe in the echo context, it's better to use quoted forms for consistency.

Apply this diff:
-echo "Arguments: $@"
+echo "Arguments: $*"
Or use quoted form:
-echo "Arguments: $@"
+echo "Arguments: $*"
.github/workflows/upload_complete_workflow_metrics.py (2)
1199-1206: Improve exception handling specificity.

The code catches blind Exception which makes debugging difficult. Consider catching more specific exceptions or at least logging the exception type and traceback.

Apply this diff:
                     try:
                         self.post_to_db(layer_index, layer_metric)
                         print(
                             f"✅ Uploaded layer {layer_idx}: {layer.get('size_human', '0B')}"
                         )
                         total_layers_processed += 1
-                    except Exception as e:
+                    except (requests.exceptions.RequestException, ValueError) as e:
                         print(f"❌ Failed to upload layer {layer_idx}: {e}")
Or keep Exception but add traceback logging for debugging:
+import traceback
+
                     try:
                         self.post_to_db(layer_index, layer_metric)
                         print(
                             f"✅ Uploaded layer {layer_idx}: {layer.get('size_human', '0B')}"
                         )
                         total_layers_processed += 1
                     except Exception as e:
                         print(f"❌ Failed to upload layer {layer_idx}: {e}")
+                        print(f"   Traceback: {traceback.format_exc()}", file=sys.stderr)
1208-1209: Improve exception handling specificity.

Similar to the previous comment, catching blind Exception makes debugging difficult when processing layer metrics files fails.

Apply this diff:
+import traceback
+
             except Exception as e:
                 print(f"❌ Failed to process layer metrics file {layer_file}: {e}")
+                print(f"   Traceback: {traceback.format_exc()}", file=sys.stderr)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6e24123 and 670d87b.

📒 Files selected for processing (5)

.github/actions/docker-build/action.yml (4 hunks)
.github/scripts/collect_layer_metrics.py (1 hunks)
.github/workflows/container-validation-backends.yml (2 hunks)
.github/workflows/upload_complete_workflow_metrics.py (6 hunks)
container/build.sh (6 hunks)

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: 2025-08-30T20:43:49.632Z

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: container/Dockerfile:437-449
Timestamp: 2025-08-30T20:43:49.632Z
Learning: In the dynamo project's devcontainer setup, the team prioritizes consistency across framework-specific Dockerfiles (like container/Dockerfile, container/Dockerfile.vllm, etc.) by mirroring their structure, even when individual optimizations might be possible, to maintain uniformity in the development environment setup.

Applied to files:

container/build.sh

📚 Learning: 2025-08-30T20:43:10.091Z

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project, devcontainer.json files use templated container names (like "dynamo-vllm-devcontainer") that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.

Applied to files:

container/build.sh

📚 Learning: 2025-08-30T20:43:10.091Z

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project's devcontainer setup, hard-coded container names in devcontainer.json files serve as templates that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.

Applied to files:

container/build.sh

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4523/merge) by nv-nmailhot.

.github/scripts/collect_layer_metrics.py

[error] 1-1: Black formatting check failed. reformatted this file by the hook (pre-commit).

container/build.sh

[error] 1-1: Trailing whitespace found and fixed by pre-commit hook.

.github/actions/docker-build/action.yml

[error] 1-1: Trailing whitespace found and fixed by pre-commit hook.

.github/workflows/container-validation-backends.yml

[error] 1-1: Command failed with exit code 1 during pre-commit checks: pre-commit run --show-diff-on-failure --color=always --all-files

.github/workflows/upload_complete_workflow_metrics.py

[error] 1-1: Black formatting check failed. reformatted this file by the hook (pre-commit).

🪛 Ruff (0.14.5)

.github/scripts/collect_layer_metrics.py

19-19: subprocess call: check for execution of untrusted input

(S603)

.github/workflows/upload_complete_workflow_metrics.py

1205-1205: Do not catch blind exception: Exception

(BLE001)

1208-1208: Do not catch blind exception: Exception

(BLE001)

🪛 Shellcheck (0.11.0)

container/build.sh

[error] 529-529: Argument mixes string and array. Use * or separate argument.

(SC2145)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: sglang (arm64)
GitHub Check: sglang (amd64)
GitHub Check: vllm (arm64)
GitHub Check: trtllm (amd64)
GitHub Check: trtllm (arm64)
GitHub Check: operator (arm64)
GitHub Check: vllm (amd64)
GitHub Check: operator (amd64)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (2)

.github/workflows/container-validation-backends.yml (2)

507-522: LGTM! Artifact downloads configured correctly.

The layer metrics and build logs artifact download steps are properly configured with:

Correct artifact patterns matching the upload patterns in docker-build action

Appropriate continue-on-error: true to handle cases where artifacts may not exist

Proper merge-multiple configuration for combining artifacts from multiple jobs

540-540: LGTM! LAYER_INDEX propagation is correct.

The LAYER_INDEX environment variable is properly propagated to the metrics upload step, which is required by the upload_complete_workflow_metrics.py script to upload layer-level metrics.

coderabbitai · 2025-11-21T01:11:36Z

.github/scripts/collect_layer_metrics.py

+def run_command(cmd: List[str]) -> str:
+    """Run a command and return its output"""
+    try:
+        result = subprocess.run(
+            cmd, capture_output=True, text=True, check=True, timeout=60
+        )
+        return result.stdout.strip()
+    except subprocess.CalledProcessError as e:
+        print(f"Error running command {' '.join(cmd)}: {e}", file=sys.stderr)
+        print(f"STDERR: {e.stderr}", file=sys.stderr)
+        return ""
+    except subprocess.TimeoutExpired:
+        print(f"Command timed out: {' '.join(cmd)}", file=sys.stderr)
+        return ""


🛠️ Refactor suggestion | 🟠 Major

Add input validation for image tags.

The run_command function executes Docker commands with user-provided input (image_tag from command-line arguments). While using a list prevents shell injection, the image tag should be validated to ensure it follows Docker's naming conventions.

Apply this diff to add validation:

+def validate_image_tag(image_tag: str) -> bool: + """Validate that image_tag follows Docker naming conventions""" + import re + # Docker image tag pattern: [registry/]name[:tag] + pattern = r'^[\w][\w.-]{0,127}(?:/[\w][\w.-]{0,127})*(?::[\w][\w.-]{0,127})?$' + return bool(re.match(pattern, image_tag)) + def run_command(cmd: List[str]) -> str: """Run a command and return its output""" try: result = subprocess.run( - cmd, capture_output=True, text=True, check=True, timeout=60 + cmd, capture_output=True, text=True, check=True, timeout=60, shell=False )

And in the main function:

image_tag = sys.argv[1] + if not validate_image_tag(image_tag): + print(f"Error: Invalid image tag format: {image_tag}", file=sys.stderr) + sys.exit(1)

🧰 Tools

🪛 Ruff (0.14.5)

19-19: subprocess call: check for execution of untrusted input

(S603)

🤖 Prompt for AI Agents

In .github/scripts/collect_layer_metrics.py around lines 16 to 29, the run_command function is fine but the script accepts an image_tag from CLI without validation; add an image tag validation function that uses a conservative regex (e.g. allow lowercase repo/name components with optional separators and an optional :tag consisting of letters, digits, dot, underscore, dash), enforce max length (<=255), and call it in main before constructing Docker commands; if validation fails, write an error to stderr and exit non‑zero so invalid input is rejected before run_command is invoked.

coderabbitai · 2025-11-21T01:11:36Z

.github/scripts/collect_layer_metrics.py

+def get_layer_cache_info(build_log: Optional[str] = None) -> Dict[str, str]:
+    """
+    Parse build log to determine which layers were cached.
+    Returns a dict mapping layer index to cache status.
+    """
+    cache_info = {}
+
+    if not build_log:
+        return cache_info
+
+    # Parse build log for CACHED indicators
+    # This is best-effort as docker build output format can vary
+    layer_idx = 0
+    for line in build_log.split("\n"):
+        if "CACHED" in line:
+            cache_info[str(layer_idx)] = "cached"
+        elif line.strip().startswith("Step "):
+            layer_idx += 1
+
+    return cache_info


⚠️ Potential issue | 🟡 Minor

Cache information parsing may be inaccurate.

The get_layer_cache_info function assumes every Docker build step creates a layer, but this isn't always true (e.g., CMD, LABEL, and other instructions may not create new layers). This could cause cache status to be assigned to incorrect layers.

Since this is marked as "best-effort" and the function returns an empty dict if no build log is provided, this is acceptable for now. However, consider improving the parsing logic in a future iteration to match step numbers with actual layer creation.

nv-nmailhot added 2 commits November 20, 2025 17:02

add tons of layer level logging

74b190a

add container logging

670d87b

nv-nmailhot requested review from a team as code owners November 21, 2025 01:05

pull-request-size bot added the size/XL label Nov 21, 2025

github-actions bot added the feat label Nov 21, 2025

coderabbitai bot reviewed Nov 21, 2025

View reviewed changes

add buildkit logging

c47ae1c

pull-request-size bot added size/XXL and removed size/XL labels Nov 21, 2025

copy-pr-bot bot temporarily deployed to GITLAB November 21, 2025 19:08 Inactive

cleanup layer info

78ed022

copy-pr-bot bot temporarily deployed to GITLAB November 21, 2025 23:05 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 21, 2025 23:07 Inactive

make logging cleaner

29d8000

copy-pr-bot bot temporarily deployed to GITLAB November 21, 2025 23:09 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 21, 2025 23:11 Inactive

remove extra logging

f1dfbe1

copy-pr-bot bot temporarily deployed to GITLAB November 24, 2025 19:01 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 24, 2025 19:05 Inactive

remove extra logging

257abad

pull-request-size bot added size/XL and removed size/XXL labels Nov 24, 2025

copy-pr-bot bot temporarily deployed to GITLAB November 24, 2025 19:52 Inactive

change file names

04b40af

copy-pr-bot bot temporarily deployed to GITLAB November 24, 2025 19:58 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 24, 2025 20:03 Inactive

nv-nmailhot closed this Nov 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add container layer metrics #4523

feat: add container layer metrics #4523

nv-nmailhot commented Nov 21, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 21, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 21, 2025

Uh oh!

coderabbitai bot Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add container layer metrics #4523

feat: add container layer metrics #4523

Conversation

nv-nmailhot commented Nov 21, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Nov 21, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nv-nmailhot commented Nov 21, 2025 •

edited by coderabbitai bot

Loading