-
Notifications
You must be signed in to change notification settings - Fork 711
feat: add container layer metrics #4523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis pull request introduces end-to-end build telemetry and metrics collection infrastructure across Docker build workflows. It adds layer-level metrics collection, build logging with statistics extraction, artifact uploads, and extends the metrics uploader to handle per-layer data alongside existing job-level metrics. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (4)
.github/scripts/collect_layer_metrics.py (1)
78-106: Move import statement to module level.The
remodule is imported inside the function, which is inefficient if the function is called multiple times. Move the import to the top of the file with other imports.Apply this diff:
import json import subprocess import sys +import re from datetime import datetime, timezone from typing import Any, Dict, List, Optional ... def parse_size_to_bytes(size_str: str) -> int: """ Convert Docker size string (e.g., "1.2GB", "500MB", "10kB") to bytes. """ size_str = size_str.strip().upper() if size_str == "0B" or size_str == "0": return 0 # Extract number and unit - import re match = re.match(r"^([\d.]+)\s*([KMGT]?B)$", size_str)container/build.sh (1)
529-529: Use quoted parameter expansion for logging.Shellcheck warns about mixing string and array with
$@. While this is safe in the echo context, it's better to use quoted forms for consistency.Apply this diff:
-echo "Arguments: $@" +echo "Arguments: $*"Or use quoted form:
-echo "Arguments: $@" +echo "Arguments: $*".github/workflows/upload_complete_workflow_metrics.py (2)
1199-1206: Improve exception handling specificity.The code catches blind
Exceptionwhich makes debugging difficult. Consider catching more specific exceptions or at least logging the exception type and traceback.Apply this diff:
try: self.post_to_db(layer_index, layer_metric) print( f"✅ Uploaded layer {layer_idx}: {layer.get('size_human', '0B')}" ) total_layers_processed += 1 - except Exception as e: + except (requests.exceptions.RequestException, ValueError) as e: print(f"❌ Failed to upload layer {layer_idx}: {e}")Or keep Exception but add traceback logging for debugging:
+import traceback + try: self.post_to_db(layer_index, layer_metric) print( f"✅ Uploaded layer {layer_idx}: {layer.get('size_human', '0B')}" ) total_layers_processed += 1 except Exception as e: print(f"❌ Failed to upload layer {layer_idx}: {e}") + print(f" Traceback: {traceback.format_exc()}", file=sys.stderr)
1208-1209: Improve exception handling specificity.Similar to the previous comment, catching blind
Exceptionmakes debugging difficult when processing layer metrics files fails.Apply this diff:
+import traceback + except Exception as e: print(f"❌ Failed to process layer metrics file {layer_file}: {e}") + print(f" Traceback: {traceback.format_exc()}", file=sys.stderr)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
.github/actions/docker-build/action.yml(4 hunks).github/scripts/collect_layer_metrics.py(1 hunks).github/workflows/container-validation-backends.yml(2 hunks).github/workflows/upload_complete_workflow_metrics.py(6 hunks)container/build.sh(6 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-30T20:43:49.632Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: container/Dockerfile:437-449
Timestamp: 2025-08-30T20:43:49.632Z
Learning: In the dynamo project's devcontainer setup, the team prioritizes consistency across framework-specific Dockerfiles (like container/Dockerfile, container/Dockerfile.vllm, etc.) by mirroring their structure, even when individual optimizations might be possible, to maintain uniformity in the development environment setup.
Applied to files:
container/build.sh
📚 Learning: 2025-08-30T20:43:10.091Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project, devcontainer.json files use templated container names (like "dynamo-vllm-devcontainer") that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.
Applied to files:
container/build.sh
📚 Learning: 2025-08-30T20:43:10.091Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project's devcontainer setup, hard-coded container names in devcontainer.json files serve as templates that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.
Applied to files:
container/build.sh
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4523/merge) by nv-nmailhot.
.github/scripts/collect_layer_metrics.py
[error] 1-1: Black formatting check failed. reformatted this file by the hook (pre-commit).
container/build.sh
[error] 1-1: Trailing whitespace found and fixed by pre-commit hook.
.github/actions/docker-build/action.yml
[error] 1-1: Trailing whitespace found and fixed by pre-commit hook.
.github/workflows/container-validation-backends.yml
[error] 1-1: Command failed with exit code 1 during pre-commit checks: pre-commit run --show-diff-on-failure --color=always --all-files
.github/workflows/upload_complete_workflow_metrics.py
[error] 1-1: Black formatting check failed. reformatted this file by the hook (pre-commit).
🪛 Ruff (0.14.5)
.github/scripts/collect_layer_metrics.py
19-19: subprocess call: check for execution of untrusted input
(S603)
.github/workflows/upload_complete_workflow_metrics.py
1205-1205: Do not catch blind exception: Exception
(BLE001)
1208-1208: Do not catch blind exception: Exception
(BLE001)
🪛 Shellcheck (0.11.0)
container/build.sh
[error] 529-529: Argument mixes string and array. Use * or separate argument.
(SC2145)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: sglang (arm64)
- GitHub Check: sglang (amd64)
- GitHub Check: vllm (arm64)
- GitHub Check: trtllm (amd64)
- GitHub Check: trtllm (arm64)
- GitHub Check: operator (arm64)
- GitHub Check: vllm (amd64)
- GitHub Check: operator (amd64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (2)
.github/workflows/container-validation-backends.yml (2)
507-522: LGTM! Artifact downloads configured correctly.The layer metrics and build logs artifact download steps are properly configured with:
- Correct artifact patterns matching the upload patterns in docker-build action
- Appropriate
continue-on-error: trueto handle cases where artifacts may not exist- Proper merge-multiple configuration for combining artifacts from multiple jobs
540-540: LGTM! LAYER_INDEX propagation is correct.The LAYER_INDEX environment variable is properly propagated to the metrics upload step, which is required by the upload_complete_workflow_metrics.py script to upload layer-level metrics.
| def run_command(cmd: List[str]) -> str: | ||
| """Run a command and return its output""" | ||
| try: | ||
| result = subprocess.run( | ||
| cmd, capture_output=True, text=True, check=True, timeout=60 | ||
| ) | ||
| return result.stdout.strip() | ||
| except subprocess.CalledProcessError as e: | ||
| print(f"Error running command {' '.join(cmd)}: {e}", file=sys.stderr) | ||
| print(f"STDERR: {e.stderr}", file=sys.stderr) | ||
| return "" | ||
| except subprocess.TimeoutExpired: | ||
| print(f"Command timed out: {' '.join(cmd)}", file=sys.stderr) | ||
| return "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add input validation for image tags.
The run_command function executes Docker commands with user-provided input (image_tag from command-line arguments). While using a list prevents shell injection, the image tag should be validated to ensure it follows Docker's naming conventions.
Apply this diff to add validation:
+def validate_image_tag(image_tag: str) -> bool:
+ """Validate that image_tag follows Docker naming conventions"""
+ import re
+ # Docker image tag pattern: [registry/]name[:tag]
+ pattern = r'^[\w][\w.-]{0,127}(?:/[\w][\w.-]{0,127})*(?::[\w][\w.-]{0,127})?$'
+ return bool(re.match(pattern, image_tag))
+
def run_command(cmd: List[str]) -> str:
"""Run a command and return its output"""
try:
result = subprocess.run(
- cmd, capture_output=True, text=True, check=True, timeout=60
+ cmd, capture_output=True, text=True, check=True, timeout=60, shell=False
)And in the main function:
image_tag = sys.argv[1]
+ if not validate_image_tag(image_tag):
+ print(f"Error: Invalid image tag format: {image_tag}", file=sys.stderr)
+ sys.exit(1)🧰 Tools
🪛 Ruff (0.14.5)
19-19: subprocess call: check for execution of untrusted input
(S603)
🤖 Prompt for AI Agents
In .github/scripts/collect_layer_metrics.py around lines 16 to 29, the
run_command function is fine but the script accepts an image_tag from CLI
without validation; add an image tag validation function that uses a
conservative regex (e.g. allow lowercase repo/name components with optional
separators and an optional :tag consisting of letters, digits, dot, underscore,
dash), enforce max length (<=255), and call it in main before constructing
Docker commands; if validation fails, write an error to stderr and exit non‑zero
so invalid input is rejected before run_command is invoked.
| def get_layer_cache_info(build_log: Optional[str] = None) -> Dict[str, str]: | ||
| """ | ||
| Parse build log to determine which layers were cached. | ||
| Returns a dict mapping layer index to cache status. | ||
| """ | ||
| cache_info = {} | ||
|
|
||
| if not build_log: | ||
| return cache_info | ||
|
|
||
| # Parse build log for CACHED indicators | ||
| # This is best-effort as docker build output format can vary | ||
| layer_idx = 0 | ||
| for line in build_log.split("\n"): | ||
| if "CACHED" in line: | ||
| cache_info[str(layer_idx)] = "cached" | ||
| elif line.strip().startswith("Step "): | ||
| layer_idx += 1 | ||
|
|
||
| return cache_info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cache information parsing may be inaccurate.
The get_layer_cache_info function assumes every Docker build step creates a layer, but this isn't always true (e.g., CMD, LABEL, and other instructions may not create new layers). This could cause cache status to be assigned to incorrect layers.
Since this is marked as "best-effort" and the function returns an empty dict if no build log is provided, this is acceptable for now. However, consider improving the parsing logic in a future iteration to match step numbers with actual layer creation.
Overview:
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
Release Notes
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.