-
Notifications
You must be signed in to change notification settings - Fork 460
feat(metrics): add task completion latency tracking and reporting #1130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(metrics): add task completion latency tracking and reporting #1130
Conversation
Summary of ChangesHello @staryxchen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces comprehensive task completion latency distribution metrics to the Transfer Engine. This enhancement moves beyond simple throughput measurements to provide critical insights into task-level performance, enabling better performance tuning, more accurate SLA monitoring, and improved identification of tail latencies and bottlenecks. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces valuable latency tracking for transfer tasks. The implementation is mostly solid, using histograms for distribution and a snapshot mechanism for interval-based reporting. However, I've identified a critical memory leak related to the cleanup of non-completed tasks, a potential data corruption issue in task key generation, and a couple of areas for code simplification and maintainability improvements. Addressing these points will make the feature more robust and easier to maintain.
f73c515 to
20c7764
Compare
- Implement task timing with start and completion tracking - Add histogram metrics for task completion latency - Enhance metrics reporting with detailed latency distribution - Include throughput and latency statistics in periodic logs Signed-off-by: staryxchen <[email protected]>
20c7764 to
2f4050b
Compare
|
👍 I will review it today. |
- Extract latency threshold to named constant for clarity - Simplify bucket range formatting with explicit overflow handling - Restructure conditional flow to reduce nesting Signed-off-by: staryxchen <[email protected]>
df6f60b to
013237a
Compare
Add Transfer Task Latency Distribution Metrics
Summary
This PR adds histogram-based latency distribution tracking for transfer tasks in the Transfer Engine metrics system.
Motivation
Understanding task completion latency distribution is crucial for:
Previously, only throughput metrics were available. This enhancement provides detailed visibility into task-level performance.
Changes
Core Implementation
Modified Files:
mooncake-transfer-engine/include/transfer_engine.hmooncake-transfer-engine/src/transfer_engine.cppKey Components:
Latency Tracking
getTransferStatus()callylt::metric::histogram_tfor efficient distribution trackingHistogram Buckets
Interval-based Reporting
Output Format
Features:
Usage
Design Decisions
histogram_tdoesn't support reset; snapshot-based interval calculation is standard practice (Prometheus)Performance Impact
WITH_METRICScompile flagTesting Instructions:
Expected output will show latency distribution every 5 seconds alongside throughput metrics.
Type of Change
How Has This Been Tested?
WITH_METRICSenabled/disabledChecklist