feat(metrics): add task completion latency tracking and reporting #1130

staryxchen · 2025-11-27T12:21:30Z

Add Transfer Task Latency Distribution Metrics

Summary

This PR adds histogram-based latency distribution tracking for transfer tasks in the Transfer Engine metrics system.

Motivation

Understanding task completion latency distribution is crucial for:

Performance tuning and bottleneck identification
SLA monitoring and analysis
Detecting outliers and tail latencies

Previously, only throughput metrics were available. This enhancement provides detailed visibility into task-level performance.

Changes

Core Implementation

Modified Files:

mooncake-transfer-engine/include/transfer_engine.h
mooncake-transfer-engine/src/transfer_engine.cpp

Key Components:

Latency Tracking
- Record task start time on first getTransferStatus() call
- Calculate and record completion latency when task completes
- Use ylt::metric::histogram_t for efficient distribution tracking
Histogram Buckets
- Fine-grained buckets (10μs - 10s) covering sub-millisecond to multi-second latencies
- 17 boundary values creating 18 buckets optimized for high-performance scenarios
Interval-based Reporting
- Snapshot mechanism for computing per-interval statistics
- Avoids cumulative drift, shows actual distribution within each reporting period
- Aligns with Prometheus-style metrics design (monotonic counters + rate calculation)

Output Format

[Metrics] Transfer Engine Stats (over last 5s): 
Throughput: 22480.90 MB/s | 
Latency Distribution (count=56215): 0-10μs:0.2%, 10-20μs:1.6%, 20-50μs:96.5%, 50-100μs:0.9%, 100-200μs:0.8%

Features:

Only shows buckets with ≥0.1% to reduce noise
Displays task count and percentage distribution
Single-line format for easy parsing and monitoring

Usage

# Enable metrics (required)
export MC_TE_METRIC=1

# Optional: customize interval (default: 5s)
export MC_TE_METRIC_INTERVAL_SECONDS=10

Design Decisions

Histogram over avg/max: Distribution provides complete picture; avg can be skewed by tail latencies, max is unstable
Snapshot mechanism: histogram_t doesn't support reset; snapshot-based interval calculation is standard practice (Prometheus)
Bucket granularity: Focused on sub-ms to catch high-performance variations, with coarser bins for tail

Performance Impact

Minimal overhead: only enabled with WITH_METRICS compile flag
Lock-free histogram updates
Automatic cleanup of completed task timestamps

Testing Instructions:

# Build with metrics
cmake -DWITH_METRICS=ON ..
make

# Run and observe output
export MC_TE_METRIC=1
./your_test_binary

Expected output will show latency distribution every 5 seconds alongside throughput metrics.

Type of Change

Types
- Bug fix
- New feature
  - Transfer Engine
  - Mooncake Store
  - Mooncake EP
  - Integration
  - P2P Store
  - Python Wheel
- Breaking change
- CI/CD
- Documentation update
- Other

How Has This Been Tested?

✅ Compiles with WITH_METRICS enabled/disabled
✅ No memory leaks (completed tasks are cleaned up)
✅ Thread-safe concurrent task tracking
✅ Accurate distribution reporting under load

Checklist

I have performed a self-review of my own code.
I have updated the documentation.
I have added tests to prove my changes are effective.

gemini-code-assist · 2025-11-27T12:21:50Z

Summary of Changes

Hello @staryxchen, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive task completion latency distribution metrics to the Transfer Engine. This enhancement moves beyond simple throughput measurements to provide critical insights into task-level performance, enabling better performance tuning, more accurate SLA monitoring, and improved identification of tail latencies and bottlenecks.

Highlights

Task Completion Latency Tracking: Implemented histogram-based latency distribution tracking for transfer tasks within the Transfer Engine metrics system.
Detailed Latency Metrics: Records task start times on the first getTransferStatus() call and calculates completion latency when a task finishes, utilizing ylt::metric::histogram_t for efficient distribution tracking.
Configurable Histogram Buckets: Uses 17 fine-grained boundary values (10μs to 10s) to cover sub-millisecond to multi-second latencies, optimized for high-performance scenarios.
Interval-based Reporting: Employs a snapshot mechanism to compute and report per-interval latency statistics, avoiding cumulative drift and aligning with Prometheus-style metrics design.
Enhanced Metrics Output: The metrics output now includes both throughput and a detailed latency distribution, displaying only buckets with >=0.1% to reduce noise and provide a clear, single-line format.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces valuable latency tracking for transfer tasks. The implementation is mostly solid, using histograms for distribution and a snapshot mechanism for interval-based reporting. However, I've identified a critical memory leak related to the cleanup of non-completed tasks, a potential data corruption issue in task key generation, and a couple of areas for code simplification and maintainability improvements. Addressing these points will make the feature more robust and easier to maintain.

mooncake-transfer-engine/include/transfer_engine.h

mooncake-transfer-engine/src/transfer_engine.cpp

- Implement task timing with start and completion tracking - Add histogram metrics for task completion latency - Enhance metrics reporting with detailed latency distribution - Include throughput and latency statistics in periodic logs Signed-off-by: staryxchen <[email protected]>

stmatengss · 2025-11-28T02:33:32Z

👍 I will review it today.

mooncake-transfer-engine/src/transfer_engine.cpp

- Extract latency threshold to named constant for clarity - Simplify bucket range formatting with explicit overflow handling - Restructure conditional flow to reduce nesting Signed-off-by: staryxchen <[email protected]>

github-actions bot added the run-ci label Nov 27, 2025

gemini-code-assist bot reviewed Nov 27, 2025

View reviewed changes

staryxchen force-pushed the feat/add-task-latency-metrics branch 2 times, most recently from f73c515 to 20c7764 Compare November 27, 2025 12:57

staryxchen force-pushed the feat/add-task-latency-metrics branch from 20c7764 to 2f4050b Compare November 27, 2025 13:12

staryxchen mentioned this pull request Nov 27, 2025

[Performance]: Is there any way to measure network transmission latency for each llm request with mooncake or transfer engine? #1092

Open

1 task

stmatengss requested review from Azure-Tang and alogfans November 28, 2025 02:33

alogfans reviewed Dec 1, 2025

View reviewed changes

mooncake-transfer-engine/src/transfer_engine.cpp Outdated Show resolved Hide resolved

staryxchen requested review from chestnut-Q and doujiang24 as code owners December 5, 2025 04:16

staryxchen requested a review from stmatengss December 5, 2025 04:17

refactor(metrics): improve latency distribution logging logic

013237a

- Extract latency threshold to named constant for clarity - Simplify bucket range formatting with explicit overflow handling - Restructure conditional flow to reduce nesting Signed-off-by: staryxchen <[email protected]>

staryxchen force-pushed the feat/add-task-latency-metrics branch from df6f60b to 013237a Compare December 5, 2025 04:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(metrics): add task completion latency tracking and reporting #1130

feat(metrics): add task completion latency tracking and reporting #1130

Uh oh!

staryxchen commented Nov 27, 2025

Uh oh!

gemini-code-assist bot commented Nov 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stmatengss commented Nov 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(metrics): add task completion latency tracking and reporting #1130

Are you sure you want to change the base?

feat(metrics): add task completion latency tracking and reporting #1130

Uh oh!

Conversation

staryxchen commented Nov 27, 2025

Add Transfer Task Latency Distribution Metrics

Summary

Motivation

Changes

Core Implementation

Output Format

Usage

Design Decisions

Performance Impact

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

gemini-code-assist bot commented Nov 27, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stmatengss commented Nov 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants