Skip to content

Conversation

@staryxchen
Copy link
Collaborator

Add Transfer Task Latency Distribution Metrics

Summary

This PR adds histogram-based latency distribution tracking for transfer tasks in the Transfer Engine metrics system.

Motivation

Understanding task completion latency distribution is crucial for:

  • Performance tuning and bottleneck identification
  • SLA monitoring and analysis
  • Detecting outliers and tail latencies

Previously, only throughput metrics were available. This enhancement provides detailed visibility into task-level performance.

Changes

Core Implementation

Modified Files:

  • mooncake-transfer-engine/include/transfer_engine.h
  • mooncake-transfer-engine/src/transfer_engine.cpp

Key Components:

  1. Latency Tracking

    • Record task start time on first getTransferStatus() call
    • Calculate and record completion latency when task completes
    • Use ylt::metric::histogram_t for efficient distribution tracking
  2. Histogram Buckets

    • Fine-grained buckets (10μs - 10s) covering sub-millisecond to multi-second latencies
    • 17 boundary values creating 18 buckets optimized for high-performance scenarios
  3. Interval-based Reporting

    • Snapshot mechanism for computing per-interval statistics
    • Avoids cumulative drift, shows actual distribution within each reporting period
    • Aligns with Prometheus-style metrics design (monotonic counters + rate calculation)

Output Format

[Metrics] Transfer Engine Stats (over last 5s): 
Throughput: 22480.90 MB/s | 
Latency Distribution (count=56215): 0-10μs:0.2%, 10-20μs:1.6%, 20-50μs:96.5%, 50-100μs:0.9%, 100-200μs:0.8%

Features:

  • Only shows buckets with ≥0.1% to reduce noise
  • Displays task count and percentage distribution
  • Single-line format for easy parsing and monitoring

Usage

# Enable metrics (required)
export MC_TE_METRIC=1

# Optional: customize interval (default: 5s)
export MC_TE_METRIC_INTERVAL_SECONDS=10

Design Decisions

  1. Histogram over avg/max: Distribution provides complete picture; avg can be skewed by tail latencies, max is unstable
  2. Snapshot mechanism: histogram_t doesn't support reset; snapshot-based interval calculation is standard practice (Prometheus)
  3. Bucket granularity: Focused on sub-ms to catch high-performance variations, with coarser bins for tail

Performance Impact

  • Minimal overhead: only enabled with WITH_METRICS compile flag
  • Lock-free histogram updates
  • Automatic cleanup of completed task timestamps

Testing Instructions:

# Build with metrics
cmake -DWITH_METRICS=ON ..
make

# Run and observe output
export MC_TE_METRIC=1
./your_test_binary

Expected output will show latency distribution every 5 seconds alongside throughput metrics.

Type of Change

  • Types
    • Bug fix
    • New feature
      • Transfer Engine
      • Mooncake Store
      • Mooncake EP
      • Integration
      • P2P Store
      • Python Wheel
    • Breaking change
    • CI/CD
    • Documentation update
    • Other

How Has This Been Tested?

  • ✅ Compiles with WITH_METRICS enabled/disabled
  • ✅ No memory leaks (completed tasks are cleaned up)
  • ✅ Thread-safe concurrent task tracking
  • ✅ Accurate distribution reporting under load

Checklist

  • I have performed a self-review of my own code.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @staryxchen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive task completion latency distribution metrics to the Transfer Engine. This enhancement moves beyond simple throughput measurements to provide critical insights into task-level performance, enabling better performance tuning, more accurate SLA monitoring, and improved identification of tail latencies and bottlenecks.

Highlights

  • Task Completion Latency Tracking: Implemented histogram-based latency distribution tracking for transfer tasks within the Transfer Engine metrics system.
  • Detailed Latency Metrics: Records task start times on the first getTransferStatus() call and calculates completion latency when a task finishes, utilizing ylt::metric::histogram_t for efficient distribution tracking.
  • Configurable Histogram Buckets: Uses 17 fine-grained boundary values (10μs to 10s) to cover sub-millisecond to multi-second latencies, optimized for high-performance scenarios.
  • Interval-based Reporting: Employs a snapshot mechanism to compute and report per-interval latency statistics, avoiding cumulative drift and aligning with Prometheus-style metrics design.
  • Enhanced Metrics Output: The metrics output now includes both throughput and a detailed latency distribution, displaying only buckets with >=0.1% to reduce noise and provide a clear, single-line format.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces valuable latency tracking for transfer tasks. The implementation is mostly solid, using histograms for distribution and a snapshot mechanism for interval-based reporting. However, I've identified a critical memory leak related to the cleanup of non-completed tasks, a potential data corruption issue in task key generation, and a couple of areas for code simplification and maintainability improvements. Addressing these points will make the feature more robust and easier to maintain.

@staryxchen staryxchen force-pushed the feat/add-task-latency-metrics branch 2 times, most recently from f73c515 to 20c7764 Compare November 27, 2025 12:57
- Implement task timing with start and completion tracking
- Add histogram metrics for task completion latency
- Enhance metrics reporting with detailed latency distribution
- Include throughput and latency statistics in periodic logs

Signed-off-by: staryxchen <[email protected]>
@stmatengss
Copy link
Collaborator

👍 I will review it today.

- Extract latency threshold to named constant for clarity
- Simplify bucket range formatting with explicit overflow handling
- Restructure conditional flow to reduce nesting

Signed-off-by: staryxchen <[email protected]>
@staryxchen staryxchen force-pushed the feat/add-task-latency-metrics branch from df6f60b to 013237a Compare December 5, 2025 04:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants