Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add metrics for wf engine #2076

Open
wants to merge 1 commit into
base: 02-22-fix_various_bug_fixes
Choose a base branch
from

Conversation

MasterPtato
Copy link
Contributor

Changes

Copy link
Contributor Author

MasterPtato commented Feb 25, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

Copy link

cloudflare-workers-and-pages bot commented Feb 25, 2025

Deploying rivet with  Cloudflare Pages  Cloudflare Pages

Latest commit: dbc2e30
Status:🚫  Build failed.

View logs

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR adds comprehensive metrics tracking to the workflow engine. Here's a summary of the key changes:

  • Added timing metrics for workflow operations including signal dispatch, message handling, and sub-workflow execution using Instant::now() measurements
  • Added new metrics in metrics.rs for tracking workflow states, signal handling latencies, and message processing durations with proper labeling
  • Added SQLite performance testing module with worker tasks to measure database operation latencies
  • Changed SQLite synchronization mode from Normal to Full for better data consistency, trading some performance for reliability
  • Added proper error handling and metric recording across workflow operations with standardized metric naming conventions

The changes appear well-structured and improve observability of the workflow engine's performance. However, there are a few potential concerns:

  • The SQLite synchronization mode change could impact performance and should be carefully monitored
  • Some metric labels use empty strings which may affect metric cardinality and querying
  • The large number of changed files suggests this PR may be mixing multiple concerns beyond just metrics
  • Some debug logging statements look like temporary development code and should be cleaned up

64 file(s) reviewed, 19 comment(s)
Edit PR Review Bot Settings | Greptile

Comment on lines 56 to 61
[dev-dependencies]
anyhow = "1.0.82"
rand = "0.8" No newline at end of file
rand = "0.8"
statrs = "0.18"
dirs = "5.0.1"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Duplicate dependency anyhow defined in both dependencies and dev-dependencies sections. Consider removing from dev-dependencies since it's already available through main dependencies.

Suggested change
[dev-dependencies]
anyhow = "1.0.82"
rand = "0.8"
\ No newline at end of file
rand = "0.8"
statrs = "0.18"
dirs = "5.0.1"
[dev-dependencies]
rand = "0.8"
statrs = "0.18"
dirs = "5.0.1"

Comment on lines +168 to +169
.with_label_values(&["", T::NAME])
.observe(dt);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Empty string label seems unnecessary and inconsistent. Consider removing the empty string or documenting why it's needed.

Comment on lines +166 to 170
let dt = start_instant.elapsed().as_secs_f64();
metrics::SIGNAL_SEND_DURATION
.with_label_values(&["", T::NAME])
.observe(dt);
metrics::SIGNAL_PUBLISHED
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Duration metric should be recorded before incrementing the published counter in case of failures

Comment on lines +86 to +87
.with_label_values(&["", M::NAME])
.observe(dt);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Empty string label prefix seems incorrect. Should document why this empty string is needed or remove it if unnecessary.

Comment on lines +84 to 91
let dt = start_instant.elapsed().as_secs_f64();
metrics::MESSAGE_SEND_DURATION
.with_label_values(&["", M::NAME])
.observe(dt);
metrics::MESSAGE_PUBLISHED
.with_label_values(&[M::NAME])
.with_label_values(&["", M::NAME])
.inc();

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Metrics are recorded even if message send fails. Consider moving metrics recording into the success path only.

Comment on lines 4 to 6
pub struct DbDataKey {
db_name_segment: Arc<Vec<u8>>
db_name_segment: Arc<Vec<u8>>,
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: The struct field visibility is private but the struct itself is public. Consider documenting why this design choice was made or making the field public if it needs to be accessed externally.

root.server.as_mut().unwrap().foundationdb = Some(Default::default());
let config = rivet_config::Config::from_root(root);
let mut root = rivet_config::config::Root::default();
root.server.as_mut().unwrap().foundationdb = Some(Default::default());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: unwrap() on Option without error handling could panic if server is None

Suggested change
root.server.as_mut().unwrap().foundationdb = Some(Default::default());
root.server.get_or_insert_with(Default::default).foundationdb = Some(Default::default());

Comment on lines +187 to +191
let date = if print_ts > 1 {
datetime.format("%Y-%m-%d %H:%M:%S%.3f")
} else {
datetime.format("%Y-%m-%d %H:%M:%S")
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Consider extracting timestamp formatting logic into a helper function since it's repeated in multiple places


pub static ref INSERT_COMMANDS_ACQUIRE_DURATION: HistogramVec = register_histogram_vec_with_registry!(
"pegboard_client_insert_commands_acquire_duration",
"TODO REMOVE",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: The description 'TODO REMOVE' suggests this metric is temporary but doesn't explain why. Either remove the metric now or document its actual purpose.

"pegboard_client_insert_commands_acquire_duration",
"TODO REMOVE",
&["workflow_id"],
BUCKETS.to_vec(),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: BUCKETS.to_vec() creates a new vector allocation for each histogram. Consider using BUCKETS directly if possible to avoid unnecessary allocations.

@MasterPtato MasterPtato force-pushed the 02-22-fix_various_bug_fixes branch from ed680a5 to 2051a35 Compare February 26, 2025 03:29
@MasterPtato MasterPtato force-pushed the 02-25-fix_add_metrics_for_wf_engine branch from dba8d32 to dbc2e30 Compare February 26, 2025 03:29
@NathanFlurry NathanFlurry force-pushed the 02-22-fix_various_bug_fixes branch from 2051a35 to ed680a5 Compare February 26, 2025 06:18
Copy link

cloudflare-workers-and-pages bot commented Feb 26, 2025

Deploying rivet-hub with  Cloudflare Pages  Cloudflare Pages

Latest commit: dba8d32
Status: ✅  Deploy successful!
Preview URL: https://63951b92.rivet-hub-7jb.pages.dev
Branch Preview URL: https://02-25-fix-add-metrics-for-wf.rivet-hub-7jb.pages.dev

View logs

@NathanFlurry NathanFlurry force-pushed the 02-25-fix_add_metrics_for_wf_engine branch from dbc2e30 to dba8d32 Compare February 26, 2025 06:18
@MasterPtato MasterPtato force-pushed the 02-22-fix_various_bug_fixes branch from ed680a5 to 2051a35 Compare February 27, 2025 02:45
@MasterPtato MasterPtato force-pushed the 02-25-fix_add_metrics_for_wf_engine branch from dba8d32 to dbc2e30 Compare February 27, 2025 02:45
@NathanFlurry NathanFlurry force-pushed the 02-25-fix_add_metrics_for_wf_engine branch from dbc2e30 to dba8d32 Compare February 27, 2025 07:59
@NathanFlurry NathanFlurry force-pushed the 02-22-fix_various_bug_fixes branch from 2051a35 to ed680a5 Compare February 27, 2025 07:59
@MasterPtato MasterPtato force-pushed the 02-22-fix_various_bug_fixes branch from ed680a5 to 2051a35 Compare February 27, 2025 18:58
@MasterPtato MasterPtato force-pushed the 02-25-fix_add_metrics_for_wf_engine branch from dba8d32 to dbc2e30 Compare February 27, 2025 18:58
@NathanFlurry NathanFlurry force-pushed the 02-22-fix_various_bug_fixes branch from 2051a35 to ed680a5 Compare February 27, 2025 20:45
@NathanFlurry NathanFlurry force-pushed the 02-25-fix_add_metrics_for_wf_engine branch from dbc2e30 to dba8d32 Compare February 27, 2025 20:46
@MasterPtato MasterPtato force-pushed the 02-22-fix_various_bug_fixes branch from ed680a5 to 2051a35 Compare February 28, 2025 03:08
@MasterPtato MasterPtato force-pushed the 02-25-fix_add_metrics_for_wf_engine branch from dba8d32 to dbc2e30 Compare February 28, 2025 03:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant