Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(vector): implement spooling logic #229

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Conversation

JoshFerge
Copy link
Member

@JoshFerge JoshFerge commented Jan 31, 2025

Right now when a vector request fails, we simply drop the uptime check result. This is not acceptable. The following PR does the following:

  • Introduces two new config items retry_vector_errors_forever

If retry_vector_errors_forever is set to True, we will retry making the vector request for a batch forever if the request fails due to a non 2XX or other error. Messages will spool on the channel in the meantime. Given the current machine size and requests per second a pod handles, this should give us roughly 2.5 days of spooling before the service fails from an OOM. (assuming 500 byte check result size, and 4GB pod memory limit).

We also assume for the sake of this idea that we will produce vector results faster than we will uptime checks. I assume this a safe assumption, although if vector is having problems, the queue could still increase while we're producing successfully to vector.

  • Introduces a metric for vector request worker queue size, and vector response time.

We can create monitors on both of these metrics, and an aggressive queue size monitor should allow for plenty of time for the on-call to take action before a queue fills up & process dies.


In a future PR, we could consider a strategy like dropping all successful uptime checks, if the queue is too full. can evaluate options with the team

@JoshFerge JoshFerge requested a review from a team as a code owner January 31, 2025 22:20
Copy link
Member

@wedamija wedamija left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To some extent I wonder if it's even worth having the max number of retries, if we have days of spooling buffer available

@JoshFerge JoshFerge force-pushed the jferg/vector-retries branch from d95dd3d to d215f4a Compare February 4, 2025 16:38
@JoshFerge JoshFerge changed the title feat(vector): implement spooling / retry logic feat(vector): implement spooling logic Feb 4, 2025
@JoshFerge JoshFerge force-pushed the jferg/vector-retries branch from 35f7731 to 2cff19b Compare February 11, 2025 18:36
Comment on lines +81 to +92
if let Err(e) = {
let start = std::time::Instant::now();
let result = send_batch(
batch_to_send,
client.clone(),
config.endpoint.clone(),
config.retry_vector_errors_forever,
)
.await;
metrics::histogram!("vector_producer.send_batch.duration", "uptime_region" => config.region.clone(), "histogram" => "timer").record(start.elapsed().as_secs_f64());
result
} {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Having this whoel block inside the if is getting a bit unweildy, I think we could just do the if on the result itself

if let Err(e) = result {
   // ...
}

Copy link
Member

@evanpurkhiser evanpurkhiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I think we could clean up some of the metric and logs after in this module to make them all more consistent maybe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants