-
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(vector): implement spooling logic #229
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To some extent I wonder if it's even worth having the max number of retries, if we have days of spooling buffer available
d95dd3d
to
d215f4a
Compare
35f7731
to
2cff19b
Compare
if let Err(e) = { | ||
let start = std::time::Instant::now(); | ||
let result = send_batch( | ||
batch_to_send, | ||
client.clone(), | ||
config.endpoint.clone(), | ||
config.retry_vector_errors_forever, | ||
) | ||
.await; | ||
metrics::histogram!("vector_producer.send_batch.duration", "uptime_region" => config.region.clone(), "histogram" => "timer").record(start.elapsed().as_secs_f64()); | ||
result | ||
} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Having this whoel block inside the if is getting a bit unweildy, I think we could just do the if
on the result itself
if let Err(e) = result {
// ...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I think we could clean up some of the metric and logs after in this module to make them all more consistent maybe
Right now when a vector request fails, we simply drop the uptime check result. This is not acceptable. The following PR does the following:
retry_vector_errors_forever
If
retry_vector_errors_forever
is set toTrue
, we will retry making the vector request for a batch forever if the request fails due to a non 2XX or other error. Messages will spool on the channel in the meantime. Given the current machine size and requests per second a pod handles, this should give us roughly 2.5 days of spooling before the service fails from an OOM. (assuming 500 byte check result size, and 4GB pod memory limit).We also assume for the sake of this idea that we will produce vector results faster than we will uptime checks. I assume this a safe assumption, although if vector is having problems, the queue could still increase while we're producing successfully to vector.
We can create monitors on both of these metrics, and an aggressive queue size monitor should allow for plenty of time for the on-call to take action before a queue fills up & process dies.
In a future PR, we could consider a strategy like dropping all successful uptime checks, if the queue is too full. can evaluate options with the team