-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an optional bandwidth cap to TieredMergePolicy
?
#14148
Comments
I wonder if this is something that could be implemented in the merge scheduler rather than in the merge policy. Thinking out loud: the merge policy's responsibility is to compute efficient merges that meet its constraints (merge factor, deletes, etc.). But then the merge scheduler is free to throttle these merges, or even to ignore some of them if they don't meet its own constraints? |
I guess that's why (at least partly) @mikemccand suggested to change |
Doing this in I guess "throttle during" is also an option, just like the IO rate limiter can do today ... in fact, the IO rate limiter is already one way to cap bandwidth. But the big problem with doing this throttling late in the game (In But maybe that handicap to TMP would be fine in practice? Maybe, the choices it makes based on stale index geometry are not so different from what it would make with "live" stock prices? Though, if it was a max-sized merge, and enough updates/deletes arrive in those ten minutes, then Or maybe even some combination of the two approaches? |
Intuitively, I had thought of the "throttle at start" approach, where we would also give I guess it's similar to the idea you mentioned in your last paragraph, though I like that |
Description
TL;DR:
TieredMergePolicy
can create massive snapshots if you configure it for aggressivedeletesPctAllowed
, which can hurt searchers (cause page fault storms) in a near-real-time replication world. Maybe we could add an optional (off by default) "rate limit" on how many amortized bytes/sec TMP is merging? This is just an idea / brainstorming / design discussion so far ... no PR.Full context:
At Amazon (Product Search team) we use near-real-time segment replication to efficiently distribute index updates to all searchers/replicas.
Since we have many searchers per indexer shard, to scale to very high QPS, we intentionally tune
TieredMergePolicy
(TMP) to very aggressively reclaim deletions. Burning extra CPU / bandwidth during indexing to save even a little bit of CPU during searching is a good tradeoff for us (and in general, for Lucene users with high QPS requirements).But we have a "fun" problem occasionally: sometimes we have an update storm (an upstream team reindexes large-ish portions of Amazon's catalog through the real-time indexing stream), and this leads to lots and lots of merging and many large (max-sized 5 GB) segments being replicated out to searchers in short order, sometimes over links (e.g. cross-region) that are not as crazy-fast as within-region networking fabric, and our searchers fall behind a bit.
Falling behind is not the end of the world: the searchers simply skip some point-in-time snapshots and jump to the latest one, effectively sub-sampling checkpoints as best they can given the bandwidth constraints. Index-to-search latency is hurt a bit, but recovers once the indexer catches up on the update storm.
The bigger problem for us is that we size our shards, roughly, so that the important parts of the index (the parts hit often by query traffic) are fully "hot". I.e. so the OS has enough RAM to hold the hot parts of the index. But when it takes too long to copy and light (switch over to the new segments for searching) a given snapshot, and we skip the next one or two snapshots, the followon snapshot that we finally do load may have a sizable part of the index rewritten, and the snapshot size maybe a big percentage of the overall index, and copying/lighting it will stress the OS into a paging storm, hurting our long-pole latencies.
So one possible solution we thought of is to add an optional (off by default)
setMaxBandwidth
to TMP so that "on average" (amortized over some time window ish) TMP would not produce so many merges that it exceeds that bandwidth cap. With such a cap, during an update storm (war time), the index delete %tg would necessarily increase beyond what we ideally want / configured withsetDeletesPctAllowed
, but then during peace time, TMP could again catch up and push the deletes back below the target.The text was updated successfully, but these errors were encountered: