Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix application qps quota stalls. #14859

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

bziobrowski
Copy link
Contributor

@bziobrowski bziobrowski commented Jan 21, 2025

PR fixes #14852.
It removes slow & locking ZK queries from the hot path (query execution) and depends on background messaging to keep quotas in sync.
It changes application quota logic slightly so that non-positive quota values mean that quota is disabled and can be acquired anytime.

While checking the logic I also found that:

  • quota values between (0.0, 1) should make it impossible to acquire quota and effectively block queries with given application name
  • similar thing can happen when quota value is higher than 1 but smaller than number of online brokers . That's because quota is evenly split between online brokers.

@codecov-commenter
Copy link

codecov-commenter commented Jan 21, 2025

Codecov Report

Attention: Patch coverage is 73.68421% with 10 lines in your changes missing coverage. Please review.

Project coverage is 63.74%. Comparing base (59551e4) to head (e4395e1).
Report is 1643 commits behind head on master.

Files with missing lines Patch % Lines
...quota/HelixExternalViewBasedQueryQuotaManager.java 82.35% 6 Missing ⚠️
...esources/PinotApplicationQuotaRestletResource.java 0.00% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #14859      +/-   ##
============================================
+ Coverage     61.75%   63.74%   +1.98%     
- Complexity      207     1472    +1265     
============================================
  Files          2436     2709     +273     
  Lines        133233   151889   +18656     
  Branches      20636    23456    +2820     
============================================
+ Hits          82274    96816   +14542     
- Misses        44911    47806    +2895     
- Partials       6048     7267    +1219     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.68% <73.68%> (+1.97%) ⬆️
java-21 63.62% <73.68%> (+1.99%) ⬆️
skip-bytebuffers-false 63.70% <73.68%> (+1.95%) ⬆️
skip-bytebuffers-true 63.60% <73.68%> (+35.87%) ⬆️
temurin 63.74% <73.68%> (+1.98%) ⬆️
unittests 63.73% <73.68%> (+1.98%) ⬆️
unittests1 56.24% <ø> (+9.35%) ⬆️
unittests2 34.05% <73.68%> (+6.32%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@siddharthteotia
Copy link
Contributor

siddharthteotia commented Jan 22, 2025

Do we have any existing tests that already exercise this path? Since we are changing the code on critical path, I suggest adding tests (if not there)

@@ -319,7 +319,7 @@ private void verifyQuotaUpdate(float quotaQps) {
} catch (IOException e) {
throw new RuntimeException(e);
}
}, 5000, "Failed to reflect query quota on rate limiter in 5s.");
}, 10000, "Failed to reflect query quota on rate limiter in 5s.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not strictly necessary, so I'll revert it.

@bziobrowski
Copy link
Contributor Author

Quotas are tested mainly in HelixExternalViewBasedQueryQuotaManagerTest and QueryQuotaClusterIntegrationTest. I'll have a look at line coverage today.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the general logic introduced in #14226 needs to be improved. All the quota updates should be done via the processApplicationQueryRateLimitingClusterConfigChange() callback, and from query path it should call a real-only method which doesn't do any update logic.

@@ -74,6 +74,12 @@
* - broker added or removed from cluster
*/
public class HelixExternalViewBasedQueryQuotaManager implements ClusterChangeHandler, QueryQuotaManager {

// Minimum 'working' value for app quota. If actual value is less than this (e.g. 0.0), it is considered as disabled.
private static final double MIN_APP_QUOTA = Math.nextUp(0.0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intention here is to treat 0 as disabled. It is not very readable to have this minimum double, can we change the comparison (e.g. < to <=) sign instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the value and hid logic behind isDisabled(), isEnabled() methods.

@@ -130,9 +136,10 @@ private void initializeApplicationQpsQuotas() {

String appName = entry.getKey();
double appQpsQuota =
entry.getValue() != null && entry.getValue() != -1.0d ? entry.getValue() : _defaultQpsQuotaForApplication;
entry.getValue() != null && entry.getValue() >= MIN_APP_QUOTA ? entry.getValue()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not introduced in this PR, but we might want to allow overriding default quota to disable throttling

Suggested change
entry.getValue() != null && entry.getValue() >= MIN_APP_QUOTA ? entry.getValue()
entry.getValue() != null ? entry.getValue() : _defaultQpsQuotaForApplication;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied.


if (appQpsQuota < 0) {
if (appQpsQuota < MIN_APP_QUOTA) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for other places

Suggested change
if (appQpsQuota < MIN_APP_QUOTA) {
if (appQpsQuota <= 0) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above.

}

// Caller method need not worry about getting lock on _applicationRateLimiterMap
// as this method will do idempotent updates to the application rate limiters
private synchronized void createOrUpdateApplicationRateLimiter(List<String> applicationNames) {
private synchronized void createOrUpdateApplicationRateLimiter(List<String> applicationNames, double override) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making override a @Nullable, and use null to represent not override

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more method to hide the override when it's not needed but kept it as primitive because null is not necessarily more readable and would trigger boxing/unboxing.

@bziobrowski
Copy link
Contributor Author

I feel the general logic introduced in #14226 needs to be improved. All the quota updates should be done via the processApplicationQueryRateLimitingClusterConfigChange() callback, and from query path it should call a real-only method which doesn't do any update logic.

That method is applied only when default value changes.
createOrUpdateApplicationRateLimiter() for a specific app name is called when RefreshApplicationQpsQuotaMessageHandler is received.

There's one exception to the the 'read-only-nees' and that is when an unknown app name is detected and default app quota is enabled. In such case we've to create rate-limiter on the spot, but without querying ZK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Query execution path is accessing metadata in zookeeper through Quota Managers
5 participants