-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
feat(upsampling) - issues-stats API error upsampling support #94835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
issues-stats now will return upsampled error counts for projects in the allowlist. Required opening a special case for the non standard query in both SnubaTSDB and the snuba SDK `aliased_query`. The chosen implementation avoids opening them up to be too flexible which might introduce vulnerabilites, while staying as clean as possible.
Codecov ReportAll modified and coverable lines are covered by tests ✅ ✅ All tests successful. No failed tests found. Additional details and impacted files@@ Coverage Diff @@
## master #94835 +/- ##
=========================================
Coverage 87.93% 87.94%
=========================================
Files 10381 10388 +7
Lines 601411 603227 +1816
Branches 23393 23393
=========================================
+ Hits 528860 530505 +1645
- Misses 72084 72255 +171
Partials 467 467 |
@sentry review |
On it! We are reviewing the PR and will provide feedback shortly. |
@sentry generate-test |
On it! Sentry is generating unit tests for this PR. |
PR DescriptionThis pull request introduces error upsampling, a mechanism to correct for client-side error sampling by adjusting the count of errors based on the client's sample rate. This ensures more accurate error reporting and analysis, especially when clients are configured to sample errors to reduce data volume. Click to see moreKey Technical ChangesThe key technical changes include: 1) A new function Architecture DecisionsThe architectural decisions involve modifying the query generation process in Snuba to conditionally use the Dependencies and InteractionsThis change depends on the Risk ConsiderationsPotential risks include: 1) Performance impact of the Notable Implementation DetailsA notable implementation detail is the special handling of the |
def are_all_projects_error_upsampled(project_ids: Sequence[int]) -> bool: | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function signature change removes the organization
parameter, but this breaks the contract. Consider providing a deprecation period or keeping backward compatibility by making the parameter optional with a default value of None
. This ensures existing code won't break immediately.
def are_all_projects_error_upsampled(project_ids: Sequence[int]) -> bool: | |
""" | |
def are_all_projects_error_upsampled( | |
project_ids: Sequence[int], organization: Organization | None = None | |
) -> bool: |
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an internal helper method, we can freely break its contract since we control all of its usages
from snuba_sdk import Column, DeleteQuery, Function, MetricsQuery, Request | ||
from snuba_sdk.legacy import json_to_snql |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing import for SelectableExpression
type. This should be added to the imports at the top of the file to avoid potential import errors.
from snuba_sdk import Column, DeleteQuery, Function, MetricsQuery, Request | |
from snuba_sdk.legacy import json_to_snql | |
from snuba_sdk.query import SelectableExpression |
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missed the fact the import does exist
user=None, | ||
aggregation_override: str | None = None, | ||
**kwargs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The aggregation_override
parameter is added to the abstract method signature, but there's no validation to ensure the override is a valid aggregation type. Consider adding validation to prevent runtime errors.
user=None, | |
aggregation_override: str | None = None, | |
**kwargs, | |
def query_tsdb( | |
self, | |
item_list: Sequence[Group], | |
query_params, | |
conditions=None, | |
environment_ids=None, | |
user=None, | |
aggregation_override: str | None = None, | |
**kwargs, | |
): | |
if aggregation_override and aggregation_override not in ['count', 'upsampled_count']: | |
raise ValueError(f"Invalid aggregation_override: {aggregation_override}") | |
pass |
Did we get this right? 👍 / 👎 to inform future reviews.
# Check if all projects are allowlisted for error upsampling | ||
is_upsampled = are_all_projects_error_upsampled(project_ids) | ||
if is_upsampled: | ||
aggregations[0] = ["upsampled_count", "", "times_seen"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check for error upsampling is performed but there's no error handling if the function call fails. Consider wrapping this in a try-catch block to ensure robustness.
# Check if all projects are allowlisted for error upsampling | |
is_upsampled = are_all_projects_error_upsampled(project_ids) | |
if is_upsampled: | |
aggregations[0] = ["upsampled_count", "", "times_seen"] | |
# Check if all projects are allowlisted for error upsampling | |
try: | |
is_upsampled = are_all_projects_error_upsampled(project_ids) | |
if is_upsampled: | |
aggregations[0] = ["upsampled_count", "", "times_seen"] | |
except Exception as e: | |
# Log the error but don't fail the entire query | |
logger.warning("Error checking upsampling allowlist", exc_info=e) |
Did we get this right? 👍 / 👎 to inform future reviews.
aggregated_as = "aggregate" | ||
aggregations: list[SelectableExpression] = [ | ||
Function( | ||
aggregation, | ||
[Column(model_aggregate)] if model_aggregate else [], | ||
aggregated_as, | ||
) | ||
] | ||
if aggregation == "upsampled_count": | ||
aggregations: list[SelectableExpression] = [ | ||
get_upsampled_count_snql_with_alias(aggregated_as) | ||
] | ||
else: | ||
aggregations = [ | ||
Function( | ||
function=aggregation, | ||
parameters=[Column(model_aggregate)] if model_aggregate else [], | ||
alias=aggregated_as, | ||
) | ||
] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The aggregation logic creates different list structures based on the aggregation type, but this inconsistency could lead to bugs. Consider using a consistent pattern for all aggregation types.
aggregated_as = "aggregate" | |
aggregations: list[SelectableExpression] = [ | |
Function( | |
aggregation, | |
[Column(model_aggregate)] if model_aggregate else [], | |
aggregated_as, | |
) | |
] | |
if aggregation == "upsampled_count": | |
aggregations: list[SelectableExpression] = [ | |
get_upsampled_count_snql_with_alias(aggregated_as) | |
] | |
else: | |
aggregations = [ | |
Function( | |
function=aggregation, | |
parameters=[Column(model_aggregate)] if model_aggregate else [], | |
alias=aggregated_as, | |
) | |
] | |
if aggregation == "upsampled_count": | |
aggregations: list[SelectableExpression] = [ | |
get_upsampled_count_snql_with_alias(aggregated_as) | |
] | |
else: | |
aggregations = [ | |
Function( | |
function=aggregation, | |
parameters=[Column(model_aggregate)] if model_aggregate else [], | |
alias=aggregated_as, | |
) | |
] |
Did we get this right? 👍 / 👎 to inform future reviews.
for aggregation in aggregations: | ||
derived_columns.append(aggregation[2]) | ||
|
||
if aggregation[0] == UPSAMPLED_ERROR_AGGREGATION: | ||
# Special-case: upsampled_count aggregation - this aggregation type | ||
# requires special handling to convert it into a selected column | ||
# with the appropriate SNQL function structure | ||
if selected_columns is None: | ||
selected_columns = [] | ||
selected_columns.append( | ||
get_upsampled_count_snql_with_alias( | ||
aggregation[2] | ||
if len(aggregation) > 2 and aggregation[2] is not None | ||
else "upsampled_count" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The aggregation processing logic uses hardcoded index aggregation[2]
without checking if the aggregation tuple has enough elements. This could cause IndexError. Add bounds checking.
for aggregation in aggregations: | |
derived_columns.append(aggregation[2]) | |
if aggregation[0] == UPSAMPLED_ERROR_AGGREGATION: | |
# Special-case: upsampled_count aggregation - this aggregation type | |
# requires special handling to convert it into a selected column | |
# with the appropriate SNQL function structure | |
if selected_columns is None: | |
selected_columns = [] | |
selected_columns.append( | |
get_upsampled_count_snql_with_alias( | |
aggregation[2] | |
if len(aggregation) > 2 and aggregation[2] is not None | |
else "upsampled_count" | |
) | |
if aggregation[0] == UPSAMPLED_ERROR_AGGREGATION: | |
# Special-case: upsampled_count aggregation | |
if selected_columns is None: | |
selected_columns = [] | |
alias = ( | |
aggregation[2] | |
if len(aggregation) > 2 and aggregation[2] is not None | |
else "upsampled_count" | |
) | |
selected_columns.append( | |
get_upsampled_count_snql_with_alias(alias) | |
) |
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does check the tuple
def test_are_all_projects_error_upsampled(self, mock_options: Mock) -> None: | ||
# Test when all projects are allowlisted | ||
mock_options.get.return_value = self.project_ids | ||
assert _are_all_projects_error_upsampled(self.project_ids, self.organization) is True | ||
assert are_all_projects_error_upsampled(self.project_ids) is True | ||
|
||
# Test when some projects are not allowlisted | ||
mock_options.get.return_value = self.project_ids[:-1] | ||
assert _are_all_projects_error_upsampled(self.project_ids, self.organization) is False | ||
assert are_all_projects_error_upsampled(self.project_ids) is False | ||
|
||
# Test when no projects are allowlisted | ||
mock_options.get.return_value = [] | ||
assert _are_all_projects_error_upsampled(self.project_ids, self.organization) is False | ||
assert are_all_projects_error_upsampled(self.project_ids) is False | ||
|
||
# Test when no project IDs provided | ||
assert _are_all_projects_error_upsampled([], self.organization) is False | ||
assert are_all_projects_error_upsampled([]) is False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is calling the old private function _are_all_projects_error_upsampled
but the diff shows this function was renamed to are_all_projects_error_upsampled
(public). Update the test to use the new public function name.
def test_are_all_projects_error_upsampled(self, mock_options: Mock) -> None: | |
# Test when all projects are allowlisted | |
mock_options.get.return_value = self.project_ids | |
assert _are_all_projects_error_upsampled(self.project_ids, self.organization) is True | |
assert are_all_projects_error_upsampled(self.project_ids) is True | |
# Test when some projects are not allowlisted | |
mock_options.get.return_value = self.project_ids[:-1] | |
assert _are_all_projects_error_upsampled(self.project_ids, self.organization) is False | |
assert are_all_projects_error_upsampled(self.project_ids) is False | |
# Test when no projects are allowlisted | |
mock_options.get.return_value = [] | |
assert _are_all_projects_error_upsampled(self.project_ids, self.organization) is False | |
assert are_all_projects_error_upsampled(self.project_ids) is False | |
# Test when no project IDs provided | |
assert _are_all_projects_error_upsampled([], self.organization) is False | |
assert are_all_projects_error_upsampled([]) is False | |
@patch("sentry.api.helpers.error_upsampling.options") | |
def test_are_all_projects_error_upsampled(self, mock_options: Mock) -> None: | |
# Test when all projects are allowlisted | |
mock_options.get.return_value = self.project_ids | |
assert are_all_projects_error_upsampled(self.project_ids) is True | |
# Test when some projects are not allowlisted | |
mock_options.get.return_value = self.project_ids[:-1] | |
assert are_all_projects_error_upsampled(self.project_ids) is False | |
# Test when no projects are allowlisted | |
mock_options.get.return_value = [] | |
assert are_all_projects_error_upsampled(self.project_ids) is False | |
# Test when no project IDs provided | |
assert are_all_projects_error_upsampled([]) is False |
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some bug here, it suggests the same as the PR version, and thinks the old version is something it is not
"""Test that count is upsampled for allowlisted projects in group index stats.""" | ||
with self.options({"issues.client_error_sampling.project_allowlist": [self.project.id]}): | ||
project = self.project | ||
event_data = { | ||
"timestamp": before_now(seconds=30).isoformat(), | ||
"message": "Error event for upsampling", | ||
"contexts": {"error_sampling": {"client_sample_rate": 0.1}}, | ||
} | ||
event = self.store_event( | ||
data=event_data, | ||
project_id=project.id, | ||
) | ||
|
||
group = event.group | ||
self.login_as(user=self.user) | ||
|
||
with self.feature("organizations:error-upsampling"): | ||
response = self.get_response(query="is:unresolved", groups=[group.id]) | ||
assert response.status_code == 200 | ||
assert len(response.data) == 1 | ||
# Expect the count to be upsampled (1 / 0.1 = 10) - count is a string | ||
assert response.data[0]["count"] == "10" | ||
# Also check that lifetime stats are upsampled | ||
assert response.data[0]["lifetime"]["count"] == "10" | ||
# Also check that stats are upsampled, latest time bucket should contain upsampled event | ||
assert response.data[0]["stats"]["24h"][-1][1] == 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test stores an event with a client_sample_rate of 0.1 and expects the count to be upsampled to 10, but there's no verification that the upsampling logic actually uses the sample_weight field. Consider adding assertions to verify the sample_weight is correctly applied.
"""Test that count is upsampled for allowlisted projects in group index stats.""" | |
with self.options({"issues.client_error_sampling.project_allowlist": [self.project.id]}): | |
project = self.project | |
event_data = { | |
"timestamp": before_now(seconds=30).isoformat(), | |
"message": "Error event for upsampling", | |
"contexts": {"error_sampling": {"client_sample_rate": 0.1}}, | |
} | |
event = self.store_event( | |
data=event_data, | |
project_id=project.id, | |
) | |
group = event.group | |
self.login_as(user=self.user) | |
with self.feature("organizations:error-upsampling"): | |
response = self.get_response(query="is:unresolved", groups=[group.id]) | |
assert response.status_code == 200 | |
assert len(response.data) == 1 | |
# Expect the count to be upsampled (1 / 0.1 = 10) - count is a string | |
assert response.data[0]["count"] == "10" | |
# Also check that lifetime stats are upsampled | |
assert response.data[0]["lifetime"]["count"] == "10" | |
# Also check that stats are upsampled, latest time bucket should contain upsampled event | |
assert response.data[0]["stats"]["24h"][-1][1] == 10 | |
event_data = { | |
"timestamp": before_now(seconds=30).isoformat(), | |
"message": "Error event for upsampling", | |
"contexts": {"error_sampling": {"client_sample_rate": 0.1}}, | |
# Explicitly set sample_weight to verify upsampling logic | |
"extra": {"sample_weight": 10.0} | |
} |
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even if its a reasonable suggestion to add an assertion for it (even though its not accurate) it suggests to delete the entire test and add only that, which makes no sense
if self.stats_period and not self._collapse("stats"): | ||
aggregation_override = None | ||
if self.project_ids: | ||
is_upsampled = are_all_projects_error_upsampled(self.project_ids) | ||
if is_upsampled: | ||
aggregation_override = UPSAMPLED_ERROR_AGGREGATION | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The project_ids are stored in self.project_ids
in the constructor but this could be None. Add null checking before using it in get_attrs
method.
if self.stats_period and not self._collapse("stats"): | |
aggregation_override = None | |
if self.project_ids: | |
is_upsampled = are_all_projects_error_upsampled(self.project_ids) | |
if is_upsampled: | |
aggregation_override = UPSAMPLED_ERROR_AGGREGATION | |
aggregation_override = None | |
if self.project_ids: | |
try: | |
is_upsampled = are_all_projects_error_upsampled(self.project_ids) | |
if is_upsampled: | |
aggregation_override = UPSAMPLED_ERROR_AGGREGATION | |
except Exception as e: | |
# Log warning but don't fail the query | |
logger.warning("Failed to check upsampling status", exc_info=e) |
Did we get this right? 👍 / 👎 to inform future reviews.
def get_upsampled_count_snql_with_alias(alias: str) -> list[SelectableExpression]: | ||
return Function( | ||
function="toInt64", | ||
parameters=[ | ||
Function( | ||
function="sum", | ||
parameters=[ | ||
Function( | ||
function="ifNull", | ||
parameters=[Column(name="sample_weight"), 1], | ||
alias=None, | ||
) | ||
], | ||
alias=None, | ||
) | ||
], | ||
alias=alias, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding documentation for the complex nested function structure that builds the upsampled count SNQL. This function creates a deeply nested structure that would benefit from clearer documentation.
def get_upsampled_count_snql_with_alias(alias: str) -> list[SelectableExpression]: | |
return Function( | |
function="toInt64", | |
parameters=[ | |
Function( | |
function="sum", | |
parameters=[ | |
Function( | |
function="ifNull", | |
parameters=[Column(name="sample_weight"), 1], | |
alias=None, | |
) | |
], | |
alias=None, | |
) | |
], | |
alias=alias, | |
) | |
def get_upsampled_count_snql_with_alias(alias: str) -> SelectableExpression: | |
""" | |
Creates a SnQL expression for upsampled count aggregation. | |
This builds: toInt64(sum(ifNull(sample_weight, 1))) | |
Which sums the sample_weight column, defaulting to 1 if null, | |
then converts the result to a 64-bit integer. | |
Args: | |
alias: The alias to assign to the resulting expression | |
Returns: | |
A SnQL Function expression for upsampled counting | |
""" | |
return Function( | |
function="toInt64", | |
parameters=[ | |
Function( | |
function="sum", | |
parameters=[ | |
Function( | |
function="ifNull", | |
parameters=[Column(name="sample_weight"), 1], | |
alias=None, | |
) | |
], | |
alias=None, | |
) | |
], | |
alias=alias, | |
) |
Did we get this right? 👍 / 👎 to inform future reviews.
Sentry has determined that unit tests are not necessary for this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense to me.
Suspect IssuesThis pull request was deployed and Sentry observed the following issues:
Did you find this useful? React with a 👍 or 👎 |
issues-stats now will return upsampled error counts for projects in the allowlist. Required opening a special case for the non standard query in both SnubaTSDB and the snuba SDK `aliased_query`. The chosen implementation avoids opening them up to be too flexible which might introduce vulnerabilities, while staying as clean as possible.
issues-stats now will return upsampled error counts for projects in the allowlist. Required opening a special case for the non standard query in both SnubaTSDB and the snuba SDK
aliased_query
.The chosen implementation avoids opening them up to be too flexible which might introduce vulnerabilites, while staying as clean as possible.
Review note - Notice issues-stats returns both a total count and the graph of counts over a period of time, hence the two queries needed.
I had to improvise and get creative to open up paths to do what I needed here, I am not 100% sure about this solution. I think it strikes a good balance between safety and cleanliness, but feel free to suggest alternative approaches I may have missed.