Skip to content

feat(upsampling) - issues-stats API error upsampling support #94835

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 8, 2025

Conversation

yuvmen
Copy link
Member

@yuvmen yuvmen commented Jul 2, 2025

issues-stats now will return upsampled error counts for projects in the allowlist. Required opening a special case for the non standard query in both SnubaTSDB and the snuba SDK aliased_query.
The chosen implementation avoids opening them up to be too flexible which might introduce vulnerabilites, while staying as clean as possible.

Review note - Notice issues-stats returns both a total count and the graph of counts over a period of time, hence the two queries needed.
I had to improvise and get creative to open up paths to do what I needed here, I am not 100% sure about this solution. I think it strikes a good balance between safety and cleanliness, but feel free to suggest alternative approaches I may have missed.

issues-stats now will return upsampled error counts for projects in the allowlist.
Required opening a special case for the non standard query in both SnubaTSDB and the
snuba SDK `aliased_query`.
The chosen implementation avoids opening them up to be too flexible which might introduce vulnerabilites,
while staying as clean as possible.
@yuvmen yuvmen requested a review from a team July 2, 2025 22:48
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Jul 2, 2025
Copy link

codecov bot commented Jul 2, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff            @@
##           master   #94835     +/-   ##
=========================================
  Coverage   87.93%   87.94%             
=========================================
  Files       10381    10388      +7     
  Lines      601411   603227   +1816     
  Branches    23393    23393             
=========================================
+ Hits       528860   530505   +1645     
- Misses      72084    72255    +171     
  Partials      467      467             

@yuvmen
Copy link
Member Author

yuvmen commented Jul 7, 2025

@sentry review

Copy link
Contributor

seer-by-sentry bot commented Jul 7, 2025

On it! We are reviewing the PR and will provide feedback shortly.

@yuvmen
Copy link
Member Author

yuvmen commented Jul 7, 2025

@sentry generate-test

Copy link
Contributor

seer-by-sentry bot commented Jul 7, 2025

On it! Sentry is generating unit tests for this PR.

Copy link
Contributor

seer-by-sentry bot commented Jul 7, 2025

PR Description

This pull request introduces error upsampling, a mechanism to correct for client-side error sampling by adjusting the count of errors based on the client's sample rate. This ensures more accurate error reporting and analysis, especially when clients are configured to sample errors to reduce data volume.

Click to see more

Key Technical Changes

The key technical changes include: 1) A new function are_all_projects_error_upsampled to check if all projects in a query are allowlisted for error upsampling. 2) Modification of the Snuba query generation to use upsampled_count aggregation when error upsampling is enabled. 3) Introduction of get_upsampled_count_snql_with_alias to construct the SnQL query for calculating the upsampled count. 4) Updates to the Group and GroupStream serializers to incorporate the upsampled count in API responses. 5) Addition of a new constant UPSAMPLED_ERROR_AGGREGATION to define the aggregation type.

Architecture Decisions

The architectural decisions involve modifying the query generation process in Snuba to conditionally use the upsampled_count aggregation. This approach minimizes code duplication and allows for a seamless integration of the upsampling logic. The use of SnQL functions ensures that the upsampling calculation is performed efficiently within Snuba.

Dependencies and Interactions

This change depends on the snuba_sdk for constructing SnQL queries. It interacts with the options system to determine the allowlisted projects. It also affects the Group and GroupStream serializers, which are used to generate API responses for issue data. The changes also interact with the feature flagging system to enable/disable the error upsampling feature.

Risk Considerations

Potential risks include: 1) Performance impact of the upsampled_count aggregation on Snuba queries. 2) Incorrect upsampling if the client-side sample rate is not accurately reported. 3) Data inconsistencies if error upsampling is enabled/disabled without proper migration. 4) The removal of the organization parameter from are_all_projects_error_upsampled could break existing code if it relies on this parameter.

Notable Implementation Details

A notable implementation detail is the special handling of the upsampled_count aggregation in aliased_query_params within sentry/utils/snuba.py. This function is responsible for converting the aggregation into a selected column with the appropriate SnQL function structure. Also, the get_upsampled_count_snql_with_alias function constructs a complex nested SnQL function to calculate the upsampled count, which requires careful attention to ensure correctness.

Comment on lines +30 to 31
def are_all_projects_error_upsampled(project_ids: Sequence[int]) -> bool:
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function signature change removes the organization parameter, but this breaks the contract. Consider providing a deprecation period or keeping backward compatibility by making the parameter optional with a default value of None. This ensures existing code won't break immediately.

Suggested change
def are_all_projects_error_upsampled(project_ids: Sequence[int]) -> bool:
"""
def are_all_projects_error_upsampled(
project_ids: Sequence[int], organization: Organization | None = None
) -> bool:

Did we get this right? 👍 / 👎 to inform future reviews.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an internal helper method, we can freely break its contract since we control all of its usages

Comment on lines +26 to 27
from snuba_sdk import Column, DeleteQuery, Function, MetricsQuery, Request
from snuba_sdk.legacy import json_to_snql
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing import for SelectableExpression type. This should be added to the imports at the top of the file to avoid potential import errors.

Suggested change
from snuba_sdk import Column, DeleteQuery, Function, MetricsQuery, Request
from snuba_sdk.legacy import json_to_snql
from snuba_sdk.query import SelectableExpression

Did we get this right? 👍 / 👎 to inform future reviews.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missed the fact the import does exist

Comment on lines 123 to 125
user=None,
aggregation_override: str | None = None,
**kwargs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aggregation_override parameter is added to the abstract method signature, but there's no validation to ensure the override is a valid aggregation type. Consider adding validation to prevent runtime errors.

Suggested change
user=None,
aggregation_override: str | None = None,
**kwargs,
def query_tsdb(
self,
item_list: Sequence[Group],
query_params,
conditions=None,
environment_ids=None,
user=None,
aggregation_override: str | None = None,
**kwargs,
):
if aggregation_override and aggregation_override not in ['count', 'upsampled_count']:
raise ValueError(f"Invalid aggregation_override: {aggregation_override}")
pass

Did we get this right? 👍 / 👎 to inform future reviews.

Comment on lines +1070 to +1073
# Check if all projects are allowlisted for error upsampling
is_upsampled = are_all_projects_error_upsampled(project_ids)
if is_upsampled:
aggregations[0] = ["upsampled_count", "", "times_seen"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check for error upsampling is performed but there's no error handling if the function call fails. Consider wrapping this in a try-catch block to ensure robustness.

Suggested change
# Check if all projects are allowlisted for error upsampling
is_upsampled = are_all_projects_error_upsampled(project_ids)
if is_upsampled:
aggregations[0] = ["upsampled_count", "", "times_seen"]
# Check if all projects are allowlisted for error upsampling
try:
is_upsampled = are_all_projects_error_upsampled(project_ids)
if is_upsampled:
aggregations[0] = ["upsampled_count", "", "times_seen"]
except Exception as e:
# Log the error but don't fail the entire query
logger.warning("Error checking upsampling allowlist", exc_info=e)

Did we get this right? 👍 / 👎 to inform future reviews.

Comment on lines 381 to 394
aggregated_as = "aggregate"
aggregations: list[SelectableExpression] = [
Function(
aggregation,
[Column(model_aggregate)] if model_aggregate else [],
aggregated_as,
)
]
if aggregation == "upsampled_count":
aggregations: list[SelectableExpression] = [
get_upsampled_count_snql_with_alias(aggregated_as)
]
else:
aggregations = [
Function(
function=aggregation,
parameters=[Column(model_aggregate)] if model_aggregate else [],
alias=aggregated_as,
)
]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aggregation logic creates different list structures based on the aggregation type, but this inconsistency could lead to bugs. Consider using a consistent pattern for all aggregation types.

Suggested change
aggregated_as = "aggregate"
aggregations: list[SelectableExpression] = [
Function(
aggregation,
[Column(model_aggregate)] if model_aggregate else [],
aggregated_as,
)
]
if aggregation == "upsampled_count":
aggregations: list[SelectableExpression] = [
get_upsampled_count_snql_with_alias(aggregated_as)
]
else:
aggregations = [
Function(
function=aggregation,
parameters=[Column(model_aggregate)] if model_aggregate else [],
alias=aggregated_as,
)
]
if aggregation == "upsampled_count":
aggregations: list[SelectableExpression] = [
get_upsampled_count_snql_with_alias(aggregated_as)
]
else:
aggregations = [
Function(
function=aggregation,
parameters=[Column(model_aggregate)] if model_aggregate else [],
alias=aggregated_as,
)
]

Did we get this right? 👍 / 👎 to inform future reviews.

Comment on lines 1671 to +1685
for aggregation in aggregations:
derived_columns.append(aggregation[2])

if aggregation[0] == UPSAMPLED_ERROR_AGGREGATION:
# Special-case: upsampled_count aggregation - this aggregation type
# requires special handling to convert it into a selected column
# with the appropriate SNQL function structure
if selected_columns is None:
selected_columns = []
selected_columns.append(
get_upsampled_count_snql_with_alias(
aggregation[2]
if len(aggregation) > 2 and aggregation[2] is not None
else "upsampled_count"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aggregation processing logic uses hardcoded index aggregation[2] without checking if the aggregation tuple has enough elements. This could cause IndexError. Add bounds checking.

Suggested change
for aggregation in aggregations:
derived_columns.append(aggregation[2])
if aggregation[0] == UPSAMPLED_ERROR_AGGREGATION:
# Special-case: upsampled_count aggregation - this aggregation type
# requires special handling to convert it into a selected column
# with the appropriate SNQL function structure
if selected_columns is None:
selected_columns = []
selected_columns.append(
get_upsampled_count_snql_with_alias(
aggregation[2]
if len(aggregation) > 2 and aggregation[2] is not None
else "upsampled_count"
)
if aggregation[0] == UPSAMPLED_ERROR_AGGREGATION:
# Special-case: upsampled_count aggregation
if selected_columns is None:
selected_columns = []
alias = (
aggregation[2]
if len(aggregation) > 2 and aggregation[2] is not None
else "upsampled_count"
)
selected_columns.append(
get_upsampled_count_snql_with_alias(alias)
)

Did we get this right? 👍 / 👎 to inform future reviews.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does check the tuple

Comment on lines 38 to +52
def test_are_all_projects_error_upsampled(self, mock_options: Mock) -> None:
# Test when all projects are allowlisted
mock_options.get.return_value = self.project_ids
assert _are_all_projects_error_upsampled(self.project_ids, self.organization) is True
assert are_all_projects_error_upsampled(self.project_ids) is True

# Test when some projects are not allowlisted
mock_options.get.return_value = self.project_ids[:-1]
assert _are_all_projects_error_upsampled(self.project_ids, self.organization) is False
assert are_all_projects_error_upsampled(self.project_ids) is False

# Test when no projects are allowlisted
mock_options.get.return_value = []
assert _are_all_projects_error_upsampled(self.project_ids, self.organization) is False
assert are_all_projects_error_upsampled(self.project_ids) is False

# Test when no project IDs provided
assert _are_all_projects_error_upsampled([], self.organization) is False
assert are_all_projects_error_upsampled([]) is False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is calling the old private function _are_all_projects_error_upsampled but the diff shows this function was renamed to are_all_projects_error_upsampled (public). Update the test to use the new public function name.

Suggested change
def test_are_all_projects_error_upsampled(self, mock_options: Mock) -> None:
# Test when all projects are allowlisted
mock_options.get.return_value = self.project_ids
assert _are_all_projects_error_upsampled(self.project_ids, self.organization) is True
assert are_all_projects_error_upsampled(self.project_ids) is True
# Test when some projects are not allowlisted
mock_options.get.return_value = self.project_ids[:-1]
assert _are_all_projects_error_upsampled(self.project_ids, self.organization) is False
assert are_all_projects_error_upsampled(self.project_ids) is False
# Test when no projects are allowlisted
mock_options.get.return_value = []
assert _are_all_projects_error_upsampled(self.project_ids, self.organization) is False
assert are_all_projects_error_upsampled(self.project_ids) is False
# Test when no project IDs provided
assert _are_all_projects_error_upsampled([], self.organization) is False
assert are_all_projects_error_upsampled([]) is False
@patch("sentry.api.helpers.error_upsampling.options")
def test_are_all_projects_error_upsampled(self, mock_options: Mock) -> None:
# Test when all projects are allowlisted
mock_options.get.return_value = self.project_ids
assert are_all_projects_error_upsampled(self.project_ids) is True
# Test when some projects are not allowlisted
mock_options.get.return_value = self.project_ids[:-1]
assert are_all_projects_error_upsampled(self.project_ids) is False
# Test when no projects are allowlisted
mock_options.get.return_value = []
assert are_all_projects_error_upsampled(self.project_ids) is False
# Test when no project IDs provided
assert are_all_projects_error_upsampled([]) is False

Did we get this right? 👍 / 👎 to inform future reviews.

Copy link
Member Author

@yuvmen yuvmen Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some bug here, it suggests the same as the PR version, and thinks the old version is something it is not

Comment on lines +242 to +267
"""Test that count is upsampled for allowlisted projects in group index stats."""
with self.options({"issues.client_error_sampling.project_allowlist": [self.project.id]}):
project = self.project
event_data = {
"timestamp": before_now(seconds=30).isoformat(),
"message": "Error event for upsampling",
"contexts": {"error_sampling": {"client_sample_rate": 0.1}},
}
event = self.store_event(
data=event_data,
project_id=project.id,
)

group = event.group
self.login_as(user=self.user)

with self.feature("organizations:error-upsampling"):
response = self.get_response(query="is:unresolved", groups=[group.id])
assert response.status_code == 200
assert len(response.data) == 1
# Expect the count to be upsampled (1 / 0.1 = 10) - count is a string
assert response.data[0]["count"] == "10"
# Also check that lifetime stats are upsampled
assert response.data[0]["lifetime"]["count"] == "10"
# Also check that stats are upsampled, latest time bucket should contain upsampled event
assert response.data[0]["stats"]["24h"][-1][1] == 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test stores an event with a client_sample_rate of 0.1 and expects the count to be upsampled to 10, but there's no verification that the upsampling logic actually uses the sample_weight field. Consider adding assertions to verify the sample_weight is correctly applied.

Suggested change
"""Test that count is upsampled for allowlisted projects in group index stats."""
with self.options({"issues.client_error_sampling.project_allowlist": [self.project.id]}):
project = self.project
event_data = {
"timestamp": before_now(seconds=30).isoformat(),
"message": "Error event for upsampling",
"contexts": {"error_sampling": {"client_sample_rate": 0.1}},
}
event = self.store_event(
data=event_data,
project_id=project.id,
)
group = event.group
self.login_as(user=self.user)
with self.feature("organizations:error-upsampling"):
response = self.get_response(query="is:unresolved", groups=[group.id])
assert response.status_code == 200
assert len(response.data) == 1
# Expect the count to be upsampled (1 / 0.1 = 10) - count is a string
assert response.data[0]["count"] == "10"
# Also check that lifetime stats are upsampled
assert response.data[0]["lifetime"]["count"] == "10"
# Also check that stats are upsampled, latest time bucket should contain upsampled event
assert response.data[0]["stats"]["24h"][-1][1] == 10
event_data = {
"timestamp": before_now(seconds=30).isoformat(),
"message": "Error event for upsampling",
"contexts": {"error_sampling": {"client_sample_rate": 0.1}},
# Explicitly set sample_weight to verify upsampling logic
"extra": {"sample_weight": 10.0}
}

Did we get this right? 👍 / 👎 to inform future reviews.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even if its a reasonable suggestion to add an assertion for it (even though its not accurate) it suggests to delete the entire test and add only that, which makes no sense

Comment on lines 395 to +401
if self.stats_period and not self._collapse("stats"):
aggregation_override = None
if self.project_ids:
is_upsampled = are_all_projects_error_upsampled(self.project_ids)
if is_upsampled:
aggregation_override = UPSAMPLED_ERROR_AGGREGATION

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The project_ids are stored in self.project_ids in the constructor but this could be None. Add null checking before using it in get_attrs method.

Suggested change
if self.stats_period and not self._collapse("stats"):
aggregation_override = None
if self.project_ids:
is_upsampled = are_all_projects_error_upsampled(self.project_ids)
if is_upsampled:
aggregation_override = UPSAMPLED_ERROR_AGGREGATION
aggregation_override = None
if self.project_ids:
try:
is_upsampled = are_all_projects_error_upsampled(self.project_ids)
if is_upsampled:
aggregation_override = UPSAMPLED_ERROR_AGGREGATION
except Exception as e:
# Log warning but don't fail the query
logger.warning("Failed to check upsampling status", exc_info=e)

Did we get this right? 👍 / 👎 to inform future reviews.

Comment on lines +2084 to +2101
def get_upsampled_count_snql_with_alias(alias: str) -> list[SelectableExpression]:
return Function(
function="toInt64",
parameters=[
Function(
function="sum",
parameters=[
Function(
function="ifNull",
parameters=[Column(name="sample_weight"), 1],
alias=None,
)
],
alias=None,
)
],
alias=alias,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding documentation for the complex nested function structure that builds the upsampled count SNQL. This function creates a deeply nested structure that would benefit from clearer documentation.

Suggested change
def get_upsampled_count_snql_with_alias(alias: str) -> list[SelectableExpression]:
return Function(
function="toInt64",
parameters=[
Function(
function="sum",
parameters=[
Function(
function="ifNull",
parameters=[Column(name="sample_weight"), 1],
alias=None,
)
],
alias=None,
)
],
alias=alias,
)
def get_upsampled_count_snql_with_alias(alias: str) -> SelectableExpression:
"""
Creates a SnQL expression for upsampled count aggregation.
This builds: toInt64(sum(ifNull(sample_weight, 1)))
Which sums the sample_weight column, defaulting to 1 if null,
then converts the result to a 64-bit integer.
Args:
alias: The alias to assign to the resulting expression
Returns:
A SnQL Function expression for upsampled counting
"""
return Function(
function="toInt64",
parameters=[
Function(
function="sum",
parameters=[
Function(
function="ifNull",
parameters=[Column(name="sample_weight"), 1],
alias=None,
)
],
alias=None,
)
],
alias=alias,
)

Did we get this right? 👍 / 👎 to inform future reviews.

Copy link
Contributor

seer-by-sentry bot commented Jul 7, 2025

Sentry has determined that unit tests are not necessary for this PR.

@yuvmen yuvmen marked this pull request as ready for review July 7, 2025 20:35
@yuvmen yuvmen requested review from a team as code owners July 7, 2025 20:35
Copy link
Member

@JoshFerge JoshFerge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me.

@yuvmen yuvmen merged commit 7b4c8df into master Jul 8, 2025
66 checks passed
@yuvmen yuvmen deleted the yuvmen/issues-stats-error-upsampling-support branch July 8, 2025 19:42
Copy link

sentry-io bot commented Jul 9, 2025

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

Did you find this useful? React with a 👍 or 👎

andrewshie-sentry pushed a commit that referenced this pull request Jul 14, 2025
issues-stats now will return upsampled error counts for projects in the allowlist. 
Required opening a special case for the non standard query in both SnubaTSDB and the snuba SDK `aliased_query`.
The chosen implementation avoids opening them up to be too flexible which might introduce vulnerabilities, while staying as clean as possible.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants