-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
fix(deletions): Prevent timeouts when deleting GroupHash records #101545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #101545 +/- ##
===========================================
- Coverage 80.98% 80.98% -0.01%
===========================================
Files 8706 8706
Lines 387005 387142 +137
Branches 24548 24548
===========================================
+ Hits 313413 313522 +109
- Misses 73245 73273 +28
Partials 347 347 |
efa76a4 to
e7e5821
Compare
| EVENT_CHUNK_SIZE = 10000 | ||
| GROUP_HASH_ITERATIONS = 10000 | ||
| # Batch size for nullifying group_hash_metadata.seer_matched_grouphash_id references to avoid database timeouts | ||
| GROUP_HASH_METADATA_BATCH_SIZE = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're seeing 100 hashes updates taking over 30 seconds long, thus, this should be good enough.
|
|
||
| iterations += 1 | ||
|
|
||
| if iterations == GROUP_HASH_ITERATIONS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a drive-by change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'd move this out of the loop so we're only checking it once (after the loop has finished). Or might be clearer to have a did_break: bool flag that's set on break (and this check would be outside the loop with a if not did_break:
| for i in range(0, len(hash_ids), GROUP_HASH_METADATA_BATCH_SIZE): | ||
| batch = hash_ids[i : i + GROUP_HASH_METADATA_BATCH_SIZE] | ||
| GroupHashMetadata.objects.filter( | ||
| seer_matched_grouphash_id__in=batch, seer_matched_grouphash_id__isnull=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using seer_matched_grouphash_id__isnull=False reduces the number of rows that need updating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this reduce further beyond what the seer_matched_grouphash_id__in=batch filter is already doing? Does batch contain None / null values?
| register( | ||
| "deletions.group-hashes-batch-size", | ||
| default=10000, | ||
| default=100, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
options automator uses this value.
tests/sentry/deletions/test_group.py
Outdated
| "args": [self.project.id, error_group_hashes, 0] | ||
| } | ||
|
|
||
| def test_batch_nullify_seer_matched_grouphash_references(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you read the test, it looks like what I'm doing makes sense, however, I'm not entirely sure if this the way I'm associating the hashes and metadata is correct. I have some other changes locally which also don't convince me.
I would like to still get this in as the code changes are obvious.
I will spend time talking with @lobsterkatie next week to see if what I'm doing makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The closest explanation I have is this:
#83081 (comment)
tests/sentry/deletions/test_group.py
Outdated
|
|
||
| # Pretend that Seer tells us that grouphash B is similar to grouphash A | ||
| grouphash_b.metadata.seer_matched_grouphash = grouphash_a | ||
| grouphash_b.metadata.save() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the hack instead of doing something like what you see below to accomplish the same thing:
with mock.patch(
"sentry.grouping.ingest.seer.get_seer_similar_issues"
) as mock_get_seer_similar_issues:
# Let seer similarity return that grouphash_b is similar to grouphash_a
mock_get_seer_similar_issues.return_value = (0.01, grouphash_a)
tests/sentry/deletions/test_group.py
Outdated
|
|
||
| # Grouphash B's metadata should still exist, but the reference to A should be nullified | ||
| metadata_b = GroupHashMetadata.objects.get(id=metadata_b_id) | ||
| assert metadata_b.seer_matched_grouphash is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is now None while before it was assert grouphash_b.metadata.seer_matched_grouphash == grouphash_a.
| if iterations == GROUP_HASH_ITERATIONS: | ||
| metrics.incr("deletions.group_hashes.max_iterations_reached", sample_rate=1.0) | ||
| logger.warning( | ||
| "Group hashes batch deletion reached the maximum number of iterations. " | ||
| "Investigate if we need to change the GROUP_HASH_ITERATIONS value." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential bug: The reduced batch size with an unchanged iteration limit can cause delete_group_hashes to silently fail, leaving orphaned data for projects with over 1M GroupHash records.
-
Description: The batch size for
GroupHashdeletion was reduced from 10,000 to 100, but the iteration limitGROUP_HASH_ITERATIONSremains at 10,000. This lowers the maximum number of deletable hashes in a single run from 100 million to 1 million. When this new, lower limit is reached, such as during the deletion of a large project, the function logs a warning and exits without raising an error. This silent failure leaves orphanedGroupHashrecords in the database, as the calling function is unaware the deletion was incomplete. -
Suggested fix: Increase the
GROUP_HASH_ITERATIONSconstant to compensate for the smaller batch size, for example, to 1,000,000, to maintain the previous capacity. Alternatively, raise an exception when the iteration limit is reached to prevent silent failures and allow the caller to handle the incomplete deletion.
severity: 0.7, confidence: 0.95
Did we get this right? 👍 / 👎 to inform future reviews.
| for i in range(0, len(hash_ids), GROUP_HASH_METADATA_BATCH_SIZE): | ||
| batch = hash_ids[i : i + GROUP_HASH_METADATA_BATCH_SIZE] | ||
| GroupHashMetadata.objects.filter( | ||
| seer_matched_grouphash_id__in=batch, seer_matched_grouphash_id__isnull=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this reduce further beyond what the seer_matched_grouphash_id__in=batch filter is already doing? Does batch contain None / null values?
|
|
||
| iterations += 1 | ||
|
|
||
| if iterations == GROUP_HASH_ITERATIONS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'd move this out of the loop so we're only checking it once (after the loop has finished). Or might be clearer to have a did_break: bool flag that's set on break (and this check would be outside the loop with a if not did_break:
| # and we need to nullify the seer_matched_grouphash_id field in the GroupHashMetadata model before deleting the GroupHash model | ||
| # to prevent the implicit ON DELETE SET NULL cascade from timing out. | ||
| # Process in small batches to avoid statement timeouts on high fan-out relationships | ||
| for i in range(0, len(hash_ids), GROUP_HASH_METADATA_BATCH_SIZE): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder if it would be more performant — and worth changing — the loop in delete_group_hashes from:
qs = GroupHash.objects.filter(project_id=project_id, group_id__in=group_ids).values_list(
"id", "hash"
)[:hashes_batch_size]
hashes_chunk = list(qs)
to something more like this, where we could do one big query and then divvy it up over the iterative loop. (Not blocking, just wondering.)
18f02af to
06b7efa
Compare
|
|
||
| __repr__ = sane_repr("group_id", "hash") | ||
| __repr__ = sane_repr("group_id", "hash", "metadata") | ||
| __str__ = __repr__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Circular Reference in GroupHash Representation
Adding metadata to GroupHash.__repr__ and seer_matched_grouphash to GroupHashMetadata.__repr__ creates a circular reference. This causes infinite recursion when these objects are string-represented, leading to stack overflow errors and unexpected database queries.
Additional Locations (1)
The issues comes from this block: https://github.com/getsentry/sentry/blob/a3a771719d4777bd747d98fb05eb77c20425e3d6/src/sentry/deletions/defaults/group.py#L248-L259 The update is triggered because of this `on_delete`: https://github.com/getsentry/sentry/blob/b1f684a335128dbc74ad3a7fac1d7052df9e8f01/src/sentry/models/grouphashmetadata.py#L116-L118 Currently, when we try to delete all the group hashes, we update the related group hash metadata first. This query ends up failing for taking longer than 30 seconds: > SQL: UPDATE "sentry_grouphashmetadata" SET "seer_matched_grouphash_id" = NULL WHERE "sentry_grouphashmetadata"."seer_matched_grouphash_id" IN (%s, ..., %s) This can be resolved by deleting the group hash _metadata_ rows before trying to delete the group hash rows. This will avoid the update statement altogether. This fix was initially started in #101545, however, the solution has completely changed, thus, starting a new PR. Fixes [SENTRY-5ABJ](https://sentry.io/organizations/sentry/issues/6930113529/).
…1720) The issues comes from this block: https://github.com/getsentry/sentry/blob/a3a771719d4777bd747d98fb05eb77c20425e3d6/src/sentry/deletions/defaults/group.py#L248-L259 The update is triggered because of this `on_delete`: https://github.com/getsentry/sentry/blob/b1f684a335128dbc74ad3a7fac1d7052df9e8f01/src/sentry/models/grouphashmetadata.py#L116-L118 Currently, when we try to delete all the group hashes, we update the related group hash metadata first. This query ends up failing for taking longer than 30 seconds: > SQL: UPDATE "sentry_grouphashmetadata" SET "seer_matched_grouphash_id" = NULL WHERE "sentry_grouphashmetadata"."seer_matched_grouphash_id" IN (%s, ..., %s) This can be resolved by deleting the group hash _metadata_ rows before trying to delete the group hash rows. This will avoid the update statement altogether. This fix was initially started in #101545, however, the solution has completely changed, thus, starting a new PR. Fixes [SENTRY-5ABJ](https://sentry.io/organizations/sentry/issues/6930113529/).
Fixes SENTRY-5ABJ.
The issues comes from this block:
sentry/src/sentry/deletions/defaults/group.py
Lines 248 to 259 in a3a7717
The update is triggered because of this
on_delete:sentry/src/sentry/models/grouphashmetadata.py
Lines 116 to 118 in b1f684a
Currently, when we try to delete all the group hashes, we update the related group hash metadata first. This query ends up failing for taking longer than 30 seconds:
This can be resolved by deleting the group hash metadata rows before trying to delete the group hash rows. This will avoid the update statements altogether.
This fix was initially generated by Seer, however, the final fix is a complete different approach.