Skip to content

Conversation

@seer-by-sentry
Copy link
Contributor

@seer-by-sentry seer-by-sentry bot commented Oct 15, 2025

Fixes SENTRY-5ABJ.

The issues comes from this block:

try:
if seer_deletion:
# Tell seer to delete grouping records for these groups
# It's low priority to delete the hashes from seer, so we don't want
# any network errors to block the deletion of the groups
hash_values = [gh[1] for gh in hashes_chunk]
may_schedule_task_to_delete_hashes_from_seer(project_id, hash_values)
except Exception:
logger.warning("Error scheduling task to delete hashes from seer")
finally:
hash_ids = [gh[0] for gh in hashes_chunk]
GroupHash.objects.filter(id__in=hash_ids).delete()

The update is triggered because of this on_delete:

seer_matched_grouphash = FlexibleForeignKey(
"sentry.GroupHash", related_name="seer_matchees", on_delete=models.SET_NULL, null=True
)

Currently, when we try to delete all the group hashes, we update the related group hash metadata first. This query ends up failing for taking longer than 30 seconds:

SQL: UPDATE "sentry_grouphashmetadata" SET "seer_matched_grouphash_id" = NULL WHERE "sentry_grouphashmetadata"."seer_matched_grouphash_id" IN (%s, ..., %s)

This can be resolved by deleting the group hash metadata rows before trying to delete the group hash rows. This will avoid the update statements altogether.

This fix was initially generated by Seer, however, the final fix is a complete different approach.

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Oct 15, 2025
@codecov
Copy link

codecov bot commented Oct 15, 2025

Codecov Report

❌ Patch coverage is 81.81818% with 2 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/sentry/deletions/defaults/group.py 81.81% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #101545      +/-   ##
===========================================
- Coverage   80.98%    80.98%   -0.01%     
===========================================
  Files        8706      8706              
  Lines      387005    387142     +137     
  Branches    24548     24548              
===========================================
+ Hits       313413    313522     +109     
- Misses      73245     73273      +28     
  Partials      347       347              

EVENT_CHUNK_SIZE = 10000
GROUP_HASH_ITERATIONS = 10000
# Batch size for nullifying group_hash_metadata.seer_matched_grouphash_id references to avoid database timeouts
GROUP_HASH_METADATA_BATCH_SIZE = 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're seeing 100 hashes updates taking over 30 seconds long, thus, this should be good enough.


iterations += 1

if iterations == GROUP_HASH_ITERATIONS:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a drive-by change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd move this out of the loop so we're only checking it once (after the loop has finished). Or might be clearer to have a did_break: bool flag that's set on break (and this check would be outside the loop with a if not did_break:

for i in range(0, len(hash_ids), GROUP_HASH_METADATA_BATCH_SIZE):
batch = hash_ids[i : i + GROUP_HASH_METADATA_BATCH_SIZE]
GroupHashMetadata.objects.filter(
seer_matched_grouphash_id__in=batch, seer_matched_grouphash_id__isnull=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using seer_matched_grouphash_id__isnull=False reduces the number of rows that need updating.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this reduce further beyond what the seer_matched_grouphash_id__in=batch filter is already doing? Does batch contain None / null values?

register(
"deletions.group-hashes-batch-size",
default=10000,
default=100,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

options automator uses this value.

"args": [self.project.id, error_group_hashes, 0]
}

def test_batch_nullify_seer_matched_grouphash_references(self) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you read the test, it looks like what I'm doing makes sense, however, I'm not entirely sure if this the way I'm associating the hashes and metadata is correct. I have some other changes locally which also don't convince me.

I would like to still get this in as the code changes are obvious.

I will spend time talking with @lobsterkatie next week to see if what I'm doing makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The closest explanation I have is this:
#83081 (comment)


# Pretend that Seer tells us that grouphash B is similar to grouphash A
grouphash_b.metadata.seer_matched_grouphash = grouphash_a
grouphash_b.metadata.save()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the hack instead of doing something like what you see below to accomplish the same thing:

with mock.patch(
            "sentry.grouping.ingest.seer.get_seer_similar_issues"
        ) as mock_get_seer_similar_issues:
            # Let seer similarity return that grouphash_b is similar to grouphash_a
            mock_get_seer_similar_issues.return_value = (0.01, grouphash_a)


# Grouphash B's metadata should still exist, but the reference to A should be nullified
metadata_b = GroupHashMetadata.objects.get(id=metadata_b_id)
assert metadata_b.seer_matched_grouphash is None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is now None while before it was assert grouphash_b.metadata.seer_matched_grouphash == grouphash_a.

@armenzg armenzg marked this pull request as ready for review October 16, 2025 19:40
@armenzg armenzg requested a review from a team as a code owner October 16, 2025 19:40
@armenzg armenzg added the Trigger: getsentry tests Once code is reviewed: apply label to PR to trigger getsentry tests label Oct 16, 2025
Comment on lines 285 to 290
if iterations == GROUP_HASH_ITERATIONS:
metrics.incr("deletions.group_hashes.max_iterations_reached", sample_rate=1.0)
logger.warning(
"Group hashes batch deletion reached the maximum number of iterations. "
"Investigate if we need to change the GROUP_HASH_ITERATIONS value."
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential bug: The reduced batch size with an unchanged iteration limit can cause delete_group_hashes to silently fail, leaving orphaned data for projects with over 1M GroupHash records.
  • Description: The batch size for GroupHash deletion was reduced from 10,000 to 100, but the iteration limit GROUP_HASH_ITERATIONS remains at 10,000. This lowers the maximum number of deletable hashes in a single run from 100 million to 1 million. When this new, lower limit is reached, such as during the deletion of a large project, the function logs a warning and exits without raising an error. This silent failure leaves orphaned GroupHash records in the database, as the calling function is unaware the deletion was incomplete.

  • Suggested fix: Increase the GROUP_HASH_ITERATIONS constant to compensate for the smaller batch size, for example, to 1,000,000, to maintain the previous capacity. Alternatively, raise an exception when the iteration limit is reached to prevent silent failures and allow the caller to handle the incomplete deletion.
    severity: 0.7, confidence: 0.95

Did we get this right? 👍 / 👎 to inform future reviews.

for i in range(0, len(hash_ids), GROUP_HASH_METADATA_BATCH_SIZE):
batch = hash_ids[i : i + GROUP_HASH_METADATA_BATCH_SIZE]
GroupHashMetadata.objects.filter(
seer_matched_grouphash_id__in=batch, seer_matched_grouphash_id__isnull=False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this reduce further beyond what the seer_matched_grouphash_id__in=batch filter is already doing? Does batch contain None / null values?


iterations += 1

if iterations == GROUP_HASH_ITERATIONS:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd move this out of the loop so we're only checking it once (after the loop has finished). Or might be clearer to have a did_break: bool flag that's set on break (and this check would be outside the loop with a if not did_break:

# and we need to nullify the seer_matched_grouphash_id field in the GroupHashMetadata model before deleting the GroupHash model
# to prevent the implicit ON DELETE SET NULL cascade from timing out.
# Process in small batches to avoid statement timeouts on high fan-out relationships
for i in range(0, len(hash_ids), GROUP_HASH_METADATA_BATCH_SIZE):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if it would be more performant — and worth changing — the loop in delete_group_hashes from:

        qs = GroupHash.objects.filter(project_id=project_id, group_id__in=group_ids).values_list(
            "id", "hash"
        )[:hashes_batch_size]
        hashes_chunk = list(qs)

to something more like this, where we could do one big query and then divvy it up over the iterative loop. (Not blocking, just wondering.)

@armenzg armenzg force-pushed the seer/fix-grouphash-deletion-timeout branch from 18f02af to 06b7efa Compare October 17, 2025 13:49
@armenzg armenzg requested a review from a team as a code owner October 17, 2025 13:49
@github-actions github-actions bot removed the Trigger: getsentry tests Once code is reviewed: apply label to PR to trigger getsentry tests label Oct 17, 2025

__repr__ = sane_repr("group_id", "hash")
__repr__ = sane_repr("group_id", "hash", "metadata")
__str__ = __repr__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Circular Reference in GroupHash Representation

Adding metadata to GroupHash.__repr__ and seer_matched_grouphash to GroupHashMetadata.__repr__ creates a circular reference. This causes infinite recursion when these objects are string-represented, leading to stack overflow errors and unexpected database queries.

Additional Locations (1)

Fix in Cursor Fix in Web

armenzg added a commit that referenced this pull request Oct 17, 2025
The issues comes from this block:
https://github.com/getsentry/sentry/blob/a3a771719d4777bd747d98fb05eb77c20425e3d6/src/sentry/deletions/defaults/group.py#L248-L259

The update is triggered because of this `on_delete`:
https://github.com/getsentry/sentry/blob/b1f684a335128dbc74ad3a7fac1d7052df9e8f01/src/sentry/models/grouphashmetadata.py#L116-L118

Currently, when we try to delete all the group hashes, we update the related group hash metadata first. This query ends up failing for taking longer than 30 seconds:

> SQL: UPDATE "sentry_grouphashmetadata" SET "seer_matched_grouphash_id" = NULL WHERE "sentry_grouphashmetadata"."seer_matched_grouphash_id" IN (%s, ..., %s)

This can be resolved by deleting the group hash _metadata_ rows before trying to delete the group hash rows. This will avoid the update statement altogether.

This fix was initially started in #101545, however, the solution has completely changed, thus, starting a new PR.

Fixes [SENTRY-5ABJ](https://sentry.io/organizations/sentry/issues/6930113529/).
armenzg added a commit that referenced this pull request Oct 17, 2025
…1720)

The issues comes from this block:

https://github.com/getsentry/sentry/blob/a3a771719d4777bd747d98fb05eb77c20425e3d6/src/sentry/deletions/defaults/group.py#L248-L259

The update is triggered because of this `on_delete`:
https://github.com/getsentry/sentry/blob/b1f684a335128dbc74ad3a7fac1d7052df9e8f01/src/sentry/models/grouphashmetadata.py#L116-L118

Currently, when we try to delete all the group hashes, we update the
related group hash metadata first. This query ends up failing for taking
longer than 30 seconds:

> SQL: UPDATE "sentry_grouphashmetadata" SET "seer_matched_grouphash_id"
= NULL WHERE "sentry_grouphashmetadata"."seer_matched_grouphash_id" IN
(%s, ..., %s)

This can be resolved by deleting the group hash _metadata_ rows before
trying to delete the group hash rows. This will avoid the update
statement altogether.

This fix was initially started in #101545, however, the solution has
completely changed, thus, starting a new PR.

Fixes
[SENTRY-5ABJ](https://sentry.io/organizations/sentry/issues/6930113529/).
@armenzg armenzg closed this Oct 20, 2025
@armenzg armenzg deleted the seer/fix-grouphash-deletion-timeout branch October 20, 2025 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants