Skip to content

Conversation

@tanishq-chugh
Copy link
Contributor

…e of HMS instance running initiator crash

What changes were proposed in this pull request?

Compactor cleaner fix to address duplicate directories created from multiple jobs running same compaction

Why are the changes needed?

To address the race condition where multiple jobs running same compaction leads to duplicate data

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manual Testing after reproducing the issue on a deployed cluster

return 1;
}
}
else if (visibilityTxnId != parsedDelta.visibilityTxnId) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else if (visibilityTxnId != parsedDelta.visibilityTxnId) {
return visibilityTxnId < parsedDelta.visibilityTxnId ? 1 : -1;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change in 8711dcd

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't allow concurrent compaction with above properties

@kuczoram
Copy link
Contributor

Hi @tanishq-chugh ,
thanks a lot for this PR. It looks good, only one thing I am missing. Could you please add a test case for this scenario? I think in the TestCleaner tests you can define a directory structures and see it the cleaner properly cleans them up.

@tanishq-chugh
Copy link
Contributor Author

Hi @kuczoram
Thanks for pointing this out. I have added a new test case in TestCleaner for this scenario.
Added in commit - 611d9a7

@sonarqubecloud
Copy link

Copy link
Contributor

@kuczoram kuczoram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @tanishq-chugh! It looks good to me.

@kuczoram kuczoram merged commit 96cf347 into apache:master Sep 29, 2025
2 checks passed
@deniskuzZ
Copy link
Member

deniskuzZ commented Sep 29, 2025

hi @tanishq-chugh, how could we get into the situation with concurrent minor compactions? If that is related to compaction running on HMS I already mentioned to you, that we should just drop the support (it's already deprecated)
#6068 (comment)

&& next.minWriteId == prev.minWriteId
&& next.statementId == prev.statementId) {
&& next.statementId == prev.statementId
&& (next.isDeleteDelta || prev.isDeleteDelta)) {
Copy link
Member

@deniskuzZ deniskuzZ Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this change needed?

@tanishq-chugh
Copy link
Contributor Author

Hi @deniskuzZ
This is not related to compaction running on HMS, but happens when compaction workers are indeed running on HS2 itself.
I have added the detailed information regarding how it happens in the description of the Hive JIRA: HIVE-29210


// Overlapping compacted deltas with different visibilityTxnIDs simulating concurrent compaction from two workers
addDeltaFile(t, null, 22L, 23L, 2, 24);
addDeltaFile(t, null, 22L, 23L, 2, 25);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should allow this to happen in first place

@deniskuzZ
Copy link
Member

Hi @deniskuzZ This is not related to compaction running on HMS, but happens when compaction workers are indeed running on HS2 itself. I have added the detailed information regarding how it happens in the description of the Hive JIRA: HIVE-29210

@tanishq-chugh HIVE-29210 mentions HMS local workers

@tanishq-chugh
Copy link
Contributor Author

@deniskuzZ
In a case, with multiple HiveServer2 (HS2) instances, one of the HS2 instances may run on the same host as the Hive Metastore (HMS). In this setup, the initiator runs within HMS, while the compaction worker threads run within HS2.

It mentions initiator in HMS but worker threads within HS2.

@deniskuzZ
Copy link
Member

@tanishq-chugh, revokeFromLocalWorkers was only needed when HMS had local workers. As we remove support it should be just dropped, instead of adding workarounds in a code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants