-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Bulk deletion in RemoveSnapshots #11837
base: main
Are you sure you want to change the base?
Core: Bulk deletion in RemoveSnapshots #11837
Conversation
0c875f9
to
aafd0fa
Compare
Slack discussion about this: https://apache-iceberg.slack.com/archives/C03LG1D563F/p1733215233582339 |
c3807ba
to
87083d5
Compare
The current implementation uses the deleteFile() of the FileIO even if it supports bulk operations. Even though the user of the RemoveSnapshots API can provide a custom Consumer to perform bulk deletion, Iceberg can be clever enough itself to find out if bulk deletion is possible on the FileIO.
87083d5
to
d0638e5
Compare
I'm going to suggest some tests of failure handling to see what happens there
|
Hi @steveloughran , |
Hi @amogh-jahagirdar , |
@amogh-jahagirdar Would you mind taking a look. This came from a Slack discussion we had earlier. cc @pvary in case you have some capacity for this. |
* safe. | ||
*/ | ||
public class BulkDeleteConsumer implements Consumer<String> { | ||
private final List<String> files = Lists.newArrayList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit afraid that this list could become quite big.
Could we "flush" the delete in batches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could also make for a slow delete at the end. Ideally there'd be a page size for deletes, say 1000, and then kick off the delete in a separate thread.
Both S3A and S3FileIO have a configurable page size; s3a bulk delete is also rate limited per bucket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to flushing in batches, the list of files can be quite large for snapshot expiration. I think having a constant 1000 is fine to begin with.
I don't think it's really strictly required to kick of the delete in a separate thread, and would prefer to keep it simple at least for now. We generally are performing bulk deletes in maintenance operations which are already long running and a good chunk of that time is spent in CPU/memory bound computations of which files to delete rather than actually doing deletion.
If it's a real issue I'd table that as an optimization for later.
return; | ||
} | ||
|
||
ops.deleteFiles(files); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to do retry, error handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retry: no, they should do it themselves. If you add a layer of retry on top of their code, you simply double wrap the failures for exponential delays before giving up.
Do not try and be clever here. Look at the S3A connector policy, recognise how complicated it is, different policies for connectivity vs throttle vs other errors, what can be retried, how long to wait/backoff, etc etc.
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ARetryPolicy.java
double wrapping retries is a real PITA, it's bad enough that the V2 SDK has taken to retrying some things (UnknownHostException) that it never used to...doing stuff in the app makes things work.
regarding error handling: what is there to do other than report an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original delete path had retries. See:
Tasks.foreach(pathsToDelete)
.executeWith(deleteExecutorService)
.retry(3)
.stopRetryOn(NotFoundException.class)
.stopOnFailure()
.suppressFailureWhenFinished()
.onFailure(
(file, thrown) -> LOG.warn("Delete failed for {} file: {}", fileType, file, thrown))
.run(deleteFunc::accept);
I think we should match the original behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think I agree with @pvary that to begin with we should probably mimic the existing delete retry behavior. In terms of error handling the deletion is all best effort. No operation should be impeded due to failure to physically delete a file off disk.
Though I understand @steveloughran point that double wrapping retries is not good either since we're essentially retrying 3 * num_sdk_retries on every retryable failure which just keeps clients up for unnecessarily long.
I think there's a worthwhile discussion to be had though in a follow on if we want to tune these retry behaviors in it's entirety to account for clients already performing retries.
I also don't know what the other clients such as Azure/GCS do in terms of automatic retries (since we want whatever is here to generalize across other systems).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented; no actual code suggestions
* safe. | ||
*/ | ||
public class BulkDeleteConsumer implements Consumer<String> { | ||
private final List<String> files = Lists.newArrayList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could also make for a slow delete at the end. Ideally there'd be a page size for deletes, say 1000, and then kick off the delete in a separate thread.
Both S3A and S3FileIO have a configurable page size; s3a bulk delete is also rate limited per bucket.
return; | ||
} | ||
|
||
ops.deleteFiles(files); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retry: no, they should do it themselves. If you add a layer of retry on top of their code, you simply double wrap the failures for exponential delays before giving up.
Do not try and be clever here. Look at the S3A connector policy, recognise how complicated it is, different policies for connectivity vs throttle vs other errors, what can be retried, how long to wait/backoff, etc etc.
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ARetryPolicy.java
double wrapping retries is a real PITA, it's bad enough that the V2 SDK has taken to retrying some things (UnknownHostException) that it never used to...doing stuff in the app makes things work.
regarding error handling: what is there to do other than report an error?
Sorry for the late followup, I'm taking a look! |
* Consumer class to collect file paths one by one and perform a bulk deletion on them. Not thread | ||
* safe. | ||
*/ | ||
public class BulkDeleteConsumer implements Consumer<String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be public? It'd be ideal if this can be package private and encapsulated in the core places where it's needed.
* safe. | ||
*/ | ||
public class BulkDeleteConsumer implements Consumer<String> { | ||
private final List<String> files = Lists.newArrayList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to flushing in batches, the list of files can be quite large for snapshot expiration. I think having a constant 1000 is fine to begin with.
I don't think it's really strictly required to kick of the delete in a separate thread, and would prefer to keep it simple at least for now. We generally are performing bulk deletes in maintenance operations which are already long running and a good chunk of that time is spent in CPU/memory bound computations of which files to delete rather than actually doing deletion.
If it's a real issue I'd table that as an optimization for later.
return; | ||
} | ||
|
||
ops.deleteFiles(files); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think I agree with @pvary that to begin with we should probably mimic the existing delete retry behavior. In terms of error handling the deletion is all best effort. No operation should be impeded due to failure to physically delete a file off disk.
Though I understand @steveloughran point that double wrapping retries is not good either since we're essentially retrying 3 * num_sdk_retries on every retryable failure which just keeps clients up for unnecessarily long.
I think there's a worthwhile discussion to be had though in a follow on if we want to tune these retry behaviors in it's entirety to account for clients already performing retries.
I also don't know what the other clients such as Azure/GCS do in terms of automatic retries (since we want whatever is here to generalize across other systems).
The current implementation uses the deleteFile() of the FileIO even if it supports bulk operations. Even though the user of the RemoveSnapshots API can provide a custom Consumer to perform bulk deletion, Iceberg can be clever enough itself to find out if bulk deletion is possible on the FileIO.