Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Bulk deletion in RemoveSnapshots #11837

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gaborkaszab
Copy link
Collaborator

The current implementation uses the deleteFile() of the FileIO even if it supports bulk operations. Even though the user of the RemoveSnapshots API can provide a custom Consumer to perform bulk deletion, Iceberg can be clever enough itself to find out if bulk deletion is possible on the FileIO.

@github-actions github-actions bot added the core label Dec 20, 2024
@gaborkaszab gaborkaszab force-pushed the main_bulk_delete_in_remove_snapshots branch from 0c875f9 to aafd0fa Compare December 20, 2024 15:57
@gaborkaszab
Copy link
Collaborator Author

@gaborkaszab gaborkaszab force-pushed the main_bulk_delete_in_remove_snapshots branch 2 times, most recently from c3807ba to 87083d5 Compare January 13, 2025 10:28
The current implementation uses the deleteFile() of the FileIO even if it
supports bulk operations. Even though the user of the RemoveSnapshots API can
provide a custom Consumer to perform bulk deletion, Iceberg can be clever enough
itself to find out if bulk deletion is possible on the FileIO.
@gaborkaszab gaborkaszab force-pushed the main_bulk_delete_in_remove_snapshots branch from 87083d5 to d0638e5 Compare January 13, 2025 11:10
@steveloughran
Copy link
Contributor

I'm going to suggest some tests of failure handling to see what happens there

@gaborkaszab
Copy link
Collaborator Author

I'm going to suggest some tests of failure handling to see what happens there

Hi @steveloughran ,
The expire snapshot tests doesn't exercise the differences between FileIO implementations. The main point of these tests is to verify that the desired interface is called with the desired parameters, but they use TestTableOperations with LocalFileIO.

@gaborkaszab
Copy link
Collaborator Author

Hi @amogh-jahagirdar ,
Would you mind taking a look? This PR came up with a conversation with you on Iceberg Slack. https://apache-iceberg.slack.com/archives/C03LG1D563F/p1733215233582339

@gaborkaszab
Copy link
Collaborator Author

@amogh-jahagirdar Would you mind taking a look. This came from a Slack discussion we had earlier.

cc @pvary in case you have some capacity for this.
Thanks!

* safe.
*/
public class BulkDeleteConsumer implements Consumer<String> {
private final List<String> files = Lists.newArrayList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit afraid that this list could become quite big.
Could we "flush" the delete in batches?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also make for a slow delete at the end. Ideally there'd be a page size for deletes, say 1000, and then kick off the delete in a separate thread.

Both S3A and S3FileIO have a configurable page size; s3a bulk delete is also rate limited per bucket.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to flushing in batches, the list of files can be quite large for snapshot expiration. I think having a constant 1000 is fine to begin with.

I don't think it's really strictly required to kick of the delete in a separate thread, and would prefer to keep it simple at least for now. We generally are performing bulk deletes in maintenance operations which are already long running and a good chunk of that time is spent in CPU/memory bound computations of which files to delete rather than actually doing deletion.

If it's a real issue I'd table that as an optimization for later.

return;
}

ops.deleteFiles(files);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to do retry, error handling?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retry: no, they should do it themselves. If you add a layer of retry on top of their code, you simply double wrap the failures for exponential delays before giving up.

Do not try and be clever here. Look at the S3A connector policy, recognise how complicated it is, different policies for connectivity vs throttle vs other errors, what can be retried, how long to wait/backoff, etc etc.
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ARetryPolicy.java

double wrapping retries is a real PITA, it's bad enough that the V2 SDK has taken to retrying some things (UnknownHostException) that it never used to...doing stuff in the app makes things work.

regarding error handling: what is there to do other than report an error?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original delete path had retries. See:

Tasks.foreach(pathsToDelete)
          .executeWith(deleteExecutorService)
          .retry(3)
          .stopRetryOn(NotFoundException.class)
          .stopOnFailure()
          .suppressFailureWhenFinished()
          .onFailure(
              (file, thrown) -> LOG.warn("Delete failed for {} file: {}", fileType, file, thrown))
          .run(deleteFunc::accept);

I think we should match the original behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think I agree with @pvary that to begin with we should probably mimic the existing delete retry behavior. In terms of error handling the deletion is all best effort. No operation should be impeded due to failure to physically delete a file off disk.

Though I understand @steveloughran point that double wrapping retries is not good either since we're essentially retrying 3 * num_sdk_retries on every retryable failure which just keeps clients up for unnecessarily long.

I think there's a worthwhile discussion to be had though in a follow on if we want to tune these retry behaviors in it's entirety to account for clients already performing retries.

I also don't know what the other clients such as Azure/GCS do in terms of automatic retries (since we want whatever is here to generalize across other systems).

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented; no actual code suggestions

* safe.
*/
public class BulkDeleteConsumer implements Consumer<String> {
private final List<String> files = Lists.newArrayList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also make for a slow delete at the end. Ideally there'd be a page size for deletes, say 1000, and then kick off the delete in a separate thread.

Both S3A and S3FileIO have a configurable page size; s3a bulk delete is also rate limited per bucket.

return;
}

ops.deleteFiles(files);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retry: no, they should do it themselves. If you add a layer of retry on top of their code, you simply double wrap the failures for exponential delays before giving up.

Do not try and be clever here. Look at the S3A connector policy, recognise how complicated it is, different policies for connectivity vs throttle vs other errors, what can be retried, how long to wait/backoff, etc etc.
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ARetryPolicy.java

double wrapping retries is a real PITA, it's bad enough that the V2 SDK has taken to retrying some things (UnknownHostException) that it never used to...doing stuff in the app makes things work.

regarding error handling: what is there to do other than report an error?

@amogh-jahagirdar
Copy link
Contributor

Sorry for the late followup, I'm taking a look!

* Consumer class to collect file paths one by one and perform a bulk deletion on them. Not thread
* safe.
*/
public class BulkDeleteConsumer implements Consumer<String> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be public? It'd be ideal if this can be package private and encapsulated in the core places where it's needed.

* safe.
*/
public class BulkDeleteConsumer implements Consumer<String> {
private final List<String> files = Lists.newArrayList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to flushing in batches, the list of files can be quite large for snapshot expiration. I think having a constant 1000 is fine to begin with.

I don't think it's really strictly required to kick of the delete in a separate thread, and would prefer to keep it simple at least for now. We generally are performing bulk deletes in maintenance operations which are already long running and a good chunk of that time is spent in CPU/memory bound computations of which files to delete rather than actually doing deletion.

If it's a real issue I'd table that as an optimization for later.

return;
}

ops.deleteFiles(files);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think I agree with @pvary that to begin with we should probably mimic the existing delete retry behavior. In terms of error handling the deletion is all best effort. No operation should be impeded due to failure to physically delete a file off disk.

Though I understand @steveloughran point that double wrapping retries is not good either since we're essentially retrying 3 * num_sdk_retries on every retryable failure which just keeps clients up for unnecessarily long.

I think there's a worthwhile discussion to be had though in a follow on if we want to tune these retry behaviors in it's entirety to account for clients already performing retries.

I also don't know what the other clients such as Azure/GCS do in terms of automatic retries (since we want whatever is here to generalize across other systems).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants