Core: Bulk deletion in RemoveSnapshots #11837

gaborkaszab · 2024-12-20T15:53:39Z

The current implementation uses the deleteFile() of the FileIO even if it supports bulk operations. Even though the user of the RemoveSnapshots API can provide a custom Consumer to perform bulk deletion, Iceberg can be clever enough itself to find out if bulk deletion is possible on the FileIO.

gaborkaszab · 2024-12-20T16:48:14Z

Slack discussion about this: https://apache-iceberg.slack.com/archives/C03LG1D563F/p1733215233582339

The current implementation uses the deleteFile() of the FileIO even if it supports bulk operations. Even though the user of the RemoveSnapshots API can provide a custom Consumer to perform bulk deletion, Iceberg can be clever enough itself to find out if bulk deletion is possible on the FileIO.

steveloughran · 2025-01-15T12:32:26Z

I'm going to suggest some tests of failure handling to see what happens there

missing file (all should ignore)
putting a non-empty directory where one of the file paths ends. S3FileIO will ignore, HadoopFileIO when not using the changes of Core: HadoopFileIO to support bulk delete through the Hadoop Filesystem APIs #10233 will fail. That is what happens today.

gaborkaszab · 2025-01-21T15:01:17Z

I'm going to suggest some tests of failure handling to see what happens there

missing file (all should ignore)

putting a non-empty directory where one of the file paths ends. S3FileIO will ignore, HadoopFileIO when not using the changes of [draft] AWS: support hadoop bulk delete API #10233 will fail. That is what happens today.

Hi @steveloughran ,
The expire snapshot tests doesn't exercise the differences between FileIO implementations. The main point of these tests is to verify that the desired interface is called with the desired parameters, but they use TestTableOperations with LocalFileIO.

gaborkaszab · 2025-01-21T15:02:54Z

Hi @amogh-jahagirdar ,
Would you mind taking a look? This PR came up with a conversation with you on Iceberg Slack. https://apache-iceberg.slack.com/archives/C03LG1D563F/p1733215233582339

gaborkaszab · 2025-02-10T10:53:23Z

@amogh-jahagirdar Would you mind taking a look. This came from a Slack discussion we had earlier.

cc @pvary in case you have some capacity for this.
Thanks!

pvary · 2025-02-10T14:06:01Z

core/src/main/java/org/apache/iceberg/BulkDeleteConsumer.java

+ * safe.
+ */
+public class BulkDeleteConsumer implements Consumer<String> {
+  private final List<String> files = Lists.newArrayList();


I'm a bit afraid that this list could become quite big.
Could we "flush" the delete in batches?

could also make for a slow delete at the end. Ideally there'd be a page size for deletes, say 1000, and then kick off the delete in a separate thread.

Both S3A and S3FileIO have a configurable page size; s3a bulk delete is also rate limited per bucket.

+1 to flushing in batches, the list of files can be quite large for snapshot expiration. I think having a constant 1000 is fine to begin with.

I don't think it's really strictly required to kick of the delete in a separate thread, and would prefer to keep it simple at least for now. We generally are performing bulk deletes in maintenance operations which are already long running and a good chunk of that time is spent in CPU/memory bound computations of which files to delete rather than actually doing deletion.

If it's a real issue I'd table that as an optimization for later.

pvary · 2025-02-10T14:11:27Z

core/src/main/java/org/apache/iceberg/BulkDeleteConsumer.java

+      return;
+    }
+
+    ops.deleteFiles(files);


Do we want to do retry, error handling?

Retry: no, they should do it themselves. If you add a layer of retry on top of their code, you simply double wrap the failures for exponential delays before giving up.

Do not try and be clever here. Look at the S3A connector policy, recognise how complicated it is, different policies for connectivity vs throttle vs other errors, what can be retried, how long to wait/backoff, etc etc.
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ARetryPolicy.java

double wrapping retries is a real PITA, it's bad enough that the V2 SDK has taken to retrying some things (UnknownHostException) that it never used to...doing stuff in the app makes things work.

regarding error handling: what is there to do other than report an error?

The original delete path had retries. See:

Tasks.foreach(pathsToDelete) .executeWith(deleteExecutorService) .retry(3) .stopRetryOn(NotFoundException.class) .stopOnFailure() .suppressFailureWhenFinished() .onFailure( (file, thrown) -> LOG.warn("Delete failed for {} file: {}", fileType, file, thrown)) .run(deleteFunc::accept);

I think we should match the original behavior.

Yeah I think I agree with @pvary that to begin with we should probably mimic the existing delete retry behavior. In terms of error handling the deletion is all best effort. No operation should be impeded due to failure to physically delete a file off disk.

Though I understand @steveloughran point that double wrapping retries is not good either since we're essentially retrying 3 * num_sdk_retries on every retryable failure which just keeps clients up for unnecessarily long.

I think there's a worthwhile discussion to be had though in a follow on if we want to tune these retry behaviors in it's entirety to account for clients already performing retries.

I also don't know what the other clients such as Azure/GCS do in terms of automatic retries (since we want whatever is here to generalize across other systems).

steveloughran

commented; no actual code suggestions

steveloughran · 2025-02-10T17:42:05Z

core/src/main/java/org/apache/iceberg/BulkDeleteConsumer.java

+ * safe.
+ */
+public class BulkDeleteConsumer implements Consumer<String> {
+  private final List<String> files = Lists.newArrayList();


could also make for a slow delete at the end. Ideally there'd be a page size for deletes, say 1000, and then kick off the delete in a separate thread.

Both S3A and S3FileIO have a configurable page size; s3a bulk delete is also rate limited per bucket.

steveloughran · 2025-02-10T17:47:31Z

core/src/main/java/org/apache/iceberg/BulkDeleteConsumer.java

+      return;
+    }
+
+    ops.deleteFiles(files);


Retry: no, they should do it themselves. If you add a layer of retry on top of their code, you simply double wrap the failures for exponential delays before giving up.

Do not try and be clever here. Look at the S3A connector policy, recognise how complicated it is, different policies for connectivity vs throttle vs other errors, what can be retried, how long to wait/backoff, etc etc.
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ARetryPolicy.java

double wrapping retries is a real PITA, it's bad enough that the V2 SDK has taken to retrying some things (UnknownHostException) that it never used to...doing stuff in the app makes things work.

regarding error handling: what is there to do other than report an error?

amogh-jahagirdar · 2025-02-11T00:17:48Z

Sorry for the late followup, I'm taking a look!

amogh-jahagirdar · 2025-02-11T00:25:09Z

core/src/main/java/org/apache/iceberg/BulkDeleteConsumer.java

+ * Consumer class to collect file paths one by one and perform a bulk deletion on them. Not thread
+ * safe.
+ */
+public class BulkDeleteConsumer implements Consumer<String> {


Does this need to be public? It'd be ideal if this can be package private and encapsulated in the core places where it's needed.

amogh-jahagirdar · 2025-02-11T00:46:28Z

core/src/main/java/org/apache/iceberg/BulkDeleteConsumer.java

+ * safe.
+ */
+public class BulkDeleteConsumer implements Consumer<String> {
+  private final List<String> files = Lists.newArrayList();


+1 to flushing in batches, the list of files can be quite large for snapshot expiration. I think having a constant 1000 is fine to begin with.

I don't think it's really strictly required to kick of the delete in a separate thread, and would prefer to keep it simple at least for now. We generally are performing bulk deletes in maintenance operations which are already long running and a good chunk of that time is spent in CPU/memory bound computations of which files to delete rather than actually doing deletion.

If it's a real issue I'd table that as an optimization for later.

amogh-jahagirdar · 2025-02-11T01:02:54Z

core/src/main/java/org/apache/iceberg/BulkDeleteConsumer.java

+      return;
+    }
+
+    ops.deleteFiles(files);


Yeah I think I agree with @pvary that to begin with we should probably mimic the existing delete retry behavior. In terms of error handling the deletion is all best effort. No operation should be impeded due to failure to physically delete a file off disk.

Though I understand @steveloughran point that double wrapping retries is not good either since we're essentially retrying 3 * num_sdk_retries on every retryable failure which just keeps clients up for unnecessarily long.

I think there's a worthwhile discussion to be had though in a follow on if we want to tune these retry behaviors in it's entirety to account for clients already performing retries.

I also don't know what the other clients such as Azure/GCS do in terms of automatic retries (since we want whatever is here to generalize across other systems).

github-actions bot added the core label Dec 20, 2024

gaborkaszab requested review from amogh-jahagirdar and nastra December 20, 2024 15:54

gaborkaszab force-pushed the main_bulk_delete_in_remove_snapshots branch from 0c875f9 to aafd0fa Compare December 20, 2024 15:57

gaborkaszab force-pushed the main_bulk_delete_in_remove_snapshots branch 2 times, most recently from c3807ba to 87083d5 Compare January 13, 2025 10:28

gaborkaszab force-pushed the main_bulk_delete_in_remove_snapshots branch from 87083d5 to d0638e5 Compare January 13, 2025 11:10

pvary reviewed Feb 10, 2025

View reviewed changes

steveloughran reviewed Feb 10, 2025

View reviewed changes

amogh-jahagirdar reviewed Feb 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Bulk deletion in RemoveSnapshots #11837

Core: Bulk deletion in RemoveSnapshots #11837

gaborkaszab commented Dec 20, 2024

gaborkaszab commented Dec 20, 2024

steveloughran commented Jan 15, 2025

gaborkaszab commented Jan 21, 2025

gaborkaszab commented Jan 21, 2025

gaborkaszab commented Feb 10, 2025

pvary Feb 10, 2025

steveloughran Feb 10, 2025

amogh-jahagirdar Feb 11, 2025

pvary Feb 10, 2025

steveloughran Feb 10, 2025

pvary Feb 10, 2025

amogh-jahagirdar Feb 11, 2025

steveloughran left a comment

steveloughran Feb 10, 2025

steveloughran Feb 10, 2025

amogh-jahagirdar commented Feb 11, 2025

amogh-jahagirdar Feb 11, 2025

amogh-jahagirdar Feb 11, 2025

amogh-jahagirdar Feb 11, 2025

Core: Bulk deletion in RemoveSnapshots #11837

Are you sure you want to change the base?

Core: Bulk deletion in RemoveSnapshots #11837

Conversation

gaborkaszab commented Dec 20, 2024

gaborkaszab commented Dec 20, 2024

steveloughran commented Jan 15, 2025

gaborkaszab commented Jan 21, 2025

gaborkaszab commented Jan 21, 2025

gaborkaszab commented Feb 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steveloughran left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogh-jahagirdar commented Feb 11, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment