feat: basic table scan planning #112

gty404 · 2025-05-27T06:46:04Z

Introducing basic scan table data interface

src/iceberg/result.h

src/iceberg/table_scan.h

src/iceberg/table_scan.cc

src/iceberg/type_fwd.h

src/iceberg/table_scan.h

src/iceberg/type_fwd.h

src/iceberg/table_scan.h

src/iceberg/table_scan.cc

Co-authored-by: Gang Wu <[email protected]>

src/iceberg/snapshot.h

src/iceberg/table_scan.h

src/iceberg/manifest_reader.h

wgtmac · 2025-06-29T07:39:12Z

src/iceberg/manifest_reader.h

+/// \param file_path Path to the manifest list file.
+/// \return A Result containing the reader or an error.
+Result<std::unique_ptr<ManifestListReader>> CreateManifestListReader(
+    const std::string& file_path) {


I think this is not enough. At least we need extra parameters like table_format_version and file_io.

BTW, should we use std::string_view for path?

Do you need to provide a ManifestListReaderBuilder/ManifestReaderBuilder?

I don't know yet. It depends on how we will use them. BTW @dongxiao1198 will work on manifest reading.

src/iceberg/snapshot.h

src/iceberg/table_scan.h

src/iceberg/table_scan.cc

wgtmac · 2025-06-30T14:44:08Z

src/iceberg/manifest_reader.h

+/// \param file_path Path to the manifest list file.
+/// \return A Result containing the reader or an error.
+Result<std::unique_ptr<ManifestListReader>> CreateManifestListReader(
+    const std::string& file_path) {


I don't know yet. It depends on how we will use them. BTW @dongxiao1198 will work on manifest reading.

src/iceberg/table_scan.h

wgtmac · 2025-07-01T10:02:25Z

src/iceberg/table_scan.h

+};
+
+/// \brief Represents a task to scan a portion of a data file.
+class ICEBERG_EXPORT FileScanTask : public ScanTask {


Some thoughts about FileScanTask:

Should we remove ScanTask abstraction above? If we remove the abstraction, we can directly use aggregate initialization to create a task. Otherwise we may need to expand the constructor every time a new parameter is required.

If we do (1) above, is it possible also to make it a simple struct by removing all functions (as they are all trivial accessors).

Should we add fields (a.k.a. spec and partition_value) from Java PartitionScanTask to support partitioning? We can add them later but a TODO comment is desirable.

Should we combine start and length, and wrap them by std::optional? I believe they are not required at all times.

I initially expected it to just be a struct, but since the previous comments suggested doing an abstraction, I referred to the design in iceberg-java/iceberg-python.

Partition spec and value can be obtained from DataFile and Snapshot, and we can add these interfaces when needed for subsequent PR

Sure, I will modify it to optional, thanks.

Partition spec and value can be obtained from DataFile and Snapshot

That's a good point

wgtmac · 2025-07-01T12:03:14Z

src/iceberg/table_scan.h

+  /// \brief Sets the schema to use for the scan.
+  /// \param schema The schema to use.
+  /// \return Reference to the builder.
+  TableScanBuilder& WithSchema(std::shared_ptr<Schema> schema);


I think we don't need this. We just need schema of a specific snapshot id which can be obtained via table_metadata. Did I miss something?

This is used to specify the projected schema without specifying the column name, and I have modified it to WithProjectedSchema.

wgtmac · 2025-07-01T13:46:16Z

src/iceberg/table_scan.h

+  /// \brief snapshot ID to scan, if specified.
+  std::optional<int64_t> snapshot_id_;
+  /// \brief Context for the scan, including snapshot, schema, and filter.
+  TableScanContext context_;


In the Java version of TableScanContext, column_names_ and snapshot_id_ are also stored in it. Should we follow the same pattern? If we do this, it seems that TableScanBuilder is indeed a TableScanContextBuilder.

I originally expected that TableScanContext would be context information retained after converting various input parameters, and that what was no longer needed in the subsequent file scanning process would be removed.

Do you mean you will remove TableScanContext? It depends on whether you will reuse it to plan files for metadata tables.

TableScanContext will be relied on during subsequent plan files

wgtmac · 2025-07-01T14:07:43Z

src/iceberg/table_scan.cc

+        data_entry.sequence_number.value_or(TableMetadata::kInitialSequenceNumber);
+    for (auto it = sequence_index.lower_bound(data_sequence_number);
+         it != sequence_index.end(); ++it) {
+      // Additional filtering logic here


What is the additional filtering logic? Did you mean to further check if the delete files can be filtered?

DataFiles only need to retain DeleteFiles with a sequence greater than their own?

This differs per equality and positional deletes. I think there is a pretty good overview here: https://iceberg.apache.org/spec/#scan-planning

src/iceberg/table_scan.cc

wgtmac · 2025-07-01T14:15:16Z

src/iceberg/table_scan.cc

+  return sizeInBytes;
+}
+
+int32_t FileScanTask::files_count() const {


I'm not sure if we need to rename it to FilesCount(). @lidavidm suggestion?

wgtmac · 2025-07-03T01:26:24Z

src/iceberg/table_scan.h

+};
+
+/// \brief Represents a task to scan a portion of a data file.
+class ICEBERG_EXPORT FileScanTask : public ScanTask {


Partition spec and value can be obtained from DataFile and Snapshot

That's a good point

wgtmac · 2025-07-03T01:28:01Z

src/iceberg/table_scan.h

+  /// \brief snapshot ID to scan, if specified.
+  std::optional<int64_t> snapshot_id_;
+  /// \brief Context for the scan, including snapshot, schema, and filter.
+  TableScanContext context_;


Do you mean you will remove TableScanContext? It depends on whether you will reuse it to plan files for metadata tables.

wgtmac · 2025-07-03T01:33:20Z

src/iceberg/table_scan.h

+
+  /// \brief Plans the scan tasks by resolving manifests and data files.
+  /// \return A Result containing scan tasks or an error.
+  virtual Result<std::vector<std::shared_ptr<FileScanTask>>> PlanFiles() const = 0;


Suggested change

virtual Result<std::vector<std::shared_ptr<FileScanTask>>> PlanFiles() const = 0;

virtual Result<std::vector<std::shared_ptr<ScanTask>>> PlanTasks() const = 0;

I remember that @lishuxu has commented to rename it to PlanTasks. Do we need to modify the signature as above?

In iceberg-java, the result of planFiles is a one-to-one correspondence between the entire file and FileScanTask. planTask is the result of splitting the result of planFiles. Each scanTask may contain multiple files, or only a part of one file. I am not sure if the split feature is needed now, so I did not use PlanTask.

wgtmac · 2025-07-03T02:08:34Z

src/iceberg/table_scan.h

+};
+
+/// \brief A scan that reads data files and applies delete files to filter rows.
+class ICEBERG_EXPORT DataScan : public TableScan {


I'm a little bit confused about the name of Scan and ScanTask across different implementations. Should this be DataTableScan which produces FileScanTask? For DataScan, I think it should produce a group of DataTask which contains rows of FileScanTask.

Simply put:
Scan -> ScanTask
TableScan -> FileScanTask
DataScan -> DataTask

DataScan inherits TableScan inherits Scan
DataTask inherits FileScanTask inherits ScanTask

WDYT? @gty404 @Fokko

BTW, I think we can constantly evolve this design because APIs can be unstable before the 1.0.0 release.

At PyIceberg we (tried to) copied the Java structure, but in the end I think it is too much OOP for Python. Maybe good to start small in C++ as well. While we can change APIs until 1.0.0, I think it is important to get this one right pretty early on, since this is the main integration point for query engines.

Fokko · 2025-07-03T08:11:04Z

src/iceberg/table_scan.cc

+FileScanTask::FileScanTask(std::shared_ptr<DataFile> file,
+                           std::vector<std::shared_ptr<DataFile>> delete_files,
+                           int64_t start, int64_t length,
+                           std::shared_ptr<Expression> residual)
+    : data_file_(std::move(file)),
+      delete_files_(std::move(delete_files)),
+      start_(start),
+      length_(length),
+      residual_(std::move(residual)) {}


It looks like this FileScanTask is inspired by PyIceberg, while I think it might be better to follow Java: https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/FileScanTask.java

The name FileScanTask implies to me, that it will read a full file, and then the start and end do not make sense. This is in the case you want to split up the work by row-group, rather than file.

Yes, the start and end are not required, I will remove them from FileScanTask.

src/iceberg/manifest_reader.h

mapleFU · 2025-07-03T14:09:48Z

src/iceberg/manifest_reader.h

+/// \param file_path Path to the manifest file.
+/// \return A Result containing the reader or an error.
+Result<std::unique_ptr<ManifestReader>> CreateManifestReader(
+    const std::string_view& file_path) {


mapleFU · 2025-07-03T14:10:52Z

src/iceberg/table_scan.cc

+
+ private:
+  /// \brief Index by sequence number for quick filtering
+  std::multimap<int64_t, ManifestEntry*> sequence_index;


Suggested change

std::multimap<int64_t, ManifestEntry*> sequence_index;

std::multimap<int64_t, ManifestEntry*> sequence_index_;

For private variables? Besides, why not std::map<int64_t, std::vector<const ManifestEntry*>>?

It is for the convenience of searching by sequence as mentioned earlier.

mapleFU · 2025-07-03T14:12:55Z

src/iceberg/table_scan.cc

+    for (auto it = sequence_index.lower_bound(data_sequence_number);
+         it != sequence_index.end(); ++it) {


Is this incorrect? Since it find the lowerbound and traverse all the sequence numbers above data_sequence_number

Yes, the meaning here is to find all DeleteFiles corresponding to this DataFile. Only those with a sequence number higher than the DataFile need to be read.

Higher or equal for positional deletes: https://iceberg.apache.org/spec/#scan-planning

mapleFU · 2025-07-03T14:14:43Z

src/iceberg/table_scan.cc

+  column_names_.reserve(column_names.size());
+  column_names_ = std::move(column_names);


We don't need reserve before move. std::move usally exchange memory between column_names_ and column_names.

mapleFU · 2025-07-03T14:17:55Z

src/iceberg/table_scan.cc

+      return InvalidArgument("Schema {} in snapshot {} is not found",
+                             *snapshot->schema_id, snapshot->snapshot_id);
+    }
+    auto schema = *it;


Suggested change

auto schema = *it;

const auto& schema = *it;

mapleFU · 2025-07-03T14:20:52Z

src/iceberg/table_scan.cc

+    auto matched_deletes = GetMatchedDeletes(*data_entry, delete_file_index);
+    const auto& data_file = data_entry->data_file;
+    tasks.emplace_back(std::make_shared<FileScanTask>(
+        data_file, std::move(matched_deletes), std::move(residual)));


Suggested change

data_file, std::move(matched_deletes), std::move(residual)));

data_file, std::move(matched_deletes), residual));

Seems that we cannot move residual because it's used multiple times in the loop?

Yes, this has not been implemented yet, and it is expected that this residual will be shared among all tasks.

wgtmac

I just have a comment w.r.t. the table scan name. Elsewhere LGTM.

wgtmac · 2025-07-04T08:32:28Z

src/iceberg/table_scan.h

+};
+
+/// \brief A scan that reads data files and applies delete files to filter rows.
+class ICEBERG_EXPORT DataScan : public TableScan {


Suggested change

class ICEBERG_EXPORT DataScan : public TableScan {

class ICEBERG_EXPORT DataTableScan : public TableScan {

I think DataScan is ambiguous and people may assume that it will return rows of data instead of the file list. DataTable is better to differentiate from MetadataTable.

mapleFU · 2025-07-04T13:12:21Z

src/iceberg/table_scan.cc

+
+namespace {
+/// \brief Use indexed data structures for efficient lookups
+class DeleteFileIndex {


In intuition this can build once ( push all element and sort one), then when querying, return a slice for it. But this also looks good to me

mapleFU · 2025-07-04T13:17:11Z

src/iceberg/table_scan.cc

+}
+
+int64_t FileScanTask::SizeBytes() const {
+  int64_t sizeInBytes = data_file_->file_size_in_bytes;


Suggested change

int64_t sizeInBytes = data_file_->file_size_in_bytes;

int64_t size_in_bytes = data_file_->file_size_in_bytes;

mapleFU · 2025-07-04T13:18:49Z

src/iceberg/table_scan.cc

+                           table_metadata->table_uuid);
+  }
+  auto iter = std::ranges::find_if(table_metadata->snapshots,
+                                   [&snapshot_id](const auto& snapshot) {


Use the way below to eval *snapshot_id only once

Suggested change

[&snapshot_id](const auto& snapshot) {

[id = *snapshot_id](const auto& snapshot) {

mapleFU · 2025-07-04T13:19:22Z

src/iceberg/table_scan.cc

+    }
+
+    const auto& schemas = table_metadata->schemas;
+    const auto it = std::ranges::find_if(schemas, [&schema_id](const auto& schema) {


mapleFU · 2025-07-04T13:23:17Z

src/iceberg/table_scan.cc

+        // TODO(gty404): support case-insensitive column names
+        auto field_opt = schema->GetFieldByName(column_name);
+        if (!field_opt) {
+          return InvalidArgument("Column {} not found in schema", column_name);


should we add schema to output?

Fokko · 2025-07-04T14:05:27Z

src/iceberg/table.cc

@@ -107,4 +108,8 @@ const std::vector<SnapshotLogEntry>& Table::history() const {

 const std::shared_ptr<FileIO>& Table::io() const { return io_; }

+std::unique_ptr<TableScanBuilder> Table::NewScan() const {
+  return std::make_unique<TableScanBuilder>(metadata_, io_);


How about passing in the Table instead? That has all the metadata, and also the io

Fokko · 2025-07-04T14:13:00Z

src/iceberg/table_scan.cc

+class DeleteFileIndex {
+ public:
+  /// \brief Build the index from a list of manifest entries.
+  explicit DeleteFileIndex(const std::vector<std::unique_ptr<ManifestEntry>>& entries) {


Do we want to add this right away, or defer this to a later PR? Previously at PyIceberg we threw an exception when we encountered a positional or equality delete.

Fokko · 2025-07-04T14:18:11Z

src/iceberg/table_scan.cc

+    // TODO(gty404): check if the delete entry contains the data entry's file path
+    matched_deletes.emplace_back(delete_entry->data_file);


Without this filter, it will likely explode the number of relevant entries for each of the data files.

Fokko · 2025-07-04T14:19:56Z

src/iceberg/table_scan.cc

+        data_entry.sequence_number.value_or(TableMetadata::kInitialSequenceNumber);
+    for (auto it = sequence_index.lower_bound(data_sequence_number);
+         it != sequence_index.end(); ++it) {
+      // Additional filtering logic here


This differs per equality and positional deletes. I think there is a pretty good overview here: https://iceberg.apache.org/spec/#scan-planning

Fokko · 2025-07-04T14:30:18Z

src/iceberg/table_scan.cc

+    for (auto it = sequence_index.lower_bound(data_sequence_number);
+         it != sequence_index.end(); ++it) {


Higher or equal for positional deletes: https://iceberg.apache.org/spec/#scan-planning

gty404 added 4 commits May 27, 2025 14:43

feat: basic table scan planning

e971cc4

fix cpp lint

5fc6971

fix build fail on windows

6a2cb74

fix lint

d71c26a

lidavidm reviewed May 27, 2025

View reviewed changes

src/iceberg/result.h Outdated Show resolved Hide resolved

src/iceberg/table_scan.h Outdated Show resolved Hide resolved

gty404 added 2 commits May 27, 2025 16:18

fix some comments

c6c1a1f

fix clang format

cd07a0c

lidavidm reviewed May 28, 2025

View reviewed changes

src/iceberg/table_scan.h Outdated Show resolved Hide resolved

src/iceberg/table_scan.h Outdated Show resolved Hide resolved

src/iceberg/table_scan.cc Outdated Show resolved Hide resolved

src/iceberg/table_scan.cc Outdated Show resolved Hide resolved

wgtmac reviewed May 28, 2025

View reviewed changes

gty404 added 2 commits May 29, 2025 10:07

fix some comments

b7becc2

Merge branch 'main' into table-scan

abfdfcd

wgtmac reviewed May 30, 2025

View reviewed changes

yingcai-cy reviewed Jun 5, 2025

View reviewed changes

src/iceberg/table_scan.h Outdated Show resolved Hide resolved

src/iceberg/table_scan.cc Outdated Show resolved Hide resolved

src/iceberg/table_scan.cc Outdated Show resolved Hide resolved

gty404 and others added 3 commits June 14, 2025 14:08

Update src/iceberg/table_scan.h

28043b1

Co-authored-by: Gang Wu <[email protected]>

Update src/iceberg/table_scan.h

fa25891

Co-authored-by: Gang Wu <[email protected]>

Merge branch 'main' into table-scan

428651f

gty404 force-pushed the table-scan branch from 6cbd651 to 428651f Compare June 14, 2025 06:42

gty404 added 6 commits June 14, 2025 14:49

fix comments

812a545

Merge branch 'main' into table-scan

0f79c7c

Abstract TableScan and ScanTask

85802e9

fix lint

c7621b3

fix lint

e1267fc

fix lint

5248e22

zhjwpku reviewed Jun 28, 2025

View reviewed changes

src/iceberg/snapshot.h Outdated Show resolved Hide resolved

src/iceberg/table_scan.h Outdated Show resolved Hide resolved

wgtmac reviewed Jun 29, 2025

View reviewed changes

lishuxu reviewed Jun 29, 2025

View reviewed changes

src/iceberg/table_scan.h Outdated Show resolved Hide resolved

Merge branch 'main' into table-scan

368e268

lishuxu reviewed Jun 30, 2025

View reviewed changes

src/iceberg/table_scan.cc Outdated Show resolved Hide resolved

gty404 added 2 commits June 30, 2025 10:26

resolve some comments

29e8865

remove Snapshot::kInitialSequenceNumber

ae560f3

wgtmac reviewed Jul 1, 2025

View reviewed changes

resolve some comments

0ff952b

wgtmac reviewed Jul 3, 2025

View reviewed changes

Fokko reviewed Jul 3, 2025

View reviewed changes

resolve some comments

1b5d123

mapleFU reviewed Jul 3, 2025

View reviewed changes

resolve some comments

3dc2b38

wgtmac approved these changes Jul 4, 2025

View reviewed changes

resolve comments

702d0f4

lishuxu approved these changes Jul 4, 2025

View reviewed changes

mapleFU approved these changes Jul 4, 2025

View reviewed changes

Fokko reviewed Jul 4, 2025

View reviewed changes

	virtual Result<std::vector<std::shared_ptr<FileScanTask>>> PlanFiles() const = 0;
	virtual Result<std::vector<std::shared_ptr<ScanTask>>> PlanTasks() const = 0;

	std::multimap<int64_t, ManifestEntry*> sequence_index;
	std::multimap<int64_t, ManifestEntry*> sequence_index_;

		for (auto it = sequence_index.lower_bound(data_sequence_number);
		it != sequence_index.end(); ++it) {

		column_names_.reserve(column_names.size());
		column_names_ = std::move(column_names);

	data_file, std::move(matched_deletes), std::move(residual)));
	data_file, std::move(matched_deletes), residual));

	class ICEBERG_EXPORT DataScan : public TableScan {
	class ICEBERG_EXPORT DataTableScan : public TableScan {

	int64_t sizeInBytes = data_file_->file_size_in_bytes;
	int64_t size_in_bytes = data_file_->file_size_in_bytes;

	[&snapshot_id](const auto& snapshot) {
	[id = *snapshot_id](const auto& snapshot) {

		// TODO(gty404): check if the delete entry contains the data entry's file path
		matched_deletes.emplace_back(delete_entry->data_file);

feat: basic table scan planning #112

Are you sure you want to change the base?

feat: basic table scan planning #112

Conversation

gty404 commented May 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wgtmac Jul 3, 2025 •

edited

Loading