Skip to content

Conversation

@ghareeb-falazi
Copy link

@ghareeb-falazi ghareeb-falazi commented Oct 23, 2025

design

This PR aims at creating library instrumentation for Apache Iceberg scan metrics. The figure above summarizes the planned solution (the green classes constitute the instrumentation).

The Apache Iceberg API currently emits two types of metrics: ScanMetrics that are emitted when a table scan is executed, and CommitMetrics that are emitted when modifications to the data or metadata are executed, e.g., insert, update, delete, drop column, time travel, etc.
ScanMetrics are straightforward to report using a custom reporter. We can programmatically inject such a reporter into an exiting Scan object as described here. On the contrary, CommitMetrics are difficult to report using a custom reporter due to the absence of a programmatical interface to inject such a reporter. A configuration-based approach exists for this purpose as described here. This PR focuses on ScanMetrics only and creates a library-based instrumentation for them.

CommitMetrics and agent-based instrumentation will be the subject of future PRs. Therefore, this PR partially solves #15113

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 23, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: ghareeb-falazi / name: Ghareeb Falazi (aee6f7b, b6ee75e)
  • ✅ login: jaydeluca / name: Jay DeLuca (aee6f7b)

@ghareeb-falazi ghareeb-falazi marked this pull request as ready for review November 4, 2025 13:57
@ghareeb-falazi ghareeb-falazi requested a review from a team as a code owner November 4, 2025 13:57
Comment on lines +7 to +11
implementation("org.apache.iceberg:iceberg-core:1.8.1") {
artifact {
classifier = "tests"
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this second implementation needed?

Copy link
Author

@ghareeb-falazi ghareeb-falazi Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaydeluca In the abstract unit test class AbstractIcebergTest, I am using the public classes TestTables and TestTable that are defined in the src/test/java directory of the iceberg core project, so the second implementation is meant to give me access to these classes in the unit tests of the testing project.
As a side note: I am not really an expert in Gradle, so it could really be that a better approach exists...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be ok, but it might be brittle for us to have a dependency on an artifact that they are not publicly publishing. I don't know anything about iceberg, but what if we replaced the test setup with hadoop? if we add the implementation dependency here:

implementation("org.apache.hadoop:hadoop-common:3.4.2")

and then instead of using the test table, you do:

    Configuration conf = new Configuration();
    this.tables = new HadoopTables(conf);
    String tableLocation = new File(tableDir, "test").toString();
    this.table =
        tables.create(
            SCHEMA,
            SPEC,
            Collections.singletonMap("format-version", String.valueOf(FORMAT_VERSION)),
            tableLocation);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaydeluca yes this works, very good idea! thanks!

Copy link
Author

@ghareeb-falazi ghareeb-falazi Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaydeluca It turns out that hadoop won't work with a high JDK level without modifying some security settings (see screenshot for concrete error). Therefore, I added maxJavaVersionForTests.set(JavaVersion.VERSION_21) to build.gradke.kts, and with this, the tests work fine.
Alternatively, we can go back to depending on test library of Iceberg if you want. I don't see where the problem is since the dependency is only for testing, and we control the exact iceberg version that we are using for testing. What do you think?
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think using the test dependency is probably preferable compared to not testing against newer versions of java, so It's probably best to switch back. Can you just include a small comment above the gradle config saying something like "Test classes are not published by default" or something similar that explains why it is necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaydeluca I reverted back to using the Iceberg test classes and added the required comment in the gradle file

Copy link
Contributor

@laurit laurit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +7 to +11
implementation("org.apache.iceberg:iceberg-core:1.8.1") {
artifact {
classifier = "tests"
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be ok, but it might be brittle for us to have a dependency on an artifact that they are not publicly publishing. I don't know anything about iceberg, but what if we replaced the test setup with hadoop? if we add the implementation dependency here:

implementation("org.apache.hadoop:hadoop-common:3.4.2")

and then instead of using the test table, you do:

    Configuration conf = new Configuration();
    this.tables = new HadoopTables(conf);
    String tableLocation = new File(tableDir, "test").toString();
    this.table =
        tables.create(
            SCHEMA,
            SPEC,
            Collections.singletonMap("format-version", String.valueOf(FORMAT_VERSION)),
            tableLocation);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants