Skip to content

Conversation

kokila-19
Copy link
Contributor

@kokila-19 kokila-19 commented Sep 23, 2025

What changes were proposed in this pull request?

Added Z-order support for Iceberg tables via CREATE TABLE DDL

Why are the changes needed?

To support zorder indexing which will improve data clustering and query performance on Iceberg tables.

Does this PR introduce any user-facing change?

Yes , new syntax support
CREATE TABLE test_zorder (
id int,
text string)
WRITE LOCALLY ZORDER by id, text
STORED BY iceberg
STORED As orc;

How was this patch tested?

qtest

@kokila-19 kokila-19 changed the title HIVE-29133: Support Z-order indexing for Iceberg tables via CREATE TA… HIVE-29133: Support Z-order for Iceberg tables via CREATE TABLE Sep 23, 2025
@zhangbutao zhangbutao requested a review from Copilot September 27, 2025 13:32
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds Z-order support for Iceberg tables via CREATE TABLE DDL syntax, enabling improved data clustering and query performance. The implementation introduces a new syntax "WRITE LOCALLY ZORDER BY" alongside supporting infrastructure for Z-order indexing.

  • Adds new DDL syntax "WRITE LOCALLY ZORDER BY" for creating Iceberg tables with Z-order sorting
  • Implements Z-order functionality through a custom UDF that uses Iceberg's ZOrderByteUtils
  • Extends the parser and analyzer to handle Z-order specifications in CREATE TABLE statements

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
CreateTableAnalyzer.java Adds Z-order AST processing and JSON serialization for CREATE TABLE
ZorderFields.java New data structure for holding Z-order field metadata
ZOrderFieldDesc.java New descriptor class for individual Z-order columns
HiveParser.g Adds grammar rules for ZORDER syntax parsing
GenericUDFIcebergZorder.java New UDF implementing Z-order value calculation using Iceberg utilities
HiveIcebergStorageHandler.java Integrates Z-order sorting into Iceberg write operations
BaseHiveIcebergMetaHook.java Handles Z-order metadata persistence in table properties

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +54 to +57
if (arguments.length < 2) {
throw new UDFArgumentException("iceberg_zorder requires at least 2 arguments");
Copy link
Preview

Copilot AI Sep 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimum argument check should allow for 1 argument since Z-order can be applied to a single column. Change the condition to arguments.length < 1.

Suggested change
if (arguments.length < 2) {
throw new UDFArgumentException("iceberg_zorder requires at least 2 arguments");
if (arguments.length < 1) {
throw new UDFArgumentException("iceberg_zorder requires at least 1 argument");

Copilot uses AI. Check for mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Z-Order is designed for multi-dimensional clustering. It can be applied for 1 column but it does not differ from normal sort. Calculating z-indices , ordering and rewriting files in case of compaction adds overhead without any value. So, it is not required to support for 1 column

Comment on lines 101 to 105
if (value == null) {
// For NULL values, we have primitive buffer size of 8 with values of 0
return ByteBuffer.wrap(new byte[ZOrderByteUtils.PRIMITIVE_BUFFER_SIZE]).array();
}
Copy link
Preview

Copilot AI Sep 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a new ByteBuffer and byte array for each null value is inefficient. Consider creating a static final byte array for null values to reuse across calls.

Copilot uses AI. Check for mistakes.

}

private boolean isZOrderJSON(String jsonString) {
return jsonString.contains("zorderFields");
Copy link
Preview

Copilot AI Sep 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using string contains for JSON detection is unreliable and could produce false positives. Consider parsing the JSON to check for the actual structure or use a more robust detection method.

Suggested change
return jsonString.contains("zorderFields");
try {
com.fasterxml.jackson.databind.JsonNode node = JSON_OBJECT_MAPPER.readTree(jsonString);
return node.has("zorderFields");
} catch (Exception e) {
return false;
}

Copilot uses AI. Check for mistakes.

set hive.optimize.shared.work.merge.ts.schema=true;

-- Validates z-order on CREATE via clause.
CREATE TABLE default.zorder_it_nulls (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we enable the z-order feature for an already created Iceberg table?
Can we use the ALTER TABLE syntax to enable z-order for an already created Iceberg table?

Copy link
Member

@deniskuzZ deniskuzZ Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangbutao, what are your thoughts on #6095 (comment) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangbutao Current PR implements end-to-end functionality for ZOrder, including support for CREATE and INSERT.
I will have two more commits coming in for alter and compaction zorder support.
Next PR will focus on adding alter table support for Zorder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangbutao, what are your thoughts on #6095 (comment) ?

I am ok with the syntax.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangbutao Current PR implements end-to-end functionality for ZOrder, including support for CREATE and INSERT. I will have two more commits coming in for alter and compaction zorder support. Next PR will focus on adding alter table support for Zorder.

@kokila-19 That's ok.
BTW, could you please supplement the SQL documentation as well? I don't want such great features to go unused due to a lack of documentation, as it's likely that many users aren't even aware of them.
To be honest, I've even forgotten a lot of the semantics we implemented for Iceberg, haha.
Perhaps it's about time we gradually improve our Hive documentation too. :)
Thanks.

Copy link
Member

@deniskuzZ deniskuzZ Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current PR implements end-to-end functionality for ZOrder, including support for CREATE and INSERT.

@kokila-19, what is the syntax for insert? or insert uses the schema/spec zorder?

INSERT INTO ice_orc
WRITE ORDERED BY ZORDER(id, text)
SELECT * FROM source_table;

Copy link
Contributor Author

@kokila-19 kokila-19 Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deniskuzZ
Insert will use Z-Order spec stored in table properties sort.order and sort.columns which will be provided in INSERT/ALTER command.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangbutao Certainly, documentation is a must. I've already created a ticket for it: HIVE-29135 and I plan to add the documentation once the feature is complete.

Perhaps it's about time we gradually improve our Hive documentation too. :)

I agree that it's time we improve Hive's documentation, and I'm happy to contribute to that effort.

CREATE TABLE default.zorder_it_nulls (
id int,
text string)
WRITE LOCALLY ZORDER by id, text
Copy link
Member

@deniskuzZ deniskuzZ Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kokila-19, syntax should be the following:

WRITE [LOCALLY] ORDERED BY zorder (id, text)

Copy link
Member

@deniskuzZ deniskuzZ Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • alter cmd
ALTER TABLE ice_orc 
SET WRITE [LOCALLY] ORDERED BY zorder (id, text);
  • compaction cmd:
OPTIMIZE TABLE ice_orc rewrite data 
zorder by (id, text); // optional, if not specified, should use zorder from table spec/schema

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can kill the LOCALLY or make it optional to be consistent with Spark

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using existing locally ordered by syntax implemented in #5541
If i have to remove locally keyword for zorder, I believe we should remove for this as well to be consistent in hive.
Or make LOCALLY optional for both orders

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OPTIMIZE TABLE ice_orc rewrite data 
zorder by (id, text); // optional, if not specified, should use zorder from table spec/schema

Note during implementation:
If the Zorder columns in table spec is different from spec provided in OPTIMIZE TABLE cmd for compaction, this should throw error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure. delta and spark iceberg support that. we can call optimize with zorder and later alter the write zorder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark only supports Z-order in rewrite_data_files (during compaction), it does not have Z-order support for DDL queries. So, they will not have this scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deniskuzZ
i think we can kill the LOCALLY or make it optional to be consistent with Spark
I will have this change in the follow up PR.

Copy link

sonarqubecloud bot commented Oct 3, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants