-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29133: Support Z-order for Iceberg tables via CREATE TABLE #6095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds Z-order support for Iceberg tables via CREATE TABLE DDL syntax, enabling improved data clustering and query performance. The implementation introduces a new syntax "WRITE LOCALLY ZORDER BY" alongside supporting infrastructure for Z-order indexing.
- Adds new DDL syntax "WRITE LOCALLY ZORDER BY" for creating Iceberg tables with Z-order sorting
- Implements Z-order functionality through a custom UDF that uses Iceberg's ZOrderByteUtils
- Extends the parser and analyzer to handle Z-order specifications in CREATE TABLE statements
Reviewed Changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
CreateTableAnalyzer.java | Adds Z-order AST processing and JSON serialization for CREATE TABLE |
ZorderFields.java | New data structure for holding Z-order field metadata |
ZOrderFieldDesc.java | New descriptor class for individual Z-order columns |
HiveParser.g | Adds grammar rules for ZORDER syntax parsing |
GenericUDFIcebergZorder.java | New UDF implementing Z-order value calculation using Iceberg utilities |
HiveIcebergStorageHandler.java | Integrates Z-order sorting into Iceberg write operations |
BaseHiveIcebergMetaHook.java | Handles Z-order metadata persistence in table properties |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
if (arguments.length < 2) { | ||
throw new UDFArgumentException("iceberg_zorder requires at least 2 arguments"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The minimum argument check should allow for 1 argument since Z-order can be applied to a single column. Change the condition to arguments.length < 1
.
if (arguments.length < 2) { | |
throw new UDFArgumentException("iceberg_zorder requires at least 2 arguments"); | |
if (arguments.length < 1) { | |
throw new UDFArgumentException("iceberg_zorder requires at least 1 argument"); |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Z-Order is designed for multi-dimensional clustering. It can be applied for 1 column but it does not differ from normal sort. Calculating z-indices , ordering and rewriting files in case of compaction adds overhead without any value. So, it is not required to support for 1 column
if (value == null) { | ||
// For NULL values, we have primitive buffer size of 8 with values of 0 | ||
return ByteBuffer.wrap(new byte[ZOrderByteUtils.PRIMITIVE_BUFFER_SIZE]).array(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a new ByteBuffer and byte array for each null value is inefficient. Consider creating a static final byte array for null values to reuse across calls.
Copilot uses AI. Check for mistakes.
} | ||
|
||
private boolean isZOrderJSON(String jsonString) { | ||
return jsonString.contains("zorderFields"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using string contains for JSON detection is unreliable and could produce false positives. Consider parsing the JSON to check for the actual structure or use a more robust detection method.
return jsonString.contains("zorderFields"); | |
try { | |
com.fasterxml.jackson.databind.JsonNode node = JSON_OBJECT_MAPPER.readTree(jsonString); | |
return node.has("zorderFields"); | |
} catch (Exception e) { | |
return false; | |
} |
Copilot uses AI. Check for mistakes.
set hive.optimize.shared.work.merge.ts.schema=true; | ||
|
||
-- Validates z-order on CREATE via clause. | ||
CREATE TABLE default.zorder_it_nulls ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we enable the z-order feature for an already created Iceberg table?
Can we use the ALTER TABLE
syntax to enable z-order for an already created Iceberg table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangbutao, what are your thoughts on #6095 (comment) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangbutao Current PR implements end-to-end functionality for ZOrder, including support for CREATE and INSERT.
I will have two more commits coming in for alter and compaction zorder support.
Next PR will focus on adding alter table support for Zorder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangbutao, what are your thoughts on #6095 (comment) ?
I am ok with the syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangbutao Current PR implements end-to-end functionality for ZOrder, including support for CREATE and INSERT. I will have two more commits coming in for alter and compaction zorder support. Next PR will focus on adding alter table support for Zorder.
@kokila-19 That's ok.
BTW, could you please supplement the SQL documentation as well? I don't want such great features to go unused due to a lack of documentation, as it's likely that many users aren't even aware of them.
To be honest, I've even forgotten a lot of the semantics we implemented for Iceberg, haha.
Perhaps it's about time we gradually improve our Hive documentation too. :)
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current PR implements end-to-end functionality for ZOrder, including support for CREATE and INSERT.
@kokila-19, what is the syntax for insert? or insert uses the schema/spec zorder?
INSERT INTO ice_orc
WRITE ORDERED BY ZORDER(id, text)
SELECT * FROM source_table;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deniskuzZ
Insert will use Z-Order spec stored in table properties sort.order
and sort.columns
which will be provided in INSERT/ALTER command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangbutao Certainly, documentation is a must. I've already created a ticket for it: HIVE-29135 and I plan to add the documentation once the feature is complete.
Perhaps it's about time we gradually improve our Hive documentation too. :)
I agree that it's time we improve Hive's documentation, and I'm happy to contribute to that effort.
CREATE TABLE default.zorder_it_nulls ( | ||
id int, | ||
text string) | ||
WRITE LOCALLY ZORDER by id, text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kokila-19, syntax should be the following:
WRITE [LOCALLY] ORDERED BY zorder (id, text)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- alter cmd
ALTER TABLE ice_orc
SET WRITE [LOCALLY] ORDERED BY zorder (id, text);
- compaction cmd:
OPTIMIZE TABLE ice_orc rewrite data
zorder by (id, text); // optional, if not specified, should use zorder from table spec/schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we can kill the LOCALLY
or make it optional to be consistent with Spark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was using existing locally ordered by syntax implemented in #5541
If i have to remove locally keyword for zorder, I believe we should remove for this as well to be consistent in hive.
Or make LOCALLY optional for both orders
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OPTIMIZE TABLE ice_orc rewrite data
zorder by (id, text); // optional, if not specified, should use zorder from table spec/schema
Note during implementation:
If the Zorder columns in table spec is different from spec provided in OPTIMIZE TABLE cmd for compaction, this should throw error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure. delta and spark iceberg support that. we can call optimize with zorder and later alter the write zorder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark only supports Z-order in rewrite_data_files (during compaction), it does not have Z-order support for DDL queries. So, they will not have this scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deniskuzZ
i think we can kill the LOCALLY or make it optional to be consistent with Spark
I will have this change in the follow up PR.
|
What changes were proposed in this pull request?
Added Z-order support for Iceberg tables via CREATE TABLE DDL
Why are the changes needed?
To support zorder indexing which will improve data clustering and query performance on Iceberg tables.
Does this PR introduce any user-facing change?
Yes , new syntax support
CREATE TABLE test_zorder (
id int,
text string)
WRITE LOCALLY ZORDER by id, text
STORED BY iceberg
STORED As orc;
How was this patch tested?
qtest