-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Managed IO of Iceberg appends NULL / Unprintable characters for string type columns. #33963
Comments
I also tried using GSON in place of @ahmedabu98 Do you have any inputs here? |
Hey @saathwik-tk, I'm having trouble reproducing this one. Here's what I tried:
PartitionSpec partitionSpec =
PartitionSpec.builderFor(ICEBERG_SCHEMA)
.identity("bool")
.hour("datetime")
.truncate("str", "value_x".length())
.build();
BigqueryClient bqClient = new BigqueryClient(getClass().getSimpleName());
String q =
String.format(
"SELECT * FROM `%s.%s` where str = 'value_123'", OPTIONS.getProject(), tableId());
List<TableRow> rowList = bqClient.queryUnflattened(q, OPTIONS.getProject(), true, true); As expected, I get one record where str = value_123. I tried the same but without partitioning on "str" and it also works. |
Can you try running a simple pipeline with a fixed row(s)? i.e. something like writePipeline
.apply(Create.of(Row.withSchema(...).addValues(...).build()))
.apply(Managed.write(Managed.ICEBERG).withConfig(...)); |
Also are you seeing this for just String types? |
Yes I see it only for string types. |
Just tried the above, resulted the same thing @ahmedabu98
|
Thanks for doing that. Can you also paste the Schema you're using? |
Ignore the thing that I made it Nullable. (Doesn't actually matter) |
What happened?
Data Ingested via
Issue is:
After data gets ingested we cannot see the data with the below queries (given that field_name has many values with 'value1')
SELECT * FROM catalog_name.namespace.table_name WHERE field_name = 'value1';
SELECT * FROM catalog_name.namespace.table_name WHERE field_name like 'value1';
However we can see the data with the below queries
SELECT * FROM catalog_name.namespace.table_name WHERE TRIM(field_name) = 'value1'
SELECT * FROM catalog_name.namespace.table_name WHERE field_name like '%value1'
SELECT * FROM catalog_name.namespace.table_name WHERE field_name like 'value1%'
SELECT * FROM catalog_name.namespace.table_name WHERE field_name like '%value1%'
Even though I tried using the below approach, I saw the same issue, but the below approach makes sure that it is not from the source data.
However the issue is not seen in all the string type values, but seen in few not seen in most, some of the strings include '1234567890', '2025-02-12' or any date of this type in a string format.
NOTE:
Use the same reproduction as this
Beam Version: 2.62.0
Iceberg Version: 1.4.2
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: