[Bug]: Managed IO of Iceberg appends NULL / Unprintable characters for string type columns. #33963

saathwik-tk · 2025-02-12T07:38:22Z

What happened?

Data Ingested via

pipeline.apply(<Source>)
              .apply(JsonToRow.withSchema(mySchema))
              .apply(Managed.write(Managed.ICEBERG).withConfig(myConfig))

Issue is:
After data gets ingested we cannot see the data with the below queries (given that field_name has many values with 'value1')

SELECT * FROM catalog_name.namespace.table_name WHERE field_name = 'value1';
SELECT * FROM catalog_name.namespace.table_name WHERE field_name like 'value1';

However we can see the data with the below queries

SELECT * FROM catalog_name.namespace.table_name WHERE TRIM(field_name) = 'value1'
SELECT * FROM catalog_name.namespace.table_name WHERE field_name like '%value1'
SELECT * FROM catalog_name.namespace.table_name WHERE field_name like 'value1%'
SELECT * FROM catalog_name.namespace.table_name WHERE field_name like '%value1%'

Even though I tried using the below approach, I saw the same issue, but the below approach makes sure that it is not from the source data.

pipeline.apply(<Source>)
              .apply(JsonToRow.withSchema(mySchema)).setCoder(RowCoder.of(mySchema))
              .apply(ParDo.of(new DoFn<Row, Row>() {
                    @ProcessElement
                    public void processFn(@Element Row row, OutputReceiver<Row> out){
                        List<Object> cleanedValues = schema.getFields().stream()
                                .map(field -> {
                                    Object value = row.getValue(field.getName());
                                    if(value instanceof  String){
                                        return ((String) value).trim();
                                    }
                                    return value;
                                })
                                .collect(Collectors.toList());
                        Row trimmedRow = Row.withSchema(mySchema)
                                .addValues(cleanedValues)
                                .build();
                        out.output(trimmedRow);
                    }
                })).setCoder(RowCoder.of(mySchema))
              .apply(Managed.write(Managed.ICEBERG).withConfig(myConfig));

However the issue is not seen in all the string type values, but seen in few not seen in most, some of the strings include '1234567890', '2025-02-12' or any date of this type in a string format.

NOTE:
Use the same reproduction as this
Beam Version: 2.62.0
Iceberg Version: 1.4.2

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

The text was updated successfully, but these errors were encountered:

saathwik-tk · 2025-02-14T06:20:20Z

I also tried using GSON in place of JsonToRow.withSchema() to make sure that ObjectMapper isn't a cause.

@ahmedabu98 Do you have any inputs here?

ahmedabu98 · 2025-02-14T13:58:13Z

Hey @saathwik-tk, I'm having trouble reproducing this one. Here's what I tried:

Create an Iceberg table with BigQueryMetastoreCatalog using this partition spec:

PartitionSpec partitionSpec =
    PartitionSpec.builderFor(ICEBERG_SCHEMA)
        .identity("bool")
        .hour("datetime")
        .truncate("str", "value_x".length())
        .build();

Write rows to Iceberg table
Execute BQ query:

BigqueryClient bqClient = new BigqueryClient(getClass().getSimpleName());
String q =
    String.format(
        "SELECT * FROM `%s.%s` where str = 'value_123'", OPTIONS.getProject(), tableId());
List<TableRow> rowList = bqClient.queryUnflattened(q, OPTIONS.getProject(), true, true);

As expected, I get one record where str = value_123.

I tried the same but without partitioning on "str" and it also works.

ahmedabu98 · 2025-02-14T14:01:06Z

Can you try running a simple pipeline with a fixed row(s)? i.e. something like

writePipeline
        .apply(Create.of(Row.withSchema(...).addValues(...).build()))
        .apply(Managed.write(Managed.ICEBERG).withConfig(...));

ahmedabu98 · 2025-02-14T14:01:19Z

Also are you seeing this for just String types?

saathwik-tk · 2025-02-14T14:03:39Z

Yes I see it only for string types.

saathwik-tk · 2025-02-14T14:44:40Z

pipeline.apply(Create.of(Row.withSchema(schemaTest).addValues("2025-02-14",1).build())).setCoder(RowCoder.of(schemaTest))
             .apply(Managed.write(Managed.ICEBERG).withConfig(config));

Just tried the above, resulted the same thing
SELECT * FROM table where id=1 works
SELECT * FROM table where date='2025-02-14' doesn't returns any data.

@ahmedabu98
try this particular kind of strings '2025-02-14', '2025-02-13' ... like this,
just FYI, I'm using Hive Catalog and Trino as a query Engine.

Schema schemaTest = Schema.builder()
                .addStringField("date")
                .addNullableInt32Field("id")
                .build();

ahmedabu98 · 2025-02-14T14:54:26Z

Thanks for doing that. Can you also paste the Schema you're using?

saathwik-tk · 2025-02-14T15:00:57Z

Schema schemaTest = Schema.builder()
                .addStringField("date")
                .addNullableInt32Field("id")
                .build();

Ignore the thing that I made it Nullable. (Doesn't actually matter)

saathwik-tk added awaiting triage bug labels Feb 12, 2025

github-actions bot added java io P2 labels Feb 12, 2025

liferoad assigned ahmedabu98 Feb 12, 2025

liferoad added IcebergIO IcebergIO: can only be used through ManagedIO and removed awaiting triage labels Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Managed IO of Iceberg appends NULL / Unprintable characters for string type columns. #33963

[Bug]: Managed IO of Iceberg appends NULL / Unprintable characters for string type columns. #33963

saathwik-tk commented Feb 12, 2025 •

edited

Loading

saathwik-tk commented Feb 14, 2025

ahmedabu98 commented Feb 14, 2025

ahmedabu98 commented Feb 14, 2025

ahmedabu98 commented Feb 14, 2025

saathwik-tk commented Feb 14, 2025

saathwik-tk commented Feb 14, 2025 •

edited

Loading

ahmedabu98 commented Feb 14, 2025

saathwik-tk commented Feb 14, 2025

[Bug]: Managed IO of Iceberg appends NULL / Unprintable characters for string type columns. #33963

[Bug]: Managed IO of Iceberg appends NULL / Unprintable characters for string type columns. #33963

Comments

saathwik-tk commented Feb 12, 2025 • edited Loading

What happened?

Issue Priority

Issue Components

saathwik-tk commented Feb 14, 2025

ahmedabu98 commented Feb 14, 2025

ahmedabu98 commented Feb 14, 2025

ahmedabu98 commented Feb 14, 2025

saathwik-tk commented Feb 14, 2025

saathwik-tk commented Feb 14, 2025 • edited Loading

ahmedabu98 commented Feb 14, 2025

saathwik-tk commented Feb 14, 2025

saathwik-tk commented Feb 12, 2025 •

edited

Loading

saathwik-tk commented Feb 14, 2025 •

edited

Loading