Skip to content

ducklake_add_data_files fails to bind legacy Avro LIST layout (array) — expects canonical list/element #617

@danchubb

Description

@danchubb

What happens?

HI, I'm getting problems loading a parquet file with complex structs produced from an avro file. I've managed to recreate it from a simple parquet below.

I think this is due to CALL ducklake_add_data_files(...) failing with Parquet files that use the legacy Avro LIST layout, where the repeated child group is named array. It expects list/element naming and throws a type mismatch (e.g., expected INTEGER[] but found INTEGER / expected STRUCT(…) but found VARCHAR), even though DuckDB can read the file just fine.

Is there any way for ducklake_add_data_files to normalize legacy Avro LIST naming during binding (e.g., treat array as canonical list/element), so files written by older Avro/parquet-mr writers can be registered without rewrites?

Thanks for the great work.

To Reproduce

This code should replicate the error (also provided below)

import duckdb
import requests
from duckdb_extensions import import_extension
import_extension("ducklake")


# example old parquet file
url = "https://raw.githubusercontent.com/apache/parquet-testing/master/data/old_list_structure.parquet"
output_file = "old_list_structure.parquet"

response = requests.get(url)
response.raise_for_status()  

with open(output_file, "wb") as f:
       f.write(response.content)


con = duckdb.connect()

con.execute(
    f"""
    ATTACH 'ducklake:test.ducklake' AS lake
    """
)

con.execute(f"""
    CREATE TABLE lake.test AS
    SELECT * FROM read_parquet('old_list_structure.parquet')
    WITH NO DATA;
""")


con.execute(f"""
    CALL ducklake_add_data_files('lake', 'test', 'old_list_structure.parquet')
""")


ERROR MESSAGE:

---------------------------------------------------------------------------
InvalidInputException                     Traceback (most recent call last)
Cell In[1], line 33
     20 con.execute(
     21     f"""
     22     ATTACH 'ducklake:test.ducklake' AS lake
     23     """
     24 )
     26 con.execute(f"""
     27     CREATE TABLE lake.test AS
     28     SELECT * FROM read_parquet('old_list_structure.parquet')
     29     WITH NO DATA;
     30 """)
---> 33 con.execute(f"""
     34     CALL ducklake_add_data_files('lake', 'test', 'old_list_structure.parquet')
     35 """)

InvalidInputException: Invalid Input Error: Failed to map column "a.list" from file "old_list_structure.parquet" to the column in table "test"
* Expected type "INTEGER[]" but found type "INTEGER"

OS:

Ubuntu

DuckDB Version:

1.4.2

DuckLake Version:

77f2512

DuckDB Client:

77f2512 Python

Hardware:

No response

Full Name:

Daniel Chubb

Affiliation:

no affiliation

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions