-
Notifications
You must be signed in to change notification settings - Fork 127
Description
What happens?
HI, I'm getting problems loading a parquet file with complex structs produced from an avro file. I've managed to recreate it from a simple parquet below.
I think this is due to CALL ducklake_add_data_files(...) failing with Parquet files that use the legacy Avro LIST layout, where the repeated child group is named array. It expects list/element naming and throws a type mismatch (e.g., expected INTEGER[] but found INTEGER / expected STRUCT(…) but found VARCHAR), even though DuckDB can read the file just fine.
Is there any way for ducklake_add_data_files to normalize legacy Avro LIST naming during binding (e.g., treat array as canonical list/element), so files written by older Avro/parquet-mr writers can be registered without rewrites?
Thanks for the great work.
To Reproduce
This code should replicate the error (also provided below)
import duckdb
import requests
from duckdb_extensions import import_extension
import_extension("ducklake")
# example old parquet file
url = "https://raw.githubusercontent.com/apache/parquet-testing/master/data/old_list_structure.parquet"
output_file = "old_list_structure.parquet"
response = requests.get(url)
response.raise_for_status()
with open(output_file, "wb") as f:
f.write(response.content)
con = duckdb.connect()
con.execute(
f"""
ATTACH 'ducklake:test.ducklake' AS lake
"""
)
con.execute(f"""
CREATE TABLE lake.test AS
SELECT * FROM read_parquet('old_list_structure.parquet')
WITH NO DATA;
""")
con.execute(f"""
CALL ducklake_add_data_files('lake', 'test', 'old_list_structure.parquet')
""")
ERROR MESSAGE:
---------------------------------------------------------------------------
InvalidInputException Traceback (most recent call last)
Cell In[1], line 33
20 con.execute(
21 f"""
22 ATTACH 'ducklake:test.ducklake' AS lake
23 """
24 )
26 con.execute(f"""
27 CREATE TABLE lake.test AS
28 SELECT * FROM read_parquet('old_list_structure.parquet')
29 WITH NO DATA;
30 """)
---> 33 con.execute(f"""
34 CALL ducklake_add_data_files('lake', 'test', 'old_list_structure.parquet')
35 """)
InvalidInputException: Invalid Input Error: Failed to map column "a.list" from file "old_list_structure.parquet" to the column in table "test"
* Expected type "INTEGER[]" but found type "INTEGER"
OS:
Ubuntu
DuckDB Version:
1.4.2
DuckLake Version:
DuckDB Client:
77f2512 Python
Hardware:
No response
Full Name:
Daniel Chubb
Affiliation:
no affiliation
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- Yes, I have