-
Notifications
You must be signed in to change notification settings - Fork 1.5k
BUG: schema_force_view_type configuration not working for CREATE EXTERNAL TABLE #14922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
02b45c6
to
b41b4fb
Compare
@@ -2400,7 +2400,8 @@ async fn write_json_with_order() -> Result<()> { | |||
#[tokio::test] | |||
async fn write_table_with_order() -> Result<()> { | |||
let tmp_dir = TempDir::new()?; | |||
let ctx = SessionContext::new(); | |||
let config = SessionConfig::new().with_parquet_force_view_metadata(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to disable this?
I can explain, because
- We select from the datasource or memory which are not forced to utf8view type, but we create external table with utf8view field when we enable the force_view.
- When we insert into cases, we check the insert data type and the table data type, so we should make the data type consistent.
Related ticket to support cast it to utf8view automaticlly:
#13408 (comment)
statement ok
create table t(a varchar) as values ('1'), ('2');
query T
select arrow_typeof(a) from t;
----
Utf8
Utf8
statement ok
drop table t
@@ -59,7 +59,8 @@ use tempfile::tempdir; | |||
#[tokio::main] | |||
async fn main() -> Result<()> { | |||
// The SessionContext is the main high level API for interacting with DataFusion | |||
let ctx = SessionContext::new(); | |||
let config = SessionConfig::new().with_parquet_force_view_metadata(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, we need to make the datatype consistent for insert into case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is because SQL VARCHAR
is being mapped to Utf8
03)----Sort: aggregate_test_100.c1 ASC NULLS LAST | ||
04)------TableScan: aggregate_test_100 projection=[c1] | ||
physical_plan | ||
01)DataSinkExec: sink=ParquetSink(file_groups=[]) | ||
02)--SortExec: expr=[c1@0 ASC NULLS LAST], preserve_partitioning=[false] | ||
03)----DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c1], file_type=csv, has_header=true | ||
02)--CoalescePartitionsExec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not related to this PR:
I am confused why we add CAST() for logic plan will cause the physical plan change to add
02)--CoalescePartitionsExec
03)----ProjectionExec: expr=[CAST(c1@0 AS Utf8View) as c1]
04)------RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1
@@ -144,7 +144,7 @@ EXPLAIN select b from t_pushdown where a = 'bar' order by b; | |||
---- | |||
logical_plan | |||
01)Sort: t_pushdown.b ASC NULLS LAST | |||
02)--TableScan: t_pushdown projection=[b], full_filters=[t_pushdown.a = Utf8("bar")] | |||
02)--TableScan: t_pushdown projection=[b], full_filters=[t_pushdown.a = Utf8View("bar")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Create table without pushdown
statement ok
CREATE EXTERNAL TABLE t_pushdown(a varchar, b int, c float) STORED AS PARQUET
LOCATION 'test_files/scratch/parquet_filter_pushdown/parquet_table/';
This is expected, now we support it for external table creation with Utf8View for this PR.
Testing result: > CREATE EXTERNAL TABLE IF NOT EXISTS lineitem (
l_orderkey BIGINT,
l_partkey BIGINT,
l_suppkey BIGINT,
l_linenumber INTEGER,
l_quantity DECIMAL(15, 2),
l_extendedprice DECIMAL(15, 2),
l_discount DECIMAL(15, 2),
l_tax DECIMAL(15, 2),
l_returnflag VARCHAR,
l_linestatus VARCHAR,
l_shipdate DATE,
l_commitdate DATE,
l_receiptdate DATE,
l_shipinstruct VARCHAR,
l_shipmode VARCHAR,
l_comment VARCHAR
) STORED AS parquet
LOCATION '/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf10/lineitem';
0 row(s) fetched.
Elapsed 0.006 seconds.
> select arrow_typeof(l_comment) from lineitem limit 1;
+----------------------------------+
| arrow_typeof(lineitem.l_comment) |
+----------------------------------+
| Utf8View |
+----------------------------------+
1 row(s) fetched.
Elapsed 0.092 seconds. |
@@ -456,13 +456,16 @@ explain insert into table_without_values select c1 from aggregate_test_100 order | |||
---- | |||
logical_plan | |||
01)Dml: op=[Insert Into] table=[table_without_values] | |||
02)--Projection: aggregate_test_100.c1 AS c1 | |||
02)--Projection: CAST(aggregate_test_100.c1 AS Utf8View) AS c1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also expected, for the sql way insert into, we will automatically cast the type to match the table schema, see details:
The code here:
datafusion/datafusion/sql/src/statement.rs
Line 1958 in 2011f52
.cast_to(target_field.data_type(), source.schema())? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a follow-up ticket for it, we'd better to support cast also for insert into api way consistent with insert into sql.
#15015
@@ -377,6 +377,21 @@ impl FileFormat for ParquetFormat { | |||
Ok(Arc::new(schema)) | |||
} | |||
|
|||
async fn transform_schema(&self, schema: SchemaRef) -> Result<SchemaRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this transformation rule adapted from some existing transformation? Because there is an additional rule here: transform_binary_to_string
, so I made a such guess.
If so, perhaps we should extract the common logic to keep it consistent, or at least let them reference to each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we call it at the end of infer_shema, but it's not limited to infer_shema:
let schema = if self.binary_as_string() {
transform_binary_to_string(&schema)
} else {
schema
};
let schema = if self.force_view_types() {
transform_schema_to_view(&schema)
} else {
schema
};
I will try to extract the common logic to make it consistent.
Thank you @2010YOUY01 for review, addressed comments in latsest PR. |
LGTM, thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR and report @zhuqi-lucas
It seems like the problem is that VARCHAR
in SQL is treated as Utf8
(see my comment here #14909 (comment))
It seems to me like this PR is a workaround -- if we make this change SQL VARCHAR will be treated as Utf8View for create table statements with parquet but Utf8 otherwise. I think this discrepancy is why you had to change the insert cases
I think a better approach would be to change the mapping of VARCHAR
to Utf8View
everywhere (change this line
datafusion/datafusion/sql/src/planner.rs
Line 561 in c0d53ad
_ => Ok(DataType::Utf8), |
Thank you @alamb for review and explain, mapping of |
@alamb I changed the VARCHAR to utf8view now, but it seems cause many disruptive testing changes which need to dig into more details. We may need to file more tickets... |
3 5 19 | ||
4 5 23 | ||
5 5 14 | ||
DataFusion error: This feature is not implemented: Support for 'approx_distinct' for data type Utf8View is not implemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approx_distinct seems not support Utf8View
We may need to file a ticket
04)------CoalesceBatchesExec: target_batch_size=8192 | ||
05)--------RepartitionExec: partitioning=Hash([trace_id@0], 4), input_partitions=4 | ||
06)----------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 | ||
07)------------AggregateExec: mode=Partial, gby=[trace_id@0 as trace_id], aggr=[max(traces.timestamp)], lim=[4] | ||
07)------------AggregateExec: mode=Partial, gby=[trace_id@0 as trace_id], aggr=[max(traces.timestamp)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused why lim=[4] is deleted when we change VARCHAR to utf8view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is because there is special topk functionality for aggregates: https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/aggregates/topk/mod.rs
This special implementation likely doesn't have support for Utf8View
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thank you @alamb !
@@ -260,8 +260,8 @@ select | |||
arrow_typeof(coalesce(c, arrow_cast('b', 'Dictionary(Int32, Utf8)'))) | |||
from t; | |||
---- | |||
a Dictionary(Int32, Utf8) | |||
b Dictionary(Int32, Utf8) | |||
a Utf8View |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dictionary(Int32, Utf8)
changes to Utf8View when we change VARCHAR to utf8view
Check optimizer-specific invariants after optimizer rule: optimize_projections | ||
caused by | ||
Internal error: Failed due to a difference in schemas, original schema: DFSchema { inner: Schema { fields: [Field { name: "concat_ws(Utf8(\"\"),foo.a,foo.b,Utf8(\"c\"))", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [None], functional_dependencies: FunctionalDependencies { deps: [] } }, new schema: DFSchema { inner: Schema { fields: [Field { name: "concat_ws(Utf8(\"\"),foo.a,foo.b,Utf8(\"c\"))", data_type: Utf8View, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [None], functional_dependencies: FunctionalDependencies { deps: [] } }. | ||
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like concat_ws with different datatype which can't support after this PR.
@@ -88,21 +88,6 @@ e 5 | |||
query TT | |||
EXPLAIN SELECT t2.c1, t2.r FROM (SELECT c1, RANK() OVER (ORDER BY c1 DESC) AS r, c3, c9 FROM sink_table ORDER BY c1, c3 LIMIT 2) AS t2 ORDER BY t2.c1, t2.c3, t2.c9; | |||
---- | |||
logical_plan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused why it is deleted after this PR.
2010 21 | ||
DataFusion error: type_coercion | ||
caused by | ||
Internal error: Cannot run range queries on datatype: Utf8View. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems range query not support Utf8View.
SELECT approx_percentile_cont(c3, 0.95, c1) FROM aggregate_test_100 | ||
---- | ||
DataFusion error: Error during planning: Failed to coerce arguments to satisfy a call to 'approx_percentile_cont' function: coercion from [Int16, Float64, Utf8View] to the signature OneOf([Exact([Int8, Float64]), Exact([Int8, Float64, Int8]), Exact([Int8, Float64, Int16]), Exact([Int8, Float64, Int32]), Exact([Int8, Float64, Int64]), Exact([Int8, Float64, UInt8]), Exact([Int8, Float64, UInt16]), Exact([Int8, Float64, UInt32]), Exact([Int8, Float64, UInt64]), Exact([Int16, Float64]), Exact([Int16, Float64, Int8]), Exact([Int16, Float64, Int16]), Exact([Int16, Float64, Int32]), Exact([Int16, Float64, Int64]), Exact([Int16, Float64, UInt8]), Exact([Int16, Float64, UInt16]), Exact([Int16, Float64, UInt32]), Exact([Int16, Float64, UInt64]), Exact([Int32, Float64]), Exact([Int32, Float64, Int8]), Exact([Int32, Float64, Int16]), Exact([Int32, Float64, Int32]), Exact([Int32, Float64, Int64]), Exact([Int32, Float64, UInt8]), Exact([Int32, Float64, UInt16]), Exact([Int32, Float64, UInt32]), Exact([Int32, Float64, UInt64]), Exact([Int64, Float64]), Exact([Int64, Float64, Int8]), Exact([Int64, Float64, Int16]), Exact([Int64, Float64, Int32]), Exact([Int64, Float64, Int64]), Exact([Int64, Float64, UInt8]), Exact([Int64, Float64, UInt16]), Exact([Int64, Float64, UInt32]), Exact([Int64, Float64, UInt64]), Exact([UInt8, Float64]), Exact([UInt8, Float64, Int8]), Exact([UInt8, Float64, Int16]), Exact([UInt8, Float64, Int32]), Exact([UInt8, Float64, Int64]), Exact([UInt8, Float64, UInt8]), Exact([UInt8, Float64, UInt16]), Exact([UInt8, Float64, UInt32]), Exact([UInt8, Float64, UInt64]), Exact([UInt16, Float64]), Exact([UInt16, Float64, Int8]), Exact([UInt16, Float64, Int16]), Exact([UInt16, Float64, Int32]), Exact([UInt16, Float64, Int64]), Exact([UInt16, Float64, UInt8]), Exact([UInt16, Float64, UInt16]), Exact([UInt16, Float64, UInt32]), Exact([UInt16, Float64, UInt64]), Exact([UInt32, Float64]), Exact([UInt32, Float64, Int8]), Exact([UInt32, Float64, Int16]), Exact([UInt32, Float64, Int32]), Exact([UInt32, Float64, Int64]), Exact([UInt32, Float64, UInt8]), Exact([UInt32, Float64, UInt16]), Exact([UInt32, Float64, UInt32]), Exact([UInt32, Float64, UInt64]), Exact([UInt64, Float64]), Exact([UInt64, Float64, Int8]), Exact([UInt64, Float64, Int16]), Exact([UInt64, Float64, Int32]), Exact([UInt64, Float64, Int64]), Exact([UInt64, Float64, UInt8]), Exact([UInt64, Float64, UInt16]), Exact([UInt64, Float64, UInt32]), Exact([UInt64, Float64, UInt64]), Exact([Float32, Float64]), Exact([Float32, Float64, Int8]), Exact([Float32, Float64, Int16]), Exact([Float32, Float64, Int32]), Exact([Float32, Float64, Int64]), Exact([Float32, Float64, UInt8]), Exact([Float32, Float64, UInt16]), Exact([Float32, Float64, UInt32]), Exact([Float32, Float64, UInt64]), Exact([Float64, Float64]), Exact([Float64, Float64, Int8]), Exact([Float64, Float64, Int16]), Exact([Float64, Float64, Int32]), Exact([Float64, Float64, Int64]), Exact([Float64, Float64, UInt8]), Exact([Float64, Float64, UInt16]), Exact([Float64, Float64, UInt32]), Exact([Float64, Float64, UInt64])]) failed No function matches the given name and argument types 'approx_percentile_cont(Int16, Float64, Utf8View)'. You might need to add explicit type casts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approx_percentile_cont also not support utf8view
One question for me: I run locally cargo test --test sqllogictests successfully, and use cargo test --test sqllogictests -- --complete to generate the slt testing files, but the CI still failed for the PR? For example: External error: statement is expected to fail with error:
(multiline) DataFusion error: Error during planning: Failed to coerce arguments to satisfy a call to 'approx_percentile_cont_with_weight' function: coercion from [Utf8View, Int8, Float64] to the signature OneOf([Exact([Int8, Int8, Float64]), Exact([Int16, Int16, Float64]), Exact([Int32, Int32, Float64]), Exact([Int64, Int64, Float64]), Exact([UInt8, UInt8, Float64]), Exact([UInt16, UInt16, Float64]), Exact([UInt32, UInt32, Float64]), Exact([UInt64, UInt64, Float64]), Exact([Float32, Float32, Float64]), Exact([Float64, Float64, Float64])]) failed No function matches the given name and argument types 'approx_percentile_cont_with_weight(Utf8View, Int8, Float64)'. You might need to add explicit type casts.
Candidate functions:
approx_percentile_cont_with_weight(Int8, Int8, Float64)
approx_percentile_cont_with_weight(Int16, Int16, Float64)
approx_percentile_cont_with_weight(Int32, Int32, Float64)
approx_percentile_cont_with_weight(Int64, Int64, Float64)
approx_percentile_cont_with_weight(UInt8, UInt8, Float64)
approx_percentile_cont_with_weight(UInt16, UInt16, Float64)
approx_percentile_cont_with_weight(UInt32, UInt32, Float64)
approx_percentile_cont_with_weight(UInt64, UInt64, Float64)
approx_percentile_cont_with_weight(Float32, Float32, Float64)
approx_percentile_cont_with_weight(Float64, Float64, Float64)
but got error:
DataFusion error: Error during planning: Failed to coerce arguments to satisfy a call to 'approx_percentile_cont_with_weight' function: coercion from [Utf8View, Int8, Float64] to the signature OneOf([Exact([Int8, Int8, Float64]), Exact([Int16, Int16, Float64]), Exact([Int32, Int32, Float64]), Exact([Int64, Int64, Float64]), Exact([UInt8, UInt8, Float64]), Exact([UInt16, UInt16, Float64]), Exact([UInt32, UInt32, Float64]), Exact([UInt64, UInt64, Float64]), Exact([Float32, Float32, Float64]), Exact([Float64, Float64, Float64])]) failed
[SQL] SELECT approx_percentile_cont_with_weight(c1, c2, 0.95) FROM aggregate_test_100
at test_files/aggregate.slt:135 |
Sometimes I have seen that related to the exact error message / matching syntax. You can shorten the error message that is being expected and that is how I work around it |
Thank you for this work @zhuqi-lucas -- I see switching to use I can try to file such a ticket later today / tomorrow if you don't have a chance to do so |
Thank you @alamb for confirming, i created the arrow-rs subtask now: |
Thank you @alamb , after digging into, i found the problem is: cargo test --test sqllogictests --features backtrace -- --complete The testing result will be different when we using backtrace for sqllogictest. And the current testing for cargo test will use backtrace: macos-aarch64:
name: cargo test (macos-aarch64)
runs-on: macos-14
steps:
- uses: actions/checkout@v4
with:
submodules: true
fetch-depth: 1
- name: Setup Rust toolchain
uses: ./.github/actions/setup-macos-aarch64-builder
- name: Run tests (excluding doctests)
shell: bash
run: cargo test --profile ci --lib --tests --bins --features avro,json,backtrace,integration-tests But other places not using the backtrace. And i created a ticket to fix it for the CI to make it consistent: Also submitted the fix PR: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zhuqi-lucas -- I started organizing the work needed to switch the default VARCHAR
type to Utf8View
here: #15096
I am going to mark this PR as a draft as we work through the issues
@@ -3627,10 +3625,15 @@ physical_plan | |||
07)------------StreamingTableExec: partition_sizes=1, projection=[a0, a, b, c, d], infinite_source=true, output_orderings=[[a@1 ASC NULLS LAST, b@2 ASC NULLS LAST], [c@3 ASC NULLS LAST]] | |||
|
|||
# CTAS with NTILE function | |||
statement ok | |||
statement error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like a regression to me
@@ -1898,6 +1898,9 @@ mod tests { | |||
Ok(()) | |||
} | |||
|
|||
/// Note: We now default to use Utf8View, but we don't support for Utf8View in JSON reader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The follow on ticket is here: feat: Support Utf8View in JSON reader arrow-rs#7244
@@ -289,14 +289,8 @@ SELECT c2, approx_median(c5), approx_median(c11) FROM aggregate_test_100 GROUP B | |||
5 593204320 0.5156586 | |||
|
|||
# Test approx_distinct for varchar / int | |||
query III | |||
query error DataFusion error: This feature is not implemented: Support for 'approx_distinct' for data type Utf8View is not implemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be needed too -- added a note to
04)------CoalesceBatchesExec: target_batch_size=8192 | ||
05)--------RepartitionExec: partitioning=Hash([trace_id@0], 4), input_partitions=4 | ||
06)----------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 | ||
07)------------AggregateExec: mode=Partial, gby=[trace_id@0 as trace_id], aggr=[max(traces.timestamp)], lim=[4] | ||
07)------------AggregateExec: mode=Partial, gby=[trace_id@0 as trace_id], aggr=[max(traces.timestamp)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is because there is special topk functionality for aggregates: https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/aggregates/topk/mod.rs
This special implementation likely doesn't have support for Utf8View
@@ -124,69 +124,24 @@ STORED AS AVRO | |||
LOCATION '../../testing/data/avro/simple_fixed.avro'; | |||
|
|||
# test avro query | |||
query IT | |||
query error DataFusion error: Arrow error: Schema error: type Utf8View not supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 looks like we need to support Utf8View for avro as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah @alamb , i agree, we need to create more sub-tasks.
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
schema_force_view_type
configuration not working forCREATE EXTERNAL TABLE
#14909Rationale for this change
schema_force_view_type configuration not working for CREATE EXTERNAL TABLE
Because we onlly use infer_schema to infer schema and transform schema, but for those provided shema such as CREATE EXTERNAL TABLE, we also need to transform schema.
What changes are included in this PR?
Add separate transform schema function to do the transform.
Are these changes tested?
Yes
Are there any user-facing changes?