Skip to content

BUG: schema_force_view_type configuration not working for CREATE EXTERNAL TABLE #14922

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Feb 28, 2025

Which issue does this PR close?

Rationale for this change

schema_force_view_type configuration not working for CREATE EXTERNAL TABLE

Because we onlly use infer_schema to infer schema and transform schema, but for those provided shema such as CREATE EXTERNAL TABLE, we also need to transform schema.

What changes are included in this PR?

Add separate transform schema function to do the transform.

Are these changes tested?

Yes

> CREATE EXTERNAL TABLE IF NOT EXISTS lineitem (
        l_orderkey BIGINT,
        l_partkey BIGINT,
        l_suppkey BIGINT,
        l_linenumber INTEGER,
        l_quantity DECIMAL(15, 2),
        l_extendedprice DECIMAL(15, 2),
        l_discount DECIMAL(15, 2),
        l_tax DECIMAL(15, 2),
        l_returnflag VARCHAR,
        l_linestatus VARCHAR,
        l_shipdate DATE,
        l_commitdate DATE,
        l_receiptdate DATE,
        l_shipinstruct VARCHAR,
        l_shipmode VARCHAR,
        l_comment VARCHAR
) STORED AS parquet
LOCATION '/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf10/lineitem';
0 row(s) fetched.
Elapsed 0.006 seconds.

> select arrow_typeof(l_comment) from lineitem limit 1;
+----------------------------------+
| arrow_typeof(lineitem.l_comment) |
+----------------------------------+
| Utf8View                         |
+----------------------------------+
1 row(s) fetched.
Elapsed 0.092 seconds.

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Feb 28, 2025
@zhuqi-lucas zhuqi-lucas marked this pull request as draft February 28, 2025 03:56
@github-actions github-actions bot added execution Related to the execution crate sqllogictest SQL Logic Tests (.slt) labels Feb 28, 2025
@@ -2400,7 +2400,8 @@ async fn write_json_with_order() -> Result<()> {
#[tokio::test]
async fn write_table_with_order() -> Result<()> {
let tmp_dir = TempDir::new()?;
let ctx = SessionContext::new();
let config = SessionConfig::new().with_parquet_force_view_metadata(false);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to disable this?

I can explain, because

  1. We select from the datasource or memory which are not forced to utf8view type, but we create external table with utf8view field when we enable the force_view.
  2. When we insert into cases, we check the insert data type and the table data type, so we should make the data type consistent.

Related ticket to support cast it to utf8view automaticlly:
#13408 (comment)

statement ok
create table t(a varchar) as values ('1'), ('2');

query T
select arrow_typeof(a) from t;
----
Utf8
Utf8

statement ok
drop table t

@zhuqi-lucas zhuqi-lucas marked this pull request as ready for review March 2, 2025 00:30
@@ -59,7 +59,8 @@ use tempfile::tempdir;
#[tokio::main]
async fn main() -> Result<()> {
// The SessionContext is the main high level API for interacting with DataFusion
let ctx = SessionContext::new();
let config = SessionConfig::new().with_parquet_force_view_metadata(false);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we need to make the datatype consistent for insert into case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is because SQL VARCHAR is being mapped to Utf8

03)----Sort: aggregate_test_100.c1 ASC NULLS LAST
04)------TableScan: aggregate_test_100 projection=[c1]
physical_plan
01)DataSinkExec: sink=ParquetSink(file_groups=[])
02)--SortExec: expr=[c1@0 ASC NULLS LAST], preserve_partitioning=[false]
03)----DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c1], file_type=csv, has_header=true
02)--CoalescePartitionsExec
Copy link
Contributor Author

@zhuqi-lucas zhuqi-lucas Mar 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not related to this PR:

I am confused why we add CAST() for logic plan will cause the physical plan change to add

02)--CoalescePartitionsExec
03)----ProjectionExec: expr=[CAST(c1@0 AS Utf8View) as c1]
04)------RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1

@@ -144,7 +144,7 @@ EXPLAIN select b from t_pushdown where a = 'bar' order by b;
----
logical_plan
01)Sort: t_pushdown.b ASC NULLS LAST
02)--TableScan: t_pushdown projection=[b], full_filters=[t_pushdown.a = Utf8("bar")]
02)--TableScan: t_pushdown projection=[b], full_filters=[t_pushdown.a = Utf8View("bar")]
Copy link
Contributor Author

@zhuqi-lucas zhuqi-lucas Mar 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## Create table without pushdown
statement ok
CREATE EXTERNAL TABLE t_pushdown(a varchar, b int, c float) STORED AS PARQUET
LOCATION 'test_files/scratch/parquet_filter_pushdown/parquet_table/';

This is expected, now we support it for external table creation with Utf8View for this PR.

@zhuqi-lucas
Copy link
Contributor Author

Testing result:

> CREATE EXTERNAL TABLE IF NOT EXISTS lineitem (
        l_orderkey BIGINT,
        l_partkey BIGINT,
        l_suppkey BIGINT,
        l_linenumber INTEGER,
        l_quantity DECIMAL(15, 2),
        l_extendedprice DECIMAL(15, 2),
        l_discount DECIMAL(15, 2),
        l_tax DECIMAL(15, 2),
        l_returnflag VARCHAR,
        l_linestatus VARCHAR,
        l_shipdate DATE,
        l_commitdate DATE,
        l_receiptdate DATE,
        l_shipinstruct VARCHAR,
        l_shipmode VARCHAR,
        l_comment VARCHAR
) STORED AS parquet
LOCATION '/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf10/lineitem';
0 row(s) fetched.
Elapsed 0.006 seconds.

> select arrow_typeof(l_comment) from lineitem limit 1;
+----------------------------------+
| arrow_typeof(lineitem.l_comment) |
+----------------------------------+
| Utf8View                         |
+----------------------------------+
1 row(s) fetched.
Elapsed 0.092 seconds.

@@ -456,13 +456,16 @@ explain insert into table_without_values select c1 from aggregate_test_100 order
----
logical_plan
01)Dml: op=[Insert Into] table=[table_without_values]
02)--Projection: aggregate_test_100.c1 AS c1
02)--Projection: CAST(aggregate_test_100.c1 AS Utf8View) AS c1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also expected, for the sql way insert into, we will automatically cast the type to match the table schema, see details:

The code here:

.cast_to(target_field.data_type(), source.schema())?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a follow-up ticket for it, we'd better to support cast also for insert into api way consistent with insert into sql.
#15015

@@ -377,6 +377,21 @@ impl FileFormat for ParquetFormat {
Ok(Arc::new(schema))
}

async fn transform_schema(&self, schema: SchemaRef) -> Result<SchemaRef> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this transformation rule adapted from some existing transformation? Because there is an additional rule here: transform_binary_to_string, so I made a such guess.
If so, perhaps we should extract the common logic to keep it consistent, or at least let them reference to each other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we call it at the end of infer_shema, but it's not limited to infer_shema:

     let schema = if self.binary_as_string() {
            transform_binary_to_string(&schema)
        } else {
            schema
        };

        let schema = if self.force_view_types() {
            transform_schema_to_view(&schema)
        } else {
            schema
        };

I will try to extract the common logic to make it consistent.

@alamb alamb mentioned this pull request Mar 4, 2025
12 tasks
@github-actions github-actions bot added the datasource Changes to the datasource crate label Mar 5, 2025
@zhuqi-lucas
Copy link
Contributor Author

Thank you @2010YOUY01 for review, addressed comments in latsest PR.

@2010YOUY01
Copy link
Contributor

Thank you @2010YOUY01 for review, addressed comments in latsest PR.

LGTM, thank you.
I don't know this code well so let's wait for others to approve it.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR and report @zhuqi-lucas

It seems like the problem is that VARCHAR in SQL is treated as Utf8 (see my comment here #14909 (comment))

It seems to me like this PR is a workaround -- if we make this change SQL VARCHAR will be treated as Utf8View for create table statements with parquet but Utf8 otherwise. I think this discrepancy is why you had to change the insert cases

I think a better approach would be to change the mapping of VARCHAR to Utf8View everywhere (change this line

_ => Ok(DataType::Utf8),
) though this might be significantly more disruptive

@zhuqi-lucas
Copy link
Contributor Author

Thank you for this PR and report @zhuqi-lucas

It seems like the problem is that VARCHAR in SQL is treated as Utf8 (see my comment here #14909 (comment))

It seems to me like this PR is a workaround -- if we make this change SQL VARCHAR will be treated as Utf8View for create table statements with parquet but Utf8 otherwise. I think this discrepancy is why you had to change the insert cases

I think a better approach would be to change the mapping of VARCHAR to Utf8View everywhere (change this line

_ => Ok(DataType::Utf8),

) though this might be significantly more disruptive

Thank you @alamb for review and explain, mapping of VARCHAR to Utf8View everywhere is really a better solution, i will try to address this idea.

@github-actions github-actions bot added sql SQL Planner and removed datasource Changes to the datasource crate labels Mar 5, 2025
@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Mar 6, 2025

@alamb I changed the VARCHAR to utf8view now, but it seems cause many disruptive testing changes which need to dig into more details. We may need to file more tickets...

3 5 19
4 5 23
5 5 14
DataFusion error: This feature is not implemented: Support for 'approx_distinct' for data type Utf8View is not implemented
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approx_distinct seems not support Utf8View

We may need to file a ticket

04)------CoalesceBatchesExec: target_batch_size=8192
05)--------RepartitionExec: partitioning=Hash([trace_id@0], 4), input_partitions=4
06)----------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
07)------------AggregateExec: mode=Partial, gby=[trace_id@0 as trace_id], aggr=[max(traces.timestamp)], lim=[4]
07)------------AggregateExec: mode=Partial, gby=[trace_id@0 as trace_id], aggr=[max(traces.timestamp)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused why lim=[4] is deleted when we change VARCHAR to utf8view.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is because there is special topk functionality for aggregates: https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/aggregates/topk/mod.rs

This special implementation likely doesn't have support for Utf8View

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thank you @alamb !

@@ -260,8 +260,8 @@ select
arrow_typeof(coalesce(c, arrow_cast('b', 'Dictionary(Int32, Utf8)')))
from t;
----
a Dictionary(Int32, Utf8)
b Dictionary(Int32, Utf8)
a Utf8View
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dictionary(Int32, Utf8)

changes to Utf8View when we change VARCHAR to utf8view

Check optimizer-specific invariants after optimizer rule: optimize_projections
caused by
Internal error: Failed due to a difference in schemas, original schema: DFSchema { inner: Schema { fields: [Field { name: "concat_ws(Utf8(\"\"),foo.a,foo.b,Utf8(\"c\"))", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [None], functional_dependencies: FunctionalDependencies { deps: [] } }, new schema: DFSchema { inner: Schema { fields: [Field { name: "concat_ws(Utf8(\"\"),foo.a,foo.b,Utf8(\"c\"))", data_type: Utf8View, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [None], functional_dependencies: FunctionalDependencies { deps: [] } }.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like concat_ws with different datatype which can't support after this PR.

@@ -88,21 +88,6 @@ e 5
query TT
EXPLAIN SELECT t2.c1, t2.r FROM (SELECT c1, RANK() OVER (ORDER BY c1 DESC) AS r, c3, c9 FROM sink_table ORDER BY c1, c3 LIMIT 2) AS t2 ORDER BY t2.c1, t2.c3, t2.c9;
----
logical_plan
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused why it is deleted after this PR.

2010 21
DataFusion error: type_coercion
caused by
Internal error: Cannot run range queries on datatype: Utf8View.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems range query not support Utf8View.

SELECT approx_percentile_cont(c3, 0.95, c1) FROM aggregate_test_100
----
DataFusion error: Error during planning: Failed to coerce arguments to satisfy a call to 'approx_percentile_cont' function: coercion from [Int16, Float64, Utf8View] to the signature OneOf([Exact([Int8, Float64]), Exact([Int8, Float64, Int8]), Exact([Int8, Float64, Int16]), Exact([Int8, Float64, Int32]), Exact([Int8, Float64, Int64]), Exact([Int8, Float64, UInt8]), Exact([Int8, Float64, UInt16]), Exact([Int8, Float64, UInt32]), Exact([Int8, Float64, UInt64]), Exact([Int16, Float64]), Exact([Int16, Float64, Int8]), Exact([Int16, Float64, Int16]), Exact([Int16, Float64, Int32]), Exact([Int16, Float64, Int64]), Exact([Int16, Float64, UInt8]), Exact([Int16, Float64, UInt16]), Exact([Int16, Float64, UInt32]), Exact([Int16, Float64, UInt64]), Exact([Int32, Float64]), Exact([Int32, Float64, Int8]), Exact([Int32, Float64, Int16]), Exact([Int32, Float64, Int32]), Exact([Int32, Float64, Int64]), Exact([Int32, Float64, UInt8]), Exact([Int32, Float64, UInt16]), Exact([Int32, Float64, UInt32]), Exact([Int32, Float64, UInt64]), Exact([Int64, Float64]), Exact([Int64, Float64, Int8]), Exact([Int64, Float64, Int16]), Exact([Int64, Float64, Int32]), Exact([Int64, Float64, Int64]), Exact([Int64, Float64, UInt8]), Exact([Int64, Float64, UInt16]), Exact([Int64, Float64, UInt32]), Exact([Int64, Float64, UInt64]), Exact([UInt8, Float64]), Exact([UInt8, Float64, Int8]), Exact([UInt8, Float64, Int16]), Exact([UInt8, Float64, Int32]), Exact([UInt8, Float64, Int64]), Exact([UInt8, Float64, UInt8]), Exact([UInt8, Float64, UInt16]), Exact([UInt8, Float64, UInt32]), Exact([UInt8, Float64, UInt64]), Exact([UInt16, Float64]), Exact([UInt16, Float64, Int8]), Exact([UInt16, Float64, Int16]), Exact([UInt16, Float64, Int32]), Exact([UInt16, Float64, Int64]), Exact([UInt16, Float64, UInt8]), Exact([UInt16, Float64, UInt16]), Exact([UInt16, Float64, UInt32]), Exact([UInt16, Float64, UInt64]), Exact([UInt32, Float64]), Exact([UInt32, Float64, Int8]), Exact([UInt32, Float64, Int16]), Exact([UInt32, Float64, Int32]), Exact([UInt32, Float64, Int64]), Exact([UInt32, Float64, UInt8]), Exact([UInt32, Float64, UInt16]), Exact([UInt32, Float64, UInt32]), Exact([UInt32, Float64, UInt64]), Exact([UInt64, Float64]), Exact([UInt64, Float64, Int8]), Exact([UInt64, Float64, Int16]), Exact([UInt64, Float64, Int32]), Exact([UInt64, Float64, Int64]), Exact([UInt64, Float64, UInt8]), Exact([UInt64, Float64, UInt16]), Exact([UInt64, Float64, UInt32]), Exact([UInt64, Float64, UInt64]), Exact([Float32, Float64]), Exact([Float32, Float64, Int8]), Exact([Float32, Float64, Int16]), Exact([Float32, Float64, Int32]), Exact([Float32, Float64, Int64]), Exact([Float32, Float64, UInt8]), Exact([Float32, Float64, UInt16]), Exact([Float32, Float64, UInt32]), Exact([Float32, Float64, UInt64]), Exact([Float64, Float64]), Exact([Float64, Float64, Int8]), Exact([Float64, Float64, Int16]), Exact([Float64, Float64, Int32]), Exact([Float64, Float64, Int64]), Exact([Float64, Float64, UInt8]), Exact([Float64, Float64, UInt16]), Exact([Float64, Float64, UInt32]), Exact([Float64, Float64, UInt64])]) failed No function matches the given name and argument types 'approx_percentile_cont(Int16, Float64, Utf8View)'. You might need to add explicit type casts.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approx_percentile_cont also not support utf8view

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Mar 6, 2025

One question for me:

I run locally cargo test --test sqllogictests successfully, and use

cargo test --test sqllogictests -- --complete to generate the slt testing files, but the CI still failed for the PR?

For example:

External error: statement is expected to fail with error:
	(multiline) DataFusion error: Error during planning: Failed to coerce arguments to satisfy a call to 'approx_percentile_cont_with_weight' function: coercion from [Utf8View, Int8, Float64] to the signature OneOf([Exact([Int8, Int8, Float64]), Exact([Int16, Int16, Float64]), Exact([Int32, Int32, Float64]), Exact([Int64, Int64, Float64]), Exact([UInt8, UInt8, Float64]), Exact([UInt16, UInt16, Float64]), Exact([UInt32, UInt32, Float64]), Exact([UInt64, UInt64, Float64]), Exact([Float32, Float32, Float64]), Exact([Float64, Float64, Float64])]) failed No function matches the given name and argument types 'approx_percentile_cont_with_weight(Utf8View, Int8, Float64)'. You might need to add explicit type casts.
	Candidate functions:
	approx_percentile_cont_with_weight(Int8, Int8, Float64)
	approx_percentile_cont_with_weight(Int16, Int16, Float64)
	approx_percentile_cont_with_weight(Int32, Int32, Float64)
	approx_percentile_cont_with_weight(Int64, Int64, Float64)
	approx_percentile_cont_with_weight(UInt8, UInt8, Float64)
	approx_percentile_cont_with_weight(UInt16, UInt16, Float64)
	approx_percentile_cont_with_weight(UInt32, UInt32, Float64)
	approx_percentile_cont_with_weight(UInt64, UInt64, Float64)
	approx_percentile_cont_with_weight(Float32, Float32, Float64)
	approx_percentile_cont_with_weight(Float64, Float64, Float64)
but got error:
	DataFusion error: Error during planning: Failed to coerce arguments to satisfy a call to 'approx_percentile_cont_with_weight' function: coercion from [Utf8View, Int8, Float64] to the signature OneOf([Exact([Int8, Int8, Float64]), Exact([Int16, Int16, Float64]), Exact([Int32, Int32, Float64]), Exact([Int64, Int64, Float64]), Exact([UInt8, UInt8, Float64]), Exact([UInt16, UInt16, Float64]), Exact([UInt32, UInt32, Float64]), Exact([UInt64, UInt64, Float64]), Exact([Float32, Float32, Float64]), Exact([Float64, Float64, Float64])]) failed
[SQL] SELECT approx_percentile_cont_with_weight(c1, c2, 0.95) FROM aggregate_test_100
at test_files/aggregate.slt:135

@alamb
Copy link
Contributor

alamb commented Mar 6, 2025

cargo test --test sqllogictests -- --complete to generate the slt testing files, but the CI still failed for the PR?

Sometimes I have seen that related to the exact error message / matching syntax. You can shorten the error message that is being expected and that is how I work around it

@alamb
Copy link
Contributor

alamb commented Mar 6, 2025

Thank you for this work @zhuqi-lucas -- I see switching to use Utf8View is a non trivial change and we should probably be tracking with its own epic as it will require several other items to fully do (such as JSON Uft8View support)

I can try to file such a ticket later today / tomorrow if you don't have a chance to do so

@zhuqi-lucas
Copy link
Contributor Author

Thank you for this work @zhuqi-lucas -- I see switching to use Utf8View is a non trivial change and we should probably be tracking with its own epic as it will require several other items to fully do (such as JSON Uft8View support)

I can try to file such a ticket later today / tomorrow if you don't have a chance to do so

Thank you @alamb for confirming, i created the arrow-rs subtask now:
apache/arrow-rs#7244

@github-actions github-actions bot added the development-process Related to development process of DataFusion label Mar 7, 2025
@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Mar 7, 2025

cargo test --test sqllogictests -- --complete to generate the slt testing files, but the CI still failed for the PR?

Sometimes I have seen that related to the exact error message / matching syntax. You can shorten the error message that is being expected and that is how I work around it

Thank you @alamb , after digging into, i found the problem is:

cargo test --test sqllogictests    --features backtrace -- --complete

The testing result will be different when we using backtrace for sqllogictest.

And the current testing for cargo test will use backtrace:

  macos-aarch64:
    name: cargo test (macos-aarch64)
    runs-on: macos-14
    steps:
      - uses: actions/checkout@v4
        with:
          submodules: true
          fetch-depth: 1
      - name: Setup Rust toolchain
        uses: ./.github/actions/setup-macos-aarch64-builder
      - name: Run tests (excluding doctests)
        shell: bash
        run: cargo test --profile ci --lib --tests --bins --features avro,json,backtrace,integration-tests

But other places not using the backtrace.

And i created a ticket to fix it for the CI to make it consistent:
#15058

Also submitted the fix PR:
#15059

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhuqi-lucas -- I started organizing the work needed to switch the default VARCHAR type to Utf8View here: #15096

I am going to mark this PR as a draft as we work through the issues

@@ -3627,10 +3625,15 @@ physical_plan
07)------------StreamingTableExec: partition_sizes=1, projection=[a0, a, b, c, d], infinite_source=true, output_orderings=[[a@1 ASC NULLS LAST, b@2 ASC NULLS LAST], [c@3 ASC NULLS LAST]]

# CTAS with NTILE function
statement ok
statement error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a regression to me

@@ -1898,6 +1898,9 @@ mod tests {
Ok(())
}

/// Note: We now default to use Utf8View, but we don't support for Utf8View in JSON reader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -289,14 +289,8 @@ SELECT c2, approx_median(c5), approx_median(c11) FROM aggregate_test_100 GROUP B
5 593204320 0.5156586

# Test approx_distinct for varchar / int
query III
query error DataFusion error: This feature is not implemented: Support for 'approx_distinct' for data type Utf8View is not implemented
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be needed too -- added a note to

04)------CoalesceBatchesExec: target_batch_size=8192
05)--------RepartitionExec: partitioning=Hash([trace_id@0], 4), input_partitions=4
06)----------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
07)------------AggregateExec: mode=Partial, gby=[trace_id@0 as trace_id], aggr=[max(traces.timestamp)], lim=[4]
07)------------AggregateExec: mode=Partial, gby=[trace_id@0 as trace_id], aggr=[max(traces.timestamp)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is because there is special topk functionality for aggregates: https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/aggregates/topk/mod.rs

This special implementation likely doesn't have support for Utf8View

@@ -124,69 +124,24 @@ STORED AS AVRO
LOCATION '../../testing/data/avro/simple_fixed.avro';

# test avro query
query IT
query error DataFusion error: Arrow error: Schema error: type Utf8View not supported
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 looks like we need to support Utf8View for avro as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah @alamb , i agree, we need to create more sub-tasks.

Copy link

github-actions bot commented May 8, 2025

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate development-process Related to development process of DataFusion execution Related to the execution crate physical-expr Changes to the physical-expr crates sql SQL Planner sqllogictest SQL Logic Tests (.slt) Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

schema_force_view_type configuration not working for CREATE EXTERNAL TABLE
3 participants