Closed
Description
Is your feature request related to a problem or challenge?
Tracking ticket for next release, also a place to track desired inclusions
Previous release will be https://crates.io/crates/datafusion/45.0.0 (likely Feb 1, 2025) December 31, 2024 so next major release would be around March 1, 2025
Steps:
- Update version and changelog: Update version to 47.0.0, add CHANGELOG #15731
- Test with DataFusion Python:
- Test with DataFusion Comet - chore: Upgrade to datafusion 47.0.0-rc1 and arrow-rs 55.0.0 datafusion-comet#1563
- Test with delta.rs: feat: upgrade to DataFusion 47.0.0 delta-io/delta-rs#3378
- Test with SailHQ: Test DataFusion 47 lakehq/sail#434
- Write upgrade guide: Upgrade guide for DataFusion 47.0.0 #15707
- Test with parquet viewer
- Voting Thread: https://lists.apache.org/thread/zrq9x9gf51r8b6m9qokf2q75kh251rm6
- Create ticket for next release: Release DataFusion
48.0.0
(June 2025) #15771
Prior release tickets:
45.0.0
: Release DataFusion45.0.0
#1400846.0.0
: Release DataFusion46.0.0
#14123
Changes to add to upgrade guide
These PRs made changes that deserve a mention in the upgrade guide
- chore: remove deprecated variants of UDF's invoke (invoke, invoke_no_args, invoke_batch) #15123
- chore: remove ScalarUDFImpl::return_type_from_exprs #15130
- Fix type coercion for unsigned and signed integers (
Int64
vsUInt64
, etc) #15341 - Refactor: add
FileGroup
structure forVec<PartitionedFile>
#15379 - Add
downcast_to_source
method forDataSourceExec
#15416 - Support computing statistics for FileGroup #15432
- chore: cleanup deprecated API since
version <= 40
#15027 - Need to handle DisplayFormatType::TreeRender for execution plans.
- map_partial_batch is removed from schema mapper
- page_pruning_predicate is removed from public api -- parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener #15561
Features to mention in the blog (if they make it)
- [EPIC] Complete
SQL EXPLAIN
Tree Rendering #14914 - [EPIC] A collection of tickets for improving sorting larger than memory datasets / spilling sorts #15271
- Change mapping of SQL
VARCHAR
fromUtf8
toUtf8View
#15096 - feat: introduce
JoinSetTracer
trait for tracing context propagation in spawned tasks #14547 - Improve performance of
first_value
by implementing specialGroupsAccumulator
#15266
Bugs that would be good to fix
- Regression: eager evaluation of expressions inside CASE conditional expression #15384
-
panic
when evaluating trivial WHERE with a CTE #15386 - Fix: after repartitioning, the
PartitionedFile
andFileGroup
statistics should be inexact/recomputed #15539 - Regression in
last_value
functionality #15676 - Remove waits from blocking threads reading spill files. #15654
Community Wishlist
- Comparison Operators for Decimals of Different Precisions and Scales #15174
- Format
Date32
to string given timestamp specifiers #15361 - Feature: support cast
date
totimestamp
with tz #14638 - fix: update group by columns for merge phase after spill #15531
- Add
statistics_by_partition
API toExecutionPlan
#15495 - Add coerce int96 option for Parquet to support different TimeUnits, test int96_from_spark.parquet from parquet-testing #15537