Bad query plan produced when `all_horizontal` is used with `join_where` #21009

SebStrug · 2025-01-30T15:31:48Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from datetime import date, timedelta

import polars as pl

left_frame = pl.LazyFrame(
    {
        "left_match": ["A", "B", "C", "D", "E", "F"],
        "left_date_start": [date(2001, 1, 1) + timedelta(days=d) for d in range(1, 7)],
    }
)

right_frame = pl.LazyFrame(
    {
        "right_match": ["D", "E", "F", "G", "H", "I"],
        "right_date": [date(2001, 1, 6) + timedelta(days=d) for d in range(1, 7)],
    }
)

joined_fast = left_frame.join_where(
    right_frame,
    pl.col("left_match") == pl.col("right_match"),
    pl.col("right_date") >= pl.col("left_date_start"),
)

joined_slow = left_frame.join_where(
    right_frame,
    pl.all_horizontal(pl.col("left_match") == pl.col("right_match")),
    pl.col("right_date") >= pl.col("left_date_start"),
)


print("Fast evaluation:\n", joined_fast.explain(optimized=True))
print("\n\nSlow evalaution:\n", joined_slow.explain(optimized=True))

Log output

Fast evaluation:
 FILTER [(col("left_date_start")) <= (col("right_date"))] FROM
  INNER JOIN:
  LEFT PLAN ON: [col("left_match")]
    DF ["left_match", "left_date_start"]; PROJECT */2 COLUMNS
  RIGHT PLAN ON: [col("right_match")]
    DF ["right_match", "right_date"]; PROJECT */2 COLUMNS
  END INNER JOIN


Slow evalaution:
 FILTER [(col("left_match")) == (col("right_match"))].cast(Boolean) FROM
  IEJOIN JOIN:
  LEFT PLAN ON: [col("left_date_start")]
    DF ["left_match", "left_date_start"]; PROJECT */2 COLUMNS
  RIGHT PLAN ON: [col("right_date")]
    DF ["right_match", "right_date"]; PROJECT */2 COLUMNS
  END IEJOIN JOIN

Issue description

When using all_horizontal within the join_where the query plan shows that:

The filter is applied after the join.
An IE join is used instead of an inner join.

When I run the second expression/join example on larger dataframes, it is much slower than the first expression/join example, in some cases by 100x.

My specific use case is when I have multiple boolean comparisons I want to chain together with .all_horizontal((<first-comparison>).or_(<second-comparison>)). I have not included the use of .or_ in this minimal example of the change in query plan.

Expected behavior

Both expressions should have the same evaluation plan because there is a single boolean comparison. Both should then take the same time and consume the same amount of memory.

Installed versions

--------Version info---------
Polars:              1.21.0
Index type:          UInt32
Platform:            Linux-5.15.0-91-generic-x86_64-with-glibc2.35
Python:              3.12.8 (main, Jan 14 2025, 22:49:14) [Clang 19.1.6 ]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          3.1.1
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          2.38.0
great_tables         <not installed>
matplotlib           3.10.0
numpy                2.2.2
openpyxl             <not installed>
pandas               <not installed>
pyarrow              19.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           2.0.37
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

The text was updated successfully, but these errors were encountered:

ritchie46 · 2025-01-30T15:58:39Z

I believe this is a missed optimization opportunity. Or do you observe wrong results?

SebStrug · 2025-01-30T16:18:49Z

The results are the same in both cases, yes it's an optimization issue. Thanks for fixing the labels.

SebStrug added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 30, 2025

ritchie46 added support enhancement New feature or an improvement of an existing feature and removed bug Something isn't working support labels Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad query plan produced when `all_horizontal` is used with `join_where` #21009

Bad query plan produced when `all_horizontal` is used with `join_where` #21009

SebStrug commented Jan 30, 2025

ritchie46 commented Jan 30, 2025

SebStrug commented Jan 30, 2025 •

edited

Loading

Bad query plan produced when all_horizontal is used with join_where #21009

Bad query plan produced when all_horizontal is used with join_where #21009

Comments

SebStrug commented Jan 30, 2025

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

ritchie46 commented Jan 30, 2025

SebStrug commented Jan 30, 2025 • edited Loading

Bad query plan produced when `all_horizontal` is used with `join_where` #21009

Bad query plan produced when `all_horizontal` is used with `join_where` #21009

SebStrug commented Jan 30, 2025 •

edited

Loading