Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad query plan produced when all_horizontal is used with join_where #21009

Open
2 tasks done
SebStrug opened this issue Jan 30, 2025 · 2 comments
Open
2 tasks done

Bad query plan produced when all_horizontal is used with join_where #21009

SebStrug opened this issue Jan 30, 2025 · 2 comments
Labels
enhancement New feature or an improvement of an existing feature needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@SebStrug
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from datetime import date, timedelta

import polars as pl

left_frame = pl.LazyFrame(
    {
        "left_match": ["A", "B", "C", "D", "E", "F"],
        "left_date_start": [date(2001, 1, 1) + timedelta(days=d) for d in range(1, 7)],
    }
)

right_frame = pl.LazyFrame(
    {
        "right_match": ["D", "E", "F", "G", "H", "I"],
        "right_date": [date(2001, 1, 6) + timedelta(days=d) for d in range(1, 7)],
    }
)

joined_fast = left_frame.join_where(
    right_frame,
    pl.col("left_match") == pl.col("right_match"),
    pl.col("right_date") >= pl.col("left_date_start"),
)

joined_slow = left_frame.join_where(
    right_frame,
    pl.all_horizontal(pl.col("left_match") == pl.col("right_match")),
    pl.col("right_date") >= pl.col("left_date_start"),
)


print("Fast evaluation:\n", joined_fast.explain(optimized=True))
print("\n\nSlow evalaution:\n", joined_slow.explain(optimized=True))

Log output

Fast evaluation:
 FILTER [(col("left_date_start")) <= (col("right_date"))] FROM
  INNER JOIN:
  LEFT PLAN ON: [col("left_match")]
    DF ["left_match", "left_date_start"]; PROJECT */2 COLUMNS
  RIGHT PLAN ON: [col("right_match")]
    DF ["right_match", "right_date"]; PROJECT */2 COLUMNS
  END INNER JOIN


Slow evalaution:
 FILTER [(col("left_match")) == (col("right_match"))].cast(Boolean) FROM
  IEJOIN JOIN:
  LEFT PLAN ON: [col("left_date_start")]
    DF ["left_match", "left_date_start"]; PROJECT */2 COLUMNS
  RIGHT PLAN ON: [col("right_date")]
    DF ["right_match", "right_date"]; PROJECT */2 COLUMNS
  END IEJOIN JOIN

Issue description

When using all_horizontal within the join_where the query plan shows that:

  1. The filter is applied after the join.
  2. An IE join is used instead of an inner join.

When I run the second expression/join example on larger dataframes, it is much slower than the first expression/join example, in some cases by 100x.

My specific use case is when I have multiple boolean comparisons I want to chain together with .all_horizontal((<first-comparison>).or_(<second-comparison>)). I have not included the use of .or_ in this minimal example of the change in query plan.

Expected behavior

Both expressions should have the same evaluation plan because there is a single boolean comparison. Both should then take the same time and consume the same amount of memory.

Installed versions

--------Version info---------
Polars:              1.21.0
Index type:          UInt32
Platform:            Linux-5.15.0-91-generic-x86_64-with-glibc2.35
Python:              3.12.8 (main, Jan 14 2025, 22:49:14) [Clang 19.1.6 ]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          3.1.1
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          2.38.0
great_tables         <not installed>
matplotlib           3.10.0
numpy                2.2.2
openpyxl             <not installed>
pandas               <not installed>
pyarrow              19.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           2.0.37
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@SebStrug SebStrug added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 30, 2025
@ritchie46 ritchie46 added support enhancement New feature or an improvement of an existing feature and removed bug Something isn't working support labels Jan 30, 2025
@ritchie46
Copy link
Member

I believe this is a missed optimization opportunity. Or do you observe wrong results?

@SebStrug
Copy link
Author

SebStrug commented Jan 30, 2025

The results are the same in both cases, yes it's an optimization issue. Thanks for fixing the labels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants