-
Notifications
You must be signed in to change notification settings - Fork 175
Closed
Labels
bug: incorrect resultSomething isn't workingSomething isn't workingbug: it raises an error but shouldn'thigh priorityYour PR will be reviewed very quickly if you address thisYour PR will be reviewed very quickly if you address thispandas-likeIssue is related to pandas-like backendsIssue is related to pandas-like backendspyarrowIssue is related to pyarrow backendIssue is related to pyarrow backend
Description
Expected results
Same as polars:
import polars as pl
data = {"a": [1, 1, None, 2, 2], "b": [1, 3, 3, 2, 3], "i": [0, 1, 2, 3, 4]}
b = pl.col("b")
df = pl.DataFrame(data)
df.select(
"i",
b_min=b.min().over("a"),
b_mean=b.mean().over("a"),
b_first=b.first().over("a"),
b_last=b.last().over("a"),
).sort("i").drop("i")shape: (5, 4)
┌───────┬────────┬─────────┬────────┐
│ b_min ┆ b_mean ┆ b_first ┆ b_last │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ i64 ┆ i64 │
╞═══════╪════════╪═════════╪════════╡
│ 1 ┆ 2.0 ┆ 1 ┆ 3 │
│ 1 ┆ 2.0 ┆ 1 ┆ 3 │
│ 3 ┆ 3.0 ┆ 3 ┆ 3 │
│ 2 ┆ 2.5 ┆ 2 ┆ 3 │
│ 2 ┆ 2.5 ┆ 2 ┆ 3 │
└───────┴────────┴─────────┴────────┘
Repro
I discovered in #3295 that when we join back, the presence of None on the partition_by key(s) causes trouble:
import narwhals as nw
b = nw.col("b")
df = nw.from_dict(data, backend="pyarrow")
df.select(
"i",
b_min=b.min().over("a"),
b_mean=b.mean().over("a"),
b_first=b.first().over("a"),
b_last=b.last().over("a"),
).sort("i").drop("i").to_polars()Show output
shape: (5, 4)
┌───────┬────────┬─────────┬────────┐
│ b_min ┆ b_mean ┆ b_first ┆ b_last │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ i64 ┆ i64 │
╞═══════╪════════╪═════════╪════════╡
│ 1 ┆ 2.0 ┆ 1 ┆ 3 │
│ 1 ┆ 2.0 ┆ 1 ┆ 3 │
│ 2 ┆ 2.5 ┆ 2 ┆ 3 │
│ 2 ┆ 2.5 ┆ 2 ┆ 3 │
│ null ┆ null ┆ null ┆ null │
└───────┴────────┴─────────┴────────┘
Our pandas impl raises on the same query:
df = nw.from_dict(data, backend="pandas")
df.select(
"i",
b_min=b.min().over("a"),
b_mean=b.mean().over("a"),
b_first=b.first().over("a"),
b_last=b.last().over("a"),
).sort("i").drop("i")ShapeError: Expected object of length 5, got length: 4
The error can be avoided by removing first, last - but we still get incorrect results for the others:
df = nw.from_dict(data, backend="pandas")
df.select("i", b_min=b.min().over("a"), b_mean=b.mean().over("a")).sort("i").drop(
"i"
).to_polars()Show output
shape: (5, 2)
┌───────┬────────┐
│ b_min ┆ b_mean │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═══════╪════════╡
│ 1.0 ┆ 2.0 │
│ 1.0 ┆ 2.0 │
│ null ┆ null │
│ 2.0 ┆ 2.5 │
│ 2.0 ┆ 2.5 │
└───────┴────────┘
We can get the correct result for that part with duckdb:
df = nw.from_dict(data, backend="polars")
df.lazy("duckdb").select("i", b_min=b.min().over("a"), b_mean=b.mean().over("a")).sort(
"i"
).drop("i").collect("polars").to_polars()Show output
shape: (5, 2)
┌───────┬────────┐
│ b_min ┆ b_mean │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═══════╪════════╡
│ 1 ┆ 2.0 │
│ 1 ┆ 2.0 │
│ 3 ┆ 3.0 │
│ 2 ┆ 2.5 │
│ 2 ┆ 2.5 │
└───────┴────────┘
And by adding some order_bys, we can do the other two:
df = nw.from_dict(data, backend="polars")
df.lazy("duckdb").select(
"i",
b_min=b.min().over("a"),
b_mean=b.mean().over("a"),
b_first=b.first().over("a", order_by="i"),
b_last=b.last().over("a", order_by="i"),
).sort("i").drop("i").collect("polars").to_polars()Show output
shape: (5, 4)
┌───────┬────────┬─────────┬────────┐
│ b_min ┆ b_mean ┆ b_first ┆ b_last │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ i64 ┆ i64 │
╞═══════╪════════╪═════════╪════════╡
│ 1 ┆ 2.0 ┆ 1 ┆ 3 │
│ 1 ┆ 2.0 ┆ 1 ┆ 3 │
│ 3 ┆ 3.0 ┆ 3 ┆ 3 │
│ 2 ┆ 2.5 ┆ 2 ┆ 3 │
│ 2 ┆ 2.5 ┆ 2 ┆ 3 │
└───────┴────────┴─────────┴────────┘
Metadata
Metadata
Assignees
Labels
bug: incorrect resultSomething isn't workingSomething isn't workingbug: it raises an error but shouldn'thigh priorityYour PR will be reviewed very quickly if you address thisYour PR will be reviewed very quickly if you address thispandas-likeIssue is related to pandas-like backendsIssue is related to pandas-like backendspyarrowIssue is related to pyarrow backendIssue is related to pyarrow backend