Skip to content

Incorrect results/errors for partitioned over() with nulls #3300

@dangotbanned

Description

@dangotbanned

Expected results

Same as polars:

import polars as pl
data = {"a": [1, 1, None, 2, 2], "b": [1, 3, 3, 2, 3], "i": [0, 1, 2, 3, 4]}

b = pl.col("b")
df = pl.DataFrame(data)

df.select(
    "i",
    b_min=b.min().over("a"),
    b_mean=b.mean().over("a"),
    b_first=b.first().over("a"),
    b_last=b.last().over("a"),
).sort("i").drop("i")
shape: (5, 4)
┌───────┬────────┬─────────┬────────┐
│ b_min ┆ b_mean ┆ b_first ┆ b_last │
│ ---   ┆ ---    ┆ ---     ┆ ---    │
│ i64   ┆ f64    ┆ i64     ┆ i64    │
╞═══════╪════════╪═════════╪════════╡
│ 1     ┆ 2.0    ┆ 1       ┆ 3      │
│ 1     ┆ 2.0    ┆ 1       ┆ 3      │
│ 3     ┆ 3.0    ┆ 3       ┆ 3      │
│ 2     ┆ 2.5    ┆ 2       ┆ 3      │
│ 2     ┆ 2.5    ┆ 2       ┆ 3      │
└───────┴────────┴─────────┴────────┘

Repro

I discovered in #3295 that when we join back, the presence of None on the partition_by key(s) causes trouble:

import narwhals as nw

b = nw.col("b")
df = nw.from_dict(data, backend="pyarrow")
df.select(
    "i",
    b_min=b.min().over("a"),
    b_mean=b.mean().over("a"),
    b_first=b.first().over("a"),
    b_last=b.last().over("a"),
).sort("i").drop("i").to_polars()
Show output

shape: (5, 4)
┌───────┬────────┬─────────┬────────┐
│ b_min ┆ b_mean ┆ b_first ┆ b_last │
│ ---   ┆ ---    ┆ ---     ┆ ---    │
│ i64   ┆ f64    ┆ i64     ┆ i64    │
╞═══════╪════════╪═════════╪════════╡
│ 1     ┆ 2.0    ┆ 1       ┆ 3      │
│ 1     ┆ 2.0    ┆ 1       ┆ 3      │
│ 2     ┆ 2.5    ┆ 2       ┆ 3      │
│ 2     ┆ 2.5    ┆ 2       ┆ 3      │
│ null  ┆ null   ┆ null    ┆ null   │
└───────┴────────┴─────────┴────────┘

Our pandas impl raises on the same query:

df = nw.from_dict(data, backend="pandas")
df.select(
    "i",
    b_min=b.min().over("a"),
    b_mean=b.mean().over("a"),
    b_first=b.first().over("a"),
    b_last=b.last().over("a"),
).sort("i").drop("i")
ShapeError: Expected object of length 5, got length: 4

The error can be avoided by removing first, last - but we still get incorrect results for the others:

df = nw.from_dict(data, backend="pandas")
df.select("i", b_min=b.min().over("a"), b_mean=b.mean().over("a")).sort("i").drop(
    "i"
).to_polars()
Show output

shape: (5, 2)
┌───────┬────────┐
│ b_min ┆ b_mean │
│ ---   ┆ ---    │
│ f64   ┆ f64    │
╞═══════╪════════╡
│ 1.0   ┆ 2.0    │
│ 1.0   ┆ 2.0    │
│ null  ┆ null   │
│ 2.0   ┆ 2.5    │
│ 2.0   ┆ 2.5    │
└───────┴────────┘

We can get the correct result for that part with duckdb:

df = nw.from_dict(data, backend="polars")
df.lazy("duckdb").select("i", b_min=b.min().over("a"), b_mean=b.mean().over("a")).sort(
    "i"
).drop("i").collect("polars").to_polars()
Show output

shape: (5, 2)
┌───────┬────────┐
│ b_min ┆ b_mean │
│ ---   ┆ ---    │
│ i64   ┆ f64    │
╞═══════╪════════╡
│ 1     ┆ 2.0    │
│ 1     ┆ 2.0    │
│ 3     ┆ 3.0    │
│ 2     ┆ 2.5    │
│ 2     ┆ 2.5    │
└───────┴────────┘

And by adding some order_bys, we can do the other two:

df = nw.from_dict(data, backend="polars")
df.lazy("duckdb").select(
    "i",
    b_min=b.min().over("a"),
    b_mean=b.mean().over("a"),
    b_first=b.first().over("a", order_by="i"),
    b_last=b.last().over("a", order_by="i"),
).sort("i").drop("i").collect("polars").to_polars()
Show output

shape: (5, 4)
┌───────┬────────┬─────────┬────────┐
│ b_min ┆ b_mean ┆ b_first ┆ b_last │
│ ---   ┆ ---    ┆ ---     ┆ ---    │
│ i64   ┆ f64    ┆ i64     ┆ i64    │
╞═══════╪════════╪═════════╪════════╡
│ 1     ┆ 2.0    ┆ 1       ┆ 3      │
│ 1     ┆ 2.0    ┆ 1       ┆ 3      │
│ 3     ┆ 3.0    ┆ 3       ┆ 3      │
│ 2     ┆ 2.5    ┆ 2       ┆ 3      │
│ 2     ┆ 2.5    ┆ 2       ┆ 3      │
└───────┴────────┴─────────┴────────┘

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions