Add `Expr.statistics` method based on `Statistics` #84

rjzamora · 2023-05-16T20:43:36Z

Alternate take on https://github.com/mrocklin/dask-expr/pull/40

Adds dask_expr.statistics module (defining a simple dataclass structure for statistics)
Adds Expr.statistics (which utilizes dask_expr.statistics)
Adds FrameBase.__len__ (which utilizes Expr.statistics when possible)

Rational for requiring Expr.statistics: dict to be dask_expr.statistics.Statistics instances: It think we will be wanting to add/leverage many different kinds of "statistics." Therefore, I think we will need a class structure like this to simplify and isolate the logic that dictates if/how a specific Expr "parent" can assume a specific Statistics object from its child.

rjzamora · 2023-05-16T20:44:08Z

I'm interested to know your thoughts on the general approach here @mrocklin

mrocklin · 2023-05-16T23:07:43Z

dask_expr/statistics.py

@@ -0,0 +1,74 @@
+from __future__ import annotations


I could use some help understanding this module.

I think that in general we have yet to define what kinds of statistics we're going to capture, and how we plan to encode those. There are lots of options here.

I think what I'm seeing here is that your response is "we'll just make different classes for all the different kinds of things that people might want to encode". Is that correct? If so, I'm not totally bought into this just yet.

I think that the question of "how do we encode dataframe-level or partition-level statistics" is a big open one. I'm ok with us not having a clear answer on this before we move forward, but I want the level of sophistication of our solution to be correlated with our confidence. This feels like a somewhat sophisticated/specific solution (a few classes with some specific method APIs) but I don't have confidence that it's correct (or at least I don't know enough to be confident). Can you help me understand here?

Hmmm. We may need to have a real-time chat about this one. My primary goal here was to keep things very simple, and so it worries me a bit that you see something sophisticated.

The general approach here is: “Adopt the same statistics approach suggested in #40, but use a simple data class as a container for the statistics so that we know if/how it should be passed from child to parent.” I only added the simple class structure to the mix after I started experimenting with row-count and min/max column statistics, and felt that there was unnecessary _statistics logic polluting several non-IO Expr classes. Since I know the statistics representation/framework is likely to evolve (or be replaced completely) in the future, I was hoping to keep the logic isolated. In the end, I decided to focus on the simple row-count case, and propose a class structure that I expect to be relevant to all statistics: We need hold some kind of statistics “data”, and we need to expose a mechanism to allow the passing of a specific kind of statistics between child and parent.

I suppose you are probably saying that that you would prefer not to introduce classes until we know that those classes will capture some of the other kinds of statistics we will want to track (e.g. min/max/null column statistics, and “shuffled-by” information)? This request is perfectly fair. I’ll admit that part of the reason I didn’t include min/max column statistics in this PR is that I hadn’t decided on the best way to represent partition-wise column statistics.

Aside: My favorite column-statistics approach I’ve played with so far is to track a ColumnStatistics(Statistics) object for each column, and for the data of that object to be a ColumnMaxima(PartitionStatistics) object where data is a tuple of {‘min’: …, ‘max’: …} dicts.  

Another consideration is whether this design will allow us to push down “requests” for missing statistics into a ReadParquet expression at optimization time. I think the answer is “yes,” but this question is another reason I’d like to keep the statistics logic isolated in the meantime.

One thing I don't like about the design in this PR is that it still uses the dict approach (as is) from #40 for tracking all known statistics. Whatever design we ultimately go with, we will probably need to enforce explicit rules for key names and collisions. I didn't bother to deal with this yet, but it was certainly on my mind.

mrocklin · 2023-05-17T13:56:26Z

Happy with a real-time chat any time. My afternoon is pretty open after 1pm. (although I might want to check out a bit earlier today)

…

On Wed, May 17, 2023 at 8:53 AM Richard (Rick) Zamora < ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In dask_expr/statistics.py <https://github.com/mrocklin/dask-expr/pull/84#discussion_r1196561344>: > @@ -0,0 +1,74 @@ +from __future__ import annotations One thing I *don't* like about the design in this PR is that it still uses the dict approach (as is) from #40 <https://github.com/mrocklin/dask-expr/pull/40> for tracking all known statistics. Whatever design we ultimately go with, we will probably need to enforce explicit rules for key names and collisions. I didn't bother to deal with this yet, but it was certainly on my mind. — Reply to this email directly, view it on GitHub <https://github.com/mrocklin/dask-expr/pull/84#discussion_r1196561344>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTDMKPUFYAXDYB36OL3XGTJ6FANCNFSM6AAAAAAYEFRQ3I> . You are receiving this because you were mentioned.Message ID: ***@***.***>

rjzamora added 11 commits May 15, 2023 08:54

start experimenting with parquet statistics

13b828f

Merge remote-tracking branch 'upstream/main' into pq-statistics-len

f5f4e19

adopt parts of dask#40

990ba4c

experimenting with dedicated Metadata class structure

1c62f4c

add missing file

afd59d7

go back to and remove sub-class for now

8302305

add parquet test

a3c5f2c

use assume vs inherit

cbced80

use assume vs inherit

5fe5862

split test

b0946f8

fix doc-string

bfd8710

rjzamora closed this May 16, 2023

rjzamora reopened this May 16, 2023

fix typos

2d343c7

rjzamora marked this pull request as ready for review May 16, 2023 20:54

mrocklin reviewed May 16, 2023

View reviewed changes

rjzamora mentioned this pull request May 19, 2023

Implement __len__ and leverage parquet statistics #102

Merged

rjzamora closed this May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add `Expr.statistics` method based on `Statistics` #84

Add `Expr.statistics` method based on `Statistics` #84

Uh oh!

rjzamora commented May 16, 2023

Uh oh!

rjzamora commented May 16, 2023

Uh oh!

mrocklin May 16, 2023

Uh oh!

rjzamora May 17, 2023

Uh oh!

rjzamora May 17, 2023

Uh oh!

mrocklin commented May 17, 2023 via email

Uh oh!

Uh oh!

Uh oh!

Add Expr.statistics method based on Statistics #84

Add Expr.statistics method based on Statistics #84

Uh oh!

Conversation

rjzamora commented May 16, 2023

Uh oh!

rjzamora commented May 16, 2023

Uh oh!

mrocklin May 16, 2023

Choose a reason for hiding this comment

Uh oh!

rjzamora May 17, 2023

Choose a reason for hiding this comment

Uh oh!

rjzamora May 17, 2023

Choose a reason for hiding this comment

Uh oh!

mrocklin commented May 17, 2023 via email

Uh oh!

Uh oh!

Add `Expr.statistics` method based on `Statistics` #84

Add `Expr.statistics` method based on `Statistics` #84