Skip to content

Add Expr.statistics method based on Statistics #84

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from

Conversation

rjzamora
Copy link
Member

Alternate take on https://github.com/mrocklin/dask-expr/pull/40

  • Adds dask_expr.statistics module (defining a simple dataclass structure for statistics)
  • Adds Expr.statistics (which utilizes dask_expr.statistics)
  • Adds FrameBase.__len__ (which utilizes Expr.statistics when possible)

Rational for requiring Expr.statistics: dict to be dask_expr.statistics.Statistics instances: It think we will be wanting to add/leverage many different kinds of "statistics." Therefore, I think we will need a class structure like this to simplify and isolate the logic that dictates if/how a specific Expr "parent" can assume a specific Statistics object from its child.

@rjzamora
Copy link
Member Author

I'm interested to know your thoughts on the general approach here @mrocklin

@rjzamora rjzamora closed this May 16, 2023
@rjzamora rjzamora reopened this May 16, 2023
@rjzamora rjzamora marked this pull request as ready for review May 16, 2023 20:54
@@ -0,0 +1,74 @@
from __future__ import annotations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could use some help understanding this module.

I think that in general we have yet to define what kinds of statistics we're going to capture, and how we plan to encode those. There are lots of options here.

I think what I'm seeing here is that your response is "we'll just make different classes for all the different kinds of things that people might want to encode". Is that correct? If so, I'm not totally bought into this just yet.

I think that the question of "how do we encode dataframe-level or partition-level statistics" is a big open one. I'm ok with us not having a clear answer on this before we move forward, but I want the level of sophistication of our solution to be correlated with our confidence. This feels like a somewhat sophisticated/specific solution (a few classes with some specific method APIs) but I don't have confidence that it's correct (or at least I don't know enough to be confident). Can you help me understand here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. We may need to have a real-time chat about this one. My primary goal here was to keep things very simple, and so it worries me a bit that you see something sophisticated.

The general approach here is: “Adopt the same statistics approach suggested in #40, but use a simple data class as a container for the statistics so that we know if/how it should be passed from child to parent.” I only added the simple class structure to the mix after I started experimenting with row-count and min/max column statistics, and felt that there was unnecessary _statistics logic polluting several non-IO Expr classes. Since I know the statistics representation/framework is likely to evolve (or be replaced completely) in the future, I was hoping to keep the logic isolated. In the end, I decided to focus on the simple row-count case, and propose a class structure that I expect to be relevant to all statistics: We need hold some kind of statistics “data”, and we need to expose a mechanism to allow the passing of a specific kind of statistics between child and parent.

I suppose you are probably saying that that you would prefer not to introduce classes until we know that those classes will capture some of the other kinds of statistics we will want to track (e.g. min/max/null column statistics, and “shuffled-by” information)? This request is perfectly fair. I’ll admit that part of the reason I didn’t include min/max column statistics in this PR is that I hadn’t decided on the best way to represent partition-wise column statistics.

Aside: My favorite column-statistics approach I’ve played with so far is to track a ColumnStatistics(Statistics) object for each column, and for the data of that object to be a ColumnMaxima(PartitionStatistics) object where data is a tuple of {‘min’: …, ‘max’: …} dicts.



Another consideration is whether this design will allow us to push down “requests” for missing statistics into a ReadParquet expression at optimization time. I think the answer is “yes,” but this question is another reason I’d like to keep the statistics logic isolated in the meantime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I don't like about the design in this PR is that it still uses the dict approach (as is) from #40 for tracking all known statistics. Whatever design we ultimately go with, we will probably need to enforce explicit rules for key names and collisions. I didn't bother to deal with this yet, but it was certainly on my mind.

@mrocklin
Copy link
Member

mrocklin commented May 17, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants