-
-
Notifications
You must be signed in to change notification settings - Fork 27
Add Expr.statistics
method based on Statistics
#84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm interested to know your thoughts on the general approach here @mrocklin |
@@ -0,0 +1,74 @@ | |||
from __future__ import annotations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could use some help understanding this module.
I think that in general we have yet to define what kinds of statistics we're going to capture, and how we plan to encode those. There are lots of options here.
I think what I'm seeing here is that your response is "we'll just make different classes for all the different kinds of things that people might want to encode". Is that correct? If so, I'm not totally bought into this just yet.
I think that the question of "how do we encode dataframe-level or partition-level statistics" is a big open one. I'm ok with us not having a clear answer on this before we move forward, but I want the level of sophistication of our solution to be correlated with our confidence. This feels like a somewhat sophisticated/specific solution (a few classes with some specific method APIs) but I don't have confidence that it's correct (or at least I don't know enough to be confident). Can you help me understand here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. We may need to have a real-time chat about this one. My primary goal here was to keep things very simple, and so it worries me a bit that you see something sophisticated.
The general approach here is: “Adopt the same statistics
approach suggested in #40, but use a simple data class as a container for the statistics so that we know if/how it should be passed from child to parent.” I only added the simple class structure to the mix after I started experimenting with row-count and min/max column statistics, and felt that there was unnecessary _statistics
logic polluting several non-IO Expr
classes. Since I know the statistics representation/framework is likely to evolve (or be replaced completely) in the future, I was hoping to keep the logic isolated. In the end, I decided to focus on the simple row-count case, and propose a class structure that I expect to be relevant to all statistics: We need hold some kind of statistics “data”, and we need to expose a mechanism to allow the passing of a specific kind of statistics between child and parent.
I suppose you are probably saying that that you would prefer not to introduce classes until we know that those classes will capture some of the other kinds of statistics we will want to track (e.g. min/max/null column statistics, and “shuffled-by” information)? This request is perfectly fair. I’ll admit that part of the reason I didn’t include min/max column statistics in this PR is that I hadn’t decided on the best way to represent partition-wise column statistics.
Aside: My favorite column-statistics approach I’ve played with so far is to track a ColumnStatistics(Statistics)
object for each column, and for the data
of that object to be a ColumnMaxima(PartitionStatistics)
object where data
is a tuple of {‘min’: …, ‘max’: …}
dicts.
Another consideration is whether this design will allow us to push down “requests” for missing statistics into a ReadParquet
expression at optimization time. I think the answer is “yes,” but this question is another reason I’d like to keep the statistics logic isolated in the meantime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I don't like about the design in this PR is that it still uses the dict
approach (as is) from #40 for tracking all known statistics. Whatever design we ultimately go with, we will probably need to enforce explicit rules for key names and collisions. I didn't bother to deal with this yet, but it was certainly on my mind.
Happy with a real-time chat any time. My afternoon is pretty open after
1pm. (although I might want to check out a bit earlier today)
…On Wed, May 17, 2023 at 8:53 AM Richard (Rick) Zamora < ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In dask_expr/statistics.py
<https://github.com/mrocklin/dask-expr/pull/84#discussion_r1196561344>:
> @@ -0,0 +1,74 @@
+from __future__ import annotations
One thing I *don't* like about the design in this PR is that it still
uses the dict approach (as is) from #40
<https://github.com/mrocklin/dask-expr/pull/40> for tracking all known
statistics. Whatever design we ultimately go with, we will probably need to
enforce explicit rules for key names and collisions. I didn't bother to
deal with this yet, but it was certainly on my mind.
—
Reply to this email directly, view it on GitHub
<https://github.com/mrocklin/dask-expr/pull/84#discussion_r1196561344>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTDMKPUFYAXDYB36OL3XGTJ6FANCNFSM6AAAAAAYEFRQ3I>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Alternate take on https://github.com/mrocklin/dask-expr/pull/40
dask_expr.statistics
module (defining a simpledataclass
structure for statistics)Expr.statistics
(which utilizesdask_expr.statistics
)FrameBase.__len__
(which utilizesExpr.statistics
when possible)Rational for requiring
Expr.statistics: dict
to bedask_expr.statistics.Statistics
instances: It think we will be wanting to add/leverage many different kinds of "statistics." Therefore, I think we will need a class structure like this to simplify and isolate the logic that dictates if/how a specificExpr
"parent" can assume a specificStatistics
object from its child.