-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StatisticsV2: initial statistics framework redesign #14699
Conversation
…frastructure for stats top-down propagation and final bottom-up calculation
…tion phase; add compute_range function
…s, todos for the future
…and inequations distribution combinations
…ibutions with known ranges
…n for Eq and NotEq
nope, it will seem like
cardinality is a term related with intervals, and we have a function already for cardinality calculations as a method of
This framework provides distributions of Uniform, Exponential, Gaussian, Bernoulli, and Unknown. The first four variants represent well-known probability distributions, while the Unknown variant serves as a fallback option where the exact distribution type is unspecified. However, key statistical parameters such as mean, median, variance, and range can still be provided there (as these parameters are already meaningful for optimization and decision-making processes) If you require specific details about these distribution types or their parameters, you can refer to the links provided in the docstrings. Additionally, if you're interested in further exploring their interactions -PDF computations- I can suggest Wolfram Mathematica. BTW, you can also define your own or other known distribution types easily. Just define its parameters and implement the computations with the other types. |
I guess what I am saying (but I am not really sure about it) is that maybe postgres (and Oracle https://docs.oracle.com/en/database/oracle/oracle-database/19/tgsql/histograms.html) use histograms because most data doesn't follow a "known probability distribution", but I am not sure honestly. It's just "stuff that I was working on recently" |
@edmondop, maybe I can offer some clarification here. What we want is a computational framework that gives us how statistical quantities transform under functions defined by expressions. Once we have the machinery that does this, we can build all sorts of layers on top of it for answering column-level and table-level statistical questions. So how do we go about doing this? There are four cases in "forward" mode:
Cases 1, 2 and 3 are quite common. Case 4 happens rarely with special types of expressions. There is also the "reverse" mode where we have information about the statistics of the result (e.g. when we have a filter that forces a composite expression to be true), which enables us to update our information about the distributions of constituent expressions by recursively applying the Bayes rule. With this general explanation out of the way, let's go back to the specifics of your question. In this light, your question about histograms basically boils down to how we represent unknown distributions. Histograms are one way of doing this. Moments are another. In the initial implementation, we represent unknown distributions using various summary statistics. If this turns out to be insufficient, we can add an attribute to the unknown distribution variant of the enum to store histogram information as well. If we do this, the entire machinery will stay the same -- we will only need to update the encapsulated code that handles how unknown distributions are updated. So it would actually be a small-ish PR to do this 🙂 I hope this helps. Thanks for helping with reviewing 🚀 |
Thank you for the review @xudong963. Here are my thoughts on your questions:
Indeed, column and table statistics will be built on top of So, like how @berkaysynnada mentions, we expect to have one Coming back to the information flow from data sources: If the user supplies distributional information, it will be used by the leaf nodes as we evaluate/propagate statistics in expression graphs. Otherwise, we will fall back on the unknown distribution for leaf nodes, whose defining summary statistics can be automatically generated. In this context, your suggestion about sampling makes a lot of sense. There is no reason why we can't use statistical tests to "recognize" distributions and use recognized distributions instead of directly falling back to unknown distributions in such cases. Actually, thinking about it, doing this would be a fantastic follow-up project once we have the basics in place 🙂
Do you mean things like distinct counts? I think we will be able to see how well we estimate such things probabilistically once we finalize this and rework column/table stats with the new framework. In the worst case, all the calculus will work through unknown distributions and we will not be in a worse position than where we were before (sans bugs). In cases where we can avoid loss of statistical information, we will end up with better estimations.
I'm not sure. I don't know of any that describes exactly the same thing with what we are doing here, but the approach is somewhat similar to how belief propagation in probabilistic graphical models work (but not the same). It may be an interesting idea to write something up once we finalize all the details. |
FYI @clflushopt as I think this may be related to this as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Fly-Style @ozankabak and @berkaysynnada -- I think this is a very cool idea and brings some much needed rigor to the handling of statistics.
I have some concerns about the specifcs of how the distributions are encoded, but the general idea of encapsulating the details of a distribution behind an API / interface is really really nice
/// statistics accordingly. The default implementation simply creates an | ||
/// unknown output distribution by combining input ranges. This logic loses | ||
/// distribution information, but is a safe default. | ||
fn evaluate_statistics(&self, children: &[&StatisticsV2]) -> Result<StatisticsV2> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cool -- I love this as a building block
One suggestion in terms of API design is that&[&StatisticsV2]
pretty much requires using Vecs
I recommend adding some structure like TableStatisticsV2
or RelationStatisticsV2
that encapsulates the notion of a collection. Something like:
struct RelationStatisticsV2 {
...
}
impl RelationStatistics {
/// REturn statistics for column idx
column(&self, idx: usize) -> &StatisticsV2 { ... }
}
That would make it easier to avoid copying / change underlying representations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The i
th element of the children
slice here denotes the statistics of the i
th child expression. It follows the same pattern with how evaluate_bounds
works.
Once this PR merges and we have the machinery to calculate statistics of scalar values defined by an expression tree, we will indeed move on to things like TableStatistics
, ColumnStatistics
and others which will be built on top of this machinery. So stay tuned 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some more research into the current code, which has:
Statistics
has table level statistics, such as statistics for columns and the row count and distinct countColumnStatistics
which has column level statistics
In this PR
[&StatisticsV2]
is equivalent toStatistics
( distribution of multiple columns)StatisticsV2
is equivalent toColumnStatistics
(distribution of a single column)
In order to have the names be consistent, I recommend:
- Renaming
StatisticsV2
toColumnStatisticsV2
- Introducing
StatisticsV2
that holds a set of column statistics
UPDATE -- I think calling this Distribution
might more accurately describe what it is trying to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe I misunderstand what StatisticsV2
is for -- if it is only meant to represent distributions of values, perhaps we should call it Distribution
instead ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is only meant to represent distributions of values
This is indeed the case. It will replace Precision
in the current code.
The hierarchy we had in mind was
Statistics(V2)
: Represents statistical information (e.g. distribution, mean, variance) of a single (possibly unknown) value or an estimate. This is the focus of this PR, which provides the baseline mechanism to evaluate this for arbitrary expressions.ColumnStatistics
: It will collect a bunch ofStatistics(V2)
objects that represent estimations about the population of values in a column; e.g. its maximum value, average etc.TableStatistics
: Similar to 2, but for relations.
Revamp of the current implementations of 2 and 3, based on 1, will be the focus of subsequent PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense
Therefore I recommend renaming StatisticsV2
to Distribution
This seems more consistent given that all the variants are already named "XYZDistribution" such as UniformDistribution
, ExponentialDistribution
, etc.
pub enum Distribution {
Uniform(UniformDistribution),
Exponential(ExponentialDistribution),
Gaussian(GaussianDistribution),
Bernoulli(BernoulliDistribution),
Unknown(UnknownDistribution),
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reasonable - let's do it
} else { | ||
ScalarValue::try_from(&dt) | ||
}?; | ||
StatisticsV2::new_bernoulli(p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this would assume something about the distribution of the values (as in why does it assume a boolean variable has a bernoulli distribution 🤔 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because there is no other choice :) The only applicable distribution in case of a boolean variable is the Bernoulli distribution. Bernoulli distribution is just the stats term for a boolean variable with a parameter for "probability of being true".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a Bernoulli described the distribution of expected outcomes of a binary random variable, rather than the distribution of values within an (existing) population.
In the context of database systems, I don't think it is common to model the distribution of values in a column as though they were the output of a random variable.
I would expect that the output distribution of a boolean expression to be something like
- Uniform (all values equally likely)
- Skewed (e.g. 25% values expected to be true, 50% values expected to be false, 25% values expected to be NULL)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main confusion here is that we are not at the column level yet -- column level work will come after we have a calculus for single values/estimations
use datafusion_common::rounding::alter_fp_rounding_mode; | ||
use datafusion_common::{internal_err, not_impl_err, Result, ScalarValue}; | ||
|
||
/// New, enhanced `Statistics` definition, represents five core statistical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While these 5 distributions are very cool, I am not sure I have ever run into them being used in a practical database system (as real world data often doesn't neatly follow any existing distribution)
As I understand it, typically statistics estimation is done via:
- Assume a uniform distribution (often not particularly accurate, but very simple to implement and reason about)
- Use equi-height histograms measured across the data
- Some sort of sketch for distinct values and correlation between columns
That being said, I am not a statistics expert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess in my mind I see the following challenges:
- I am not sure about the practical use of several of these distributions
- There doesn't seem to be an easy (aka not having to change DataFusion's code) way to add other methods of statistic calculation
Instead if a enum type approach, what would you think about a trait style one? This would allow users to encode arbitrary information about their distributions without changes to the core.
Something like:
/// Describes how data is distributed across
pub trait Distribution {
/// return the mean of this distribution
pub fn mean(&self) -> Result<ScalarValue>;
/// return the range of ths
pub fn range(&self) -> Result<Interval>;
pub fn data_type(&self) -> DataType;
...
}
/// DataFusion provides some built in distributions
impl Distribution for UnknownDistribution {
...
}
impl Distribution for UniformDistribution {
...
}
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the challenge of the above is to figure out how the APi looks to compute the distributions for different physical exprs (as the calculation is going to be different for different types of input distribitions 🤔 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that is indeed the real challenge -- evaluation and propagation procedures need to match to distribution types. For example, adding two normally distributed (Gaussian) quantities results in also a normally distributed quantity. During the design phase, @berkaysynnada and I discussed whether we can use a trait-based approach but couldn't find a way to do this without excessive downcasting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I would love to support is a "bring your own statistics / cost model" approach -- where following DataFusion's other features, it
- includes a solid, though basic implementation of statistics (e.g. min/max and maybe distinct value counts)
- has APIs that help people implement fancier statistics / models if they wanted
Some fancier models I imagine people would do are:
- Histograms (this is likely the first thing people would do)
- Multi-column sketches / samples / fancier research stuff
So basically my ideal outcome would be if we knew how StatisticsV2
would allow such a thing to be implemented, and I think that will require a trait in some form.
It could end up being another variant of StatisticsV2
, like StatisticsV2::User
or something (like we did for LogicalPlan)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I will think about how we can make a User
variant work (if possible). Or maybe this belongs to the column/table level instead of the single value level.
Exponential(ExponentialDistribution), | ||
Gaussian(GaussianDistribution), | ||
Bernoulli(BernoulliDistribution), | ||
Unknown(UnknownDistribution), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unknwon name might be misleading, it looks like based on mean, avg, variance
UnknownDistribution::try_new(mean, median, variance, range).map(Unknown)
I would think unknown is no distribution or not usable one, but here likely it is usable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unknown distribution is indeed usable. It just means that we don't have exact information about how the random variable is distributed, and we are resorting to summary statistics to describe a generic distribution about which we have no special knowledge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the naming trips up more people, maybe we can change it to something like GenericDistribution
. I guess we will have enough eyes on this to know before we make a release with this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like it, GenericDistribution
or maybe MetricsDistribution
, or AvgDistribution
to show the user the inner nature of this sort of distribution
I can review and make the necessary changes to #14735 once it gets merged if that is necessary @alamb. I skimmed this PR earlier and just made a small pass through; this is some great work and will bring much of the necessary background formalization that we might need for further work on cardinality estimation and join ordering. Thanks ! |
This makes a lot of sense, thanks for your clear explanation. Now I understand how the distribution works and the difference between the current statistics model and the original Min/max/nv.
Yes,
Make sense, after statisticsv2, the worst situation is to fall back to the origin case.
Thanks, looking forward! |
Thanks for all the comments and questions. I've incorporated the naming suggestion by @alamb (and updated many comments and variable names accordingly). I also switched to I think this is good to go -- we can proceed with the follow-up tasks, which are: (1) higher-level I will wait for a day or so for more feedback in case there is any that we missed. |
+1, I understand the PR is the base to go forward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you -- I think this is a very nice place to start from.
Thank you all for the reviews and the discussions! Let's keep the momentum going and build a versatile statistics framework for DataFusion 🚀 |
* StatisticsV2: initial definition and validation method implementation * Implement mean, median and standard deviation extraction for StatsV2 * Move stats_v2 to `physical-expr` package * Introduce `ExprStatisticGraph` and `ExprStatisticGraphNode` * Split the StatisticsV2 and statistics graph locations, prepare the infrastructure for stats top-down propagation and final bottom-up calculation * Calculate variance instead of std_dev * Create a skeleton for statistics bottom-up evaluation * Introduce high-level test for 'evaluate_statistics()' * Refactor result distribution computation during the statistics evaluation phase; add compute_range function * Always produce Unknown distribution in non-mentioned combination cases, todos for the future * Introduce Bernoulli distribution to be used as result of comparisons and inequations distribution combinations * Implement initial statistics propagation of Uniform and Unknown distributions with known ranges * Implement evaluate_statistics for logical not and unary negation operator * Fix and add tests; make fmt happy * Add integration test, implement conversion into Bernoulli distribution for Eq and NotEq * Finish test, small cleanup * minor improvements * Update stats.rs * Addressing review comments * Implement median colmputation for Gaussian-Gaussian pair * Update stats_v2.rs * minor improvements * Addressing second review comments, part 1 * Return true in other cases * Finish addressing review requrests, part 2 * final clean-up * bug fix * final clean-up * apply reverse logic in stats framework as well * Update cp_solver.rs * revert data.parquet * Apply suggestions from code review * Update datafusion/physical-expr-common/src/stats_v2.rs * Update datafusion/physical-expr-common/src/stats_v2.rs * Apply suggestions from code review Fix links * Fix compilation issue * Fix mean/median formula for exponential distribution * casting + exp dir + remove opt's + is_valid refractor * Update stats_v2_graph.rs * remove inner mod * last todo: bernoulli propagation * Apply suggestions from code review * Apply suggestions from code review * prop_stats in binary * Update binary.rs * rename intervals * block explicit construction * test updates * Update binary.rs * revert renaming * impl range methods as well * Apply suggestions from code review * Apply suggestions from code review * Update datafusion/physical-expr-common/src/stats_v2.rs * Update stats_v2.rs * fmt * fix bernoulli or eval * fmt * Review * Review Part 2 * not propagate * clean-up * Review Part 3 * Review Part 4 * Review Part 5 * Review Part 6 * Review Part 7 * Review Part 8 * Review Part 9 * Review Part 10 * Review Part 11 * Review Part 12 * Review Part 13 * Review Part 14 * Review Part 15 | Fix equality comparisons between uniform distributions * Review Part 16 | Remove unnecessary temporary file * Review Part 17 | Leave TODOs for real-valued summary statistics * Review Part 18 * Review Part 19 | Fix variance calculations * Review Part 20 | Fix range calculations * Review Part 21 * Review Part 22 * Review Part 23 * Review Part 24 | Add default implementations for evaluate_statistics and propagate_statistics * Review Part 25 | Improve docs, refactor statistics graph code * Review Part 26 * Review Part 27 * Review Part 28 | Remove get_zero/get_one, simplify propagation in statistics graph * Review Part 29 * Review Part 30 | Move statistics-combining functions to core module, polish tests * Review Part 31 * Review Part 32 | Module reorganization * Review Part 33 * Add tests for bernoulli and gaussians combination * Incorporate community feedback * Fix merge issue --------- Co-authored-by: Sasha Syrotenko <[email protected]> Co-authored-by: berkaysynnada <[email protected]> Co-authored-by: Mehmet Ozan Kabak <[email protected]>
Rationale for this change
The Statistics framework in Datafusion is a foundational component for query planning and execution. It provides metadata about datasets, enabling optimization decisions and influencing runtime behaviors. This patch comprehensively redesigns the Statistics representation by transitioning to an enum-based structure that supports multiple distribution types, offering greater flexibility and expressiveness.
Worth mentioning, that it is the first PR, that just introduces a statistics framework, but does not integrate it into existing infrastructure. Also, there are TODOs present, which will be fixed in smaller PR to not overload the scope of current PR.
What changes are included in this PR?
This patch presents a Statistics v.2 framework with the following main points:
mean
andvariance
rate
andoffset
mean
,median
,variance
, andrange
properties.negate
and logicalnot
operators. In general, forward evaluation (i.e. combining expressionsinterval_arithmetic
methods;Plan for the future changes:
Precision
enum and replace its usages with newStatistics
.Uniform
: for datasets with known bounds and even distributions.Exponential
: for datasets with high skew.Gaussian
: for datasets with sufficient samples approximating a normal distribution.Unknown
: fallback option, when no specific distribution can be inferred.Are these changes tested?
Yes, these changes are tested mostly with unit tests, and also with one integration test.
P.S. Despite myself opening a PR, there was a huge effort from @berkaysynnada and @ozankabak to shape the state of this change. I want to express a huge gratitude to them.