Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StatisticsV2: initial statistics framework redesign #14699

Merged
merged 100 commits into from
Feb 24, 2025

Conversation

Fly-Style
Copy link
Contributor

@Fly-Style Fly-Style commented Feb 16, 2025

Rationale for this change

The Statistics framework in Datafusion is a foundational component for query planning and execution. It provides metadata about datasets, enabling optimization decisions and influencing runtime behaviors. This patch comprehensively redesigns the Statistics representation by transitioning to an enum-based structure that supports multiple distribution types, offering greater flexibility and expressiveness.

Worth mentioning, that it is the first PR, that just introduces a statistics framework, but does not integrate it into existing infrastructure. Also, there are TODOs present, which will be fixed in smaller PR to not overload the scope of current PR.

What changes are included in this PR?

This patch presents a Statistics v.2 framework with the following main points:

  • introduction to enum-based struct to support multiple distribution types, which initially include:
    • Uniform distribution (interval)
    • Gaussian distribution, parametrized with mean and variance
    • Exponential distribution, parametrized with rate and offset
    • Bernoulli distribution - holds probability, is used as the resulting distribution of comparison operators,
    • Unknown distribution, which abstracts any non-represented distribution, or is used as a fallback option. It is parametrized with mean, median, variance, and range properties.
  • revamps a tree-based interval evaluation and propagation for a new statistics framework, still keeping old statistics in the codebase, with support of most useful binary, unary negate and logical not operators. In general, forward evaluation (i.e. combining expressions $X$ and $Y$ to create $Z$ = $f(X, Y))$ involves assuming that $X$ and $Y$ are independent, and calculating the probability distribution of the output $Z$ under that assumption). Even though this is not the case in general, it gives “conservative” distributions by preserving “all outcomes”. On the other hand, propagation (i.e. reflecting a change in $Z$ to inputs $X$ and $Y$) involves the application of Bayes’ rule to update distributions of $X$ and $Y$.
  • introduces and extends existing interval_arithmetic methods;

Plan for the future changes:

  • Remove the Precision enum and replace its usages with new Statistics.
  • Enhancing operators to infer statistics:
    • Uniform: for datasets with known bounds and even distributions.
    • Exponential: for datasets with high skew.
    • Gaussian: for datasets with sufficient samples approximating a normal distribution.
    • Unknown: fallback option, when no specific distribution can be inferred.

Are these changes tested?

Yes, these changes are tested mostly with unit tests, and also with one integration test.

P.S. Despite myself opening a PR, there was a huge effort from @berkaysynnada and @ozankabak to shape the state of this change. I want to express a huge gratitude to them.

Sasha Syrotenko and others added 30 commits January 8, 2025 18:46
…frastructure for stats top-down propagation and final bottom-up calculation
@berkaysynnada
Copy link
Contributor

berkaysynnada commented Feb 20, 2025

so the min/max/ndv of ColumnStatistic will be removed, the ColumnStatistics will like this: pub struct ColumnStatistics{stat: StatisticsV2}, right?

nope, it will seem like

pub struct ColumnStatistics {
    /// Number of null values on column
    pub null_count: StatisticsV2<usize>,
    /// Maximum value of column
    pub max_value: StatisticsV2 <ScalarValue>,
    /// Minimum value of column
    pub min_value: StatisticsV2 <ScalarValue>,
    /// Sum value of a column
    pub sum_value: StatisticsV2 <ScalarValue>,
    /// Number of distinct values
    pub distinct_count: StatisticsV2<usize>,
}

Have we done any work on the accuracy of the new statistical information during cardinality estimation?

cardinality is a term related with intervals, and we have a function already for cardinality calculations as a method of Interval struct.

Are there certain papers that describe this statistical information framework in more detail?

This framework provides distributions of Uniform, Exponential, Gaussian, Bernoulli, and Unknown. The first four variants represent well-known probability distributions, while the Unknown variant serves as a fallback option where the exact distribution type is unspecified. However, key statistical parameters such as mean, median, variance, and range can still be provided there (as these parameters are already meaningful for optimization and decision-making processes)

If you require specific details about these distribution types or their parameters, you can refer to the links provided in the docstrings. Additionally, if you're interested in further exploring their interactions -PDF computations- I can suggest Wolfram Mathematica.

BTW, you can also define your own or other known distribution types easily. Just define its parameters and implement the computations with the other types.

@edmondop
Copy link
Contributor

I issued some comments (will still go through a second round of review)
One side question, this design is different from the one of Postgres https://www.postgresql.org/docs/current/view-pg-stats.html
that uses histograms. Is using histograms more suitable for OLTP workloads than OLAP/ I really don't know much, but I was curious about this choice

What I see is those PG stats are table and column statistics at the user level. What we're building here is a foundational statistics infrastructure that serves as a basis for other statistical concepts. It is designed to satisfy various computational requirements and parameters (and they are extensible). It is built to be robust and error-prone. If you prefer displaying a stat at some point as histogram, you can easily convert these new distributions into histograms using a few converter functions.

I guess what I am saying (but I am not really sure about it) is that maybe postgres (and Oracle https://docs.oracle.com/en/database/oracle/oracle-database/19/tgsql/histograms.html) use histograms because most data doesn't follow a "known probability distribution", but I am not sure honestly. It's just "stuff that I was working on recently"

@ozankabak
Copy link
Contributor

ozankabak commented Feb 21, 2025

@edmondop, maybe I can offer some clarification here. What we want is a computational framework that gives us how statistical quantities transform under functions defined by expressions. Once we have the machinery that does this, we can build all sorts of layers on top of it for answering column-level and table-level statistical questions.

So how do we go about doing this? There are four cases in "forward" mode:

  1. Statistical quantity with a known/estimated distribution ----> expression ----> New statistical quantity with a known/estimated distribution.
  2. Statistical quantity with a known/estimated distribution ----> expression ----> New statistical quantity with an unknown distribution.
  3. Statistical quantity with an unknown distribution ----> expression ----> New statistical quantity with an unknown distribution.
  4. Statistical quantity with known (or estimated) distribution ----> expression ----> New statistical quantity with known/estimated distribution.

Cases 1, 2 and 3 are quite common. Case 4 happens rarely with special types of expressions. There is also the "reverse" mode where we have information about the statistics of the result (e.g. when we have a filter that forces a composite expression to be true), which enables us to update our information about the distributions of constituent expressions by recursively applying the Bayes rule.

With this general explanation out of the way, let's go back to the specifics of your question. In this light, your question about histograms basically boils down to how we represent unknown distributions. Histograms are one way of doing this. Moments are another. In the initial implementation, we represent unknown distributions using various summary statistics. If this turns out to be insufficient, we can add an attribute to the unknown distribution variant of the enum to store histogram information as well. If we do this, the entire machinery will stay the same -- we will only need to update the encapsulated code that handles how unknown distributions are updated. So it would actually be a small-ish PR to do this 🙂

I hope this helps. Thanks for helping with reviewing 🚀

@ozankabak
Copy link
Contributor

ozankabak commented Feb 21, 2025

Thank you for the review @xudong963. Here are my thoughts on your questions:

  1. As the summary says, StatisticsV2 will replace the usage of Precision, so the min/max/ndv of ColumnStatistic will be removed, the ColumnStatistics will like this: pub struct ColumnStatistics{stat: StatisticsV2}, right? What information do we expect from the user's TableProvider to build the Statistics?(For the old statistic, it's better to know accurate min/max/ndv to do cardinality estimation).
    My understanding is that the user needs to know their data distribution, e.g. if their data distribution is uniform they need to provide the interval, if skewed they need to provide the information needed for the Exponential distribution. Or Datafusion can do sampling to decide?

Indeed, column and table statistics will be built on top of StatisticsV2, which simply encapsulates our information about a random variable. Treating a value drawn from a column as a random variable, the maximum/minimum/average etc. of a column also becomes a random variable (with a distinct statistical distribution). Specifically, if we let X_i denote the value of ith row of column X, the maximum value for the column would be M = max(X_1, ..., X_N) with N being the number of rows. Given probabilistic information on the possible values of an arbitrary X_i, we can also make a probabilistic guess on what M can be.

So, like how @berkaysynnada mentions, we expect to have one StatisticsV2 object for each piece of information like maximum/minimum etc. This will enable us to express certain and uncertain information about things like the maximum of a column in a single unified framework.

Coming back to the information flow from data sources: If the user supplies distributional information, it will be used by the leaf nodes as we evaluate/propagate statistics in expression graphs. Otherwise, we will fall back on the unknown distribution for leaf nodes, whose defining summary statistics can be automatically generated.

In this context, your suggestion about sampling makes a lot of sense. There is no reason why we can't use statistical tests to "recognize" distributions and use recognized distributions instead of directly falling back to unknown distributions in such cases. Actually, thinking about it, doing this would be a fantastic follow-up project once we have the basics in place 🙂

  1. Have we done any work on the accuracy of the new statistical information during cardinality estimation?

Do you mean things like distinct counts? I think we will be able to see how well we estimate such things probabilistically once we finalize this and rework column/table stats with the new framework. In the worst case, all the calculus will work through unknown distributions and we will not be in a worse position than where we were before (sans bugs). In cases where we can avoid loss of statistical information, we will end up with better estimations.

  1. Are there certain papers that describe this statistical information framework in more detail?

I'm not sure. I don't know of any that describes exactly the same thing with what we are doing here, but the approach is somewhat similar to how belief propagation in probabilistic graphical models work (but not the same). It may be an interesting idea to write something up once we finalize all the details.

@alamb alamb added the api change Changes the API exposed to users of the crate label Feb 21, 2025
@alamb
Copy link
Contributor

alamb commented Feb 21, 2025

FYI @clflushopt as I think this may be related to this as well

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Fly-Style @ozankabak and @berkaysynnada -- I think this is a very cool idea and brings some much needed rigor to the handling of statistics.

I have some concerns about the specifcs of how the distributions are encoded, but the general idea of encapsulating the details of a distribution behind an API / interface is really really nice

/// statistics accordingly. The default implementation simply creates an
/// unknown output distribution by combining input ranges. This logic loses
/// distribution information, but is a safe default.
fn evaluate_statistics(&self, children: &[&StatisticsV2]) -> Result<StatisticsV2> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool -- I love this as a building block

One suggestion in terms of API design is that&[&StatisticsV2] pretty much requires using Vecs

I recommend adding some structure like TableStatisticsV2 or RelationStatisticsV2 that encapsulates the notion of a collection. Something like:

struct RelationStatisticsV2 {
...
}

impl RelationStatistics {
   /// REturn statistics for column idx
  column(&self, idx: usize) -> &StatisticsV2 { ... }
}

That would make it easier to avoid copying / change underlying representations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ith element of the children slice here denotes the statistics of the ith child expression. It follows the same pattern with how evaluate_bounds works.

Once this PR merges and we have the machinery to calculate statistics of scalar values defined by an expression tree, we will indeed move on to things like TableStatistics, ColumnStatistics and others which will be built on top of this machinery. So stay tuned 🚀

Copy link
Contributor

@alamb alamb Feb 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some more research into the current code, which has:

  1. Statistics has table level statistics, such as statistics for columns and the row count and distinct count
  2. ColumnStatistics which has column level statistics

In this PR

  1. [&StatisticsV2] is equivalent to Statistics ( distribution of multiple columns)
  2. StatisticsV2 is equivalent to ColumnStatistics (distribution of a single column)

In order to have the names be consistent, I recommend:

  1. Renaming StatisticsV2 to ColumnStatisticsV2
  2. Introducing StatisticsV2 that holds a set of column statistics

UPDATE -- I think calling this Distribution might more accurately describe what it is trying to do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe I misunderstand what StatisticsV2 is for -- if it is only meant to represent distributions of values, perhaps we should call it Distribution instead ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is only meant to represent distributions of values

This is indeed the case. It will replace Precision in the current code.

The hierarchy we had in mind was

  1. Statistics(V2): Represents statistical information (e.g. distribution, mean, variance) of a single (possibly unknown) value or an estimate. This is the focus of this PR, which provides the baseline mechanism to evaluate this for arbitrary expressions.
  2. ColumnStatistics: It will collect a bunch of Statistics(V2) objects that represent estimations about the population of values in a column; e.g. its maximum value, average etc.
  3. TableStatistics: Similar to 2, but for relations.

Revamp of the current implementations of 2 and 3, based on 1, will be the focus of subsequent PRs.

Copy link
Contributor

@alamb alamb Feb 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense

Therefore I recommend renaming StatisticsV2 to Distribution

This seems more consistent given that all the variants are already named "XYZDistribution" such as UniformDistribution, ExponentialDistribution, etc.

pub enum Distribution {
    Uniform(UniformDistribution),
    Exponential(ExponentialDistribution),
    Gaussian(GaussianDistribution),
    Bernoulli(BernoulliDistribution),
    Unknown(UnknownDistribution),
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable - let's do it

} else {
ScalarValue::try_from(&dt)
}?;
StatisticsV2::new_bernoulli(p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this would assume something about the distribution of the values (as in why does it assume a boolean variable has a bernoulli distribution 🤔 )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there is no other choice :) The only applicable distribution in case of a boolean variable is the Bernoulli distribution. Bernoulli distribution is just the stats term for a boolean variable with a parameter for "probability of being true".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a Bernoulli described the distribution of expected outcomes of a binary random variable, rather than the distribution of values within an (existing) population.

In the context of database systems, I don't think it is common to model the distribution of values in a column as though they were the output of a random variable.

I would expect that the output distribution of a boolean expression to be something like

  1. Uniform (all values equally likely)
  2. Skewed (e.g. 25% values expected to be true, 50% values expected to be false, 25% values expected to be NULL)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main confusion here is that we are not at the column level yet -- column level work will come after we have a calculus for single values/estimations

use datafusion_common::rounding::alter_fp_rounding_mode;
use datafusion_common::{internal_err, not_impl_err, Result, ScalarValue};

/// New, enhanced `Statistics` definition, represents five core statistical
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While these 5 distributions are very cool, I am not sure I have ever run into them being used in a practical database system (as real world data often doesn't neatly follow any existing distribution)

As I understand it, typically statistics estimation is done via:

  1. Assume a uniform distribution (often not particularly accurate, but very simple to implement and reason about)
  2. Use equi-height histograms measured across the data
  3. Some sort of sketch for distinct values and correlation between columns

That being said, I am not a statistics expert

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess in my mind I see the following challenges:

  1. I am not sure about the practical use of several of these distributions
  2. There doesn't seem to be an easy (aka not having to change DataFusion's code) way to add other methods of statistic calculation

Instead if a enum type approach, what would you think about a trait style one? This would allow users to encode arbitrary information about their distributions without changes to the core.

Something like:

/// Describes how data is distributed across 
pub trait Distribution {
    /// return the mean of this distribution
    pub fn mean(&self) -> Result<ScalarValue>;
    /// return the range of ths
    pub fn range(&self) -> Result<Interval>;
    pub fn data_type(&self) -> DataType;
...
}

/// DataFusion provides some built in distributions 
impl Distribution for UnknownDistribution {
...
}

impl Distribution for UniformDistribution {
...
}
...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the challenge of the above is to figure out how the APi looks to compute the distributions for different physical exprs (as the calculation is going to be different for different types of input distribitions 🤔 )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is indeed the real challenge -- evaluation and propagation procedures need to match to distribution types. For example, adding two normally distributed (Gaussian) quantities results in also a normally distributed quantity. During the design phase, @berkaysynnada and I discussed whether we can use a trait-based approach but couldn't find a way to do this without excessive downcasting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I would love to support is a "bring your own statistics / cost model" approach -- where following DataFusion's other features, it

  1. includes a solid, though basic implementation of statistics (e.g. min/max and maybe distinct value counts)
  2. has APIs that help people implement fancier statistics / models if they wanted

Some fancier models I imagine people would do are:

  1. Histograms (this is likely the first thing people would do)
  2. Multi-column sketches / samples / fancier research stuff

So basically my ideal outcome would be if we knew how StatisticsV2 would allow such a thing to be implemented, and I think that will require a trait in some form.

It could end up being another variant of StatisticsV2, like StatisticsV2::User or something (like we did for LogicalPlan)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I will think about how we can make a User variant work (if possible). Or maybe this belongs to the column/table level instead of the single value level.

Exponential(ExponentialDistribution),
Gaussian(GaussianDistribution),
Bernoulli(BernoulliDistribution),
Unknown(UnknownDistribution),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unknwon name might be misleading, it looks like based on mean, avg, variance

UnknownDistribution::try_new(mean, median, variance, range).map(Unknown)

I would think unknown is no distribution or not usable one, but here likely it is usable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unknown distribution is indeed usable. It just means that we don't have exact information about how the random variable is distributed, and we are resorting to summary statistics to describe a generic distribution about which we have no special knowledge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the naming trips up more people, maybe we can change it to something like GenericDistribution. I guess we will have enough eyes on this to know before we make a release with this code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like it, GenericDistribution or maybe MetricsDistribution, or AvgDistribution to show the user the inner nature of this sort of distribution

@clflushopt
Copy link
Contributor

I can review and make the necessary changes to #14735 once it gets merged if that is necessary @alamb.

I skimmed this PR earlier and just made a small pass through; this is some great work and will bring much of the necessary background formalization that we might need for further work on cardinality estimation and join ordering.

Thanks !

@xudong963
Copy link
Member

Specifically, if we let X_i denote the value of ith row of column X, the maximum value for the column would be M = max(X_1, ..., X_N) with N being the number of rows. Given probabilistic information on the possible values of an arbitrary X_i, we can also make a probabilistic guess on what M can be.

This makes a lot of sense, thanks for your clear explanation. Now I understand how the distribution works and the difference between the current statistics model and the original Min/max/nv.

There is no reason why we can't use statistical tests to "recognize" distributions and use recognized distributions instead of directly falling back to unknown distributions in such cases.

Yes, Sample does have great significance for the lack of statistics. From my experience, I've built the whole optimizer, the annoying problem is that the statistics are often accurate due to frequent data increases and lack due to unstructured data, under the context, Sample will show its muscle.

In the worst case, all the calculus will work through unknown distributions and we will not be in a worse position than where we were before (sans bugs)

Make sense, after statisticsv2, the worst situation is to fall back to the origin case.

It may be an interesting idea to write something up once we finalize all the details.

Thanks, looking forward!

@github-actions github-actions bot added sql SQL Planner core Core DataFusion crate execution Related to the execution crate labels Feb 23, 2025
@github-actions github-actions bot removed sql SQL Planner core Core DataFusion crate execution Related to the execution crate labels Feb 23, 2025
@ozankabak
Copy link
Contributor

Thanks for all the comments and questions. I've incorporated the naming suggestion by @alamb (and updated many comments and variable names accordingly). I also switched to GenericDistribution instead of UnknownDistribution per our discussion with @comphead, just to make sure we avoid any confusion.

I think this is good to go -- we can proceed with the follow-up tasks, which are: (1) higher-level ColumnStatistics/TableStatistics revamps, (2) adding sampling support, (3) incrementally supporting more distributions and their interactions, (4) integrations with the optimizer code.

I will wait for a day or so for more feedback in case there is any that we missed.

@xudong963
Copy link
Member

I think this is good to go

+1, I understand the PR is the base to go forward.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you -- I think this is a very nice place to start from.

@ozankabak ozankabak merged commit 0fbd20c into apache:main Feb 24, 2025
24 checks passed
@ozankabak
Copy link
Contributor

Thank you all for the reviews and the discussions! Let's keep the momentum going and build a versatile statistics framework for DataFusion 🚀

ozankabak added a commit to synnada-ai/datafusion-upstream that referenced this pull request Feb 25, 2025
* StatisticsV2: initial definition and validation method implementation

* Implement mean, median and standard deviation extraction for StatsV2

* Move stats_v2 to `physical-expr` package

* Introduce `ExprStatisticGraph` and `ExprStatisticGraphNode`

* Split the StatisticsV2 and statistics graph locations, prepare the infrastructure for stats top-down propagation and final bottom-up calculation

* Calculate variance instead of std_dev

* Create a skeleton for statistics bottom-up evaluation

* Introduce high-level test for 'evaluate_statistics()'

* Refactor result distribution computation during the statistics evaluation phase; add compute_range function

* Always produce Unknown distribution in non-mentioned combination cases, todos for the future

* Introduce Bernoulli distribution to be used as result of comparisons and inequations distribution combinations

* Implement initial statistics propagation of Uniform and Unknown distributions with known ranges

* Implement evaluate_statistics for logical not and unary negation operator

* Fix and add tests; make fmt happy

* Add integration test, implement conversion into Bernoulli distribution for Eq and NotEq

* Finish test, small cleanup

* minor improvements

* Update stats.rs

* Addressing review comments

* Implement median colmputation for Gaussian-Gaussian pair

* Update stats_v2.rs

* minor improvements

* Addressing second review comments, part 1

* Return true in other cases

* Finish addressing review requrests, part 2

* final clean-up

* bug fix

* final clean-up

* apply reverse logic in stats framework as well

* Update cp_solver.rs

* revert data.parquet

* Apply suggestions from code review

* Update datafusion/physical-expr-common/src/stats_v2.rs

* Update datafusion/physical-expr-common/src/stats_v2.rs

* Apply suggestions from code review

Fix links

* Fix compilation issue

* Fix mean/median formula for exponential distribution

* casting + exp dir + remove opt's + is_valid refractor

* Update stats_v2_graph.rs

* remove inner mod

* last todo: bernoulli propagation

* Apply suggestions from code review

* Apply suggestions from code review

* prop_stats in binary

* Update binary.rs

* rename intervals

* block explicit construction

* test updates

* Update binary.rs

* revert renaming

* impl range methods as well

* Apply suggestions from code review

* Apply suggestions from code review

* Update datafusion/physical-expr-common/src/stats_v2.rs

* Update stats_v2.rs

* fmt

* fix bernoulli or eval

* fmt

* Review

* Review Part 2

* not propagate

* clean-up

* Review Part 3

* Review Part 4

* Review Part 5

* Review Part 6

* Review Part 7

* Review Part 8

* Review Part 9

* Review Part 10

* Review Part 11

* Review Part 12

* Review Part 13

* Review Part 14

* Review Part 15 | Fix equality comparisons between uniform distributions

* Review Part 16 | Remove unnecessary temporary file

* Review Part 17 | Leave TODOs for real-valued summary statistics

* Review Part 18

* Review Part 19 | Fix variance calculations

* Review Part 20 | Fix range calculations

* Review Part 21

* Review Part 22

* Review Part 23

* Review Part 24 | Add default implementations for evaluate_statistics and propagate_statistics

* Review Part 25 | Improve docs, refactor statistics graph code

* Review Part 26

* Review Part 27

* Review Part 28 | Remove get_zero/get_one, simplify propagation in statistics graph

* Review Part 29

* Review Part 30 | Move statistics-combining functions to core module, polish tests

* Review Part 31

* Review Part 32 | Module reorganization

* Review Part 33

* Add tests for bernoulli and gaussians combination

* Incorporate community feedback

* Fix merge issue

---------

Co-authored-by: Sasha Syrotenko <[email protected]>
Co-authored-by: berkaysynnada <[email protected]>
Co-authored-by: Mehmet Ozan Kabak <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate common Related to common crate logical-expr Logical plan and expressions physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants