StatisticsV2: initial statistics framework redesign #14699

Fly-Style · 2025-02-16T14:51:58Z

Rationale for this change

The Statistics framework in Datafusion is a foundational component for query planning and execution. It provides metadata about datasets, enabling optimization decisions and influencing runtime behaviors. This patch comprehensively redesigns the Statistics representation by transitioning to an enum-based structure that supports multiple distribution types, offering greater flexibility and expressiveness.

Worth mentioning, that it is the first PR, that just introduces a statistics framework, but does not integrate it into existing infrastructure. Also, there are TODOs present, which will be fixed in smaller PR to not overload the scope of current PR.

What changes are included in this PR?

This patch presents a Statistics v.2 framework with the following main points:

introduction to enum-based struct to support multiple distribution types, which initially include:
- Uniform distribution (interval)
- Gaussian distribution, parametrized with mean and variance
- Exponential distribution, parametrized with rate and offset
- Bernoulli distribution - holds probability, is used as the resulting distribution of comparison operators,
- Unknown distribution, which abstracts any non-represented distribution, or is used as a fallback option. It is parametrized with mean, median, variance, and range properties.
revamps a tree-based interval evaluation and propagation for a new statistics framework, still keeping old statistics in the codebase, with support of most useful binary, unary negate and logical not operators. In general, forward evaluation (i.e. combining expressions $X$ and $Y$ to create $Z$ = $f(X, Y))$ involves assuming that $X$ and $Y$ are independent, and calculating the probability distribution of the output $Z$ under that assumption). Even though this is not the case in general, it gives “conservative” distributions by preserving “all outcomes”. On the other hand, propagation (i.e. reflecting a change in $Z$ to inputs $X$ and $Y$) involves the application of Bayes’ rule to update distributions of $X$ and $Y$.
introduces and extends existing interval_arithmetic methods;

Plan for the future changes:

Remove the Precision enum and replace its usages with new Statistics.
Enhancing operators to infer statistics:
- Uniform: for datasets with known bounds and even distributions.
- Exponential: for datasets with high skew.
- Gaussian: for datasets with sufficient samples approximating a normal distribution.
- Unknown: fallback option, when no specific distribution can be inferred.

Are these changes tested?

Yes, these changes are tested mostly with unit tests, and also with one integration test.

P.S. Despite myself opening a PR, there was a huge effort from @berkaysynnada and @ozankabak to shape the state of this change. I want to express a huge gratitude to them.

…frastructure for stats top-down propagation and final bottom-up calculation

…tion phase; add compute_range function

…s, todos for the future

…and inequations distribution combinations

…ibutions with known ranges

…ator

…n for Eq and NotEq

berkaysynnada · 2025-02-20T21:10:26Z

so the min/max/ndv of ColumnStatistic will be removed, the ColumnStatistics will like this: pub struct ColumnStatistics{stat: StatisticsV2}, right?

nope, it will seem like

pub struct ColumnStatistics {
    /// Number of null values on column
    pub null_count: StatisticsV2<usize>,
    /// Maximum value of column
    pub max_value: StatisticsV2 <ScalarValue>,
    /// Minimum value of column
    pub min_value: StatisticsV2 <ScalarValue>,
    /// Sum value of a column
    pub sum_value: StatisticsV2 <ScalarValue>,
    /// Number of distinct values
    pub distinct_count: StatisticsV2<usize>,
}

Have we done any work on the accuracy of the new statistical information during cardinality estimation?

cardinality is a term related with intervals, and we have a function already for cardinality calculations as a method of Interval struct.

Are there certain papers that describe this statistical information framework in more detail?

This framework provides distributions of Uniform, Exponential, Gaussian, Bernoulli, and Unknown. The first four variants represent well-known probability distributions, while the Unknown variant serves as a fallback option where the exact distribution type is unspecified. However, key statistical parameters such as mean, median, variance, and range can still be provided there (as these parameters are already meaningful for optimization and decision-making processes)

If you require specific details about these distribution types or their parameters, you can refer to the links provided in the docstrings. Additionally, if you're interested in further exploring their interactions -PDF computations- I can suggest Wolfram Mathematica.

BTW, you can also define your own or other known distribution types easily. Just define its parameters and implement the computations with the other types.

edmondop · 2025-02-20T21:17:35Z

I issued some comments (will still go through a second round of review)
One side question, this design is different from the one of Postgres https://www.postgresql.org/docs/current/view-pg-stats.html
that uses histograms. Is using histograms more suitable for OLTP workloads than OLAP/ I really don't know much, but I was curious about this choice

What I see is those PG stats are table and column statistics at the user level. What we're building here is a foundational statistics infrastructure that serves as a basis for other statistical concepts. It is designed to satisfy various computational requirements and parameters (and they are extensible). It is built to be robust and error-prone. If you prefer displaying a stat at some point as histogram, you can easily convert these new distributions into histograms using a few converter functions.

I guess what I am saying (but I am not really sure about it) is that maybe postgres (and Oracle https://docs.oracle.com/en/database/oracle/oracle-database/19/tgsql/histograms.html) use histograms because most data doesn't follow a "known probability distribution", but I am not sure honestly. It's just "stuff that I was working on recently"

ozankabak · 2025-02-21T02:34:18Z

@edmondop, maybe I can offer some clarification here. What we want is a computational framework that gives us how statistical quantities transform under functions defined by expressions. Once we have the machinery that does this, we can build all sorts of layers on top of it for answering column-level and table-level statistical questions.

So how do we go about doing this? There are four cases in "forward" mode:

Statistical quantity with a known/estimated distribution ----> expression ----> New statistical quantity with a known/estimated distribution.
Statistical quantity with a known/estimated distribution ----> expression ----> New statistical quantity with an unknown distribution.
Statistical quantity with an unknown distribution ----> expression ----> New statistical quantity with an unknown distribution.
Statistical quantity with known (or estimated) distribution ----> expression ----> New statistical quantity with known/estimated distribution.

Cases 1, 2 and 3 are quite common. Case 4 happens rarely with special types of expressions. There is also the "reverse" mode where we have information about the statistics of the result (e.g. when we have a filter that forces a composite expression to be true), which enables us to update our information about the distributions of constituent expressions by recursively applying the Bayes rule.

With this general explanation out of the way, let's go back to the specifics of your question. In this light, your question about histograms basically boils down to how we represent unknown distributions. Histograms are one way of doing this. Moments are another. In the initial implementation, we represent unknown distributions using various summary statistics. If this turns out to be insufficient, we can add an attribute to the unknown distribution variant of the enum to store histogram information as well. If we do this, the entire machinery will stay the same -- we will only need to update the encapsulated code that handles how unknown distributions are updated. So it would actually be a small-ish PR to do this 🙂

I hope this helps. Thanks for helping with reviewing 🚀

ozankabak · 2025-02-21T03:49:37Z

Thank you for the review @xudong963. Here are my thoughts on your questions:

As the summary says, StatisticsV2 will replace the usage of Precision, so the min/max/ndv of ColumnStatistic will be removed, the ColumnStatistics will like this: pub struct ColumnStatistics{stat: StatisticsV2}, right? What information do we expect from the user's TableProvider to build the Statistics?(For the old statistic, it's better to know accurate min/max/ndv to do cardinality estimation).
My understanding is that the user needs to know their data distribution, e.g. if their data distribution is uniform they need to provide the interval, if skewed they need to provide the information needed for the Exponential distribution. Or Datafusion can do sampling to decide?

Indeed, column and table statistics will be built on top of StatisticsV2, which simply encapsulates our information about a random variable. Treating a value drawn from a column as a random variable, the maximum/minimum/average etc. of a column also becomes a random variable (with a distinct statistical distribution). Specifically, if we let X_i denote the value of ith row of column X, the maximum value for the column would be M = max(X_1, ..., X_N) with N being the number of rows. Given probabilistic information on the possible values of an arbitrary X_i, we can also make a probabilistic guess on what M can be.

So, like how @berkaysynnada mentions, we expect to have one StatisticsV2 object for each piece of information like maximum/minimum etc. This will enable us to express certain and uncertain information about things like the maximum of a column in a single unified framework.

Coming back to the information flow from data sources: If the user supplies distributional information, it will be used by the leaf nodes as we evaluate/propagate statistics in expression graphs. Otherwise, we will fall back on the unknown distribution for leaf nodes, whose defining summary statistics can be automatically generated.

In this context, your suggestion about sampling makes a lot of sense. There is no reason why we can't use statistical tests to "recognize" distributions and use recognized distributions instead of directly falling back to unknown distributions in such cases. Actually, thinking about it, doing this would be a fantastic follow-up project once we have the basics in place 🙂

Have we done any work on the accuracy of the new statistical information during cardinality estimation?

Do you mean things like distinct counts? I think we will be able to see how well we estimate such things probabilistically once we finalize this and rework column/table stats with the new framework. In the worst case, all the calculus will work through unknown distributions and we will not be in a worse position than where we were before (sans bugs). In cases where we can avoid loss of statistical information, we will end up with better estimations.

Are there certain papers that describe this statistical information framework in more detail?

I'm not sure. I don't know of any that describes exactly the same thing with what we are doing here, but the approach is somewhat similar to how belief propagation in probabilistic graphical models work (but not the same). It may be an interesting idea to write something up once we finalize all the details.

alamb · 2025-02-21T12:38:45Z

FYI @clflushopt as I think this may be related to this as well

Examples: boundary analysis example for AND/OR conjunctions #14735

alamb

Thank you @Fly-Style @ozankabak and @berkaysynnada -- I think this is a very cool idea and brings some much needed rigor to the handling of statistics.

I have some concerns about the specifcs of how the distributions are encoded, but the general idea of encapsulating the details of a distribution behind an API / interface is really really nice

datafusion/expr-common/src/interval_arithmetic.rs

alamb · 2025-02-21T12:48:11Z

datafusion/physical-expr-common/src/physical_expr.rs

+    /// statistics accordingly. The default implementation simply creates an
+    /// unknown output distribution by combining input ranges. This logic loses
+    /// distribution information, but is a safe default.
+    fn evaluate_statistics(&self, children: &[&StatisticsV2]) -> Result<StatisticsV2> {


This is very cool -- I love this as a building block

One suggestion in terms of API design is that&[&StatisticsV2] pretty much requires using Vecs

I recommend adding some structure like TableStatisticsV2 or RelationStatisticsV2 that encapsulates the notion of a collection. Something like:

struct RelationStatisticsV2 { ... } impl RelationStatistics { /// REturn statistics for column idx column(&self, idx: usize) -> &StatisticsV2 { ... } }

That would make it easier to avoid copying / change underlying representations

The ith element of the children slice here denotes the statistics of the ith child expression. It follows the same pattern with how evaluate_bounds works.

Once this PR merges and we have the machinery to calculate statistics of scalar values defined by an expression tree, we will indeed move on to things like TableStatistics, ColumnStatistics and others which will be built on top of this machinery. So stay tuned 🚀

I did some more research into the current code, which has:

Statistics has table level statistics, such as statistics for columns and the row count and distinct count

ColumnStatistics which has column level statistics

In this PR

[&StatisticsV2] is equivalent to Statistics ( distribution of multiple columns)

StatisticsV2 is equivalent to ColumnStatistics (distribution of a single column)

In order to have the names be consistent, I recommend:

Renaming StatisticsV2 to ColumnStatisticsV2

Introducing StatisticsV2 that holds a set of column statistics

UPDATE -- I think calling this Distribution might more accurately describe what it is trying to do

Or maybe I misunderstand what StatisticsV2 is for -- if it is only meant to represent distributions of values, perhaps we should call it Distribution instead ?

if it is only meant to represent distributions of values

This is indeed the case. It will replace Precision in the current code.

The hierarchy we had in mind was

Statistics(V2): Represents statistical information (e.g. distribution, mean, variance) of a single (possibly unknown) value or an estimate. This is the focus of this PR, which provides the baseline mechanism to evaluate this for arbitrary expressions.

ColumnStatistics: It will collect a bunch of Statistics(V2) objects that represent estimations about the population of values in a column; e.g. its maximum value, average etc.

TableStatistics: Similar to 2, but for relations.

Revamp of the current implementations of 2 and 3, based on 1, will be the focus of subsequent PRs.

That makes sense

Therefore I recommend renaming StatisticsV2 to Distribution

This seems more consistent given that all the variants are already named "XYZDistribution" such as UniformDistribution, ExponentialDistribution, etc.

pub enum Distribution { Uniform(UniformDistribution), Exponential(ExponentialDistribution), Gaussian(GaussianDistribution), Bernoulli(BernoulliDistribution), Unknown(UnknownDistribution), }

Reasonable - let's do it

alamb · 2025-02-21T12:50:38Z

datafusion/physical-expr-common/src/physical_expr.rs

+            } else {
+                ScalarValue::try_from(&dt)
+            }?;
+            StatisticsV2::new_bernoulli(p)


I don't understand why this would assume something about the distribution of the values (as in why does it assume a boolean variable has a bernoulli distribution 🤔 )

Because there is no other choice :) The only applicable distribution in case of a boolean variable is the Bernoulli distribution. Bernoulli distribution is just the stats term for a boolean variable with a parameter for "probability of being true".

I think a Bernoulli described the distribution of expected outcomes of a binary random variable, rather than the distribution of values within an (existing) population.

In the context of database systems, I don't think it is common to model the distribution of values in a column as though they were the output of a random variable.

I would expect that the output distribution of a boolean expression to be something like

Uniform (all values equally likely)

Skewed (e.g. 25% values expected to be true, 50% values expected to be false, 25% values expected to be NULL)

I think the main confusion here is that we are not at the column level yet -- column level work will come after we have a calculus for single values/estimations

alamb · 2025-02-21T12:56:43Z

datafusion/expr-common/src/statistics.rs

+use datafusion_common::rounding::alter_fp_rounding_mode;
+use datafusion_common::{internal_err, not_impl_err, Result, ScalarValue};
+
+/// New, enhanced `Statistics` definition, represents five core statistical


While these 5 distributions are very cool, I am not sure I have ever run into them being used in a practical database system (as real world data often doesn't neatly follow any existing distribution)

As I understand it, typically statistics estimation is done via:

Assume a uniform distribution (often not particularly accurate, but very simple to implement and reason about)

Use equi-height histograms measured across the data

Some sort of sketch for distinct values and correlation between columns

That being said, I am not a statistics expert

So I guess in my mind I see the following challenges:

I am not sure about the practical use of several of these distributions

There doesn't seem to be an easy (aka not having to change DataFusion's code) way to add other methods of statistic calculation

Instead if a enum type approach, what would you think about a trait style one? This would allow users to encode arbitrary information about their distributions without changes to the core.

Something like:

/// Describes how data is distributed across pub trait Distribution { /// return the mean of this distribution pub fn mean(&self) -> Result<ScalarValue>; /// return the range of ths pub fn range(&self) -> Result<Interval>; pub fn data_type(&self) -> DataType; ... } /// DataFusion provides some built in distributions impl Distribution for UnknownDistribution { ... } impl Distribution for UniformDistribution { ... } ...

I think the challenge of the above is to figure out how the APi looks to compute the distributions for different physical exprs (as the calculation is going to be different for different types of input distribitions 🤔 )

Yes that is indeed the real challenge -- evaluation and propagation procedures need to match to distribution types. For example, adding two normally distributed (Gaussian) quantities results in also a normally distributed quantity. During the design phase, @berkaysynnada and I discussed whether we can use a trait-based approach but couldn't find a way to do this without excessive downcasting.

What I would love to support is a "bring your own statistics / cost model" approach -- where following DataFusion's other features, it

includes a solid, though basic implementation of statistics (e.g. min/max and maybe distinct value counts)

has APIs that help people implement fancier statistics / models if they wanted

Some fancier models I imagine people would do are:

Histograms (this is likely the first thing people would do)

Multi-column sketches / samples / fancier research stuff

So basically my ideal outcome would be if we knew how StatisticsV2 would allow such a thing to be implemented, and I think that will require a trait in some form.

It could end up being another variant of StatisticsV2, like StatisticsV2::User or something (like we did for LogicalPlan)

Hmm, I will think about how we can make a User variant work (if possible). Or maybe this belongs to the column/table level instead of the single value level.

comphead · 2025-02-21T15:54:52Z

datafusion/expr-common/src/statistics.rs

+    Exponential(ExponentialDistribution),
+    Gaussian(GaussianDistribution),
+    Bernoulli(BernoulliDistribution),
+    Unknown(UnknownDistribution),


Unknwon name might be misleading, it looks like based on mean, avg, variance

UnknownDistribution::try_new(mean, median, variance, range).map(Unknown)

I would think unknown is no distribution or not usable one, but here likely it is usable

Unknown distribution is indeed usable. It just means that we don't have exact information about how the random variable is distributed, and we are resorting to summary statistics to describe a generic distribution about which we have no special knowledge.

If the naming trips up more people, maybe we can change it to something like GenericDistribution. I guess we will have enough eyes on this to know before we make a release with this code.

Like it, GenericDistribution or maybe MetricsDistribution, or AvgDistribution to show the user the inner nature of this sort of distribution

clflushopt · 2025-02-22T02:40:34Z

I can review and make the necessary changes to #14735 once it gets merged if that is necessary @alamb.

I skimmed this PR earlier and just made a small pass through; this is some great work and will bring much of the necessary background formalization that we might need for further work on cardinality estimation and join ordering.

Thanks !

xudong963 · 2025-02-23T14:01:32Z

Specifically, if we let X_i denote the value of ith row of column X, the maximum value for the column would be M = max(X_1, ..., X_N) with N being the number of rows. Given probabilistic information on the possible values of an arbitrary X_i, we can also make a probabilistic guess on what M can be.

This makes a lot of sense, thanks for your clear explanation. Now I understand how the distribution works and the difference between the current statistics model and the original Min/max/nv.

There is no reason why we can't use statistical tests to "recognize" distributions and use recognized distributions instead of directly falling back to unknown distributions in such cases.

Yes, Sample does have great significance for the lack of statistics. From my experience, I've built the whole optimizer, the annoying problem is that the statistics are often accurate due to frequent data increases and lack due to unstructured data, under the context, Sample will show its muscle.

In the worst case, all the calculus will work through unknown distributions and we will not be in a worse position than where we were before (sans bugs)

Make sense, after statisticsv2, the worst situation is to fall back to the origin case.

It may be an interesting idea to write something up once we finalize all the details.

Thanks, looking forward!

ozankabak · 2025-02-23T15:48:21Z

Thanks for all the comments and questions. I've incorporated the naming suggestion by @alamb (and updated many comments and variable names accordingly). I also switched to GenericDistribution instead of UnknownDistribution per our discussion with @comphead, just to make sure we avoid any confusion.

I think this is good to go -- we can proceed with the follow-up tasks, which are: (1) higher-level ColumnStatistics/TableStatistics revamps, (2) adding sampling support, (3) incrementally supporting more distributions and their interactions, (4) integrations with the optimizer code.

I will wait for a day or so for more feedback in case there is any that we missed.

xudong963 · 2025-02-24T11:13:19Z

I think this is good to go

+1, I understand the PR is the base to go forward.

alamb

Thank you -- I think this is a very nice place to start from.

ozankabak · 2025-02-24T14:18:52Z

Thank you all for the reviews and the discussions! Let's keep the momentum going and build a versatile statistics framework for DataFusion 🚀

* StatisticsV2: initial definition and validation method implementation * Implement mean, median and standard deviation extraction for StatsV2 * Move stats_v2 to `physical-expr` package * Introduce `ExprStatisticGraph` and `ExprStatisticGraphNode` * Split the StatisticsV2 and statistics graph locations, prepare the infrastructure for stats top-down propagation and final bottom-up calculation * Calculate variance instead of std_dev * Create a skeleton for statistics bottom-up evaluation * Introduce high-level test for 'evaluate_statistics()' * Refactor result distribution computation during the statistics evaluation phase; add compute_range function * Always produce Unknown distribution in non-mentioned combination cases, todos for the future * Introduce Bernoulli distribution to be used as result of comparisons and inequations distribution combinations * Implement initial statistics propagation of Uniform and Unknown distributions with known ranges * Implement evaluate_statistics for logical not and unary negation operator * Fix and add tests; make fmt happy * Add integration test, implement conversion into Bernoulli distribution for Eq and NotEq * Finish test, small cleanup * minor improvements * Update stats.rs * Addressing review comments * Implement median colmputation for Gaussian-Gaussian pair * Update stats_v2.rs * minor improvements * Addressing second review comments, part 1 * Return true in other cases * Finish addressing review requrests, part 2 * final clean-up * bug fix * final clean-up * apply reverse logic in stats framework as well * Update cp_solver.rs * revert data.parquet * Apply suggestions from code review * Update datafusion/physical-expr-common/src/stats_v2.rs * Update datafusion/physical-expr-common/src/stats_v2.rs * Apply suggestions from code review Fix links * Fix compilation issue * Fix mean/median formula for exponential distribution * casting + exp dir + remove opt's + is_valid refractor * Update stats_v2_graph.rs * remove inner mod * last todo: bernoulli propagation * Apply suggestions from code review * Apply suggestions from code review * prop_stats in binary * Update binary.rs * rename intervals * block explicit construction * test updates * Update binary.rs * revert renaming * impl range methods as well * Apply suggestions from code review * Apply suggestions from code review * Update datafusion/physical-expr-common/src/stats_v2.rs * Update stats_v2.rs * fmt * fix bernoulli or eval * fmt * Review * Review Part 2 * not propagate * clean-up * Review Part 3 * Review Part 4 * Review Part 5 * Review Part 6 * Review Part 7 * Review Part 8 * Review Part 9 * Review Part 10 * Review Part 11 * Review Part 12 * Review Part 13 * Review Part 14 * Review Part 15 | Fix equality comparisons between uniform distributions * Review Part 16 | Remove unnecessary temporary file * Review Part 17 | Leave TODOs for real-valued summary statistics * Review Part 18 * Review Part 19 | Fix variance calculations * Review Part 20 | Fix range calculations * Review Part 21 * Review Part 22 * Review Part 23 * Review Part 24 | Add default implementations for evaluate_statistics and propagate_statistics * Review Part 25 | Improve docs, refactor statistics graph code * Review Part 26 * Review Part 27 * Review Part 28 | Remove get_zero/get_one, simplify propagation in statistics graph * Review Part 29 * Review Part 30 | Move statistics-combining functions to core module, polish tests * Review Part 31 * Review Part 32 | Module reorganization * Review Part 33 * Add tests for bernoulli and gaussians combination * Incorporate community feedback * Fix merge issue --------- Co-authored-by: Sasha Syrotenko <[email protected]> Co-authored-by: berkaysynnada <[email protected]> Co-authored-by: Mehmet Ozan Kabak <[email protected]>

Sasha Syrotenko and others added 30 commits January 8, 2025 18:46

StatisticsV2: initial definition and validation method implementation

b4e8668

Implement mean, median and standard deviation extraction for StatsV2

992b3c0

Move stats_v2 to physical-expr package

e1d9395

Introduce ExprStatisticGraph and ExprStatisticGraphNode

1e27828

Split the StatisticsV2 and statistics graph locations, prepare the in…

674926f

…frastructure for stats top-down propagation and final bottom-up calculation

Merge branch 'apache_main' into feat/statistics_v2

943b5f1

Calculate variance instead of std_dev

45f151e

Create a skeleton for statistics bottom-up evaluation

7329c03

Introduce high-level test for 'evaluate_statistics()'

bf6d01c

Refactor result distribution computation during the statistics evalua…

add838c

…tion phase; add compute_range function

Always produce Unknown distribution in non-mentioned combination case…

d52af46

…s, todos for the future

Introduce Bernoulli distribution to be used as result of comparisons …

81b756d

…and inequations distribution combinations

Implement initial statistics propagation of Uniform and Unknown distr…

a35440b

…ibutions with known ranges

Implement evaluate_statistics for logical not and unary negation oper…

518d56c

…ator

Fix and add tests; make fmt happy

a1bbfce

Add integration test, implement conversion into Bernoulli distributio…

d9c2d83

…n for Eq and NotEq

Finish test, small cleanup

c3df2a6

minor improvements

a00e382

Update stats.rs

6245218

Addressing review comments

b8068b8

Implement median colmputation for Gaussian-Gaussian pair

140fb5e

Update stats_v2.rs

5630ff9

minor improvements

c3b96ed

Addressing second review comments, part 1

f4dd402

Return true in other cases

eac21f1

Finish addressing review requrests, part 2

10e187f

final clean-up

ee4f7ab

bug fix

4f190b8

final clean-up

9d8baa5

apply reverse logic in stats framework as well

1523ffc

alamb added the api change Changes the API exposed to users of the crate label Feb 21, 2025

alamb reviewed Feb 21, 2025

View reviewed changes

comphead reviewed Feb 21, 2025

View reviewed changes

Incorporate community feedback

3c0afec

github-actions bot added sql SQL Planner core Core DataFusion crate execution Related to the execution crate labels Feb 23, 2025

Merge branch 'main' into feat/statistics_v2

aa4b02e

github-actions bot removed sql SQL Planner core Core DataFusion crate execution Related to the execution crate labels Feb 23, 2025

Fix merge issue

17d7a7a

berkaysynnada mentioned this pull request Feb 23, 2025

Examples: boundary analysis example for AND/OR conjunctions #14735

Merged

alamb mentioned this pull request Feb 24, 2025

Weekly Plan (Andrew Lamb) Feb 24, 2025 #14850

Open

10 tasks

alamb approved these changes Feb 24, 2025

View reviewed changes

ozankabak merged commit 0fbd20c into apache:main Feb 24, 2025
24 checks passed

berkaysynnada mentioned this pull request Feb 24, 2025

StatisticsV2: statistics framework initial redesign for Datafusion synnada-ai/datafusion-upstream#57

Closed

This was referenced Feb 26, 2025

Statistics: Migrate to Distribution from Precision #14896

Open

Statistics: Implement SampledDistribution variant to Distribution to support estimated distributions #14897

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StatisticsV2: initial statistics framework redesign #14699

StatisticsV2: initial statistics framework redesign #14699

Fly-Style commented Feb 16, 2025 •

edited

Loading

berkaysynnada commented Feb 20, 2025 •

edited

Loading

edmondop commented Feb 20, 2025

ozankabak commented Feb 21, 2025 •

edited

Loading

ozankabak commented Feb 21, 2025 •

edited

Loading

alamb commented Feb 21, 2025

alamb left a comment

alamb Feb 21, 2025

ozankabak Feb 22, 2025

alamb Feb 22, 2025 •

edited

Loading

alamb Feb 22, 2025

ozankabak Feb 22, 2025

alamb Feb 22, 2025 •

edited

Loading

ozankabak Feb 22, 2025

alamb Feb 21, 2025

ozankabak Feb 22, 2025

alamb Feb 22, 2025

ozankabak Feb 22, 2025

alamb Feb 21, 2025

alamb Feb 21, 2025

alamb Feb 21, 2025

ozankabak Feb 22, 2025

alamb Feb 22, 2025

ozankabak Feb 22, 2025

comphead Feb 21, 2025

ozankabak Feb 22, 2025

ozankabak Feb 22, 2025

comphead Feb 22, 2025

clflushopt commented Feb 22, 2025

xudong963 commented Feb 23, 2025

ozankabak commented Feb 23, 2025

xudong963 commented Feb 24, 2025

alamb left a comment

ozankabak commented Feb 24, 2025

StatisticsV2: initial statistics framework redesign #14699

StatisticsV2: initial statistics framework redesign #14699

Conversation

Fly-Style commented Feb 16, 2025 • edited Loading

Rationale for this change

What changes are included in this PR?

Plan for the future changes:

Are these changes tested?

berkaysynnada commented Feb 20, 2025 • edited Loading

edmondop commented Feb 20, 2025

ozankabak commented Feb 21, 2025 • edited Loading

ozankabak commented Feb 21, 2025 • edited Loading

alamb commented Feb 21, 2025

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Feb 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Feb 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clflushopt commented Feb 22, 2025

xudong963 commented Feb 23, 2025

ozankabak commented Feb 23, 2025

xudong963 commented Feb 24, 2025

alamb left a comment

Choose a reason for hiding this comment

ozankabak commented Feb 24, 2025

Fly-Style commented Feb 16, 2025 •

edited

Loading

berkaysynnada commented Feb 20, 2025 •

edited

Loading

ozankabak commented Feb 21, 2025 •

edited

Loading

ozankabak commented Feb 21, 2025 •

edited

Loading

alamb Feb 22, 2025 •

edited

Loading

alamb Feb 22, 2025 •

edited

Loading