Skip to content

Commit 61c01d2

Browse files
committed
feat(docs): supplement the guide with an overview of boundary analysis
This change adds a short section in the Query Optimizer page of the library guide that gives a brief overview of boundary analysis and cardinality estimation and their role during query optimization.
1 parent bd64441 commit 61c01d2

File tree

1 file changed

+107
-0
lines changed

1 file changed

+107
-0
lines changed

docs/source/library-user-guide/query-optimizer.md

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -388,3 +388,110 @@ In the following example, the `type_coercion` and `simplify_expressions` passes
388388
```
389389

390390
[df]: https://crates.io/crates/datafusion
391+
392+
## Thinking about Query Optimization
393+
394+
Query optimization in DataFusion uses a cost based model. The cost based model
395+
relies on table and column level statistics to estimate selectivity; selectivity
396+
estimates are an important piece in cost analysis for filters and projections
397+
as they allow estimating the cost of joins and filters.
398+
399+
An important piece of building these estimates is *boundary analysis* which uses
400+
interval arithmetic to take an expression such as `a > 2500 AND a <= 5000` and
401+
build an accurate selectivity estimate that can then be used to find more efficient
402+
plans.
403+
404+
405+
#### `AnalysisContext` API
406+
407+
The `AnalysisContext` serves as a shared knowledge base during expression evaluation
408+
and boundary analysis. Think of it as a dynamic repository that maintains information about:
409+
410+
1. Current known boundaries for columns and expressions
411+
2. Statistics that have been gathered or inferred
412+
3. A mutable state that can be updated as analysis progresses
413+
414+
What makes `AnalysisContext` particularly powerful is its ability to propagate information
415+
through the expression tree. As each node in the expression tree is analyzed, it can both
416+
read from and write to this shared context, allowing for sophisticated boundary analysis and inference.
417+
418+
#### `ColumnStatistics` for Cardinality Estimation
419+
420+
Column statistics form the foundation of optimization decisions. Rather than just tracking
421+
simple metrics, DataFusion's `ColumnStatistics` provides a rich set of information including:
422+
423+
* Null value counts
424+
* Maximum and minimum values
425+
* Value sums (for numeric columns)
426+
* Distinct value counts
427+
428+
Each of these statistics is wrapped in a `Precision` type that indicates whether the value is
429+
exact or estimated, allowing the optimizer to make informed decisions about the reliability
430+
of its cardinality estimates.
431+
432+
### Boundary Analaysis Flow
433+
434+
The boundary analysis process flows through several stages, with each stage building
435+
upon the information gathered in previous stages. The `AnalysisContext` is continuously
436+
updated as the analysis progresses through the expression tree.
437+
438+
#### Expression Boundary Analysis
439+
440+
When analyzing expressions, DataFusion runs boundary analysis using interval arithmetic.
441+
Consider a simple predicate like age > 18 AND age <= 25. The analysis flows as follows:
442+
443+
1. Context Initialization
444+
* Begin with known column statistics
445+
* Set up initial boundaries based on column constraints
446+
* Initialize the shared analysis context
447+
448+
449+
2. Expression Tree Walk
450+
* Analyze each node in the expression tree
451+
* Propagate boundary information upward
452+
* Allow child nodes to influence parent boundaries
453+
454+
455+
3. Boundary Updates
456+
* Each expression can update the shared context
457+
* Changes flow through the entire expression tree
458+
* Final boundaries inform optimization decisions
459+
460+
### Working with the analysis API
461+
462+
The following example shows how you can run an analysis pass on a physical expression
463+
to infer the selectivity of the expression and the space of possible values it can
464+
take.
465+
466+
```rust
467+
fn analyze_filter_example() -> Result<()> {
468+
// Create a schema with an 'age' column
469+
let schema = Arc::new(Schema::new(vec![make_field("age", DataType::Int64)]));
470+
471+
// Define column statistics
472+
let column_stats = ColumnStatistics {
473+
null_count: Precision::Exact(0),
474+
max_value: Precision::Exact(ScalarValue::Int64(Some(79))),
475+
min_value: Precision::Exact(ScalarValue::Int64(Some(14))),
476+
distinct_count: Precision::Absent,
477+
sum_value: Precision::Absent,
478+
};
479+
480+
// Create expression: age > 18 AND age <= 25
481+
let expr = col("age")
482+
.gt(lit(18i64))
483+
.and(col("age").lt_eq(lit(25i64)));
484+
485+
// Initialize analysis context
486+
let initial_boundaries = vec![ExprBoundaries::try_from_column(
487+
&schema, &column_stats, 0)?];
488+
let context = AnalysisContext::new(initial_boundaries);
489+
490+
// Analyze expression
491+
let df_schema = DFSchema::try_from(schema)?;
492+
let physical_expr = SessionContext::new().create_physical_expr(expr, &df_schema)?;
493+
let analysis = analyze(&physical_expr, context, df_schema.as_ref())?;
494+
495+
Ok(())
496+
}
497+
```

0 commit comments

Comments
 (0)