@@ -388,3 +388,110 @@ In the following example, the `type_coercion` and `simplify_expressions` passes
388
388
```
389
389
390
390
[ df ] : https://crates.io/crates/datafusion
391
+
392
+ ## Thinking about Query Optimization
393
+
394
+ Query optimization in DataFusion uses a cost based model. The cost based model
395
+ relies on table and column level statistics to estimate selectivity; selectivity
396
+ estimates are an important piece in cost analysis for filters and projections
397
+ as they allow estimating the cost of joins and filters.
398
+
399
+ An important piece of building these estimates is * boundary analysis* which uses
400
+ interval arithmetic to take an expression such as ` a > 2500 AND a <= 5000 ` and
401
+ build an accurate selectivity estimate that can then be used to find more efficient
402
+ plans.
403
+
404
+
405
+ #### ` AnalysisContext ` API
406
+
407
+ The ` AnalysisContext ` serves as a shared knowledge base during expression evaluation
408
+ and boundary analysis. Think of it as a dynamic repository that maintains information about:
409
+
410
+ 1 . Current known boundaries for columns and expressions
411
+ 2 . Statistics that have been gathered or inferred
412
+ 3 . A mutable state that can be updated as analysis progresses
413
+
414
+ What makes ` AnalysisContext ` particularly powerful is its ability to propagate information
415
+ through the expression tree. As each node in the expression tree is analyzed, it can both
416
+ read from and write to this shared context, allowing for sophisticated boundary analysis and inference.
417
+
418
+ #### ` ColumnStatistics ` for Cardinality Estimation
419
+
420
+ Column statistics form the foundation of optimization decisions. Rather than just tracking
421
+ simple metrics, DataFusion's ` ColumnStatistics ` provides a rich set of information including:
422
+
423
+ * Null value counts
424
+ * Maximum and minimum values
425
+ * Value sums (for numeric columns)
426
+ * Distinct value counts
427
+
428
+ Each of these statistics is wrapped in a ` Precision ` type that indicates whether the value is
429
+ exact or estimated, allowing the optimizer to make informed decisions about the reliability
430
+ of its cardinality estimates.
431
+
432
+ ### Boundary Analaysis Flow
433
+
434
+ The boundary analysis process flows through several stages, with each stage building
435
+ upon the information gathered in previous stages. The ` AnalysisContext ` is continuously
436
+ updated as the analysis progresses through the expression tree.
437
+
438
+ #### Expression Boundary Analysis
439
+
440
+ When analyzing expressions, DataFusion runs boundary analysis using interval arithmetic.
441
+ Consider a simple predicate like age > 18 AND age <= 25. The analysis flows as follows:
442
+
443
+ 1 . Context Initialization
444
+ * Begin with known column statistics
445
+ * Set up initial boundaries based on column constraints
446
+ * Initialize the shared analysis context
447
+
448
+
449
+ 2 . Expression Tree Walk
450
+ * Analyze each node in the expression tree
451
+ * Propagate boundary information upward
452
+ * Allow child nodes to influence parent boundaries
453
+
454
+
455
+ 3 . Boundary Updates
456
+ * Each expression can update the shared context
457
+ * Changes flow through the entire expression tree
458
+ * Final boundaries inform optimization decisions
459
+
460
+ ### Working with the analysis API
461
+
462
+ The following example shows how you can run an analysis pass on a physical expression
463
+ to infer the selectivity of the expression and the space of possible values it can
464
+ take.
465
+
466
+ ``` rust
467
+ fn analyze_filter_example () -> Result <()> {
468
+ // Create a schema with an 'age' column
469
+ let schema = Arc :: new (Schema :: new (vec! [make_field (" age" , DataType :: Int64 )]));
470
+
471
+ // Define column statistics
472
+ let column_stats = ColumnStatistics {
473
+ null_count : Precision :: Exact (0 ),
474
+ max_value : Precision :: Exact (ScalarValue :: Int64 (Some (79 ))),
475
+ min_value : Precision :: Exact (ScalarValue :: Int64 (Some (14 ))),
476
+ distinct_count : Precision :: Absent ,
477
+ sum_value : Precision :: Absent ,
478
+ };
479
+
480
+ // Create expression: age > 18 AND age <= 25
481
+ let expr = col (" age" )
482
+ . gt (lit (18i64 ))
483
+ . and (col (" age" ). lt_eq (lit (25i64 )));
484
+
485
+ // Initialize analysis context
486
+ let initial_boundaries = vec! [ExprBoundaries :: try_from_column (
487
+ & schema , & column_stats , 0 )? ];
488
+ let context = AnalysisContext :: new (initial_boundaries );
489
+
490
+ // Analyze expression
491
+ let df_schema = DFSchema :: try_from (schema )? ;
492
+ let physical_expr = SessionContext :: new (). create_physical_expr (expr , & df_schema )? ;
493
+ let analysis = analyze (& physical_expr , context , df_schema . as_ref ())? ;
494
+
495
+ Ok (())
496
+ }
497
+ ```
0 commit comments