Refactor: Modular architecture and Ray integration for distributed feature engineering by taha2samy · Pull Request #5 · DataSystemsGroupUT/BigFeat

taha2samy · 2026-04-18T04:28:21Z

1. Monolithic to Modular Package Refactor

Replaced the single, hard-to-maintain bigfeat_base.py with a specialized package structure. This separation of concerns allows for easier debugging and independent scaling of generation and selection modules.

Architecture Evolution:

# BEFORE (Monolith)                # AFTER (Modular Package)
BigFeat-master/                    BigFeat-master/
└── bigfeat/                       └── bigfeat/
    └── bigfeat_base.py (1.5k LOC)     ├── __init__.py
                                       ├── base.py (Orchestrator)
                                       ├── generator.py (Logic)
                                       ├── importance.py (Models)
                                       ├── selection.py (Filtering)
                                       └── distributed_tasks.py (Ray)

2. Distributed Architecture Integration (Ray)

The legacy model executed iterations sequentially. I have integrated Ray to enable parallel execution of feature crossing and importance scoring across multiple CPU cores or remote clusters.

Code Shift:

Legacy: for i in range(gen_size): self.feat_with_depth(...) (Blocking)
New: remote_generate_batch.remote(...) (Asynchronous/Distributed)
Connectivity: Added config.py to support ray:// addresses for Kubernetes or local auto-scaling.

3. Optimized Memory Management (Shared Object Store)

Passing large NumPy matrices directly to class methods in a distributed environment causes massive memory overhead. I implemented Ray Object Store references.

Implementation Hook:

# Instead of passing the actual matrix X, we store it once:
x_ref = ray.put(x_scaled) 

# Workers now only receive a lightweight 20-byte reference (x_ref)
# to access the shared memory, preventing OOM (Out of Memory) crashes.

4. Conversion to Stateless Pure Functions

Refactored logic out of the BigFeat class into standalone Pure Functions in generator.py. This was critical because class methods carry the self context, which is often too heavy to "pickle" and send over the network.

Logic Refactoring:

# BEFORE (Class Bound)
def feat_with_depth(self, X, ...):
    op = self.rng.choice(self.operators) # Bound to instance state

# AFTER (Stateless/Pure)
def feat_with_depth(X, ..., rng, operators): 
    # Can be easily serialized and sent to any remote worker

5. Distributed Importance Logic & LightGBM Fix

Importance calculations now happen "close to the data" on remote workers. I also identified and fixed a silent bug in the original LightGBM implementation where parameters were being reset.

The Fix:

Bug: The original code had param = {} immediately following parameter definitions, erasing num_leaves and objective.
Correction: Proper parameter mapping is now maintained for both classification and regression tasks within importance.py.

6. Numerical Stability (Epsilon Patch)

To prevent the system from crashing during massive feature generation, I injected an Epsilon ($1e-9$) constant into all probability and importance calculations.

Code Protection:

# Prevents ZeroDivisionError if a feature has zero gain
ig_vector = importance_sum / (importance_sum.sum() + 1e-9)

7. Parallel-Safe Randomness (Modern NumPy)

Swapped the old np.random.RandomState for the modern np.random.default_rng. In a distributed system, using the old method often leads to "collisions" (workers generating the exact same features).

Seeding Strategy:

Implemented a dynamic seed generator: rng_seed = random_state + batch_index + iteration.
This ensures that every parallel worker explores a different part of the feature space, maximizing entropy and feature diversity.

8. Mathematical Utility Refactoring & Modernization

Decoupled mathematical operators into local_utils.py and modernized them to ensure compatibility with modern scientific stacks.

Operator Standardization: Refactored unary transformations including unary_cube, unary_multinv, unary_sqrtabs, and unary_logabs for better integration into the crossing logic.
Modern SciPy Patching: Updated the mode function to include keepdims=True, preventing runtime failures and warnings in SciPy 1.11+ environments.

9. Stateful & Isolated Feature Selection

Transitioned the feature selection logic into a dedicated, stateful selection.py module.

Pipeline Consistency: The selector object (e.g., fAnova) now preserves its fitted state within the class. This ensures that the transform phase applies the exact same statistical filtering and feature mapping determined during the fit phase on new data.

10. Unified Model Evaluation Engine

Consolidated model selection and scoring logic into a centralized evaluation.py module.

Explicit Growth Control: Enforced explicit hyper-parameters for tree-based models (e.g., min_samples_leaf=1, max_features='sqrt'). This ensures trees are fully grown to extract high-variance, high-precision importance scores during feature generation.

11. Dynamic Load Balancing (Ray Orchestration)

Overhauled the orchestrator in base.py to support high-throughput distributed execution.

Intelligent Batching: Implemented a load-balancing mechanism that polls ray.cluster_resources(). It dynamically scales the batch_size based on available CPU cores, eliminating idle worker time and optimizing cluster utilization.

12. Tree Path Optimization & Filtering

Optimized the tree-utility logic within tree_utils.py to handle large-scale importance scoring.

Redundancy Filtering: Enhanced get_paths with a unique-path filter. By discarding redundant tree branches, the system drastically reduces the computational load required to aggregate split-based feature importances.

13. Experimental Testing Infrastructure (Alpha)

Introduced the project's first automated testing suite, currently in an experimental/alpha stage.

Distributed Lifecycle: Includes conftest.py to manage the automated setup and teardown of local/mock Ray clusters.
Operator Validation: Includes test_utils.py to rigorously verify the mathematical integrity and output consistency of each feature operator.

#2) * feat: implement distributed feature generation and importance calculation using Ray * refactor: modernize random number generation and standardize parameter naming in distributed tasks * refactor: update random sampling to use Generator and standardize Random Forest hyperparameters in importance calculations * refactor: add docstrings to utility functions and modernize random sampling in group_by * feat: add random_state parameter to select_estimator and configure default estimator hyperparameters * refactor: modularize feature importance calculation and optimize fit loop logic in BigFeat * refactor: standardize variable naming conventions to lowercase in transform and select_estimator methods * refactor: modularize fit method by extracting helper functions for importance calculation, batch generation, and weight updates * fix: set fixed seed for random aggregation selection in local_utils.py * test: add unit tests for mathematical utility functions in local_utils * refactor: remove redundant imports and module docstring in distributed_tasks.py * ci: add dependency installation and coverage report generation to SonarCloud workflow * feat: include JUnit XML report in test execution and SonarCloud analysis * chore: configure fetch-depth to 0 in SonarCloud workflow to enable full history analysis * ci: configure SonarCloud coverage settings via .coveragerc and simplify workflow steps * feat: add SonarQube configuration file for project analysis * refactor: remove SonarCloud configuration files and update xunit report path in workflow * chore: set sonar working directory to root in sonarcloud workflow * chore: remove sonar.working.directory configuration from SonarCloud workflow * chore: add tests directory to SonarCloud configuration * chore: upgrade SonarCloud GitHub action to v7 * chore: update SonarQube action, adjust coverage report path, and include tests in Sonar analysis * fix: update pytest command and resolve absolute paths for SonarCloud coverage and xunit reports * fix: update pytest junit format and simplify SonarQube report paths while disabling SCM exclusions

* feat: implement distributed feature generation and importance calculation using Ray * refactor: modernize random number generation and standardize parameter naming in distributed tasks * refactor: update random sampling to use Generator and standardize Random Forest hyperparameters in importance calculations * refactor: add docstrings to utility functions and modernize random sampling in group_by * feat: add random_state parameter to select_estimator and configure default estimator hyperparameters * refactor: modularize feature importance calculation and optimize fit loop logic in BigFeat * refactor: standardize variable naming conventions to lowercase in transform and select_estimator methods * refactor: modularize fit method by extracting helper functions for importance calculation, batch generation, and weight updates * fix: set fixed seed for random aggregation selection in local_utils.py * test: add unit tests for mathematical utility functions in local_utils * refactor: remove redundant imports and module docstring in distributed_tasks.py * ci: add dependency installation and coverage report generation to SonarCloud workflow * feat: include JUnit XML report in test execution and SonarCloud analysis * chore: configure fetch-depth to 0 in SonarCloud workflow to enable full history analysis * ci: configure SonarCloud coverage settings via .coveragerc and simplify workflow steps * feat: add SonarQube configuration file for project analysis * refactor: remove SonarCloud configuration files and update xunit report path in workflow * chore: set sonar working directory to root in sonarcloud workflow * chore: remove sonar.working.directory configuration from SonarCloud workflow * chore: add tests directory to SonarCloud configuration * chore: upgrade SonarCloud GitHub action to v7 * chore: update SonarQube action, adjust coverage report path, and include tests in Sonar analysis * fix: update pytest command and resolve absolute paths for SonarCloud coverage and xunit reports * fix: update pytest junit format and simplify SonarQube report paths while disabling SCM exclusions * fix: use python -m pytest to ensure correct environment execution in SonarCloud workflow * chore: simplify pytest command by removing python -m prefix in sonarcloud workflow * chore: update SonarCloud configuration to include absolute paths and fix report path keys * chore: add placeholder debug string to base orchestrator class * refactor: remove junk debug code and comments from base orchestrator

taha2samy and others added 4 commits April 14, 2026 03:21

refector

161ae11

chore: remove unused Archive.zip file

5159834

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: Modular architecture and Ray integration for distributed feature engineering#5

Refactor: Modular architecture and Ray integration for distributed feature engineering#5
taha2samy wants to merge 4 commits intoDataSystemsGroupUT:masterfrom
taha2samy:master

taha2samy commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

taha2samy commented Apr 18, 2026

1. Monolithic to Modular Package Refactor

2. Distributed Architecture Integration (Ray)

3. Optimized Memory Management (Shared Object Store)

4. Conversion to Stateless Pure Functions

5. Distributed Importance Logic & LightGBM Fix

6. Numerical Stability (Epsilon Patch)

7. Parallel-Safe Randomness (Modern NumPy)

8. Mathematical Utility Refactoring & Modernization

9. Stateful & Isolated Feature Selection

10. Unified Model Evaluation Engine

11. Dynamic Load Balancing (Ray Orchestration)

12. Tree Path Optimization & Filtering

13. Experimental Testing Infrastructure (Alpha)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant