Refactor: Modular architecture and Ray integration for distributed feature engineering#5
Open
taha2samy wants to merge 4 commits intoDataSystemsGroupUT:masterfrom
Open
Refactor: Modular architecture and Ray integration for distributed feature engineering#5taha2samy wants to merge 4 commits intoDataSystemsGroupUT:masterfrom
taha2samy wants to merge 4 commits intoDataSystemsGroupUT:masterfrom
Conversation
#2) * feat: implement distributed feature generation and importance calculation using Ray * refactor: modernize random number generation and standardize parameter naming in distributed tasks * refactor: update random sampling to use Generator and standardize Random Forest hyperparameters in importance calculations * refactor: add docstrings to utility functions and modernize random sampling in group_by * feat: add random_state parameter to select_estimator and configure default estimator hyperparameters * refactor: modularize feature importance calculation and optimize fit loop logic in BigFeat * refactor: standardize variable naming conventions to lowercase in transform and select_estimator methods * refactor: modularize fit method by extracting helper functions for importance calculation, batch generation, and weight updates * fix: set fixed seed for random aggregation selection in local_utils.py * test: add unit tests for mathematical utility functions in local_utils * refactor: remove redundant imports and module docstring in distributed_tasks.py * ci: add dependency installation and coverage report generation to SonarCloud workflow * feat: include JUnit XML report in test execution and SonarCloud analysis * chore: configure fetch-depth to 0 in SonarCloud workflow to enable full history analysis * ci: configure SonarCloud coverage settings via .coveragerc and simplify workflow steps * feat: add SonarQube configuration file for project analysis * refactor: remove SonarCloud configuration files and update xunit report path in workflow * chore: set sonar working directory to root in sonarcloud workflow * chore: remove sonar.working.directory configuration from SonarCloud workflow * chore: add tests directory to SonarCloud configuration * chore: upgrade SonarCloud GitHub action to v7 * chore: update SonarQube action, adjust coverage report path, and include tests in Sonar analysis * fix: update pytest command and resolve absolute paths for SonarCloud coverage and xunit reports * fix: update pytest junit format and simplify SonarQube report paths while disabling SCM exclusions
* feat: implement distributed feature generation and importance calculation using Ray * refactor: modernize random number generation and standardize parameter naming in distributed tasks * refactor: update random sampling to use Generator and standardize Random Forest hyperparameters in importance calculations * refactor: add docstrings to utility functions and modernize random sampling in group_by * feat: add random_state parameter to select_estimator and configure default estimator hyperparameters * refactor: modularize feature importance calculation and optimize fit loop logic in BigFeat * refactor: standardize variable naming conventions to lowercase in transform and select_estimator methods * refactor: modularize fit method by extracting helper functions for importance calculation, batch generation, and weight updates * fix: set fixed seed for random aggregation selection in local_utils.py * test: add unit tests for mathematical utility functions in local_utils * refactor: remove redundant imports and module docstring in distributed_tasks.py * ci: add dependency installation and coverage report generation to SonarCloud workflow * feat: include JUnit XML report in test execution and SonarCloud analysis * chore: configure fetch-depth to 0 in SonarCloud workflow to enable full history analysis * ci: configure SonarCloud coverage settings via .coveragerc and simplify workflow steps * feat: add SonarQube configuration file for project analysis * refactor: remove SonarCloud configuration files and update xunit report path in workflow * chore: set sonar working directory to root in sonarcloud workflow * chore: remove sonar.working.directory configuration from SonarCloud workflow * chore: add tests directory to SonarCloud configuration * chore: upgrade SonarCloud GitHub action to v7 * chore: update SonarQube action, adjust coverage report path, and include tests in Sonar analysis * fix: update pytest command and resolve absolute paths for SonarCloud coverage and xunit reports * fix: update pytest junit format and simplify SonarQube report paths while disabling SCM exclusions * fix: use python -m pytest to ensure correct environment execution in SonarCloud workflow * chore: simplify pytest command by removing python -m prefix in sonarcloud workflow * chore: update SonarCloud configuration to include absolute paths and fix report path keys * chore: add placeholder debug string to base orchestrator class * refactor: remove junk debug code and comments from base orchestrator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
1. Monolithic to Modular Package Refactor
Replaced the single, hard-to-maintain
bigfeat_base.pywith a specialized package structure. This separation of concerns allows for easier debugging and independent scaling of generation and selection modules.Architecture Evolution:
2. Distributed Architecture Integration (Ray)
The legacy model executed iterations sequentially. I have integrated Ray to enable parallel execution of feature crossing and importance scoring across multiple CPU cores or remote clusters.
Code Shift:
for i in range(gen_size): self.feat_with_depth(...)(Blocking)remote_generate_batch.remote(...)(Asynchronous/Distributed)config.pyto supportray://addresses for Kubernetes or local auto-scaling.3. Optimized Memory Management (Shared Object Store)
Passing large NumPy matrices directly to class methods in a distributed environment causes massive memory overhead. I implemented Ray Object Store references.
Implementation Hook:
4. Conversion to Stateless Pure Functions
Refactored logic out of the
BigFeatclass into standalone Pure Functions ingenerator.py. This was critical because class methods carry theselfcontext, which is often too heavy to "pickle" and send over the network.Logic Refactoring:
5. Distributed Importance Logic & LightGBM Fix
Importance calculations now happen "close to the data" on remote workers. I also identified and fixed a silent bug in the original LightGBM implementation where parameters were being reset.
The Fix:
param = {}immediately following parameter definitions, erasingnum_leavesandobjective.classificationandregressiontasks withinimportance.py.6. Numerical Stability (Epsilon Patch)
To prevent the system from crashing during massive feature generation, I injected an Epsilon ($1e-9$ ) constant into all probability and importance calculations.
Code Protection:
7. Parallel-Safe Randomness (Modern NumPy)
Swapped the old
np.random.RandomStatefor the modernnp.random.default_rng. In a distributed system, using the old method often leads to "collisions" (workers generating the exact same features).Seeding Strategy:
rng_seed = random_state + batch_index + iteration.8. Mathematical Utility Refactoring & Modernization
Decoupled mathematical operators into
local_utils.pyand modernized them to ensure compatibility with modern scientific stacks.unary_cube,unary_multinv,unary_sqrtabs, andunary_logabsfor better integration into the crossing logic.modefunction to includekeepdims=True, preventing runtime failures and warnings in SciPy 1.11+ environments.9. Stateful & Isolated Feature Selection
Transitioned the feature selection logic into a dedicated, stateful
selection.pymodule.selectorobject (e.g., fAnova) now preserves its fitted state within the class. This ensures that thetransformphase applies the exact same statistical filtering and feature mapping determined during thefitphase on new data.10. Unified Model Evaluation Engine
Consolidated model selection and scoring logic into a centralized
evaluation.pymodule.min_samples_leaf=1,max_features='sqrt'). This ensures trees are fully grown to extract high-variance, high-precision importance scores during feature generation.11. Dynamic Load Balancing (Ray Orchestration)
Overhauled the orchestrator in
base.pyto support high-throughput distributed execution.ray.cluster_resources(). It dynamically scales thebatch_sizebased on available CPU cores, eliminating idle worker time and optimizing cluster utilization.12. Tree Path Optimization & Filtering
Optimized the tree-utility logic within
tree_utils.pyto handle large-scale importance scoring.get_pathswith a unique-path filter. By discarding redundant tree branches, the system drastically reduces the computational load required to aggregate split-based feature importances.13. Experimental Testing Infrastructure (Alpha)
Introduced the project's first automated testing suite, currently in an experimental/alpha stage.
conftest.pyto manage the automated setup and teardown of local/mock Ray clusters.test_utils.pyto rigorously verify the mathematical integrity and output consistency of each feature operator.