-
Notifications
You must be signed in to change notification settings - Fork 14
[Refactor] faster get_leafs method for xbooster/lgb_constructor.py
#13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's guide (collapsed on small PRs)Reviewer's GuideRefactors the margin-output branch of get_leafs in lgb_constructor to batch LightGBM per-tree predictions into a NumPy column stack before constructing the DataFrame, eliminating per-column assignment for better performance. Class diagram for updated get_leafs method in lgb_constructorclassDiagram
class LGBConstructor {
+model
+get_leafs(X, output_type, n_trees)
}
class LightGBMModel {
+predict(X, raw_score, start_iteration, num_iteration)
}
LGBConstructor --> LightGBMModel : uses
Flow diagram for optimized margin branch in get_leafsflowchart LR
A[start] --> B["Call get_leafs with X, output_type, n_trees"]
B --> C{output_type == margin}
C -->|no| D["Execute non margin logic (unchanged)"]
D --> Z[return df_leafs]
C -->|yes| E["Initialize tree_results as empty list"]
E --> F["Set i = 0"]
F --> G{ i < n_trees }
G -->|no| H["Stack tree_results with np.column_stack"]
H --> I["Construct df_leafs = DataFrame(stacked, index=X.index, columns=_colnames)"]
I --> Z[return df_leafs]
G -->|yes| J["res = model.predict(X, raw_score=True, start_iteration=i, num_iteration=1)"]
J --> K["Append res to tree_results"]
K --> L["i = i + 1"]
L --> G
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've left some high level feedback:
- The new
DataFrameconstruction relies on_colnames, which doesn’t appear in this snippet; double-check that_colnamesis defined in this branch and that its length always matchesn_treesto avoid runtime errors or misaligned columns. - Using
index=X.indexchanges behavior compared to the previous implementation (which used the default RangeIndex); ensureXis guaranteed to be a pandas object with anindexattribute, or guard against numpy arrays to avoid attribute errors.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new `DataFrame` construction relies on `_colnames`, which doesn’t appear in this snippet; double-check that `_colnames` is defined in this branch and that its length always matches `n_trees` to avoid runtime errors or misaligned columns.
- Using `index=X.index` changes behavior compared to the previous implementation (which used the default RangeIndex); ensure `X` is guaranteed to be a pandas object with an `index` attribute, or guard against numpy arrays to avoid attribute errors.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
Great work on these performance optimizations! All your changes from PRs #10, #11, #13, and #14 have been merged into the What's included:
Impact:
The RC branch ( Thank you for these excellent contributions! |
Description
This PR optimizes the
get_leafsmethod (specifically for output_type="margin") by replacing iterative DataFrame column assignment with the Collect-then-Construct pattern usingnp.column_stack.Key Changes
np.column_stack.Performance Benchmark
Measured using
%%timeiton the example ipynb:Summary by Sourcery
Enhancements:
np.column_stack-based DataFrame construction to speed up margin leaf retrieval.