@@ -38,212 +38,8 @@ Pkg.add(PackageSpec(url = "https://github.com/nredell/ShapML.jl"))
38
38
39
39
## Documentation and Vignettes
40
40
41
- * ** [ Docs] ( https://nredell.github.io/ShapML.jl/dev/ ) **
41
+ * ** [ Docs] ( https://nredell.github.io/ShapML.jl/dev/ ) ** (incuding examples from [ MLJ ] ( https://alan-turing-institute.github.io/MLJ.jl/dev/ ) )
42
42
43
43
* ** [ Consistency with TreeSHAP] ( https://nredell.github.io/ShapML.jl/dev/vignettes/consistency/ ) **
44
44
45
45
* ** [ Speed - Julia vs Python vs R] ( https://nredell.github.io/docs/julia_speed ) **
46
-
47
- ## Examples
48
-
49
- ### Random Forest regression model - Non-parallel
50
-
51
- * We'll explain the impact of 13 features from the Boston Housing dataset on the
52
- predicted outcome ` MedV ` --or the median value of owner-occupied homes in $1000's--using predictions
53
- from a trained Random Forest regression model and stochastic Shapley values.
54
-
55
- * We'll explain a subset of 300 instances and then assess global feature importance
56
- by aggregating the unique feature importances for each of these instances.
57
-
58
- ``` julia
59
- using ShapML
60
- using RDatasets
61
- using DataFrames
62
- using MLJ # Machine learning
63
- using Gadfly # Plotting
64
-
65
- # Load data.
66
- boston = RDatasets. dataset (" MASS" , " Boston" )
67
- # ------------------------------------------------------------------------------
68
- # Train a machine learning model; currently limited to single outcome regression and binary classification.
69
- outcome_name = " MedV"
70
-
71
- # Data prep.
72
- y, X = MLJ. unpack (boston, == (Symbol (outcome_name)), colname -> true )
73
-
74
- # Instantiate an ML model; choose any single-outcome ML model from any package.
75
- random_forest = @load RandomForestRegressor pkg = " DecisionTree"
76
- model = MLJ. machine (random_forest, X, y)
77
-
78
- # Train the model.
79
- fit! (model)
80
-
81
- # Create a wrapper function that takes the following positional arguments: (1) a
82
- # trained ML model from any Julia package, (2) a DataFrame of model features. The
83
- # function should return a 1-column DataFrame of predictions--column names do not matter.
84
- function predict_function (model, data)
85
- data_pred = DataFrame (y_pred = predict (model, data))
86
- return data_pred
87
- end
88
- # ------------------------------------------------------------------------------
89
- # ShapML setup.
90
- explain = copy (boston[1 : 300 , :]) # Compute Shapley feature-level predictions for 300 instances.
91
- explain = select (explain, Not (Symbol (outcome_name))) # Remove the outcome column.
92
-
93
- reference = copy (boston) # An optional reference population to compute the baseline prediction.
94
- reference = select (reference, Not (Symbol (outcome_name)))
95
-
96
- sample_size = 60 # Number of Monte Carlo samples.
97
- # ------------------------------------------------------------------------------
98
- # Compute stochastic Shapley values.
99
- data_shap = ShapML. shap (explain = explain,
100
- reference = reference,
101
- model = model,
102
- predict_function = predict_function,
103
- sample_size = sample_size,
104
- seed = 1
105
- )
106
-
107
- show (data_shap, allcols = true )
108
- ```
109
- <p align =" center " >
110
- <img src="./tools/shap_output.PNG" alt="shap_output">
111
- </p >
112
-
113
- * Now we'll create several plots that summarize the Shapley results for our Random Forest model.
114
- These plots will eventually be refined and incorporated into ` ShapML ` .
115
-
116
- * ** Global feature importance**
117
- + Because Shapley values represent deviations from the average or baseline prediction,
118
- plotting their average absolute value for each feature gives a sense of the magnitude with which
119
- they affect model predictions across all explained instances.
120
-
121
- ``` julia
122
- data_plot = DataFrames. by (data_shap, [:feature_name ],
123
- mean_effect = [:shap_effect ] => x -> mean (abs .(x. shap_effect)))
124
-
125
- data_plot = sort (data_plot, order (:mean_effect , rev = true ))
126
-
127
- baseline = round (data_shap. intercept[1 ], digits = 1 )
128
-
129
- p = plot (data_plot, y = :feature_name , x = :mean_effect , Coord. cartesian (yflip = true ),
130
- Scale. y_discrete, Geom. bar (position = :dodge , orientation = :horizontal ),
131
- Theme (bar_spacing = 1 mm),
132
- Guide. xlabel (" |Shapley effect| (baseline = $baseline )" ), Guide. ylabel (nothing ),
133
- Guide. title (" Feature Importance - Mean Absolute Shapley Value" ))
134
- p
135
- ```
136
- <p align =" center " >
137
- <img src="./tools/feature_importance_example.png" alt="feature_importance">
138
- </p >
139
-
140
-
141
- * ** Global feature effects**
142
- + The plot below shows how changing the value of the ` Rm ` feature--the most influential feature overall--affects
143
- model predictions (holding the other features constant). Each point represents 1 of our 300 explained instances.
144
- The black line is a loess line of best fit to summarize the effect.
145
-
146
- ``` julia
147
- data_plot = data_shap[data_shap. feature_name .== " Rm" , :] # Selecting 1 feature for ease of plotting.
148
-
149
- baseline = round (data_shap. intercept[1 ], digits = 1 )
150
-
151
- p_points = layer (data_plot, x = :feature_value , y = :shap_effect , Geom. point ())
152
- p_line = layer (data_plot, x = :feature_value , y = :shap_effect , Geom. smooth (method = :loess , smoothing = 0.5 ),
153
- style (line_width = 0.75 mm,), Theme (default_color = " black" ))
154
- p = plot (p_line, p_points, Guide. xlabel (" Feature value" ), Guide. ylabel (" Shapley effect (baseline = $baseline )" ),
155
- Guide. title (" Feature Effect - $(data_plot. feature_name[1 ]) " ))
156
- p
157
- ```
158
- <p align =" center " >
159
- <img src="./tools/feature_effects.png" alt="feature_effects">
160
- </p >
161
-
162
- ***
163
-
164
- ### Random Forest regression model - Parallel
165
-
166
- * We'll explain the same dataset with the same model, but this time we'll compute
167
- the Shapley values in parallel across cores using the built-in distributed computing
168
- in ` ShapML ` which implements ` Distributed.pmap() ` internally.
169
-
170
- * The stochastic Shapley values will be computed in parallel over 6 cores on the same machine.
171
-
172
- * With the same seed set, ** non-parallel and parallel computation will return the same results** .
173
-
174
- ``` julia
175
- using Distributed
176
- addprocs (6 ) # 6 cores.
177
- ```
178
-
179
- * The ` @everywhere ` block of code will load the relevant packages on each core. If
180
- you use another ML package, you would swap it in for ` using MLJ ` .
181
-
182
- ``` julia
183
- @everywhere begin
184
- using ShapML
185
- using DataFrames
186
- using MLJ
187
- end
188
- ```
189
-
190
- ``` julia
191
- using RDatasets
192
-
193
- # Load data.
194
- boston = RDatasets. dataset (" MASS" , " Boston" )
195
- # ------------------------------------------------------------------------------
196
- # Train a machine learning model; currently limited to single outcome regression and binary classification.
197
- outcome_name = " MedV"
198
-
199
- # Data prep.
200
- y, X = MLJ. unpack (boston, == (Symbol (outcome_name)), colname -> true )
201
-
202
- # Instantiate an ML model; choose any single-outcome ML model from any package.
203
- random_forest = @load RandomForestRegressor pkg = " DecisionTree"
204
- model = MLJ. machine (random_forest, X, y)
205
-
206
- # Train the model.
207
- fit! (model)
208
- ```
209
-
210
- * ` @everywhere ` is needed to properly initialize the ` predict() ` wrapper function.
211
-
212
- ``` julia
213
- # Create a wrapper function that takes the following positional arguments: (1) a
214
- # trained ML model from any Julia package, (2) a DataFrame of model features. The
215
- # function should return a 1-column DataFrame of predictions--column names do not matter.
216
- @everywhere function predict_function (model, data)
217
- data_pred = DataFrame (y_pred = predict (model, data))
218
- return data_pred
219
- end
220
- ```
221
-
222
- * Notice that we've set ` ShapML.shap(parallel = :samples) ` to perform the computation
223
- in parallel across our 60 Monte Carlo samples.
224
-
225
- ``` julia
226
- # ShapML setup.
227
- explain = copy (boston[1 : 300 , :]) # Compute Shapley feature-level predictions for 300 instances.
228
- explain = select (explain, Not (Symbol (outcome_name))) # Remove the outcome column.
229
-
230
- reference = copy (boston) # An optional reference population to compute the baseline prediction.
231
- reference = select (reference, Not (Symbol (outcome_name)))
232
-
233
- sample_size = 60 # Number of Monte Carlo samples.
234
- # ------------------------------------------------------------------------------
235
- # Compute stochastic Shapley values.
236
- data_shap = ShapML. shap (explain = explain,
237
- reference = reference,
238
- model = model,
239
- predict_function = predict_function,
240
- sample_size = sample_size,
241
- parallel = :samples , # Parallel computation over "sample_size".
242
- seed = 1
243
- )
244
-
245
- show (data_shap, allcols = true )
246
- ```
247
- <p align =" center " >
248
- <img src="./tools/shap_output.PNG" alt="shap_output">
249
- </p >
0 commit comments