Skip to content

Commit 4f61478

Browse files
committed
remove examples from README.md as duplication of Documenter docs
1 parent 2d7937b commit 4f61478

File tree

1 file changed

+1
-205
lines changed

1 file changed

+1
-205
lines changed

README.md

+1-205
Original file line numberDiff line numberDiff line change
@@ -38,212 +38,8 @@ Pkg.add(PackageSpec(url = "https://github.com/nredell/ShapML.jl"))
3838

3939
## Documentation and Vignettes
4040

41-
* **[Docs](https://nredell.github.io/ShapML.jl/dev/)**
41+
* **[Docs](https://nredell.github.io/ShapML.jl/dev/)** (incuding examples from [MLJ](https://alan-turing-institute.github.io/MLJ.jl/dev/))
4242

4343
* **[Consistency with TreeSHAP](https://nredell.github.io/ShapML.jl/dev/vignettes/consistency/)**
4444

4545
* **[Speed - Julia vs Python vs R](https://nredell.github.io/docs/julia_speed)**
46-
47-
## Examples
48-
49-
### Random Forest regression model - Non-parallel
50-
51-
* We'll explain the impact of 13 features from the Boston Housing dataset on the
52-
predicted outcome `MedV`--or the median value of owner-occupied homes in $1000's--using predictions
53-
from a trained Random Forest regression model and stochastic Shapley values.
54-
55-
* We'll explain a subset of 300 instances and then assess global feature importance
56-
by aggregating the unique feature importances for each of these instances.
57-
58-
``` julia
59-
using ShapML
60-
using RDatasets
61-
using DataFrames
62-
using MLJ # Machine learning
63-
using Gadfly # Plotting
64-
65-
# Load data.
66-
boston = RDatasets.dataset("MASS", "Boston")
67-
#------------------------------------------------------------------------------
68-
# Train a machine learning model; currently limited to single outcome regression and binary classification.
69-
outcome_name = "MedV"
70-
71-
# Data prep.
72-
y, X = MLJ.unpack(boston, ==(Symbol(outcome_name)), colname -> true)
73-
74-
# Instantiate an ML model; choose any single-outcome ML model from any package.
75-
random_forest = @load RandomForestRegressor pkg = "DecisionTree"
76-
model = MLJ.machine(random_forest, X, y)
77-
78-
# Train the model.
79-
fit!(model)
80-
81-
# Create a wrapper function that takes the following positional arguments: (1) a
82-
# trained ML model from any Julia package, (2) a DataFrame of model features. The
83-
# function should return a 1-column DataFrame of predictions--column names do not matter.
84-
function predict_function(model, data)
85-
data_pred = DataFrame(y_pred = predict(model, data))
86-
return data_pred
87-
end
88-
#------------------------------------------------------------------------------
89-
# ShapML setup.
90-
explain = copy(boston[1:300, :]) # Compute Shapley feature-level predictions for 300 instances.
91-
explain = select(explain, Not(Symbol(outcome_name))) # Remove the outcome column.
92-
93-
reference = copy(boston) # An optional reference population to compute the baseline prediction.
94-
reference = select(reference, Not(Symbol(outcome_name)))
95-
96-
sample_size = 60 # Number of Monte Carlo samples.
97-
#------------------------------------------------------------------------------
98-
# Compute stochastic Shapley values.
99-
data_shap = ShapML.shap(explain = explain,
100-
reference = reference,
101-
model = model,
102-
predict_function = predict_function,
103-
sample_size = sample_size,
104-
seed = 1
105-
)
106-
107-
show(data_shap, allcols = true)
108-
```
109-
<p align="center">
110-
<img src="./tools/shap_output.PNG" alt="shap_output">
111-
</p>
112-
113-
* Now we'll create several plots that summarize the Shapley results for our Random Forest model.
114-
These plots will eventually be refined and incorporated into `ShapML`.
115-
116-
* **Global feature importance**
117-
+ Because Shapley values represent deviations from the average or baseline prediction,
118-
plotting their average absolute value for each feature gives a sense of the magnitude with which
119-
they affect model predictions across all explained instances.
120-
121-
``` julia
122-
data_plot = DataFrames.by(data_shap, [:feature_name],
123-
mean_effect = [:shap_effect] => x -> mean(abs.(x.shap_effect)))
124-
125-
data_plot = sort(data_plot, order(:mean_effect, rev = true))
126-
127-
baseline = round(data_shap.intercept[1], digits = 1)
128-
129-
p = plot(data_plot, y = :feature_name, x = :mean_effect, Coord.cartesian(yflip = true),
130-
Scale.y_discrete, Geom.bar(position = :dodge, orientation = :horizontal),
131-
Theme(bar_spacing = 1mm),
132-
Guide.xlabel("|Shapley effect| (baseline = $baseline)"), Guide.ylabel(nothing),
133-
Guide.title("Feature Importance - Mean Absolute Shapley Value"))
134-
p
135-
```
136-
<p align="center">
137-
<img src="./tools/feature_importance_example.png" alt="feature_importance">
138-
</p>
139-
140-
141-
* **Global feature effects**
142-
+ The plot below shows how changing the value of the `Rm` feature--the most influential feature overall--affects
143-
model predictions (holding the other features constant). Each point represents 1 of our 300 explained instances.
144-
The black line is a loess line of best fit to summarize the effect.
145-
146-
``` julia
147-
data_plot = data_shap[data_shap.feature_name .== "Rm", :] # Selecting 1 feature for ease of plotting.
148-
149-
baseline = round(data_shap.intercept[1], digits = 1)
150-
151-
p_points = layer(data_plot, x = :feature_value, y = :shap_effect, Geom.point())
152-
p_line = layer(data_plot, x = :feature_value, y = :shap_effect, Geom.smooth(method = :loess, smoothing = 0.5),
153-
style(line_width = 0.75mm,), Theme(default_color = "black"))
154-
p = plot(p_line, p_points, Guide.xlabel("Feature value"), Guide.ylabel("Shapley effect (baseline = $baseline)"),
155-
Guide.title("Feature Effect - $(data_plot.feature_name[1])"))
156-
p
157-
```
158-
<p align="center">
159-
<img src="./tools/feature_effects.png" alt="feature_effects">
160-
</p>
161-
162-
***
163-
164-
### Random Forest regression model - Parallel
165-
166-
* We'll explain the same dataset with the same model, but this time we'll compute
167-
the Shapley values in parallel across cores using the built-in distributed computing
168-
in `ShapML` which implements `Distributed.pmap()` internally.
169-
170-
* The stochastic Shapley values will be computed in parallel over 6 cores on the same machine.
171-
172-
* With the same seed set, **non-parallel and parallel computation will return the same results**.
173-
174-
``` julia
175-
using Distributed
176-
addprocs(6) # 6 cores.
177-
```
178-
179-
* The `@everywhere` block of code will load the relevant packages on each core. If
180-
you use another ML package, you would swap it in for `using MLJ`.
181-
182-
``` julia
183-
@everywhere begin
184-
using ShapML
185-
using DataFrames
186-
using MLJ
187-
end
188-
```
189-
190-
``` julia
191-
using RDatasets
192-
193-
# Load data.
194-
boston = RDatasets.dataset("MASS", "Boston")
195-
#------------------------------------------------------------------------------
196-
# Train a machine learning model; currently limited to single outcome regression and binary classification.
197-
outcome_name = "MedV"
198-
199-
# Data prep.
200-
y, X = MLJ.unpack(boston, ==(Symbol(outcome_name)), colname -> true)
201-
202-
# Instantiate an ML model; choose any single-outcome ML model from any package.
203-
random_forest = @load RandomForestRegressor pkg = "DecisionTree"
204-
model = MLJ.machine(random_forest, X, y)
205-
206-
# Train the model.
207-
fit!(model)
208-
```
209-
210-
* `@everywhere` is needed to properly initialize the `predict()` wrapper function.
211-
212-
``` julia
213-
# Create a wrapper function that takes the following positional arguments: (1) a
214-
# trained ML model from any Julia package, (2) a DataFrame of model features. The
215-
# function should return a 1-column DataFrame of predictions--column names do not matter.
216-
@everywhere function predict_function(model, data)
217-
data_pred = DataFrame(y_pred = predict(model, data))
218-
return data_pred
219-
end
220-
```
221-
222-
* Notice that we've set `ShapML.shap(parallel = :samples)` to perform the computation
223-
in parallel across our 60 Monte Carlo samples.
224-
225-
``` julia
226-
# ShapML setup.
227-
explain = copy(boston[1:300, :]) # Compute Shapley feature-level predictions for 300 instances.
228-
explain = select(explain, Not(Symbol(outcome_name))) # Remove the outcome column.
229-
230-
reference = copy(boston) # An optional reference population to compute the baseline prediction.
231-
reference = select(reference, Not(Symbol(outcome_name)))
232-
233-
sample_size = 60 # Number of Monte Carlo samples.
234-
#------------------------------------------------------------------------------
235-
# Compute stochastic Shapley values.
236-
data_shap = ShapML.shap(explain = explain,
237-
reference = reference,
238-
model = model,
239-
predict_function = predict_function,
240-
sample_size = sample_size,
241-
parallel = :samples, # Parallel computation over "sample_size".
242-
seed = 1
243-
)
244-
245-
show(data_shap, allcols = true)
246-
```
247-
<p align="center">
248-
<img src="./tools/shap_output.PNG" alt="shap_output">
249-
</p>

0 commit comments

Comments
 (0)