-
Notifications
You must be signed in to change notification settings - Fork 1
Tim/privacy experiments ces22 #180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: privacy-analysis-main-branch
Are you sure you want to change the base?
Tim/privacy experiments ces22 #180
Conversation
…he model directories
This reverts commit 5c9d026.
…dict.py" This reverts commit a948aec.
This reverts commit e31f08d.
…ictions.csv" This reverts commit 2f7cbaa.
This reverts commit bed5f93.
This reverts commit 06277a9.
This reverts commit 258b294.
This reverts commit 2dbd04e.
…n predict.py" This reverts commit 3c7c4c2.
…y from the model directories" This reverts commit ead6749.
This reverts commit 298bd15.
…ataset" This reverts commit e8f3700.
… target value The evaluate-classifier-roc.py and evalyate-classifier-statistics.py scripts generate an ROC curve and generic metrics (respectively) for the predictions in data/predictions.csv Three open questions: (1) Is the use of classes_ in scripts/predict.py correct? (2) Even if the answer to (1) is "Yes," is use of that vector acceptable given that there is apparently no other way to map the elements returned by predict_proba() to values in the predicted column? (3) Is the vector yt[] computed in evaluate-classifer-roc.py correct as the first argument to roc_curve()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks mostly good. But we should fix the inline comments.
scripts/evaluate-classifier-roc.py
Outdated
# Create a new binary column that is 1 IFF the classifier prediction matches | ||
# the true value. This is what roc_curve() seems to want, but I'm not 100% | ||
# sure. | ||
yt = [1 if yv[i] == tv[i] else 0 for i in range(len(tv))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, yt in sklearn is normally short for "y-test". Meaning the test label you want to predict, i.e. tv
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So either delete it or turn it into a binary integer vector.
yt = [1 if yv[i] == tv[i] else 0 for i in range(len(tv))] | |
yt =[int(v) for v in tv] |
scripts/evaluate-classifier-roc.py
Outdated
yt = [1 if yv[i] == tv[i] else 0 for i in range(len(tv))] | ||
|
||
# Compute ROC curve and ROC area | ||
fpr, tpr, thresholds = roc_curve(yt, yp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, if tv is already binary you could just put tv
here.
fpr, tpr, thresholds = roc_curve(yt, yp) | |
fpr, tpr, thresholds = roc_curve(tv, yp) |
# https://scikit-learn.org/stable/modules/generated/sklearn. | ||
# linear_model.LogisticRegression.html#sklearn.linear_model. | ||
# LogisticRegression.predict_proba | ||
probabilities = ml_model.predict_proba(X_test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we need to decide which one of the two binary classes we want to predict and always grep the corresponding vector. Assuming the labels are encoded as 0, and 1, you could do the following:
probabilities = ml_model.predict_proba(X_test) | |
probabilities = ml_model.predict_proba(X_test) | |
j = list(ml_model.classes_).index(1) |
Alternatively, if your true value is encoded as "true-value"
, you could run j = list(ml_model.classes_).index("true-value")
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe for greatest generality we should specify the value to be predicted in params.yaml, then use that last method to create a new vector saying whether or not the predicted value equals that? That way we could handle dataset options like yes/no, true/false, without any preprocessing?
scripts/predict.py
Outdated
# LogisticRegression.predict_proba | ||
probabilities = ml_model.predict_proba(X_test) | ||
for i in range(len(probabilities)): | ||
j = list(ml_model.classes_).index(results["prediction"][i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.... and then you can delete this line.
This will become important when we add DVC stages for reproducibility (see following commit).
It's not needed if we save the resulting figure to file.
Will help make this part of a reproducible pipeline.
CAVEAT: Users need to change the positive label entry in params.yaml
This PR reverts a bunch of previous commits and in the end produces three net changes:
(1) Extend
scripts/predict.py
to add a column todata/predictions.csv
with the probability of the predicted value for the target feature(2) Add
scripts/evaluate-classifier-statistics.py
to compute generic metrics for the classifier output indata/predictions.csv
(3) Add
scripts/evaluate/classifier-roc.py
to generate a graphical ROC curve for the classifier output indata/predictions.csv
There are some open questions about about the correctness and morality of using the
classes_[]
vector inscripts/predict.py
and the correctness of the computed vectoryt[]
inscripts/evaluate-classifier-roc.py
as an input to theroc_curve()
function.