Skip to content

Tim/privacy experiments ces22 #180

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 33 commits into
base: privacy-analysis-main-branch
Choose a base branch
from

Conversation

neeshjaa
Copy link
Collaborator

This PR reverts a bunch of previous commits and in the end produces three net changes:

(1) Extend scripts/predict.py to add a column to data/predictions.csv with the probability of the predicted value for the target feature
(2) Add scripts/evaluate-classifier-statistics.py to compute generic metrics for the classifier output in data/predictions.csv
(3) Add scripts/evaluate/classifier-roc.py to generate a graphical ROC curve for the classifier output in data/predictions.csv

There are some open questions about about the correctness and morality of using the classes_[] vector in scripts/predict.py and the correctness of the computed vector yt[] in scripts/evaluate-classifier-roc.py as an input to the roc_curve() function.

…y from the model directories"

This reverts commit ead6749.
… target value

The evaluate-classifier-roc.py and evalyate-classifier-statistics.py scripts
generate an ROC curve and generic metrics (respectively) for the predictions
in data/predictions.csv

Three open questions:

(1) Is the use of classes_ in scripts/predict.py correct?
(2) Even if the answer to (1) is "Yes," is use of that vector acceptable
    given that there is apparently no other way to map the elements returned
    by predict_proba() to values in the predicted column?
(3) Is the vector yt[] computed in evaluate-classifer-roc.py correct as the
    first argument to roc_curve()?
Copy link
Contributor

@Schaechtle Schaechtle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks mostly good. But we should fix the inline comments.

# Create a new binary column that is 1 IFF the classifier prediction matches
# the true value. This is what roc_curve() seems to want, but I'm not 100%
# sure.
yt = [1 if yv[i] == tv[i] else 0 for i in range(len(tv))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yt in sklearn is normally short for "y-test". Meaning the test label you want to predict, i.e. tv.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So either delete it or turn it into a binary integer vector.

Suggested change
yt = [1 if yv[i] == tv[i] else 0 for i in range(len(tv))]
yt =[int(v) for v in tv]

yt = [1 if yv[i] == tv[i] else 0 for i in range(len(tv))]

# Compute ROC curve and ROC area
fpr, tpr, thresholds = roc_curve(yt, yp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, if tv is already binary you could just put tv here.

Suggested change
fpr, tpr, thresholds = roc_curve(yt, yp)
fpr, tpr, thresholds = roc_curve(tv, yp)

# https://scikit-learn.org/stable/modules/generated/sklearn.
# linear_model.LogisticRegression.html#sklearn.linear_model.
# LogisticRegression.predict_proba
probabilities = ml_model.predict_proba(X_test)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we need to decide which one of the two binary classes we want to predict and always grep the corresponding vector. Assuming the labels are encoded as 0, and 1, you could do the following:

Suggested change
probabilities = ml_model.predict_proba(X_test)
probabilities = ml_model.predict_proba(X_test)
j = list(ml_model.classes_).index(1)

Alternatively, if your true value is encoded as "true-value", you could run j = list(ml_model.classes_).index("true-value").

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for greatest generality we should specify the value to be predicted in params.yaml, then use that last method to create a new vector saying whether or not the predicted value equals that? That way we could handle dataset options like yes/no, true/false, without any preprocessing?

# LogisticRegression.predict_proba
probabilities = ml_model.predict_proba(X_test)
for i in range(len(probabilities)):
j = list(ml_model.classes_).index(results["prediction"][i])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.... and then you can delete this line.

This will become important when we add DVC stages for
reproducibility (see following commit).
It's not needed if we save the resulting figure to file.
Will help make this part of a reproducible pipeline.
CAVEAT: Users need to  change the positive label entry in params.yaml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants