Tim/privacy experiments ces22 #180

neeshjaa · 2024-02-19T02:00:18Z

This PR reverts a bunch of previous commits and in the end produces three net changes:

(1) Extend scripts/predict.py to add a column to data/predictions.csv with the probability of the predicted value for the target feature
(2) Add scripts/evaluate-classifier-statistics.py to compute generic metrics for the classifier output in data/predictions.csv
(3) Add scripts/evaluate/classifier-roc.py to generate a graphical ROC curve for the classifier output in data/predictions.csv

There are some open questions about about the correctness and morality of using the classes_[] vector in scripts/predict.py and the correctness of the computed vector yt[] in scripts/evaluate-classifier-roc.py as an input to the roc_curve() function.

…he model directories

…t.py

This reverts commit 5c9d026.

…dict.py" This reverts commit a948aec.

This reverts commit e31f08d.

…ictions.csv" This reverts commit 2f7cbaa.

This reverts commit bed5f93.

This reverts commit 06277a9.

This reverts commit 258b294.

This reverts commit 2dbd04e.

…n predict.py" This reverts commit 3c7c4c2.

…y from the model directories" This reverts commit ead6749.

This reverts commit 298bd15.

…ataset" This reverts commit e8f3700.

… target value The evaluate-classifier-roc.py and evalyate-classifier-statistics.py scripts generate an ROC curve and generic metrics (respectively) for the predictions in data/predictions.csv Three open questions: (1) Is the use of classes_ in scripts/predict.py correct? (2) Even if the answer to (1) is "Yes," is use of that vector acceptable given that there is apparently no other way to map the elements returned by predict_proba() to values in the predicted column? (3) Is the vector yt[] computed in evaluate-classifer-roc.py correct as the first argument to roc_curve()?

Schaechtle

This looks mostly good. But we should fix the inline comments.

Schaechtle · 2024-02-20T21:23:54Z

scripts/evaluate-classifier-roc.py

+# Create a new binary column that is 1 IFF the classifier prediction matches
+# the true value. This is what roc_curve() seems to want, but I'm not 100%
+# sure.
+yt = [1 if yv[i] == tv[i] else 0 for i in range(len(tv))]


ah, yt in sklearn is normally short for "y-test". Meaning the test label you want to predict, i.e. tv.

So either delete it or turn it into a binary integer vector.

Suggested change

yt = [1 if yv[i] == tv[i] else 0 for i in range(len(tv))]

yt =[int(v) for v in tv]

Schaechtle · 2024-02-20T21:34:27Z

scripts/evaluate-classifier-roc.py

+yt = [1 if yv[i] == tv[i] else 0 for i in range(len(tv))]
+
+# Compute ROC curve and ROC area
+fpr, tpr, thresholds = roc_curve(yt, yp)


Alternatively, if tv is already binary you could just put tv here.

Suggested change

fpr, tpr, thresholds = roc_curve(yt, yp)

fpr, tpr, thresholds = roc_curve(tv, yp)

Schaechtle · 2024-02-20T21:39:25Z

scripts/predict.py

+            # https://scikit-learn.org/stable/modules/generated/sklearn.
+            #     linear_model.LogisticRegression.html#sklearn.linear_model.
+            #     LogisticRegression.predict_proba
+            probabilities = ml_model.predict_proba(X_test)


So, we need to decide which one of the two binary classes we want to predict and always grep the corresponding vector. Assuming the labels are encoded as 0, and 1, you could do the following:

Suggested change

probabilities = ml_model.predict_proba(X_test)

probabilities = ml_model.predict_proba(X_test)

j = list(ml_model.classes_).index(1)

Alternatively, if your true value is encoded as "true-value", you could run j = list(ml_model.classes_).index("true-value").

Maybe for greatest generality we should specify the value to be predicted in params.yaml, then use that last method to create a new vector saying whether or not the predicted value equals that? That way we could handle dataset options like yes/no, true/false, without any preprocessing?

Schaechtle · 2024-02-20T21:46:39Z

scripts/predict.py

+            #     LogisticRegression.predict_proba
+            probabilities = ml_model.predict_proba(X_test)
+            for i in range(len(probabilities)):
+                j = list(ml_model.classes_).index(results["prediction"][i])


.... and then you can delete this line.

This will become important when we add DVC stages for reproducibility (see following commit).

It's not needed if we save the resulting figure to file.

Will help make this part of a reproducible pipeline.

CAVEAT: Users need to change the positive label entry in params.yaml

neeshjaa added 26 commits February 13, 2024 07:27

update: Initial commit with privacy experiments on CES 2022 dataset

e8f3700

update: Improvement to the documentation

298bd15

update: Generate CSV files with binary target feature directly from t…

ead6749

…he model directories

update: Update documentation regarding outstanding question on predic…

3c7c4c2

…t.py

update: More documentation tweaks

2dbd04e

fix: Must make sure Python lint likes my indentation

258b294

fix: Yet another indentation issue (tabs aren't allowed?)

06277a9

fix: Lint pedantry needed to make the code ugly

bed5f93

fix: Generalize selection of predictive probabilities in predictions.csv

2f7cbaa

fix: Clean up some temporary versions

e31f08d

docs: Update privacy README to remove open question about predict.py

a948aec

fix: Replace y_true[] input to roc_curve with synthetic column

5c9d026

Revert "fix: Replace y_true[] input to roc_curve with synthetic column"

ebea79f

This reverts commit 5c9d026.

Revert "docs: Update privacy README to remove open question about pre…

4d656ec

…dict.py" This reverts commit a948aec.

Revert "fix: Clean up some temporary versions"

1387d39

This reverts commit e31f08d.

Revert "fix: Generalize selection of predictive probabilities in pred…

0597366

…ictions.csv" This reverts commit 2f7cbaa.

Revert "fix: Lint pedantry needed to make the code ugly"

9aad228

This reverts commit bed5f93.

Revert "fix: Yet another indentation issue (tabs aren't allowed?)"

4cdeb7f

This reverts commit 06277a9.

Revert "fix: Must make sure Python lint likes my indentation"

6bc3dc2

This reverts commit 258b294.

Revert "update: More documentation tweaks"

2212bef

This reverts commit 2dbd04e.

Revert "update: Update documentation regarding outstanding question o…

e6a92a9

…n predict.py" This reverts commit 3c7c4c2.

Revert "update: Generate CSV files with binary target feature directl…

aeaff38

…y from the model directories" This reverts commit ead6749.

Revert "update: Improvement to the documentation"

9a7825e

This reverts commit 298bd15.

Revert "update: Initial commit with privacy experiments on CES 2022 d…

cc94798

…ataset" This reverts commit e8f3700.

fix: Python lint

8aefa67

Schaechtle approved these changes Feb 20, 2024

View reviewed changes

Schaechtle reviewed Feb 20, 2024

View reviewed changes

Schaechtle added 2 commits February 23, 2024 07:42

fix: Don't choose target for prediction at random

944b082

fix: Ensure predictions.csv is pointing to data/directory

c90d756

This will become important when we add DVC stages for reproducibility (see following commit).

Schaechtle added 5 commits February 23, 2024 07:42

chore: Remove plot.show()

451d357

It's not needed if we save the resulting figure to file.

feat: Save results to disk

289276f

Will help make this part of a reproducible pipeline.

feat: Add pipeline stages for Tim's analysis

f500f2b

chore: Treat roc.png as pipeline output

6ca7b00

feat: Make the "positive" label for the ROC curve explicit throughout

c603f57

CAVEAT: Users need to change the positive label entry in params.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tim/privacy experiments ces22 #180

Tim/privacy experiments ces22 #180

Uh oh!

neeshjaa commented Feb 19, 2024

Uh oh!

Schaechtle left a comment

Uh oh!

Schaechtle Feb 20, 2024

Uh oh!

Schaechtle Feb 20, 2024

Uh oh!

Schaechtle Feb 20, 2024

Uh oh!

Schaechtle Feb 20, 2024

Uh oh!

neeshjaa Feb 22, 2024

Uh oh!

Schaechtle Feb 20, 2024

Uh oh!

Uh oh!

	yt = [1 if yv[i] == tv[i] else 0 for i in range(len(tv))]
	yt =[int(v) for v in tv]

	fpr, tpr, thresholds = roc_curve(yt, yp)
	fpr, tpr, thresholds = roc_curve(tv, yp)

	probabilities = ml_model.predict_proba(X_test)
	probabilities = ml_model.predict_proba(X_test)
	j = list(ml_model.classes_).index(1)

Tim/privacy experiments ces22 #180

Are you sure you want to change the base?

Tim/privacy experiments ces22 #180

Uh oh!

Conversation

neeshjaa commented Feb 19, 2024

Uh oh!

Schaechtle left a comment

Choose a reason for hiding this comment

Uh oh!

Schaechtle Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

Schaechtle Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

Schaechtle Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

Schaechtle Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

neeshjaa Feb 22, 2024

Choose a reason for hiding this comment

Uh oh!

Schaechtle Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!