Skip to content

Tim/privacy experiments ces22 #180

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 33 commits into
base: privacy-analysis-main-branch
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
e8f3700
update: Initial commit with privacy experiments on CES 2022 dataset
neeshjaa Feb 13, 2024
298bd15
update: Improvement to the documentation
neeshjaa Feb 13, 2024
ead6749
update: Generate CSV files with binary target feature directly from t…
neeshjaa Feb 13, 2024
3c7c4c2
update: Update documentation regarding outstanding question on predic…
neeshjaa Feb 13, 2024
2dbd04e
update: More documentation tweaks
neeshjaa Feb 13, 2024
258b294
fix: Must make sure Python lint likes my indentation
neeshjaa Feb 14, 2024
06277a9
fix: Yet another indentation issue (tabs aren't allowed?)
neeshjaa Feb 14, 2024
bed5f93
fix: Lint pedantry needed to make the code ugly
neeshjaa Feb 14, 2024
2f7cbaa
fix: Generalize selection of predictive probabilities in predictions.csv
neeshjaa Feb 15, 2024
e31f08d
fix: Clean up some temporary versions
neeshjaa Feb 15, 2024
a948aec
docs: Update privacy README to remove open question about predict.py
neeshjaa Feb 15, 2024
5c9d026
fix: Replace y_true[] input to roc_curve with synthetic column
neeshjaa Feb 15, 2024
ebea79f
Revert "fix: Replace y_true[] input to roc_curve with synthetic column"
neeshjaa Feb 19, 2024
4d656ec
Revert "docs: Update privacy README to remove open question about pre…
neeshjaa Feb 19, 2024
1387d39
Revert "fix: Clean up some temporary versions"
neeshjaa Feb 19, 2024
0597366
Revert "fix: Generalize selection of predictive probabilities in pred…
neeshjaa Feb 19, 2024
9aad228
Revert "fix: Lint pedantry needed to make the code ugly"
neeshjaa Feb 19, 2024
4cdeb7f
Revert "fix: Yet another indentation issue (tabs aren't allowed?)"
neeshjaa Feb 19, 2024
6bc3dc2
Revert "fix: Must make sure Python lint likes my indentation"
neeshjaa Feb 19, 2024
2212bef
Revert "update: More documentation tweaks"
neeshjaa Feb 19, 2024
e6a92a9
Revert "update: Update documentation regarding outstanding question o…
neeshjaa Feb 19, 2024
aeaff38
Revert "update: Generate CSV files with binary target feature directl…
neeshjaa Feb 19, 2024
9a7825e
Revert "update: Improvement to the documentation"
neeshjaa Feb 19, 2024
cc94798
Revert "update: Initial commit with privacy experiments on CES 2022 d…
neeshjaa Feb 19, 2024
6f928d7
feat: Add column to predictions.csv with probability of the predicted…
neeshjaa Feb 19, 2024
8aefa67
fix: Python lint
neeshjaa Feb 19, 2024
944b082
fix: Don't choose target for prediction at random
Schaechtle Feb 22, 2024
c90d756
fix: Ensure predictions.csv is pointing to data/directory
Schaechtle Feb 22, 2024
451d357
chore: Remove plot.show()
Schaechtle Feb 22, 2024
289276f
feat: Save results to disk
Schaechtle Feb 22, 2024
f500f2b
feat: Add pipeline stages for Tim's analysis
Schaechtle Feb 22, 2024
6ca7b00
chore: Treat roc.png as pipeline output
Schaechtle Feb 22, 2024
c603f57
feat: Make the "positive" label for the ROC curve explicit throughout
Schaechtle Feb 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,4 @@ devenv.local.nix
pom.xml
pom.xml.asc
sum-product-dsl/
/roc.png
1 change: 1 addition & 0 deletions data/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,4 @@
/predictions.csv
/synthetic-data-iql.csv
/db.edn
/ml-metrics.csv
18 changes: 18 additions & 0 deletions dvc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -547,3 +547,21 @@ stages:
- data/xcat
outs:
- data/db.edn

roc-curve:
cmd: >
python scripts/evaluate-classifier-roc.py
deps:
- data/predictions.csv
params:
- synthetic_data_evaluation.positive_label
outs:
- roc.png

ml-metrics:
cmd: >
python scripts/evaluate-classifier-statistics.py
deps:
- data/predictions.csv
outs:
- data/ml-metrics.csv
3 changes: 2 additions & 1 deletion params.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,8 @@ mi:
# health_status: ["c) Average", "b) Below average"]
synthetic_data_evaluation:
# If target is not specified, a random target is chosen for prediction.
#target: Apogee_km
target: CC22_320a
positive_label: "yes" # XXX
predictor: Random_forest # One of "Random_forest" or "GLM"
#N: 10000 # Subsample held-out dataframe with 1000 samples
database:
Expand Down
39 changes: 39 additions & 0 deletions scripts/evaluate-classifier-roc.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
#!/usr/bin/python3

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
import yaml

# Generate generic ROC curve for the results found in 'predictions.csv' in the
# current directory

df = pd.read_csv("data/predictions.csv", header=0)
yv = df["prediction"]
yp = df["predictive-probability"]
tv = df["true_value"]

with open("params.yaml", "r") as f:
params = yaml.safe_load(f.read())
# Get held-out configuration for evaluation.
pos_label = params["synthetic_data_evaluation"]["positive_label"]

# Compute ROC curve and ROC area
fpr, tpr, thresholds = roc_curve(tv, yp, pos_label=pos_label)
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, color="darkorange", lw=2, label="ROC curve (area = %0.2f)" % roc_auc)
plt.plot([0, 1], [0, 1], color="navy", lw=2, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic")
plt.legend(loc="lower right")
plt.savefig("roc.png")
33 changes: 33 additions & 0 deletions scripts/evaluate-classifier-statistics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
#!/usr/bin/python3

import numpy as np
import pandas as pd

df = pd.read_csv("data/predictions.csv", header=0)

# Show some generic metrics for the results found in 'predictions.csv' in the
# current directory

X = df["true_value"]
Y = df["prediction"]

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

print("Accuracy...: %f" % accuracy_score(Y, X))
print("Precision..: %f" % precision_score(Y, X, average="macro"))
print("Recall.....: %f" % recall_score(Y, X, average="macro"))
print("F1.........: %f" % f1_score(Y, X, average="macro"))

# Also save to disk. Helpful to track the result with DVC.
result = pd.DataFrame(
{
"metric": ["Accuracy", "Precision", "Recall", "F1"],
"score:": [
accuracy_score(Y, X),
precision_score(Y, X, average="macro"),
recall_score(Y, X, average="macro"),
f1_score(Y, X, average="macro"),
],
}
)
result.to_csv("data/ml-metrics.csv", index=False)
14 changes: 13 additions & 1 deletion scripts/predict.py
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,7 @@ def main():
"training_data": [],
"test_data": [],
"prediction": [],
"predictive-probability": [],
"true_value": [],
}
for train_dataset_path in args.training:
Expand All @@ -193,8 +194,19 @@ def main():
# Need to call NP.array.flatten() here because CatBoost decides to
# wrap prediction into a separate list.
results["prediction"].extend((ml_model.predict(X_test).flatten().tolist()))
results["true_value"].extend(y_test.tolist())

# Add a new column with the probability of the predicted value.
# Although it looks evil, use of 'classes_' is documented here:
# https://scikit-learn.org/stable/modules/generated/sklearn.
# linear_model.LogisticRegression.html#sklearn.linear_model.
# LogisticRegression.predict_proba
probabilities = ml_model.predict_proba(X_test)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we need to decide which one of the two binary classes we want to predict and always grep the corresponding vector. Assuming the labels are encoded as 0, and 1, you could do the following:

Suggested change
probabilities = ml_model.predict_proba(X_test)
probabilities = ml_model.predict_proba(X_test)
j = list(ml_model.classes_).index(1)

Alternatively, if your true value is encoded as "true-value", you could run j = list(ml_model.classes_).index("true-value").

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for greatest generality we should specify the value to be predicted in params.yaml, then use that last method to create a new vector saying whether or not the predicted value equals that? That way we could handle dataset options like yes/no, true/false, without any preprocessing?

pos_label = config["positive_label"]
j = list(ml_model.classes_).index(pos_label)
for i in range(len(probabilities)):
results["predictive-probability"].append(probabilities[i][j])

results["true_value"].extend(y_test.tolist())
n_test_datapoints = y_test.shape[0]
results["target"].extend([target] * n_test_datapoints)
results["training_data"].extend([train_dataset_path] * n_test_datapoints)
Expand Down