Skip to content

PRIVACY-PRESERVING RECORD LINKAGE METHODS FOR HOMELESSNESS DATA

Notifications You must be signed in to change notification settings

Felipecastanog/final_project_ENEL645

Repository files navigation

Final Project ENEL645

Abstract

This study evaluated threshold-based classification and XGBoost for privacy-preserving record linkage (PPRL) of homelessness data. The threshold method, combined with Bloom filters and the Dice coefficient, achieved precision up to 85% and accuracy of 82.6%, but required significant computational resources, making full-scale implementation challenging. XGBoost, enhanced with feature engineering and ADASYN for class balancing, achieved precision and recall above 0.80, with 72.8% overall accuracy, while being more efficient for larger datasets. Threshold methods are suitable for resource-limited settings, while XGBoost provides robust performance where computational capacity allows. These approaches demonstrate the potential for unifying fragmented homelessness data, improving policy-making and resource allocation while maintaining privacy.

Repository Structure

Data

Contains synthetic datasets used in this study for PPRL experiments.

Codes

Contains codes used for:

  • PrivacyPreserving_and_Threshold.ipynb: Preprocessing data, applying Bloom filters, and implementing a threshold-based method for 500 subjects.
  • 3000threshold.py: Same pipeline as the code above, but implemented for 3,000 subjects (the maximum allowed by the computational resources available in TALC).
  • 3000threshold.slurm: Bash code necessary to run 3000threshold.py in TALC.
  • 3000threshold_31254.out: Output file generated by 3000threshold.py, it can be seen completely in output_updated_metrics_3000threshold.csv.
  • Final_ML_XGBoost.ipynb: Preprocessing data, training and evaluating an XGBoost machine learning model.

PDFs

Contains the final project report and presentation.

Key Contributions

  • Threshold-based PPRL: Demonstrates the effectiveness of Bloom filters combined with the Dice coefficient for resource-constrained environments.
  • XGBoost for PPRL: Highlights the potential of machine learning in achieving efficient and accurate record linkage in privacy-sensitive contexts.

About

PRIVACY-PRESERVING RECORD LINKAGE METHODS FOR HOMELESSNESS DATA

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published