This study evaluated threshold-based classification and XGBoost for privacy-preserving record linkage (PPRL) of homelessness data. The threshold method, combined with Bloom filters and the Dice coefficient, achieved precision up to 85% and accuracy of 82.6%, but required significant computational resources, making full-scale implementation challenging. XGBoost, enhanced with feature engineering and ADASYN for class balancing, achieved precision and recall above 0.80, with 72.8% overall accuracy, while being more efficient for larger datasets. Threshold methods are suitable for resource-limited settings, while XGBoost provides robust performance where computational capacity allows. These approaches demonstrate the potential for unifying fragmented homelessness data, improving policy-making and resource allocation while maintaining privacy.
Contains synthetic datasets used in this study for PPRL experiments.
Contains codes used for:
PrivacyPreserving_and_Threshold.ipynb
: Preprocessing data, applying Bloom filters, and implementing a threshold-based method for 500 subjects.3000threshold.py
: Same pipeline as the code above, but implemented for 3,000 subjects (the maximum allowed by the computational resources available in TALC).3000threshold.slurm
: Bash code necessary to run 3000threshold.py in TALC.3000threshold_31254.out
: Output file generated by 3000threshold.py, it can be seen completely inoutput_updated_metrics_3000threshold.csv
.Final_ML_XGBoost.ipynb
: Preprocessing data, training and evaluating an XGBoost machine learning model.
Contains the final project report and presentation.
- Threshold-based PPRL: Demonstrates the effectiveness of Bloom filters combined with the Dice coefficient for resource-constrained environments.
- XGBoost for PPRL: Highlights the potential of machine learning in achieving efficient and accurate record linkage in privacy-sensitive contexts.