BENCHMARKING THE EFFECTIVENESS AND EFFICIENCY OF MACHINE LEARNING ALGORITHMS FOR RECORD LINKAGE

Record linkage which refers to the identification of the same entities across several databases in the absence of an unique identifier is a crucial step for data integration. In this research, we study the effectiveness and efficiency of different machine learning algorithms (SVM, Random Forest, and neural networks) to link databases in a controlled experiment. We control for % of heterogeneity in data and size of training dataset. We evaluate the algorithms based on (1) quality of linkages such as F1 score based on a one threshold model and (2) size of uncertain regions that need manual review based on a two threshold model. We find that random forests performed very well both in terms of traditional metrics like F1 score (99.2% - 95.9%) as well as manual review set size (7.1% - 21%) for error rates from 0% to 60%. Though in terms of F1 scores, the algorithms (Random Forests, SVMs and Neural Nets) fared fairly similar, random forests outperformed the next best model by 28% on average in terms of the percentage of pairs that need manual review.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
R		R
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
data		data
ml_census_link.Rproj		ml_census_link.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BENCHMARKING THE EFFECTIVENESS AND EFFICIENCY OF MACHINE LEARNING ALGORITHMS FOR RECORD LINKAGE

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BENCHMARKING THE EFFECTIVENESS AND EFFICIENCY OF MACHINE LEARNING ALGORITHMS FOR RECORD LINKAGE

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages