Probabilistic Record Linkage and Address Standardization
Host
Lulu Kang
Speaker
Sou-Cheng Choi
Senior Statistician in NORC, University of Chicago; Research Assistant Professor, Applied Mathematics, IIT
Description
Probabilistic record linkage (PRL) refers to the process of matching records from different data sources, such as database tables with missing data in primary key. It can be applied to join or deduplicate records or to impute missing data, resulting in better data quality in any case. An important subproblem in PRL is to parse or standardize a text field such as address into its component fields, e.g., street number, street name, city, state, zip code, and country. Often, various modern data analysis techniques such as natural language processing and machine learning methods are gainfully employed in both PRL and address standardization to achieve high accuracies of linking or prediction. In a study, we compare the performance of a few widely used open source PRL packages, namely FRIL, Link Plus, R RecordLinkage, and SERF. In addition, we evaluate the baseline performance and sensitivity of a number of address-parsing web services, including the U.S. address parser, Google Maps APIs, Geocoder.us, and Data Science Toolkit. We will present the strengths and limitations of the software and services we have evaluated. This is joint work with Edward Mulrow, NORC at the University of Chicago.
Event Topic
Data Science