Page 337 - Special Topic Session (STS) - Volume 2
P. 337
STS497 Maria D.M.P.
Data Assessment. Before any data cleaning can be done, it is imperative
to assess if the language can be understood for data quality evaluation and if
the dataset format is compatible with the software. Otherwise, the dataset will
have to be translated and/or converted into a compatible format. Once the
dataset is understandable, the next step is to identify the variables or fields for
blocking and record comparison within and between the datasets. The dataset
must contain pertinent variables for data matching and must be compatible
with the chosen software to implement the succeeding steps of the Record
Linkage framework.
Data Cleaning. Raw data would normally have a lot of typographical errors
and variations, missing or out-of-date values, and different coding schemes.
String comparison algorithms are highly sensitive to these, even to slight
variations like capitalization and one-character differences. While phonetic
comparison algorithms aren’t as sensitive, these will completely omit the
likelihood of similarity between two records if one is slightly misspelled. These
are addressed by applying parsing, standardization, and data transformation
techniques on the dataset.
Deduplication within each Dataset and Record linkage between
Datasets. An additional step to data cleaning is to remove redundant records
or duplicates within each dataset. This will mitigate the chances of having
duplicated record pairs upon record linkage and inevitably enable smoother
data integration to a central database. Clerical review and classic data cleaning
techniques, such as smart functions, filtering, and conditional formatting, can
be used to perform deduplication. Another option is to use the record linkage
package in R since the process aims to identify similar strings.
The first step is to identify a categorical variable that can group the records
into broad subsets. The similarity of records within the same subset will be
quantified using a string comparator. By setting an EM weight, the software
will produce a ranking of record pairs based on their likelihood of being
matched pairs. Record linkage between datasets follows the same steps as
deduplication but will be implemented between two different data sources.
Extract and Review Record Pairs. The results of deduplication and record
linkage can be exported from R and reviewed by the researcher. While the
algorithm can shortlist potential record matches and eliminate record-by-
record clerical review, accurate validation of record matches will require a
subject-matter expert of the datasets.
3. Results
Data Assessment. The framework was tested on Tax data and Company
Registry data. Both are in English and in a format that is importable to RStudio
- indicating that there is no need for a translation or a file conversion before
data preprocessing.
326 | I S I W S C 2 0 1 9