Page 340 - Special Topic Session (STS) - Volume 2
P. 340
STS497 Maria D.M.P.
4. Discussion and Conclusion
Data Cleaning. This study covers the comparison of data matching results
between datasets that have gone through data cleaning or preprocessing and
datasets that did not. Results show that missing characters can decreased the
match rate from 1 to as low as .875, missing words from 1 to 0.625, wrong
capitalization from 1 to 0.625, and minor misspellings from 1 to 0.875. Records
that would have easily been detected as similar entities through manual
observation would be demoted to a lower match rate simply due to minor
typographical errors. This further supports that it is pertinent to clean and
standardize the datasets through removing punctuations and special
characters, removing extra spaces, capitalization of all characters, and spelling
out common abbreviations.
Deduplication. The company registry data, through deduplication, has
shown to have 56 duplicates out of 216 records. The tax data has shown to
have 77 out of 40,038. This means that the software will have 56 less records
to compare with the rest of the records in the company registry data and 77
less records for the tax data, significantly decreasing the number of multiple
duplicate record pairs and eventually the amount of time required to allocate
for manual validation.
Blocking Variable. A dataset with more than 1,000 records is likely to have
more than 1,000,000 records pairs since all records will be automatically
compared. Given a large dataset and for practically purposes, it is preferable to
assign a one to three blocking variables. Using categorical variables for blocking
would be ideal, in the case of the available datasets. Variables that refer to
unique characteristics may be too narrow and numerical variables may be too
wide of a subset. In this study, the best options were establishment role,
reporting unit, and location data (entity district and city).
Comparison Variable. Comparison variables are ideally unique
characteristics of each record, such as business names and tax identification
numbers. Based on manual observation of the record pairs between the tax
data and the company registry data, it would be recommended to assign
weights on each comparison variable based on the likelihood of uniqueness.
Business names, for example, would have a larger weight compared to license
number or tax ID since the latter two may have the same first characters by
default due to a standard naming convention. This results in a high cumulative
match rate between record pairs even if the business names are starkly
different.
329 | I S I W S C 2 0 1 9