Page 341 - Special Topic Session (STS)

Page 341 - Special Topic Session (STS) - Volume 2

P. 341

STS497 Maria D.M.P.
Dataset BusinessName TIN License Match
Number Rate
Company registry Army Welfare Project Limited AAC00013 W2002686
Tax A B ENTERPRISE AAP00434 W2002313
0.53 0.80 0.85 0.73
Table 4.1: Example of match rate result

String Comparison Algorithms. Given a large enough dataset, both
algorithms produce match rates with a normal distribution, with the Levenshtein
distributions relatively more skewed to the right. As observed with the minimum
and maximum match rates and the match rate frequency distribution, the
Levenshtein algorithm is more sensitive to differences between strings. The
usage of either one will depend on the objective of the researcher. To obtain
more potential matches and if there is ample time and bandwidth for a
comprehensive clerical review, it would be preferable to use the Jarowinkler
algorithm. If the main objective was to mitigate the risk of having false-positive
matches and deleting potentially unique records, it would be preferable to use
the Levenshtein algorithm. Another alternative would be to get the average of
the results of both algorithms but this would still require manual assessment to
ensure that the system captures the correct record pairs.
Match Rate Threshold. The defined match rate threshold for this study is
100% and will eventually be further refined if the results show that lesser match
rates can pertain to real similar records. This study shows, however, that there is
still a likelihood that records that have been detected as 100% similar based on
the comparison variables are actually different records. Out of record pairs that
have a 100% match rate, the probability of these being actual pairs is 0% to 51%,
at least based on the results of this study. This suggests that manual validation
is still a crucial step to ensure that the detected record pairs with relatively high
match rates are indeed similar. Furthermore, most of the duplicates from
deduplication are records with consecutive record numbers. In company registry
dataset, for example, the 56 duplicates were all consecutive records. This implies
that these records may pertain to different entities but the differentiating
characteristic or variable between these records are simply not available with the
given information. This further asserts the need for manual validation or
assessment of the record pairs, at least during the initial stages of record linkage
implementation for SBR datasets.

330 | I S I W S C 2 0 1 9

336 337 338 339 340 341 342 343 344 345 346