Page 341 - Special Topic Session (STS) - Volume 2
P. 341

STS497 Maria D.M.P.
                 Dataset            BusinessName            TIN        License   Match
                                                                      Number     Rate
             Company registry  Army Welfare Project Limited  AAC00013  W2002686
             Tax              A B ENTERPRISE             AAP00434  W2002313
                              0.53                       0.80        0.85        0.73
                                 Table 4.1: Example of match rate result

                String  Comparison  Algorithms.  Given  a  large  enough  dataset,  both
            algorithms produce match rates with a normal distribution, with the Levenshtein
            distributions relatively more skewed to the right. As observed with the minimum
            and  maximum  match  rates  and  the  match  rate  frequency  distribution,  the
            Levenshtein  algorithm  is  more  sensitive  to  differences  between  strings.  The
            usage of either one will depend on the objective of the researcher. To obtain
            more  potential  matches  and  if  there  is  ample  time  and  bandwidth  for  a
            comprehensive clerical review, it would be preferable to use the Jarowinkler
            algorithm. If the main objective was to mitigate the risk of having false-positive
            matches and deleting potentially unique records, it would be preferable to use
            the Levenshtein algorithm. Another alternative would be to get the average of
            the results of both algorithms but this would still require manual assessment to
            ensure that the system captures the correct record pairs.
                Match Rate Threshold. The defined match rate threshold for this study is
            100% and will eventually be further refined if the results show that lesser match
            rates can pertain to real similar records. This study shows, however, that there is
            still a likelihood that records that have been detected as 100% similar based on
            the comparison variables are actually different records. Out of record pairs that
            have a 100% match rate, the probability of these being actual pairs is 0% to 51%,
            at least based on the results of this study. This suggests that manual validation
            is still a crucial step to ensure that the detected record pairs with relatively high
            match  rates  are  indeed  similar.  Furthermore,  most  of  the  duplicates  from
            deduplication are records with consecutive record numbers. In company registry
            dataset, for example, the 56 duplicates were all consecutive records. This implies
            that  these  records  may  pertain  to  different  entities  but  the  differentiating
            characteristic or variable between these records are simply not available with the
            given  information.  This  further  asserts  the  need  for  manual  validation  or
            assessment of the record pairs, at least during the initial stages of record linkage
            implementation for SBR datasets.








                                                               330 | I S I   W S C   2 0 1 9
   336   337   338   339   340   341   342   343   344   345   346