Page 339 - Special Topic Session (STS) - Volume 2
P. 339

STS497 Maria D.M.P.
            distribution, with the Jarowinkler string comparator showing more optimistic
            match rates compared to the Levenshtein string comparator.

















              Figure 3.2: Sample frequency distribution of string comparator results of Company
                                            Registry Data

                Record Linkage between Datasets. Now that both datasets have gone
            through preprocessing and deduplication, there will be less chances of records
            having multiple pairs. Tax data initially had 40,038 records. With 77 confirmed
            duplicates, only 39,961 records  will be  used  to compare  with the company
            registry  dataset.  The  company  registry’s  record  count  dropped  from  219
            records to 163 after deduplication and confirming 56 exact duplicates within
            the same dataset through manual checking. The identified blocking variables
            would be District and City and the identified comparison variables would be
            Business Name, Tax Identification Number, and License Number since these
            variables exist and follow similar formats in both datasets.
                Extract and Review Record Pairs. No exact duplicates were found between
            the tax data and the company registry data. This is largely due to the small
            sample size of the company registry data. The maximum match rates between
            the  two  datasets  were  0.763  and  0.499  for  Jarowinkler  and  Levenshtein,
            respectively. The minimum match rates were 0.113 and 0 for Jarowinkler and
            Levenshtein, respectively.















             Figure 3.3: Sample frequency distribution of string comparator results between Tax
                                   Data and Company Registry Data
                                                               328 | I S I   W S C   2 0 1 9
   334   335   336   337   338   339   340   341   342   343   344