Page 339 - Special Topic Session (STS)

Page 339 - Special Topic Session (STS) - Volume 2

P. 339

STS497 Maria D.M.P.
distribution, with the Jarowinkler string comparator showing more optimistic
match rates compared to the Levenshtein string comparator.

Figure 3.2: Sample frequency distribution of string comparator results of Company
Registry Data

Record Linkage between Datasets. Now that both datasets have gone
through preprocessing and deduplication, there will be less chances of records
having multiple pairs. Tax data initially had 40,038 records. With 77 confirmed
duplicates, only 39,961 records will be used to compare with the company
registry dataset. The company registry’s record count dropped from 219
records to 163 after deduplication and confirming 56 exact duplicates within
the same dataset through manual checking. The identified blocking variables
would be District and City and the identified comparison variables would be
Business Name, Tax Identification Number, and License Number since these
variables exist and follow similar formats in both datasets.
Extract and Review Record Pairs. No exact duplicates were found between
the tax data and the company registry data. This is largely due to the small
sample size of the company registry data. The maximum match rates between
the two datasets were 0.763 and 0.499 for Jarowinkler and Levenshtein,
respectively. The minimum match rates were 0.113 and 0 for Jarowinkler and
Levenshtein, respectively.

Figure 3.3: Sample frequency distribution of string comparator results between Tax
Data and Company Registry Data
328 | I S I W S C 2 0 1 9

334 335 336 337 338 339 340 341 342 343 344