Page 340 - Special Topic Session (STS) - Volume 2
P. 340

STS497 Maria D.M.P.
                  4.  Discussion and Conclusion
                      Data Cleaning. This study covers the comparison of data matching results
                  between datasets that have gone through data cleaning or preprocessing and
                  datasets that did not. Results show that missing characters can decreased the
                  match rate from 1 to as low as .875, missing words from 1 to 0.625, wrong
                  capitalization from 1 to 0.625, and minor misspellings from 1 to 0.875. Records
                  that  would  have  easily  been  detected  as  similar  entities  through  manual
                  observation would be demoted to a lower match rate simply due to minor
                  typographical errors. This further supports that it is pertinent to clean and
                  standardize  the  datasets  through  removing  punctuations  and  special
                  characters, removing extra spaces, capitalization of all characters, and spelling
                  out common abbreviations.
                      Deduplication.  The company  registry  data,  through  deduplication,  has
                  shown to have 56 duplicates out of 216 records. The tax data has shown to
                  have 77 out of 40,038. This means that the software will have 56 less records
                  to compare with the rest of the records in the company registry data and 77
                  less records for the tax data, significantly decreasing the number of multiple
                  duplicate record pairs and eventually the amount of time required to allocate
                  for manual validation.
                      Blocking Variable. A dataset with more than 1,000 records is likely to have
                  more  than  1,000,000  records  pairs  since  all  records  will  be  automatically
                  compared. Given a large dataset and for practically purposes, it is preferable to
                  assign a one to three blocking variables. Using categorical variables for blocking
                  would be ideal, in  the case of the available datasets. Variables  that refer to
                  unique characteristics may be too narrow and numerical variables may be too
                  wide  of  a  subset.  In  this  study,  the  best  options  were  establishment  role,
                  reporting unit, and location data (entity district and city).
                      Comparison  Variable.  Comparison  variables  are  ideally  unique
                  characteristics of each record, such as business names and tax identification
                  numbers. Based on manual observation of the record pairs between the tax
                  data  and  the  company  registry  data,  it  would  be  recommended  to  assign
                  weights on each comparison variable based on the likelihood of uniqueness.
                  Business names, for example, would have a larger weight compared to license
                  number or tax ID since the latter two may have the same first characters by
                  default due to a standard naming convention. This results in a high cumulative
                  match  rate  between  record  pairs  even  if  the  business  names  are  starkly
                  different.







                                                                     329 | I S I   W S C   2 0 1 9
   335   336   337   338   339   340   341   342   343   344   345