Page 338 - Special Topic Session (STS) - Volume 2
P. 338

STS497 Maria D.M.P.
                     Data Cleaning. The Tax data originally had a total of 40,038 records and
                  20 fields. Fields that are only 0%-50% populated were deleted; decreasing the
                  number of fields to 17. The Company Registry data originally had a total of
                  219  records  and  143  fields.  Fields  that  are  only  0%-50%  populated  were
                  deleted and information in date format (DAY/MONTH/YEAR) were separated
                  into date day, date month and date year; decreasing the number of fields to
                  29.  For  both  datasets,  all  strings  were  capitalized,  abbreviations  were
                  elongated, and unnecessary punctuations and extra spaces were removed.
                     Deduplication  with  a  Dataset.  Finding  potential  duplicates  within
                  datasets can only be done in RStudio. The tax data was split into eight sections
                  (5,000 records per file) due to the processing limit of the hardware. There will
                  be two rounds of deduplication. The first one will be within each dataset and
                  the second will be between each dataset. The blocking variables used for the
                  company  registry  data  would  be  District  and  City.  Business  TIN,  Business
                  Name, and Business Unit Name were used for comparison. 77 exact duplicates
                  were detected through deduplication, 72 within datasets and 5 between the
                  datasets.  The  new  record  count  is  now  39,961.  The  maximum  match  rate
                  (measure  of  similarity)  is  1  for  both  the  Levenshtein  and  Jarowinkler
                  algorithms. The minimum match rate for Jarowinkler vary between 0.111 to
                  0.194. For Levenshtein, the minimum match rate does not deviate from 0.833.
                  The frequency distribution of the match rates follow a normal distribution. The
                  Levenshtein match rates are slightly more skewed to the right, indicating the
                  increased likelihood of having lower match rates compared to Jarowinkler.















                    Figure 3.1: Sample frequency distribution of string comparator results of Tax Data

                     The blocking variables used for the company registry data were the Location
                  Code,  the  Establishment  Role,  and  the  Reporting  Unit.  Business  Name,
                  Registration Number, Registered Name, and Tax Identification Number were
                  used  for  comparison.  Both  string  comparator  algorithms  (Levenshtein  and
                  Jarowinkler) were used for comparison purposes. 1115 out of 3234 record pairs
                  were 100% matched; however, only 56 were validated to be true duplicates.
                  The frequency distribution follows the same trend as the tax data frequency


                                                                     327 | I S I   W S C   2 0 1 9
   333   334   335   336   337   338   339   340   341   342   343