Page 336 - Special Topic Session (STS) - Volume 2
P. 336

STS497 Maria D.M.P.
                  computing and graphics. Apart from its customizability and transparency, R’s
                  Record  Linkage  package  includes  commonly  used  string  comparators  and
                  employs probabilistic matching algorithms. The string comparator algorithms
                  that  will  be  tested are  limited  to  those  available  in  the PHP  programming
                  language, mainly because the existing ADB SBR system is based on this. Actual
                  SBR data sources were used for testing. Due to confidentiality purposes, this
                  paper can only reveal the overarching trends as a result of testing the record
                  linkage framework using these datasets.
                     Framework. The outcome of data matching implementation depends on
                  the source of the data and the year of documentation. There are four possible
                  scenarios  for Statistical Business Register data  – deduplication, time-series,
                  compilation, and consolidation. Datasets coming from the same data source
                  and  documented  on  the  same  year  only  requires  identifying  and  deleting
                  duplicate records. This outcome is commonly termed as deduplication. Since
                  most  datasets  have  numerous  duplicate  records,  deduplication  is  a
                  prerequisite step before the implementation of records linkage methods with
                  other sources. After deduplication, the record linkage results between each
                  dataset  can  produce  a  time-series  that  shows  how  businesses  changed
                  through  time,  a  compilation  of  initially  localized  information  about  each
                  business,  or  a  consolidation  of  business  information  coming  from  various
                  resources across multiple years. All datasets will undergo the proposed data
                  matching framework detailed below to produce a cleaner and more accurate
                  version of the latter three outcomes.

                       Data                        Deduplication   Record linkage   Extract and
                     assessment     Data cleaning   within each    between       review record
                                                                                   pairs
                                                                    datasets
                                                     dataset

                                  Parsing          Create blocks      Create blocks

                                Standardization      Compare record pairs      Compare record pairs

                              Data Transformation      Set EM weight      Set EM weight

                                                 Review record pairs      Review record pairs
                       Figure 2.1: Proposed record linkage or data matching framework for ADB SBR data

                     Methodology. Given that the 2016 Tax data and 2016 Company Registry
                  data are the only datasets available, this paper will only cover deduplication
                  and compilation. The resulting trends will be validated by randomly generated
                  datasets.  Deduplication  will  be  a  pertinent  step  for  all  datasets  before
                  proceeding with record linkage. This is to ensure that there are no duplicate
                  records that may cause multiple and equally-matched record pairs between
                  datasets.


                                                                     325 | I S I   W S C   2 0 1 9
   331   332   333   334   335   336   337   338   339   340   341