Page 335 - Special Topic Session (STS) - Volume 2
P. 335

STS497 Maria D.M.P.



                          Record linkage for statistical business register
                                               data
                                       Maria Denise M. Peña
                                        Asian Development Bank

            Abstract
            Data  sources  for  Statistical  Business  Registers  typically  have  different
            structures and several typographical errors - risking the data integrity of the
            database. Organizations can address this challenge by implementing record
            linkage techniques. These techniques intend to minimize duplicate records
            and to identify similar entities between different datasets, enabling smoother
            data integration. This study will explore record linkage methods and preferred
            specifications on data cleaning, deduplication, data matching, and validation
            of record pairs of Statistical Business Register data using R or RStudio.

            Keywords
            Fuzzy  match;  Deduplication;  Entity  resolution;  Data  matching;  Data
            deduplication

            1.  Introduction
                The ADB Statistical Business Register (SBR) serves as a central database for
            national  statistics  offices  to  store  and  retrieve  historical  and  current
            information  on  businesses.  This  information  contributes  to  the  evidence-
            based decision- and policy-making of a particular territory, which entails the
            importance of the comprehensiveness and accuracy of the stored data. Since
            the information will come from various sources, a crucial challenge to optimize
            data quality would be the varying data collection formats, varying naming
            conventions, and data entry errors.
                Government agencies may allocate resources to clean the data manually
            but  this  method  may  be  unnecessarily  time-consuming  and  susceptible  to
            human error. This study will utilize recent technological advances in software
            and programming techniques to automate, or at least expedite the process of
            addressing data quality issues, with relatively accurate outcomes. The primary
            objective  is  to  determine  an  extensive  framework  for  data  cleaning  and
            identifying similar records between different datasets, specifically for the ADB
            SBR system.

            2.  Methodology
                Scope  and  Data.  The  chosen  software  would  be  R  or  RStudio,  a
            programming  language  and  free  software  environment  for  statistical


                                                               324 | I S I   W S C   2 0 1 9
   330   331   332   333   334   335   336   337   338   339   340