Page 335 - Special Topic Session (STS) - Volume 2
P. 335
STS497 Maria D.M.P.
Record linkage for statistical business register
data
Maria Denise M. Peña
Asian Development Bank
Abstract
Data sources for Statistical Business Registers typically have different
structures and several typographical errors - risking the data integrity of the
database. Organizations can address this challenge by implementing record
linkage techniques. These techniques intend to minimize duplicate records
and to identify similar entities between different datasets, enabling smoother
data integration. This study will explore record linkage methods and preferred
specifications on data cleaning, deduplication, data matching, and validation
of record pairs of Statistical Business Register data using R or RStudio.
Keywords
Fuzzy match; Deduplication; Entity resolution; Data matching; Data
deduplication
1. Introduction
The ADB Statistical Business Register (SBR) serves as a central database for
national statistics offices to store and retrieve historical and current
information on businesses. This information contributes to the evidence-
based decision- and policy-making of a particular territory, which entails the
importance of the comprehensiveness and accuracy of the stored data. Since
the information will come from various sources, a crucial challenge to optimize
data quality would be the varying data collection formats, varying naming
conventions, and data entry errors.
Government agencies may allocate resources to clean the data manually
but this method may be unnecessarily time-consuming and susceptible to
human error. This study will utilize recent technological advances in software
and programming techniques to automate, or at least expedite the process of
addressing data quality issues, with relatively accurate outcomes. The primary
objective is to determine an extensive framework for data cleaning and
identifying similar records between different datasets, specifically for the ADB
SBR system.
2. Methodology
Scope and Data. The chosen software would be R or RStudio, a
programming language and free software environment for statistical
324 | I S I W S C 2 0 1 9