Page 336 - Special Topic Session (STS) - Volume 2
P. 336
STS497 Maria D.M.P.
computing and graphics. Apart from its customizability and transparency, R’s
Record Linkage package includes commonly used string comparators and
employs probabilistic matching algorithms. The string comparator algorithms
that will be tested are limited to those available in the PHP programming
language, mainly because the existing ADB SBR system is based on this. Actual
SBR data sources were used for testing. Due to confidentiality purposes, this
paper can only reveal the overarching trends as a result of testing the record
linkage framework using these datasets.
Framework. The outcome of data matching implementation depends on
the source of the data and the year of documentation. There are four possible
scenarios for Statistical Business Register data – deduplication, time-series,
compilation, and consolidation. Datasets coming from the same data source
and documented on the same year only requires identifying and deleting
duplicate records. This outcome is commonly termed as deduplication. Since
most datasets have numerous duplicate records, deduplication is a
prerequisite step before the implementation of records linkage methods with
other sources. After deduplication, the record linkage results between each
dataset can produce a time-series that shows how businesses changed
through time, a compilation of initially localized information about each
business, or a consolidation of business information coming from various
resources across multiple years. All datasets will undergo the proposed data
matching framework detailed below to produce a cleaner and more accurate
version of the latter three outcomes.
Data Deduplication Record linkage Extract and
assessment Data cleaning within each between review record
pairs
datasets
dataset
Parsing Create blocks Create blocks
Standardization Compare record pairs Compare record pairs
Data Transformation Set EM weight Set EM weight
Review record pairs Review record pairs
Figure 2.1: Proposed record linkage or data matching framework for ADB SBR data
Methodology. Given that the 2016 Tax data and 2016 Company Registry
data are the only datasets available, this paper will only cover deduplication
and compilation. The resulting trends will be validated by randomly generated
datasets. Deduplication will be a pertinent step for all datasets before
proceeding with record linkage. This is to ensure that there are no duplicate
records that may cause multiple and equally-matched record pairs between
datasets.
325 | I S I W S C 2 0 1 9