Page 337 - Special Topic Session (STS) - Volume 2
P. 337

STS497 Maria D.M.P.
                Data Assessment. Before any data cleaning can be done, it is imperative
            to assess if the language can be understood for data quality evaluation and if
            the dataset format is compatible with the software. Otherwise, the dataset will
            have to be translated and/or converted into a compatible format. Once the
            dataset is understandable, the next step is to identify the variables or fields for
            blocking and record comparison within and between the datasets. The dataset
            must contain pertinent variables for data matching and must be compatible
            with the chosen software to implement the succeeding steps of the Record
            Linkage framework.
                Data Cleaning. Raw data would normally have a lot of typographical errors
            and variations, missing or out-of-date values, and different coding schemes.
            String  comparison  algorithms  are  highly  sensitive  to  these,  even  to  slight
            variations  like  capitalization  and  one-character  differences.  While  phonetic
            comparison  algorithms  aren’t  as  sensitive,  these  will  completely  omit  the
            likelihood of similarity between two records if one is slightly misspelled. These
            are addressed by applying parsing, standardization, and data transformation
            techniques on the dataset.
                Deduplication  within  each  Dataset  and  Record  linkage  between
            Datasets. An additional step to data cleaning is to remove redundant records
            or  duplicates  within  each  dataset.  This  will  mitigate  the chances of  having
            duplicated record pairs upon record linkage and inevitably enable smoother
            data integration to a central database. Clerical review and classic data cleaning
            techniques, such as smart functions, filtering, and conditional formatting, can
            be used to perform deduplication. Another option is to use the record linkage
            package in R since the process aims to identify similar strings.
            The first step is to identify a categorical variable that can group the records
            into broad subsets. The similarity of records within the same subset will be
            quantified using a string comparator. By setting an EM weight, the software
            will  produce  a  ranking  of  record  pairs  based  on  their  likelihood  of  being
            matched pairs. Record linkage between datasets follows the same steps as
            deduplication but will be implemented between two different data sources.
                Extract and Review Record Pairs. The results of deduplication and record
            linkage can be exported from R and reviewed by the researcher. While the
            algorithm  can  shortlist  potential  record  matches  and  eliminate  record-by-
            record  clerical  review,  accurate  validation  of  record  matches  will  require  a
            subject-matter expert of the datasets.

            3.  Results
                Data Assessment. The framework was tested on Tax data and Company
            Registry data. Both are in English and in a format that is importable to RStudio
            - indicating that there is no need for a translation or a file conversion before
            data preprocessing.

                                                               326 | I S I   W S C   2 0 1 9
   332   333   334   335   336   337   338   339   340   341   342