Page 269 - Special Topic Session (STS) - Volume 2
P. 269
STS493 Irene S.
methodologies are developed to use several (new) sources to distinguish for
example enterprises involved in internet economy, e.g. Oostrom et al. (2016).
In order to be useful for statistical purposes, as for many other Big Data
investigations, the data needs to be linked to statistical information. In this case,
the characteristics of websites needed to be linked to the businesses behind
the website. Therefore two key pieces of information were used; the websites
as recorded in the SBR, and the businesses’ CoC-registration number as
published on the website. These identifiers provide the basis upon which
websites can be linked to the respective businesses. Subsequently, when
successfully linked, the SBR facilities further links to a variety of data sources
available at CBS (see 2.2). Obviously, the future lies in extensively linking a
multitude on data sources where the international dimension in characterizing
enterprises (also small and medium) is unabated important. As Timothy
Sturgeon (2013) stated: “Clearly, the assumptions behind current data regimes
have changed and statistical systems are struggling to catch up. While it will be
exceedingly difficult to fill data gaps without new data, and progress that relies
only on existing data resources will always be limited, the most efficient
approach will be to develop systematic links between key existing data,
supplemented with a few additional variables, with data on enterprise
characteristics drawn from administrative sources, all tied together by
enterprise identifiers that make ownership clear, even when it extends across
borders.”
2.5 Methodology on combining data
In our ambition to increase our statistical output an important prerequisite
for CBS is to make as much and diverse as possible data available. CBS does
this by combining administrative data from registers, registrations, Big Data (i.e.
sensor data), private data and survey data. The adequate combination of
sources can be decisive regarding the outcome, implying that approach and
way of working need to be adjusted. In fact, when combining multiple data
sources from multiple modes the challenge is to develop methodology that
helps to deal with issues concerning matching of data sources. Specific issues
that can occur when matching various sources are; units to be matched do not
equal source units (persons, businesses), sources do not contain overlapping
units however one wants to estimate the correlation between variables
occurring in both sources, matching errors resulting in bias of an estimator that
one wants to correct and (assuring) the coherence between statistics. General
techniques like probabilistic matching, matching with supervised machine
learning and synthetic matching are extended and/or combined to solve these
issues. In addition, combining registers and survey data comes with its specific
challenges; variables can occur in multiple sources with different measurement
errors for which methods are developed to come to consistent estimates. When
258 | I S I W S C 2 0 1 9