Page 88 - Invited Paper Session (IPS) - Volume 1
P. 88
IPS98 Luciana D. V. at al.
schema mapping, record linkage and data fusion and identify a range of open
problems in this research area. Chakraborty et al. (2015) define a novel
approach to integrate diverse data types, such as historic data, survey data,
management planning data, expert knowledge and incomplete data, by
converting data into Bayesian probability forms. Dalla Valle (2014 and 2017a)
and Dalla Valle and Kenett (2015) introduced an innovative approach to
integrate survey data with official statistics data based on calibration using
copulas and nonparametric Bayesian Networks (BNs). For an overview about
copulas and their applications to finance, see Dalla Valle (2017b and 2017c)
and references therein. For an introduction to BNs see, for example, Pearl
(2009), Jensen (2001), Ben Gal (2007), Koski and Noble (2009) and Pourret et
al. (2008). In this paper, we propose a novel methodology that calibrates social
media information with online review data via resampling and performs
integration using BNs. This approach allows businesses and organizations to
correctly analyze the sentiments of online users on social media, facilitating an
accurate evaluation of the satisfaction of their customers. Such an integration,
combining different overlapping data sources, enhances the information
quality of the data analytic work in four dimensions: Data Structure, Data
Integration, Temporal Relevance and Chronology of Data and Goal (Kenett
and Shmueli, 2016).
2. Methodology
The methodology proposed in this paper aims at achieving data
integration of traditional customer satisfaction survey data with social media
data via resampling using BNs, expanding the approach presented in Dalla
Valle and Kenett (2015). We perform data integration emphasizing blog-type
data, which is a big data environment source. However, our approach is
scalable to other social media and big data sources. As mentioned above,
properly handling data integration is a key dimension in achieving high
information quality (Kenett and Shmueli, 2016). The proposed data integration
methodology aggregates customer survey data with information extracted
from social media, performing calibration of different datasets. The idea is in
the same spirit of external benchmarking used in small area estimation
(Pfeffermann, 2013). In small area estimation benchmarking robustifies the
inference by forcing the model-based predictors to agree with a design-based
estimator. Similarly, our methodology is based on qualitative data calibration
performed via resampling, where the variables levels are balanced and
customer survey estimates are updated to agree with more timely social media
data estimates. Calibration is implemented by altering the class distribution
of customers’ reviews in one of the datasets to obtain a re-balanced sample,
which reflects the distribution of the second dataset.
77 | I S I W S C 2 0 1 9