Page 87 - Invited Paper Session (IPS) - Volume 1
P. 87
IPS98 Luciana D. V. at al.
analytic work in four of InfoQ dimensions, namely, Data Structure, Data
Integration, Temporal Relevance and Chronology of Data and Goal.
Keywords
Bayesian networks; Big Data; Information quality (InfoQ); Resampling
techniques
1. Introduction
The growing availability of abundant masses of data in every sector,
including business, government and health care, is posing new analytic and
statistical challenges. In recent years, advances in the literature of big data
analysis have been significant. Amongst recent contributions to sentiment
analysis, Stander et al. (2016a) and Stander et al. (2016b) extracted Facebook
data to analyze sentiment scores and voting patterns about the June 2016 EU
referendum in the UK. Zhang et al. (2011) used sentiment analysis techniques
to predict stock market indicators using Twitter data. Asur and Huberman
(2010) predicted box-office movie revenues, performing an analysis of
sentiments from comments posted on social media. However, big data
analysis and social media mining may be challenging. The main issues are
related to the quality of data collected and reported and to the integration of
multiple datasets. In particular, social media big data often contain biased
information, especially online blogs describing opinions and sentiments about
specific products and services. Indeed, online reviews generally include overly
negative comments and feedback, since users tend to feel more free to
express their dissatisfaction online, rather than in other contexts. On the other
hand, traditional reviews generally include overly positive comments, since
people tend not to feel comfortable to voice their opinions in surveys and may
not be completely honest about their discontent. In both cases, the levels of
the variables expressing customers’ views are (sometimes strongly)
unbalanced, preventing a correct evaluation of customer satisfaction. In
handling these challenges, data integration is key, especially where data come
in both structured and unstructured formats and need to be integrated from
disparate sources stored in systems managed by different departments. In
most cases, the efficient aggregation and correlation of multiple datasets of
considerable dimensions may be very complex (Daniel, 2015). Foresti et al.
(2012) agree that data aggregation from multiple information sources is key
to decision-makers and describe a regression-based data integration
methodology applied to public and private financial databases. Dalla Valle
(2016) illustrates a different approach for blending information from official
statistics and organizational data, based on the generalization of Heckman’s
method where inference is performed according to the Bayesian framework.
Dong and Srivastava (2015) describe the big data integration techniques of
76 | I S I W S C 2 0 1 9