Page 87 - Invited Paper Session (IPS) - Volume 1
P. 87

IPS98 Luciana D. V. at al.
            analytic  work  in  four  of  InfoQ  dimensions,  namely,  Data  Structure,  Data
            Integration, Temporal Relevance and Chronology of Data and Goal.

            Keywords
            Bayesian networks; Big Data; Information quality (InfoQ); Resampling
            techniques

            1.  Introduction
                The  growing  availability  of  abundant  masses  of  data  in  every  sector,
            including business, government and health care, is posing new analytic and
            statistical challenges.  In recent years, advances in the literature of big data
            analysis  have  been  significant.  Amongst  recent  contributions  to  sentiment
            analysis, Stander et al. (2016a) and Stander et al. (2016b) extracted Facebook
            data to analyze sentiment scores and voting patterns about the June 2016 EU
            referendum in the UK. Zhang et al. (2011) used sentiment analysis techniques
            to  predict  stock  market indicators  using  Twitter  data.  Asur  and  Huberman
            (2010)  predicted  box-office  movie  revenues,  performing  an  analysis  of
            sentiments  from  comments  posted  on  social  media.    However,  big  data
            analysis and social media  mining may be challenging. The main issues are
            related to the quality of data collected and reported and to the integration of
            multiple datasets. In particular,  social media  big data often contain biased
            information, especially online blogs describing opinions and sentiments about
            specific products and services. Indeed, online reviews generally include overly
            negative  comments  and  feedback,  since  users  tend  to  feel  more  free  to
            express their dissatisfaction online, rather than in other contexts. On the other
            hand, traditional reviews generally  include overly positive comments, since
            people tend not to feel comfortable to voice their opinions in surveys and may
            not be completely honest about their discontent. In both cases, the levels of
            the  variables  expressing  customers’  views  are  (sometimes  strongly)
            unbalanced,  preventing  a  correct  evaluation  of  customer  satisfaction.    In
            handling these challenges, data integration is key, especially where data come
            in both structured and unstructured formats and need to be integrated from
            disparate sources stored in systems managed by different departments. In
            most cases, the efficient aggregation and correlation of multiple datasets of
            considerable dimensions may be very complex (Daniel, 2015).  Foresti et al.
            (2012) agree that data aggregation from multiple information sources is key
            to  decision-makers  and  describe  a  regression-based  data  integration
            methodology applied to public  and private financial databases.  Dalla  Valle
            (2016) illustrates a different approach for blending information from official
            statistics and organizational data, based on the generalization of Heckman’s
            method where inference is performed according to the Bayesian framework.
            Dong and Srivastava (2015) describe the big data integration techniques of

                                                               76 | I S I   W S C   2 0 1 9
   82   83   84   85   86   87   88   89   90   91   92