Page 51 - Contributed Paper Session (CPS) - Volume 6
P. 51

CPS1490 Nehall Ahmed Farouk Mohamed
            of  the  outcomes  variables.  The  two  main  techniques  based  on  the
            methodology  are  regression  techniques  and  machine  learning  (ML)
            techniques.  The  techniques  based  on  the  type  of  the  outcomes  variables
            depend  on  whether  the  variables  are  continuous  or  discrete,  like:  linear
            regression and random forest respectively.
               There  are  main  characteristics  of  big  data  that  need  investigation
            while dealing with predictive analytics. The first one is heterogeneity,
            which  come  as  a  result  of  different  data  sources  or  different
            populations. This can be overcome through making use of the huge size
            of data that might almost represent the population through producing
            sophisticated  techniques.  From  this  paper  author  point  of  view  to
            overcome heterogeneity is to divide the huge set of data using stratum
            technique  and  to  create  a  pattern  for  each  stratum.  As  the
            characteristics of target groups in each stratum will be similar, so it will
            be homogenous. The second characteristic is error accumulation where
            simultaneous estimations of patterns occur. These consequences some
            parameters  that  affect  the  model  might  be  considered  as  error
            accumulation. This might be overcome by dealing with each pattern
            separately before the simultaneous estimations, in order to be able to
            define the significant variables for each model but it might be quite
            difficult.  The  third  characteristic  is  spurious  correlation,  where
            independent  variables  appear  to  be  correlated  as  the  size  of  data
            increase according to the study of (Fan and Lv 2008). However making
            classifications of each stream of data (granularity) and analysing it might
            solve this. As in the analysis of Traffic Loop Detection Data, worked on
            data  according  to  certain  time  for  example  a  crowded  hour  in  the
            morning  or  evening  was  assumed  to  have  same  characteristics  and
            analysis  (Piet  J.H.  Daas  and  etc.  2015).  The  fourth  one  is  incidental
            endogeneity that” refers to a genuine relationship between variables
            and the error term” (Amir Gandomi, and Murtaza Haider 2015).
            2.3. Section 3: Machine learning and big data predictive analytics
               These days NSOs and governments do not think only about the current
            figure of the situation in the different fields, but also there is a huge trend to
            predict the future. Predicting the future is important, whether it is based on
            structured  data  or  unstructured  data,  big  data  of  data  from  statistics.  The
            previous section showed the main challenges that exist in big data predictive
            analysis,  also  tries  to  mention  different  ways  to  overcome  it.  Briefly
            overcoming  big  data  predictive  analytics  are:  using  many  patterns  –
            granularity- parallelism. Going in depth in using machine learning in big data


                                                                40 | I S I   W S C   2 0 1 9
   46   47   48   49   50   51   52   53   54   55   56