Page 252 - Contributed Paper Session (CPS) - Volume 8
P. 252

CPS2274 Nadiah M. et al.
                  characteristics of the general population (Kontaki et al., 2016). One of the most
                  widely used definitions of outlier is the one based on distance: an object x is
                  considered as an outlier, if there are less than k objects in a distance at most
                  R from x, excluding x itself.  Otherwise, x is characterized as an inlier.
                      Kontaki  et  al.,  (2016)  stated  that  the  fundamental  characteristic  of  the
                  majority  of  the  proposed  algorithms  are  operating  in  a  static  fashion.  The
                  algorithm  must  be  executed  from  scratch  if  there  are  changes  in  the
                  underlying data objects, leading to performance degradation when updates
                  are frequent. Kontaki et al., (2016) focuses on sliding window method that is
                  one  of  the  various  streaming  techniques.  Since the  stream  is  continuously
                  updated  with  fresh  data,  it  is  impossible  to  maintain  all  of  them  in  main
                  memory. Therefore, a window is used where it keeps track of the most recent
                  data and all mining tasks are performed based on what is “visible” through the
                  window. As reported in Gupta et al., (2013), most window-based models are
                  currently offline.  The most relevant research works are Angiulli & Fassetti
                  (2007) and Yang, Rundensteiner, & Ward, (2009) where both considered the
                  problem  of  continuous  outlier  detection  in  window-based  data  streams,
                  without limiting their techniques to multi-dimensional data.  However, both
                  methods still have some serious limitations.

                  3.  Result
                      In this research, we use water quality data that provides information of
                  Dissolved  Oxygen  (DO)  and  Biochemical  Oxygen  Demand  (BOD).  Figure  1
                  shows the steps that are used to identify the outlier or inlier of DO and BOD.
                  We use Euclidean distance formula to find the distance for each point of the
                  data by using R Software to identify the outlier and inlier and the result may
                  vary depending on the value of members within the window (W), radius (R)
                  and number of neighbour (k). The value of k=3 and R=4 are used based on
                  (Kontaki  et  al.,  2016)  and  the  value  W  is  set  to  be  10.  Figure  2  shows  an
                  example of 1-sliding window on a probabilistic data stream for window 1 and
                  2. Table 1 shows the result for each window. In window 1, point 4 is not a safe
                  inlier because it became an outlier in window 2. However, all inlier in window
                  1 is a safe inlier because it still remains inlier in window 2. For window 4, point
                  4 in Figure 3 is an outlier because it has three neighbours. However, in window
                  5 in Figure 4, point 4 is an inlier because it has five neighbours.













                                                                     241 | I S I   W S C   2 0 1 9
   247   248   249   250   251   252   253   254   255   256   257