Page 83 - Contributed Paper Session (CPS) - Volume 5
P. 83

CPS1060 Taik Guan T. et al.
            problem for the poor and unreliable performance of machine learning and
            data mining techniques [4]. This paper proposes to use GA to generate more
            training data by using the model obtained from the curve fitting software as a
            fitness function in GA. GA can generate large virtual samples when only small
            amounts of data are available for machine learning [5]. Training data is used
            as input for the SVM to learn the pattern of the data. The ability to identify the
            patterns  of  the  data  allows  better  understanding  and  optimization  of  the
            learning  process  [6].  The  SVM  model  is  often  preferred  due  to  its  high
            computational  efficiency  and  good  generalization  theory,  which  prevents
            over-fitting  through  the  control  of  hyperplane margins  and  Structural Risk
            Minimization [7].

            2.  Methodology
                This paper proposes a research methodology which will be implemented
            in 4 stages: data collection, machine learning, machine testing and machine
            application.
            a.  Data Collection
                There are four steps to collect the data required in this stage:
               a)  Defining data: Firstly, feature sets are defined, and the feature response
                   is labeled. The geographical features obtained from the World Bank’s
                   database  and  the  Department  of  Statistics  Malaysia  (DOSM)  are
                   categorized according to the PESTEL model. PESTEL denotes political,
                   economic, social, technical, environmental and legal.
               b)  Collecting data: Data is collected according to the feature sets and pre-
                   defined feature response. The feature sets are PESTEL factors whereas
                   the feature response is the socioeconomic status in response to the
                   factors. For each PESTEL category, the relevant features of each country
                   are  extracted  from  the  World  Bank’s  National  Accounts  data.  The
                   country’s  GNI  per  capita  is  taken  as  the response  the  geographical
                   features.
               c)  Generating data co-relationship: Thirdly, the collected data and labeled
                   response are provided to a curve-fitting software to build an empirical
                   model which is a fitness function that correlates the selected features
                   to the labeled response. Curve-fitting is a form of statistical modeling.
                   The Design Expert (DEX) curve-fitting software by Stat-Ease Inc. is used
                   to implement this step. The curve-fitting modeling in DEX is based on
                   Analysis  of  Variance  (ANOVA).  During  the  implementation,  DEX
                   generalizes the data collected.
               d)  Simulating large virtual data for machine training: Lastly, the equation
                   generated by the DEX is used as a fitness function for GA to simulate
                   more data that will be used to train the SVM.


                                                                72 | I S I   W S C   2 0 1 9
   78   79   80   81   82   83   84   85   86   87   88