Page 83 - Contributed Paper Session (CPS) - Volume 5
P. 83
CPS1060 Taik Guan T. et al.
problem for the poor and unreliable performance of machine learning and
data mining techniques [4]. This paper proposes to use GA to generate more
training data by using the model obtained from the curve fitting software as a
fitness function in GA. GA can generate large virtual samples when only small
amounts of data are available for machine learning [5]. Training data is used
as input for the SVM to learn the pattern of the data. The ability to identify the
patterns of the data allows better understanding and optimization of the
learning process [6]. The SVM model is often preferred due to its high
computational efficiency and good generalization theory, which prevents
over-fitting through the control of hyperplane margins and Structural Risk
Minimization [7].
2. Methodology
This paper proposes a research methodology which will be implemented
in 4 stages: data collection, machine learning, machine testing and machine
application.
a. Data Collection
There are four steps to collect the data required in this stage:
a) Defining data: Firstly, feature sets are defined, and the feature response
is labeled. The geographical features obtained from the World Bank’s
database and the Department of Statistics Malaysia (DOSM) are
categorized according to the PESTEL model. PESTEL denotes political,
economic, social, technical, environmental and legal.
b) Collecting data: Data is collected according to the feature sets and pre-
defined feature response. The feature sets are PESTEL factors whereas
the feature response is the socioeconomic status in response to the
factors. For each PESTEL category, the relevant features of each country
are extracted from the World Bank’s National Accounts data. The
country’s GNI per capita is taken as the response the geographical
features.
c) Generating data co-relationship: Thirdly, the collected data and labeled
response are provided to a curve-fitting software to build an empirical
model which is a fitness function that correlates the selected features
to the labeled response. Curve-fitting is a form of statistical modeling.
The Design Expert (DEX) curve-fitting software by Stat-Ease Inc. is used
to implement this step. The curve-fitting modeling in DEX is based on
Analysis of Variance (ANOVA). During the implementation, DEX
generalizes the data collected.
d) Simulating large virtual data for machine training: Lastly, the equation
generated by the DEX is used as a fitness function for GA to simulate
more data that will be used to train the SVM.
72 | I S I W S C 2 0 1 9