Page 360 - Special Topic Session (STS) - Volume 2
P. 360
STS500 Neo S.K. et al.
instance, in the selected classifier, the word ‘Shop’ had a feature importance
score of 0.044 which was more than seven times the average feature
importance score of 0.006. This meant that the specific word ‘Shop’ was highly
relevant in the classification as compared to a moderately important feature.
This allowed a summary insight into the classifier’s predictions. Feature words
with notable feature importance are highlighted in Table 3.
Finally, the training dataset was used to train the Random Forest Classifier
and the classifier was applied on the enterprise URLs to classify them into one
of the internet usage categories (B1, B2, C1 or C2).
Table 2: Results of algorithms explored
Algorithm Test Set Accuracy
Random Forest 79%
Gradient Boosting Machine 77%
Voting Classifier 77%
Logistic Regression 72%
Neural Network 71%
AdaBoost 70%
Support Vector Machine 68%
Naïve Bayes (Baseline) 57%
Table 3: Feature importance of selected words
Feature Words Feature Importance
Shop 0.044
Cart 0.041
Price 0.027
Facebook 0.021
4. Results
Out of the enterprise URLs obtained to date, 14% have websites which
generate income directly online (Figure 2). One caveat to note here is that the
enterprises classified under ‘Category C1/C2: Income generated directly
online’ might not generate their income wholly through online means and the
online platform could be one of many different revenue streams.
349 | I S I W S C 2 0 1 9