Page 174 - Contributed Paper Session (CPS) - Volume 3
P. 174
CPS1983 Chong N. et al.
Diagram 3: Process flow of Data Cleaning
Subsequently, feature extraction is used to convert textual data into vectors.
The term frequency-inverse document frequency (tf − idf) is an information
retrieval model used to weigh the relevance of the term in a document
(Manning, Raghavan & Schütze, 2008).
The term frequency (tf) is defined by the frequency of the occurrence of a
term in the document as
tf = , .
,
∑ ,
The inverse document frequency (idf) of term is defined as
idf = log .
The idf ensures that the weight of rare terms increases while it decreases for
commonly used terms (Manning et al., 2008). The tf − idf weighting gives the
term a weight in document by
− , = × .
,
Past data that has been labelled with SSOC is used to train the model.
Cosine similarity is then used to score and rank the relevance between two
documents and . To offset the effect of difference in document length,
1
2
cosine similarity of their vector representation ( ) and ( ) is used for
⃗
⃗
1
2
computation
( , ) = ⃗⃗ ( 1 )∙ ⃗⃗ ( 2 ) .
1
2
| ⃗⃗ ( 1 )|| ⃗⃗ ( 2 )|
The most suitable output is derived based on the highest cosine similarity for
the term
Job Title Job Title SSOC
Accountant Accountant 24111
Chef Chef 34341
Waitress Waitress 51312
Diagram 4: The output based on the highest cosine similarity score.
163 | I S I W S C 2 0 1 9