Page 174 - Contributed Paper Session (CPS)

Page 174 - Contributed Paper Session (CPS) - Volume 3

P. 174

CPS1983 Chong N. et al.
Diagram 3: Process flow of Data Cleaning
Subsequently, feature extraction is used to convert textual data into vectors.
The term frequency-inverse document frequency (tf − idf) is an information
retrieval model used to weigh the relevance of the term in a document
(Manning, Raghavan & Schütze, 2008).

The term frequency (tf) is defined by the frequency of the occurrence of a
term in the document as

tf = , .
,
∑ ,

The inverse document frequency (idf) of term is defined as

idf = log .

The idf ensures that the weight of rare terms increases while it decreases for
commonly used terms (Manning et al., 2008). The tf − idf weighting gives the

term a weight in document by

− , = × .
,

Past data that has been labelled with SSOC is used to train the model.

Cosine similarity is then used to score and rank the relevance between two
documents and . To offset the effect of difference in document length,
1
2
cosine similarity of their vector representation ( ) and ( ) is used for
⃗
⃗
1
2
computation
( , ) = ⃗⃗ ( 1 )∙ ⃗⃗ ( 2 ) .
1
2
| ⃗⃗ ( 1 )|| ⃗⃗ ( 2 )|

The most suitable output is derived based on the highest cosine similarity for
the term

Job Title Job Title SSOC
Accountant Accountant 24111
Chef Chef 34341
Waitress Waitress 51312

Diagram 4: The output based on the highest cosine similarity score.

163 | I S I W S C 2 0 1 9

169 170 171 172 173 174 175 176 177 178 179