Page 174 - Contributed Paper Session (CPS) - Volume 3
P. 174

CPS1983 Chong N. et al.
                                   Diagram 3: Process flow of Data Cleaning
                  Subsequently, feature extraction is used to convert textual data into vectors.
                  The term frequency-inverse document frequency (tf − idf) is an information
                  retrieval  model  used  to  weigh  the  relevance  of  the  term  in  a  document
                  (Manning, Raghavan & Schütze, 2008).

                  The term frequency (tf) is defined by the frequency of the occurrence of a
                  term  in the document  as
                                                         
                                                tf   =    ,  .
                                                  ,
                                                       ∑  ,
                                                         

                  The inverse document frequency (idf) of term  is defined as

                                                           
                                                idf = log     .
                                                   
                                                            

                  The idf ensures that the weight of rare terms increases while it decreases for
                  commonly used terms (Manning et al., 2008). The tf − idf  weighting gives the

                  term  a weight in document  by

                                           −  ,   =   ×  .
                                                          ,
                                                                    

                  Past data that has been labelled with SSOC is used to train the model.

                  Cosine similarity is then used to score and rank the relevance between two
                  documents   and  . To offset the effect of difference in document length,
                               1
                                      2
                  cosine  similarity  of  their  vector  representation ( ) and ( ) is  used  for
                                                                            ⃗
                                                                 ⃗
                                                                     1
                                                                               2
                  computation
                                          ( ,  ) =   ⃗⃗ ( 1 )∙ ⃗⃗ ( 2 )   .
                                                1
                                                    2
                                                         | ⃗⃗ ( 1  )|| ⃗⃗ ( 2 )|

                  The most suitable output is derived based on the highest cosine similarity for
                  the term

                           Job Title                 Job Title       SSOC
                           Accountant                Accountant      24111
                           Chef                      Chef            34341
                           Waitress                  Waitress        51312

                      Diagram 4: The output based on the highest cosine similarity score.





                                                                     163 | I S I   W S C   2 0 1 9
   169   170   171   172   173   174   175   176   177   178   179