Page 15 - Contributed Paper Session (CPS) - Volume 6
P. 15

CPS1465 Claude Macchi et al.
            motorcycles", 46 "Wholesale trade, except of motor vehicles and motorcycles"
            and 47 "Retail trade, except of motor vehicles and motorcycles". We can thus
            see  how  the  key  words  "workshop",  "car  body"  and  "automobile",  which
            should  only  identify  position  45,  have  also  been  linked  with  companies
            codified in positions 46 and 47 of the classification.

































                             Figure 2: Cross-reference keywords NOGA 45, 46 and 47

                These crosses also make it possible to discover elements that have not
            been  completely  cleaned  during  the  NLP  operations  (words  in  other
            languages,  isolated  letters,  etc.).  This  is  the  trigger  for  a  new  language
            detection and stemming operation phase to correct errors in keywords. And
            this loop will be repeated systematically until the keywords are of a quality
            considered good enough to launch the next phase of the process.
                This continuous feedback to the previous phase of the process is a key
            element  of  NOGAuto's  philosophy,  which  is  based  on  a  correction  and
            continuous improvement of the outcomes of the previous steps.

            The « Modelling » phase
                The  companies  to  be  codified  are  linked  in  a  single  matrix  to  the  (1)
            keywords resulting from the description of the economic activity produced in
            the “Preparation” phase as well as to (2) variables (address, jobs, turnover, legal
            form, etc.) from the SBER and which are intended to provide additional input
            for  the  definition  of  the  NOGA  code  to  be  assigned.  From  this  matrix  a
            prediction model will be generated that defines the keywords and concepts to

                                                                 4 | I S I   W S C   2 0 1 9
   10   11   12   13   14   15   16   17   18   19   20