Page 73 - Invited Paper Session (IPS) - Volume 1
P. 73

IPS57 Gerardo L. et al.
            implementation of machine learning techniques to the interpretation of the
            emotion that underlies the messages that are published on Twitter.
                The machine learning techniques allow to train a computer to replicate the
            human  criterion  in  identifying  the  emotional  load  of  each  tweet,  either
            negative or positive; This, in turn, allows classifying the tweets and build an
            indicator  that  relates  the  number  of  tweets  associated  with  a  positive
            emotional charge (tweets positive) for each tweet associated with a negative
            emotional charge (negative tweets).  We call this indicator the "positivity ratio"
            and define it as the number of positive tweets over the number of negative
            tweets for a given geographic area over a given period of time.

            2.  Methodology
                To build the Mood of the Tweeters it was necessary to download hundreds
            of millions of tweets georeferenced to the Mexican territory, available through
            the free Twitter API.  We started by manually labelling a relatively small set of
            training  tweets.    Human  labellers  (or  taggers),  were  asked  to  gauge  the
            underlying emotional charge (positive, negative or neutral) of each tweet.  The
            resulting set of tagged tweets was then used to train a computer to replicate
            the human criterion needed to classify tweets in two categories: positive and
            negative.  The training set was composed of about 54.000 standardized tweets,
            tagged by 5000 students of University TecMilenio and 5000 of our INEGI co-
            workers.  Both sets of taggers were scattered all over the 32 federate entities
            (states)  of  the  country.      Human  taggers  from  each  state  received  tweets
            generated  in  the  same  state  in  such  a  way  as  to  facilitate  the  proper
            interpretation of regionalisms.  Each tweet was assessed by different taggers
            and each tagger evaluated the same tweet several times, so that information
            from inconsistent qualifiers could be identified and disposed of.  Additionally,
            the underlying emotional charge of each tweet could be identified as each
            tweet was systematically qualified by different people as positive, negative or
            neutral, which helped to achieve a more robust qualification.
                The tweets classified by humans were then used to train the computer, for
            which it was necessary to incorporate an assembly of classifiers developed
            with the support of experts in data science from INFOTEC and Centro Geo, two
            Mexican centres of research.  After comparing among many alternatives, it was
            found  that  an  assembly  of  33  different  Support  Vector  Machine  (SVM)
            estimation  procedures,  each  one  run  with  a  different  80%  of  the    tweets
            tagged by humans and available for training and validation.
                The classification was not run over words but instead, we used q-grams of
            different lengths.  In fact, we found optimal to use q-grams of order 3, 4, 5,
            and  7.    We  also  normalized  text,  gave  polarity  values  to  emoticons,
            implemented number, URL and user substitution, and transformed words to
            basic  linguistic  roots.    TF-IDF  scheme  was  used  to  achieve  vectorization

                                                               62 | I S I   W S C   2 0 1 9
   68   69   70   71   72   73   74   75   76   77   78