Page 73 - Invited Paper Session (IPS) - Volume 1
P. 73
IPS57 Gerardo L. et al.
implementation of machine learning techniques to the interpretation of the
emotion that underlies the messages that are published on Twitter.
The machine learning techniques allow to train a computer to replicate the
human criterion in identifying the emotional load of each tweet, either
negative or positive; This, in turn, allows classifying the tweets and build an
indicator that relates the number of tweets associated with a positive
emotional charge (tweets positive) for each tweet associated with a negative
emotional charge (negative tweets). We call this indicator the "positivity ratio"
and define it as the number of positive tweets over the number of negative
tweets for a given geographic area over a given period of time.
2. Methodology
To build the Mood of the Tweeters it was necessary to download hundreds
of millions of tweets georeferenced to the Mexican territory, available through
the free Twitter API. We started by manually labelling a relatively small set of
training tweets. Human labellers (or taggers), were asked to gauge the
underlying emotional charge (positive, negative or neutral) of each tweet. The
resulting set of tagged tweets was then used to train a computer to replicate
the human criterion needed to classify tweets in two categories: positive and
negative. The training set was composed of about 54.000 standardized tweets,
tagged by 5000 students of University TecMilenio and 5000 of our INEGI co-
workers. Both sets of taggers were scattered all over the 32 federate entities
(states) of the country. Human taggers from each state received tweets
generated in the same state in such a way as to facilitate the proper
interpretation of regionalisms. Each tweet was assessed by different taggers
and each tagger evaluated the same tweet several times, so that information
from inconsistent qualifiers could be identified and disposed of. Additionally,
the underlying emotional charge of each tweet could be identified as each
tweet was systematically qualified by different people as positive, negative or
neutral, which helped to achieve a more robust qualification.
The tweets classified by humans were then used to train the computer, for
which it was necessary to incorporate an assembly of classifiers developed
with the support of experts in data science from INFOTEC and Centro Geo, two
Mexican centres of research. After comparing among many alternatives, it was
found that an assembly of 33 different Support Vector Machine (SVM)
estimation procedures, each one run with a different 80% of the tweets
tagged by humans and available for training and validation.
The classification was not run over words but instead, we used q-grams of
different lengths. In fact, we found optimal to use q-grams of order 3, 4, 5,
and 7. We also normalized text, gave polarity values to emoticons,
implemented number, URL and user substitution, and transformed words to
basic linguistic roots. TF-IDF scheme was used to achieve vectorization
62 | I S I W S C 2 0 1 9