How did the authors improve the relevance classifier?

From this study, the authors were successfully able to not only improve the performance of the relevance classifier as compared to the state-of-the-art classifiers [8], but were also able to extract meaningful and explanatory features for classification, as compared to the complete set of unigram features (1000 total) used for classification in their earlier study [8].

What is the method for generating features from tweets?

Theclassifiers used the whole tweet to generate a set of features whereas in this study, the authors extracted features based on natural language techniques to build a simple model with a fewer number of features.

(Open Access) Discovering explanatory models to identify relevant tweets on Zika (2017) | RoopTeja Muppalla

Q: What did the authors find useful in classifying the tweets?

The authors also observed that the Stepwise model contains 15 POS tag features, indicating that lexical components were useful in discriminating between the relevant and non-relevant tweets.

Q: What features were excluded from the tweet?

From these 25 POS features,two features were excluded namely ’existential verbal’ and ’proper noun verbal’, as none of the tweets contained those two features.

Wright State University Wright State University

CORE Scholar CORE Scholar

Kno.e.sis Publications

The Ohio Center of Excellence in Knowledge-

Enabled Computing (Kno.e.sis)

7-2017

Discovering Explanatory Models to Identify Relevant Tweets on Discovering Explanatory Models to Identify Relevant Tweets on

Zika Zika

RoopTeja Muppalla

Wright State University - Main Campus

, muppalla.4@wright.edu

Michele Miller

Wright State University - Main Campus

Tanvi Banerjee

Wright State University - Main Campus

, tanvi.banerjee@wright.edu

William L. Romine

Wright State University - Main Campus

, william.romine@wright.edu

Follow this and additional works at: https://corescholar.libraries.wright.edu/knoesis

Part of the Bioinformatics Commons, Communication Technology and New Media Commons,

Databases and Information Systems Commons, OS and Networks Commons, and the Science and

Technology Studies Commons

Repository Citation Repository Citation

Muppalla, R., Miller, M., Banerjee, T., & Romine, W. L. (2017). Discovering Explanatory Models to Identify

Relevant Tweets on Zika. .

https://corescholar.libraries.wright.edu/knoesis/1130

This Conference Proceeding is brought to you for free and open access by the The Ohio Center of Excellence in

Knowledge-Enabled Computing (Kno.e.sis) at CORE Scholar. It has been accepted for inclusion in Kno.e.sis

Publications by an authorized administrator of CORE Scholar. For more information, please contact library-

corescholar@wright.edu.

Discovering Explanatory Models to Identify Relevant Tweets on Zika

Roopteja Muppalla

, Michele Miller

, Tanvi Banerjee

and William Romine

Abstract— Zika virus has caught the worlds attention, and

has led people to share their opinions and concerns on social

media like Twitter. Using text-based features, extracted with

the help of Parts of Speech (POS) taggers and N-gram, a

classiﬁer was built to detect Zika related tweets from Twitter.

With a simple logistic classiﬁer, the system was successful in

detecting Zika related tweets from Twitter with a 92% accuracy.

Moreover, key features were identiﬁed that provide deeper

insights on the content of tweets relevant to Zika. This system

can be leveraged by domain experts to perform sentiment

analysis, and understand the temporal and spatial spread of

Zika.

I. INTRODUCTION

Zika has been around for decades but the current outbreak

that started in 2015 has sparked signiﬁcant concern. This is

the ﬁrst outbreak of Zika associated with microcephaly and

Guillain-Barre syndrome, so management is still an impor-

tant challenge [1]. The three ways to become infected with

the Zika virus are: (i) an infected mother passing the virus

to her fetus, (ii) being bitten by an infected Aedes mosquito,

and (iii) through sexual contact

. There are currently no

medications or vaccines to prevent or treat the Zika virus.

As of January 18, 2017, 4,900 Zika related cases had been

reported in the United States

. With the growing number of

cases, it is important for health ofﬁcials to recognize Zika

virus hot spots and spread the necessary information to the

public in real time.

Public health organizations often depend on traditional

survey based methods to gather information about a disease

outbreak. Though these methods are useful, they take a long

time to recognize an outbreak. This is a major roadblock

when trying to detect the rapid spread of a disease. However,

social media can reduce this time lag while also allowing

for studying public opinions on health issues. People often

have health related conversations on social media and openly

discuss diseases. Platforms like Twitter make it easy to share

personal experiences so that people can empathize with each

other. In particular, public opinion mining has been studied in

the past for exploration of public views on important social

issues such as gender-based violence [2], as well as to mine

health related beliefs [3], [4]. Studies like the one about

a cholera outbreak after the earthquake in Haiti [5] have

demonstrated that Twitter data may represent a potential way

to track diseases faster in future events.

Department of Computer Science and Engineering,Wright State Univer-

sity Dayton, OH 45435, USA. roopteja, tanvi@knoesis.org

Department of Biological Sciences, Wright State University, Dayton,

OH 45435, USA. miller.1232, william.romine@wright.edu

http://www.webcitation.org/6mhnTZk4b

https://www.cdc.gov/zika/geo/united-states.html

Since Twitter has become a common platform for discus-

sions about disease, researchers can have a greater under-

standing of the disease and can communicate and address

any issues in real time. This study ﬁnds the best features

which can be used to build a classiﬁer to detect Zika related

tweets. A quicker detection of a disease through social media

can give more time to prepare the response team and contain

the disease spread. Such a system can provide health ofﬁcials

a collective view of the public’s health and also detect any

future Zika outbreaks.

In this study, techniques from natural language processing

(NLP) were used in combination with machine learning

techniques to build models to classify Zika related tweets. In

particular, we used data science techniques to not only build

a strong classiﬁer to identify relevant tweets on Zika, but

also extract features that best discriminate the two categories:

relevant and not relevant. The main focus of this study is a

shift from black box methods that achieve high classiﬁcation

rates but cannot explain the results, to simpler and more

explanatory models that provide deeper insights into the

model performance. Through exploration of multiple models,

we gain a deeper understanding of the content of tweets

pertaining to Zika.

II. RELATED WORK

Multiple studies have used Twitter for exploration of pub-

lic health issues [6], [7], [8]. One study focused on the spread

of inﬂuenza from November 2008 to June 2010 and collected

300 million tweets [6]. Tweets were identiﬁed as relevant to

inﬂuenza based on their inﬂuenza corpus using a support

vector machine (SVM) based classiﬁer. Pearson correlation

was used to compare estimated values and annotations. Their

method performed well in detecting inﬂuenza epidemics with

high correlation (0.89 correlation).

Alvaro et al. [7] obtained a random sample of tweets

over a 12 month period to analyse ﬁrst-hand experience

with selective serotonin reuptake inhibitors or cognitive

enhancers. The ground truth consisted of 100 annotated

tweets for 15 categories which were then compared to crowd

sourced annotators by calculating Kappa values for each of

the categories. Using URLs, hashtags, and N-grams from the

tweets, Bayesian Generalized Linear Modeling was found to

be the best technique for interpretation. In this study, we

followed a similar approach in collecting and preparing the

data for the classiﬁers.

Recently, we performed an exploratory study using Zika

related tweets to determine what people were tweeting about

Zika [8]. Tweets were collected over a period of 2 months. A

two-stage classiﬁer was used to ﬁnd the Zika related tweets

and to further classify the tweets into subcategories. The

classiﬁers used the whole tweet to generate a set of features

whereas in this study, we extracted features based on natural

language techniques to build a simple model with a fewer

number of features.

While these different studies highlight the utility of us-

ing social media to monitor peoples thoughts regarding a

speciﬁc disease outbreak, they did not discuss the role of

features and their signiﬁcance in classiﬁcation. This study

focuses on extracting features using Part of Speech tagging

and N-gram techniques and identifying the set of features

through model selection which will improve the classiﬁcation

results. Models which are simple and interpretable will help

researchers to quickly classify Zika related tweets to address

public concerns and misconceptions, similar to the research

done on other diseases [9], [10], [11].

III. DATA COLLECTION

Tweets were collected over a period of two months (a total

of 1,234,605 tweets) from Twitter based on the keywords

’zika’ and ’zika virus’ using a Twitter streaming application

program interface (API). Though the tweets contain the

word ’zika’, not all tweets were relevant to our study. For

example, tweet such as ’if you need me I’ll b contracting

the zika virus to avoid my ap test’ contains the word ’zika’

but it was used in the context of humor. But this study

was focused on bridging the gap between public and health

organizations in order to tackle the disease, so in our context

such tweets are considered irrelevant. This results in the need

to perform classiﬁcation to remove such irrelevant tweets.

For this, we took a random sample of 1,467 tweets from

the dataset for analysis. This dataset was then annotated by

three microbiology and immunology experts as to whether

the tweets were relevant to Zika or not (1,137 tweets were

labeled relevant). Inter-rater reliability among the annotators

was found using Fleiss Kappa [12]. We calculated a Kappa

value of 0.71 which indicates substantial agreement among

the raters [13].

IV. FEATURE EXTRACTION

Once the data were collected, we needed features to help

the learning algorithms or classiﬁers predict whether or not

a tweet was relevant. A simple approach to extract features

from the text is using a bags-of-words model where each

word is considered a feature. But this results in a large

number of features, which makes the learning algorithm

difﬁcult to process. Therefore, we made use of the following

two ways to extract features from tweets:

A. Parts of Speech (POS) Features

Features were extracted from the tweets with the help of

Stanford NLP POS tagger [14]. First, a feature vector with all

the 25 POS tags was created. Then the tool annotator iden-

tiﬁed the features in the tweet and the count of each feature

was recorded. For example, some of the features generated

by the POS tagger for the tweet, ’RT @nationalpost: Canada

conﬁrms its ﬁrst case of sexually transmitted Zika virus, in

Ontario’, are shown in Table 1. From these 25 POS features,

two features were excluded namely ’existential verbal’ and

’proper noun verbal’, as none of the tweets contained those

two features.

TABLE I

POS TAG FEATURES FOR A TWEET

Tag (feature) Count

(feature

value)

Sample from the

discourse marker 2 RT, :

at-mention (@) 1 @nationalpost

proper noun 2 Canada, Zika, Ontario

verb 2 conﬁrms, transmitted

nominal and verb 1 its

punctuation 1 ,

adjective 1 ﬁrst

pre- or post-position 2 of, in

common noun 2 case, viru

adverb 1 sexually

B. N-gram features

Features were extracted with the help of n-grams. N gram

[15] is a sequence of n words from a given text which is

treated as a single unit. As part of pre-processing, URLs,

hashtags, and stopwords were removed from the tweets as

these terms appear commonly in tweets and will not help

the classiﬁer to learn and distinguish Zika related tweets. For

the tweet ’zika makes americans rethink travel’, the features

generated by n-grams are shown in Table 2.

TABLE II

N-GRAM FEATURES FOR A TWEET

N-gram Feature

Unigrams zika, makes, americans,

rethink, travel

Bi-gram zika makes, makes amer-

icans, americans rethink,

rethink travel

Several studies use the entire unigram corpus to investi-

gate text content in datasets such as tweets [8]. However,

this study was performed by taking the top 10 occurring

unigrams and bigrams in the dataset as the features for our

analysis after pre-processing. For every n-gram, the count

was increased if that selected unigram or bigram existed in

the tweet. For example, if a selected unigram like ’zika’ is

a feature and it occured 2 times in a tweet then the count of

the occurrences was recorded as (2). Higher n-grams were

not considered since the frequency of these was far less due

to the tweet length constraint of 140 characters

V. ANALYSIS

Using POS, selected unigram, and selected bigram fea-

tures, there were a total of 43 features. Examples of fea-

tures include: ’at mention’, ’birth defects’, ’cdc’, ’emoti-

con’, ’ﬁght’, ’funding’, ’hashtag’, ’microcephaly’, ’pregnant

https://dev.twitter.com/basics/counting-characters

women’, ’pronoun’, ’public health’, ’symptoms’, ’treatment’,

’URL’.

Using R

programming language, we created a simple

logistic model considering all 43 features. To estimate the

relative quality of the model containing all of these features

in relation to simpler models which contain subsets of

these features, we used Akaikes information criterion (AIC)

[16].This gave a value of 957.78 for the full model (All

Features in Table 4). Then we performed forward/backward

stepwise model selection where features were added one

at a time and tests whether the AIC will be improved

by removing a previously added feature at each step. This

process yielded a model with 27 features (Stepwise in Table

4) with an AIC value of 934.99.

Principal Component Analysis (PCA) was used to further

reduce the number of features and to test whether the

model with the reduced features gave us better results. After

performing PCA, a scree plot (Figure 1) was used to select

2 components [17]. Since there is a low correlation between

the individual features, it is highly unlikely for the features

to have high correlation with the principal components.

Therefore, a low cut off (0.2) was used to determine which

features were associated with the principal components as

shown in Table 3.

Fig. 1. Scree plot of factor eigenvalues.

Component 1 was comprised of topical features generated

by n-grams such as ’birth defects’, ’cdc’, ’microcephaly’,

whereas, component 2 was comprised of lexical features

generated by POS tagger such as ’adverb’, ’pronoun’, ’verb’.

These two components were able to explain a total of 16

features based on the cut off value of 0.2.

TABLE III

STRUCTURE MATRIX OF FEATURES ONTO COMPONENTS 1 AND 2.

Feature Component 1:

Topical

Component 2:

Lexical

adjective 0.01 -0.13

adverb -0.01 -0.33

birth defects -0.34 -0.10

causes microcephaly -0.43 -0.11

cdc -0.34 -0.08

ﬁght 0.03 0.05

verb 0.11 -0.36

https://www.r-project.org/

The model built using just these two principal components

had an AIC value of 1412.59. Therefore, a model (All-2-

PC) was built using the 2 components and the remaining

features that did not load onto these components. Similarly

another model (19-Stepwise-2-PC) was built using these 2

principal components and the remaining features present in

the Stepwise model that the two principal components did not

explain. Finally, the Stepwise model was chosen as the best

model based on the Akaike weights (w), which are used to

give the relative likelihood of existence among the models

within a probabilistic framework [18]. Based on this, we

observed that PCA did not help in improving the model.

TABLE IV

AIC VALUES FOR DIFFERENT MODELS

Model AIC w

All 957.78 1.1E-5

Stepwise 934.99 0.99

PC 1412.59 0

19-Stepwise-2-PC 1035.50 0

All-2-PC 1009.03 0

We used the Stepwise logistic model to classify Zika

related tweets from the dataset. Table 5 shows the confusion

matrix, which gives the performance of the model in clas-

sifying the data. The results for this analysis (F measure of

0.92) were considerably better than the results generated by

multiple classiﬁers for the 1000 unigram features extracted

from the dataset through Weka [8], with F measures ranging

from 0.82 to 0.89. These results show that the Stepwise

model has more distinguishing features and is able to achieve

high accuracy even with a simple logistic model, as opposed

to more complex models such as SVMs, despite using

relatively few features as compared to our earlier study [8].

TABLE V

CONFUSION MATRIX FOR STEPWISE MODEL

Predicted

Relevant Not relevant

Actual

Relevant 1071(94%) 66(6%)

Not relevant 108(33%) 222(67%)

The Stepwise model contains topical features (as shown

in Table 6) such as ’microcephaly’, ’funding’, ’ﬁght’, ’treat-

ment’, ’symptoms’, ’health’ (part of the top 12 n-grams),

which were able to classify Zika related tweets well. This

sheds light on topics people tweet the most regarding Zika.

We also observed that the Stepwise model contains 15 POS

tag features, indicating that lexical components were useful

in discriminating between the relevant and non-relevant

tweets.

Along with the n-gram features and lexical features such

as ’hashtag’, ’at mention’, ’URL’, we observe that most of

the tweets could potentially originate from a news source

or retweets of this information. For example, ’Health Tech

Forum’ retweeted the following message, ’CDCgov: The

best way to prevent #Zika is to prevent mosquito bites.

TABLE VI

FEATURES IN STEPWISE MODEL

Topical

features

treatment, symptoms, microcephaly, ﬁrst, ﬁght,

health, puerto rico, cdc, new, funding, health ofﬁ-

cials, white house

Lexical

features

URL, hashtag, discourse marker, coordinating con-

junction, interjection, at mention, punctuation, com-

mon noun, determiner, emoticon, numeral, verb, verb

particle, existential, nominal possessive

URL’, which is indeed tweeted by ’CDC’ through their

ofﬁcial Twitter handle. These lexical features could also help

researchers in analyzing public sentiment [19], [20] regarding

Zika.

VI. CONCLUSION AND FUTURE WORK

From this study, we were successfully able to not only im-

prove the performance of the relevance classiﬁer as compared

to the state-of-the-art classiﬁers [8], but were also able to ex-

tract meaningful and explanatory features for classiﬁcation,

as compared to the complete set of unigram features (1000

total) used for classiﬁcation in our earlier study [8]. This

not only allows us to better analyze system performance,

but also improve the computation time and resources to

build a high accuracy, real-time classiﬁcation system for

Zika-related tweets. For our next steps, we want to explore

explanatory features for the next stage of classiﬁcation to

categorize the relevant tweets into sub-groups of treatment,

symptoms, transmission, and prevention. Future work will

also involve using sentiment-based features that can classify

public sentiment regarding a speciﬁc topic within the tweets.

Such a system will enable public health organizations to

employ real-time decision making for epidemics like Zika

and help address public concerns in a faster and more

efﬁcient manner.

VII. ACKNOWLEDGEMENTS

Banerjee and Muppalla would like to acknowledge funding

from NIH project 1K01LM012439-01. Romine and Miller

were partially supported by Department of Education I3

project U411C140081 and Institute of Educational Sciences

(IES) award R305A150364.

REFERENCES

[1] S. S.-Y. Wong, R. W.-S. Poon, and S. C.-Y. Wong, “Zika virus

infectionthe next wave after dengue?,” Journal of the Formosan

Medical Association, vol. 115, no. 4, pp. 226–242, 2016.

[2] H. Purohit, T. Banerjee, A. Hampton, V. L. Shalin, N. Bhandutia, and

A. P. Sheth, “Gender-based violence in 140 characters or fewer: A#

bigdata case study of twitter,” arXiv preprint arXiv:1503.02086, 2015.

[3] M. J. Paul and M. Dredze, “You are what you tweet: Analyzing twitter

for public health.,” Icwsm, vol. 20, pp. 265–272, 2011.

[4] S. Bhattacharya, H. Tran, and P. Srinivasan, “Discovering health

beliefs in twitter.,” in AAAI Fall Symposium: Information Retrieval

and Knowledge Discovery in Biomedical Text, 2012.

[5] C. W. Schmidt, “Using social media to predict and track disease

outbreaks,” Environmental health perspectives, vol. 120, no. 1, p. A31,

2012.

[6] E. Aramaki, S. Maskawa, and M. Morita, “Twitter catches the ﬂu:

detecting inﬂuenza epidemics using twitter,” in Proceedings of the

conference on empirical methods in natural language processing,

pp. 1568–1576, Association for Computational Linguistics, 2011.

[7] N. Alvaro, M. Conway, S. Doan, C. Loﬁ, J. Overington, and N. Collier,

“Crowdsourcing twitter annotations to identify ﬁrst-hand experiences

of prescription drug use,” Journal of biomedical informatics, vol. 58,

pp. 280–287, 2015.

[8] M. Miller, T. Banerjee, R. Muppalla, W. Romine, and A. Sheth, “What

are people tweeting about zika? an exploratory study concerning

symptoms, treatment, transmission, and prevention,” accepted, Journal

of Medical Internet Research, 2017.

[9] A. Signorini, A. M. Segre, and P. M. Polgreen, “The use of twitter to

track levels of disease activity and public concern in the us during the

inﬂuenza a h1n1 pandemic,” PloS one, vol. 6, no. 5, p. e19467, 2011.

[10] A. Robertson, “Harnessing twitter, partnerships and the power of

inﬂuence to stop stigma and spread awareness,” in 2013 National

Conference on Health Communication, Marketing, and Media, CDC,

2013.

[11] X. Ji, S. A. Chun, and J. Geller, “Monitoring public health concerns

using twitter sentiment classiﬁcations,” in Healthcare Informatics

(ICHI), 2013 IEEE International Conference on, pp. 335–344, IEEE,

2013.

[12] M. L. McHugh, “Interrater reliability: the kappa statistic,” Biochemia

medica, vol. 22, no. 3, pp. 276–282, 2012.

[13] J. R. Landis and G. G. Koch, “The measurement of observer agreement

for categorical data,” biometrics, pp. 159–174, 1977.

[14] K. Gimpel, N. Schneider, B. O’Connor, D. Das, D. Mills, J. Eisenstein,

M. Heilman, D. Yogatama, J. Flanigan, and N. A. Smith, “Part-of-

speech tagging for twitter: Annotation, features, and experiments,” in

Proceedings of the 49th Annual Meeting of the Association for Com-

putational Linguistics: Human Language Technologies: short papers-

Volume 2, pp. 42–47, Association for Computational Linguistics, 2011.

[15] R. J. Solomonoff, “An inductive inference machine,” in IRE Conven-

tion Record, Section on Information Theory, vol. 2, pp. 56–62, 1957.

[16] D. Anderson and K. Burnham, “Model selection and multi-model

inference,” Second. NY: Springer-Verlag, 2004.

[17] R. B. Cattell, “The scree test for the number of factors,” Multivariate

behavioral research, vol. 1, no. 2, pp. 245–276, 1966.

[18] E.-J. Wagenmakers and S. Farrell, “Aic model selection using akaike

weights,” Psychonomic bulletin & review, vol. 11, no. 1, pp. 192–196,

2004.

[19] L. Barbosa and J. Feng, “Robust sentiment detection on twitter from

biased and noisy data,” in Proceedings of the 23rd International Con-

ference on Computational Linguistics: Posters, pp. 36–44, Association

for Computational Linguistics, 2010.

[20] A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and R. Passonneau,

“Sentiment analysis of twitter data,” in Proceedings of the workshop on

languages in social media, pp. 30–38, Association for Computational

Linguistics, 2011.

Discovering explanatory models to identify relevant tweets on Zika

Figures

Citations

What Are People Tweeting about Zika? An Exploratory Study Concerning Symptoms, Treatment, Transmission, and Prevention

Sentiment Analysis of Twitter Data: A Hybrid Approach

Dynamics of Health Agency Response and Public Engagement in Public Health Emergency: A Case Study of CDC Tweeting Patterns During the 2016 Zika Epidemic.

Identifying Key Topics Bearing Negative Sentiment on Twitter: Insights Concerning the 2015-2016 Zika Epidemic.

What do College Undergraduates Know about Zika and What Precautions Are They Willing to Take to Prevent its Spread

References

The measurement of observer agreement for categorical data

The scree test for the number of factors

Interrater reliability: the kappa statistic

AIC model selection using Akaike weights

Sentiment Analysis of Twitter Data

Related Papers (5)

Identifying Key Topics Bearing Negative Sentiment on Twitter: Insights Concerning the 2015-2016 Zika Epidemic.

Tweets classification and sentiment analysis for personalized tweets recommendation

Identifying personal health experience tweets with deep neural networks

User-Level Twitter Sentiment Analysis with a Hybrid Approach

Discovery and classification of user interests on social media

Frequently Asked Questions (11)

Q1. How many tweets were collected from Twitter?

Q2. How many features were generated by POS tagger?

Q3. What is the purpose of this study?

Q4. What were the features removed from the tweets?

Q5. What did the authors find useful in classifying the tweets?

Q6. What features were excluded from the tweet?

Q7. What was the ground truth of the study?

Q8. What is the model for a probabilistic framework?

Q9. How did the authors improve the relevance classifier?

Q10. How many features were used to determine the correlation between the principal components?

Q11. What is the method for generating features from tweets?