scispace - formally typeset
Open AccessJournal ArticleDOI

A Method of Automated Nonparametric Content Analysis for Social Science

TLDR
This work develops a method that gives approximately unbiased estimates of category proportions even when the optimal classifier performs poorly, and illustrates with diverse data sets, including the daily expressed opinions of thousands of people about the U.S. presidency.
Abstract
The increasing availability of digitized text presents enormous opportunities for social scientists. Yet hand coding many blogs, speeches, government records, newspapers, or other sources of unstructured text is infeasible. Although computer scientists have methods for automated content analysis, most are optimized to classify individual documents, whereas social scientists instead want generalizations about the population of documents, such as the proportion in a given category. Unfortunately, even a method with a high percent of individual documents correctly classified can be hugely biased when estimating category proportions. By directly optimizing for this social science goal, we develop a method that gives approximately unbiased estimates of category proportions even when the optimal classifier performs poorly. We illustrate with diverse data sets, including the daily expressed opinions of thousands of people about the U.S. presidency. We also make available software that implements our methods and large corpora of text for further analysis.

read more

Content maybe subject to copyright    Report

A Method of Automated Nonparametric Content
Analysis for Social Science
Daniel J. Hopkins
Georgetown University
Gary King
Harvard University
The increasing availability of digitized tex t presents enormous opportunities for social scientists. Yet hand coding many
blogs, speeches, government records, newspapers, or other sources of unstructured text is infeasible. Although com-
puter scientists have methods for automated content analysis, most are optimized to classify individual documents,
whereas social scientists instead want generalizations about the population of documents, such as the proportion in a
given category. Unfortunately, even a method with a high percent of individual documents correctly classified can be
hugely biase d when estimating category proportions. By directly optimizing for this social sc ience goal, we develop a
method that gives approximately unbiased estimates of category proportions even when the optimal classifier perfor ms
poorly. We illustrate with diverse data sets, including the daily expressed opinions of thousands of people about the
U.S. presidency. We also make available software that implements our me thods and large corpora of text for further
analysis.
E
fforts to systematically categorize text documents
date to the late 1600s, when the Church tracked
the proportion of printed texts which were non-
religious (Krippendorff 2004). Similar techniques were
used by earlier generations of social scientists, including
Waples, Berelson, and Bradshaw (1940, which apparently
includes the first use of the term content analysis”) and
Berelson and de Grazia (1947). Content analyses like these
have spread to a vast array of fields, with automated meth-
ods now joining projects based on hand coding, and have
increased at least sixfold from 1980 to 2002 (Neuendorf
2002). The recent explosive increase in web pages, blogs,
emails, dig itized books and ar ticles, transcripts, and elec-
Daniel J. Hopkins is Assistant Professor of Government, Georgetown University, 681 Intercultural Center, Washington, DC 20057
(dhopkins@iq.harvard.edu, http://www.danhopkins.org). Gary King is Albert J. Weatherhead III University Professor, Harvard University,
Institute for Quantitative Social Science, 1737 Cambridge St., Cambridge, MA 02138 (king@harvard.edu, http://gking.harvard.edu).
Replication materials are available at Hopkins and King (2009); see http://hdl.handle.net/1902.1/12898. Our special thanks to our inde-
fatigable undergraduate coders Sam Caporal, Katie Colton, Nicholas Hayes, Grace Kim, Matthew Knowles, Katherine McCabe, Andrew
Prokop, and Keneshia Washington. Each coded numerous blogs, dealt with the unending changes we made to our coding schemes, and
made many important suggestions that greatly improved our work. Matthew Knowles also helped us track down and understand the
many scholarly literatures that intersected with our work, and Steven Melendez provided invaluable computer science wizardry; both are
coauthors of the open source and free computer program that implements the methods described herein (ReadMe: Software for Automated
Content Analysis; see http://gking.harvard.edu/readme). We thank Ying Lu for her wisdom and advice, Stuart Shieber for introducing us
to the relevant computer science literature, and http://Blogpulse.com for getting us started with more than a million blog URLs. Thanks
to Ken Benoit, Doug Bond, Justin Grimmer, Matt Hindman, Dan Ho, Pranam Kolari, Mark Kantrowitz, Lillian Lee, Will Lowe, Andrew
Martin, Burt Monroe, Stephen Purpur a, Phil Schrodt, Stuart Shulman, and Kevin Quinn for helpful suggestions or data. Thanks also to
the Library of Congress (PA#NDP03-1), the Center for the Study of American Politics at Yale University, the Multidisciplinary Program
on Inequality and Social Policy, and the Institute for Quantitative Social Science for research support.
tronic versions of government documents (Lyman and
Varian 2003) suggests the potential for many new ap-
plications. Given the infeasibility of much larger scale
human-based coding, the need for automated methods is
growing fast. Indeed, large-scale projects based solely on
hand coding have stopped altogether in some fields (King
and Lowe 2003, 618).
This article introduces new methods of automated
content analysis designed to estimate the primary quan-
tity of interest in many social science applications. These
new methods take as data a potentially large set of
text documents, of which a small subset is hand coded
into an investigator-chosen set of mutually exclusive and
American Journal of Political Science, Vol. 54, No. 1, January 2010, Pp. 229–247
C
2010, Midwest Political Science Association ISSN 0092-5853
229

230 DANIEL J. HOPKINS AND GARY KING
exhaustive categories.
1
As output, the methods give ap-
proximately unbiased and statistically consistent esti-
mates of the proportion of all documents in each category.
Accurate estimates of these document category proportions
have not been a goal of most work in the classification lit-
erature, which has focused instead on increasing the accu-
racy of classification into individual document categories.
Unfortunately, methods tuned to maximize the percent
of documents correctly classified can still produce sub-
stantial biases in the aggregate proportion of documents
within each category. This poses no problem for the task
for which these methods were designed, but it suggests
that a new approach may be of use for many social science
applications.
When social scientists use formal content analysis, it
is typically to make generalizations using document cat-
egory proportions. Consider examples as far-ranging as
Mayhew (1991, chap. 3), Gamson (1992, chaps. 3, 6, 7,
and 9), Zaller (1992, chap. 9), Gerring (1998, chaps. 3–7),
Mutz (1998, chap. 8), Gilens (1999, chap. 5), Mendel-
berg (2001, chap. 5), Rudalevige (2002, chap. 4), Kellstedt
(2003, chap. 2), Jones and Baumgartner (2005, chaps.
3–10), and Hillygus and Shields (2008, chap. 6). In all
these cases and many others, researchers conducted con-
tent analyses to learn about the distribution of classifi-
cations in a population, not to assert the classification
of any particular document (which would be easy to do
through a close reading of the document in question). For
example, the manager of a congressional office would find
useful an automated method of sorting indiv idual con-
stituent letters by policy area so they can be routed to the
most informed staffer to draft a response. In contrast, po-
litical scientists would be interested primarily in tracking
the proportion of mail (and thus constituent concerns)
in each policy area. Policy makers or computer scientists
may be interested in finding the needle in the haystack
1
Although some excellent content analysis methods are able to del-
egate to the computer both the choice of the categorization scheme
and the classification of documents into the chosen categories, our
applications require methods where the social scientist chooses the
questions and the data provide the answers. The former so-called
“unsupervised learning methods” are versions of cluster analy-
sis and have the great advantage of requiring fewer startup costs,
since no theoretical choices about categories are necessar y ex ante
and no hand coding is required (Quinn et al. 2009; Simon and
Xeons 2004). In contrast, the latter so-called “supervised learning
methods,” which require a choice of categories and a s ample of
hand-coded documents, have the advantage of letting the social
scientist, rather than the computer program, determine the most
theoretically interesting questions (Kolari, Finin, and Joshi 2006;
Laver, Benoit, and Garry 2003; Pang, Lee, and Vaithyanathan 2002).
These approaches, and others such as dictionary-based methods
(Gerner et al. 1994; King and Lowe 2003), accomplish somewhat
different tasks and so can often be productively used together, such
as for discovering a relevant set of categories in part from the data.
(such as a potential terrorist threat or the right web page
to display from a search), but social scientists are more
commonly interested in characterizing the haystack. Cer-
tainly, individual document classifications, when avail-
able, provide additional information to social scientists,
since they enable one to aggregate in unanticipated ways,
serve as variables in regression-type analyses, and help
guide deeper qualitative inquiries into the nature of spe-
cific documents. But they do not usually (as in B enoit
and Laver 2003) constitute the ultimate quantities of
interest.
Automated content analysis is a new field and is newer
still within political science. We thus begin in the second
section with a concrete example to help fix ideas and de-
fine key concepts, including an analysis of expressed opin-
ionthroughblogpostsaboutSenatorJohnKerry.Wenext
explain how to represent unstructured text as structured
variables amenable to statistical analysis. The following
section discusses problems with existing methods. We
introduce our methods in the fifth section along with
empirical verification from several data sets in the sixth
section. The last section concludes. The appendix pro-
vides intercoder reliability statistics and offers a method
for coping with errors in hand-coded documents.
Measuring Political Opinions in
Blogs: A Running Example
Although our methodology works for any unstructured
text, we use blogs as our running example. Blogs (or “web
logs”) are periodic web postings usually listed in reverse
chronological order.
2
For present purposes, we define our
inferential target as expressed sentiment about each can-
didate in the 2008 American presidential election. Mea-
suring the national conversation in this way is not the
only way to define the population of interest, but it seems
to be of considerable public interest and may also be of
interest to political scientists studying activists (Verba,
Schlozman, and Brady 1995), the media (Drezner and
Farrell 2004), public opinion (Gamson 1992), social net-
works (Adamic and Glance 2005; Huckfeldt and Sprague
1995), or elite influence (Grindle 2005; Hindman, Tsiout-
siouliklis, and Johnson 2003; Zaller 1992). We attempted
to collect all English-language blog posts from highly
political people who blog about politics all the time, as
2
Eight percent of U.S. Internet users (about 12 million people),
claim to have their ow n blog (Lenhart and Fox 2006). The growth
worldwide has been explosive, from essentially none in 2000 to
estimates today that range up to 185.62 million worldwide. Blogs
are a remarkably democratic technology, with 72.82 million in
China and at least 700,000 in Iran (Helmond 2008).

A METHOD OF AUTOMATED NONPARAMETRIC CONTENT ANALYSIS FOR SOCIAL SCIENCE 231
well as others who normally blog about gardening or
their love lives, but choose to join the national conversa-
tion about the presidency for one or more posts. Bloggers’
opinions get counted when they post and not otherwise,
just as in earlier centuries when public opinion was syn-
onymous with visible public expressions rather than at-
titudes and nonattitudes expressed in sur vey responses
(Ginsberg 1986).
3
Our specific goal is to compute the proportion of
blogs each day or week in each of seven categories, in-
cluding extremely negative (2), neg ative (1), neutral
(0), positive (1), extremely positive (2), no opinion (NA),
and not a blog (NB).
4
Although the first five categories
are logically ordered, the set of all seven categories is
not (which rules out innovative approaches like Word-
scores, which presently requires a single dimension; Laver,
Benoit, and Garry 2003). Bloggers write to express opin-
ions and so category 0 is not common, although it and
NA occur commonly if the blogger is writing primarily
about something other than our subject of study. Cate-
gory NB ensures that the category list is exhaustive. This
coding scheme represents a difficult test case because of
the mixed data types, because “sentiment categorization
is more difficult than topic classification (Pang, Lee, and
Vaithyanathan 2002, 83), and because the language used
ranges from the Queen’s English to “my crunchy gf thinks
dubya h id the wmd’s, :)!!”
5
Wenowpreviewthetypeofempiricalresultsweseek.
To do this, we apply the nonparametric method described
below to blogosphere opinions about John Kerry before,
3
We obtained our list of blogs by beginning with eight public
blog directories and two other sources we obtained pr ivately,
including www.globeofblogs.com, http://truthlaidbear.com, www
.nycbloggers.com, http://dir.yahoo.com/Computers
and Internet/
Internet/, www.bloghop.com/highrating.htm, http://www
.blogrolling.com/top.phtml, a list of blogs provided by
blogrolling.com, and 1.3 million additional blogs made available
to us by Blogpulse.com. We then continuously crawl out from
the links or blogroll” on each of these blogs, adding seeds along
the way from Google and other sources, to identify our target
population.
4
Our specific instructions to coders read as follows: “Below is one
entry in our database of blog posts. Please read the entire ent ry.
Then, answer the questions at the bottom of this page: (1) indicate
whether this entry is in fact a blog posting that contains an opinion
about a national political figure. If an opinion is being expressed (2)
use the scale from 2 (extremely negative) to 2 (extremely positive)
to summarize the opinion of the blog’s author about the figure.”
5
Using hand coding to track opinion change in the blogosphere in
realtime is infeasible and even after the fact wouldbean enormously
expensive task. Using unsupervised learning methods to answer the
questions posed is also usually infeasible. Applied to blogs, these
methods often pick up topics rather than sentiment or irrelevant
features such as the informality of the text.
FIGURE 1 Blogosphere Responses to Kerry s
Botched Joke
Notes: Each line gives a time series of estimates of the proportion
of all English-language blog posts in categori es ranging from 2
(extremely negative, colored red) to 2 (extremely positive, colored
blue). The spike in the 2 category immediately followed Kerry’s
joke. Results were estimated with our nonpar ametric method in
Section 5.2.
during, and after the botched joke in the 2006 election
cycle, which was said to have caused him to not enter
the 2008 contest (“You know, education—if you make
themostofit... you can do well. If you don’t, you get
stuck in Iraq”). Figure 1 gives a time-series plot of the
proportion of blog posts in each of the opinion categories
over time. The sharp increase in the extremely negative
(2) category occurred immediately following Kerry’s
joke. Note also the concomitant drop in other categories
occurred primarily from the 1category,buteventhe
proportion in the positive categories dropped to some
degree. Although the media portrayed this joke as his
motivation for not entering the race, this figure suggests
that his high negatives before and after this event may
have been even more relevant.
These results come from an analysis of word patterns
in 10,000 blog posts, of which only 442 from five days
in early November were actually read and hand coded
by the researchers. In other words, the method outlined
in this article recovers a highly plausible pattern for sev-
eral months using word patterns contained in a small,
nonrandom subset of just a few days when anti-Kerry
sentiment was at its peak. This was one incident in the

232 DANIEL J. HOPKINS AND GARY KING
run-up to the 2008 campaign, but it gives a sense of the
widespread applicability of the methods. Although we do
not offer these in this ar ticle, one could easily imag ine
many similar analyses of political or social events where
scale or resource constraints make it impossible to con-
tinuously read and manually categorize texts. We offer
more formal validation of our methods below.
Representing Text Statistically
We now explain how to represent unstruc tured text as
structured variables amenable to statistical analysis, first
by coding variables and then via statistical notation.
Coding Variables
To analyze text statistically, we represent natural language
as numerical variables following standard procedures
(Joachims 1998; Kolari, Finin, and Joshi 2006; Manning
and Sch
¨
utze 1999; Pang, Lee, and Vaithyanathan 2002).
For example, for our key var iable, we summarize a docu-
ment (a blog post) with its category. Other variables are
computed from the text in three additional steps, each of
which works without human input, and all of which are
designed to reduce the complexity of text.
First, we drop non-English-language blogs (Cavnar
and Trenkle 1994), as well as spam blogs (with a tech-
nology we do not share publicly; for another, see Ko-
lari, Finin, and Joshi 2006). For the purposes of this
article, we focus on blog posts about President George
W. Bush (which we define as those that use the terms
“Bush,” “George W.,” “Dubya,” or “King George”) and
similarly for each of the 2008 presidential candidates. We
develop specific filters for each person of interest, en-
abling us to exclude others with similar names, such as to
avoid confusing Bill and Hillary Clinton. For our present
methodological purposes, we focus on 4,303 blog posts
about President Bush collected February 1–5, 2006, and
6,468 posts about Senator Hillary Clinton collected Au-
gust 26–30, 2006. Our method works without filtering
(and in foreign languages), but filters help focus the lim-
ited time of human coders on the categories of interest.
Second, we preprocess the text within each docu-
ment by converting to lowercase, removing all punctua-
tion, and stemming by, for example, reducing “consist,”
consisted,” consistency,” consistent,” consistently,”
consisting,” and “consists” to their stem, which is con-
sist.” Preprocessing text strips out information, in addi-
tion to reducing complexity, but experience in this liter-
ature is that the trade-off is well worth it (Porter 1980;
Quinn et al. 2009).
Finally, we summar ize the preprocessed text as di-
chotomous variables, one type for the presence or absence
of each word stem (or “unigram”), a second type for each
word pair (or “bigram”), a third type for each word triplet
(or “trigram”), and so on to all “n-grams.” This defini-
tion is not limited to dic tionary words. In our application,
we measure only the presence or absence of stems rather
than counts (the second time the word awful” appears
in a blog post does not provide as much information
as the first). Even so, the number of variables remain-
ing is enormous. For example, our sample of 10,771 blog
posts about President Bush and Senator Clinton includes
201,676 unique unig rams, 2,392,027 unique bigrams, and
5,761,979 unique trigrams. T he usual choice to simplify
further i s to consider only dichotomous stemmed uni-
gram indicator variables (the presence or absence of e ach
ofalistofwordstems),whichwehavefoundtoworkwell.
We also delete stemmed unigrams appearing in fewer than
1% or greater than 99% of all documents, which results in
3,672 variables. These procedures effectively group the in-
finite range of possible blog posts to only” 2
3,672
distinct
types. This makes the problem feasible but still represents
a huge number (larger than the number of elementary
particles in the universe).
Researchers interested in similar problems in com-
puter science commonly find that bag of words” sim-
plifications like this are highly effective (e.g., Pang, Lee,
and Vaithyanathan 2002; Sebastiani 2002), and our anal-
ysis reinforces that finding. This seems counterintuitive
at first, since it is easy to write text whose meaning is
lost when word order is discarded (e.g., “I hate Clinton.
I love Obama”). But empirically, most text sources make
the same point in enough different ways that representing
the needed information abstractly is usually sufficient. As
an analogy, when channel surfing for something to watch
on television, pausing for only a few hundred milliseconds
on a channel is typically sufficient; similarly, the negative
content of a vitriolic post about President Bush is usu-
ally easy to spot after only a sentence or two. When the
bag of words approach is not a sufficient representation,
many procedures are available: we can code where each
word stem appears in a document, tag each word with
its part of speech, or include selective bigrams, such as
by replacing “white house” with “w hite
house” (Das and
Chen 2001). We can also use counts of variables or code
variables to represent meta-data, such as the URL, title,
blogroll, or whether the post links to known liberal or
conservative sites (Thomas, Pang, and Lee 2006). Many
other similar tricks suggested in the computer science

A METHOD OF AUTOMATED NONPARAMETRIC CONTENT ANALYSIS FOR SOCIAL SCIENCE 233
literature may be useful for some problems (Pang and
Lee 2008), and all can be included in the methodology
described below, but we have not found them necessary
forthemanyapplicationswehavetriedtodate.
Notation and Quantities of Interest
Our procedures require two sets of text documents. The
first is a small labeled set,forwhicheachdocumenti
(i = 1,...,n) is labeled with one of the given categories,
usually by reading and hand coding (we discuss how large
n needs to be in the sixth section, and what to do if hand
coders are not sufficiently reliable in the appendix). We
denote the D
ocument category variable as D
i
,whichin
general takes on the value D
i
= j , for possible categories
j = 1,...,J .
6
(In our running example, D
i
takes on
the potential values {2, 1, 1, 0, 1, 2, NA, NB}.) We
denote the second, larger population set of documents
as the inferential target, and in which each document
(for = 1,...,L) has an unobserved classification D
.
Sometimes the labeled set is a sample from the population
and so the two overlap; more often it is a nonrandom
sample from a different source than the population, such
as from earlier in time.
All other information is computed directly from the
documents. To define these variables for the labeled set
denote S
ik
asequalto1ifwordStem k (k = 1,...,K )is
used at least once in document i (for i = 1,...,n)and
0 otherwise (and similarly for the population set, sub-
stituting index i with index ). This makes our abstract
summary of the text of document i the set of these vari-
ables, {S
i1
,...,S
iK
}, which we summarize as the K × 1
vector of word stem variables S
i
.WerefertoS
i
as a word
stem profile since it provides a summary of all the word
stems (or other information) used in a document.
The quantity of interest in most of the supervised
learning literature is the set of individual classifications for
all documents in the population, {D
1
,...,D
L
}. In con-
trast, the quantity of interest for most content analyses in
social science is the aggregate proportion of all (or a subset
of all) of these population documents that fall into each
category: P (D) ={P (D = 1),...,P (D = J )}
where
P(D)isaJ × 1 vector, each element of which is a pro-
portion computed by direct tabulation:
P (D = j ) =
1
L
L
=1
1(D
= j ), (1)
6
This notation is from King and Lu(2008), who use related methods
applied to unrelated substantive applications that do not involve
coding text, and different mnemonic associations.
where 1(a) = 1ifa is true and 0 otherwise. Document
category D
i
is one variable with many possible values,
whereas word profile S
i
constitutes a set of dichotomous
variables. This means that P(D) is a multinomial d istri-
bution with J possible values and P (S) is a multinomial
distribution with 2
K
possible values, each of which is a
word stem profile.
Issues with Existing Approaches
This section discusses problems with two common meth-
ods that arise when they are used to estimate social aggre-
gates rather than individual classifications.
Existing Approaches
A simple way of estimating P(D)isdirect sampling:iden-
tify a well-defined population of interest, draw a random
sample from the population, hand code all the documents
in the sample, and count the documents in each category.
This method requires basic sampling theory, no abstract
numerical summaries of any text, and no classifications
of individual documents in the unlabeled population.
The second approach to estimating P(D), the aggre-
gation of individual document classifications, is standard
in the superv ised learning literature. The idea is to first
use the labeled sample to estimate a functional relation-
ship between document category D and word features
S.Typically,D serves as a multicategory dependent vari-
able and is predicted with a set of explanatory variables
{S
i1
,...,S
iK
}, using some statistical, machine learning,
or rule-based method (such as multinomial logit, regres-
sion, discriminant analysis, radial basis functions, CART,
random forests, neural networks, support vector ma-
chines, maximum entropy, or others). Then the coeffi-
cients of the model are estimated, and both the coeffi-
cients and the data-generating process are assumed the
same in the labeled sample as in the population. The
coefficients are then used with the features measured in
the population, S
, to predict the classification for each
population document D
. Social scientists then aggregate
the indiv idual classifications via equation (1) to estimate
their quantity of interest, P(D).
Problems
Unfortunately, as Hand (2006) points out, the standard
supervised learning approach to individual document

Citations
More filters
Journal ArticleDOI

The Nature and Origins of Mass Opinion.

D. Rucinski
- 01 Feb 1994 - 
TL;DR: The Nature and Origins of Mass Opinion by John Zaller (1992) as discussed by the authors is a model of mass opinion formation that offers readers an introduction to the prevailing theory of opinion formation.
Proceedings Article

From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series

TL;DR: This work connects measures of public opinion measured from polls with sentiment measured from text, and finds several surveys on consumer confidence and political opinion over the 2008 to 2009 period correlate to sentiment word frequencies in contemporaneous Twitter messages.
Journal IssueDOI

Sentiment in short strength detection informal text

TL;DR: SentiStrength as discussed by the authors is able to predict positive emotion with 60.6p accuracy and negative emotion with 72.8p accuracy, both based upon strength scales of 1-5.
Posted Content

How Censorship in China Allows Government Criticism But Silences Collective Expression

TL;DR: In this paper, the authors proposed a system to locate, download, and analyze the content of millions of social media posts originating from nearly 1,400 different social media services all over China before the Chinese government is able to find, evaluate, and censor the large subset they deem objectionable.
Journal ArticleDOI

How Censorship in China Allows Government Criticism but Silences Collective Expression

TL;DR: Wang et al. as discussed by the authors proposed a system to locate, download, and analyze the content of millions of social media posts originating from nearly 1,400 different social media services all over China before the Chinese government is able to find, evaluate, and censor the subset they deem objectionable.
References
More filters
Book

Content analysis: an introduction to its methodology

TL;DR: History Conceptual Foundations Uses and Kinds of Inference The Logic of Content Analysis Designs Unitizing Sampling Recording Data Languages Constructs for Inference Analytical Techniques The Use of Computers Reliability Validity A Practical Guide
Book

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

TL;DR: In this paper, the authors describe the important ideas in these areas in a common conceptual framework, and the emphasis is on concepts rather than mathematics, with a liberal use of color graphics.
Journal ArticleDOI

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

TL;DR: The Elements of Statistical Learning: Data Mining, Inference, and Prediction as discussed by the authors is a popular book for data mining and machine learning, focusing on data mining, inference, and prediction.
Book

Foundations of Statistical Natural Language Processing

TL;DR: This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear and provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations.
Book ChapterDOI

Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

TL;DR: This paper explores the use of Support Vector Machines for learning text classifiers from examples and analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task.
Related Papers (5)
Frequently Asked Questions (11)
Q1. What contributions have the authors mentioned in the paper "A method of automated nonparametric content analysis for social science" ?

By directly optimizing for this social science goal, the authors develop a method that gives approximately unbiased estimates of category proportions even when the optimal classifier performs poorly. The authors also make available software that implements their methods and large corpora of text for further analysis. 

With the explosion of numerous types and huge quantities of text available to researchers on the web and elsewhere, the authors hope social scientists will begin to use these methods, and develop others, to harvest this new information and to improve their knowledge of the political, social, cultural, and economic worlds. 

The authors can also use counts of variables or code variables to represent meta-data, such as the URL, title, blogroll, or whether the post links to known liberal or conservative sites (Thomas, Pang, and Lee 2006). 

the authors summarize the preprocessed text as dichotomous variables, one type for the presence or absence of each word stem (or “unigram”), a second type for each word pair (or “bigram”), a third type for each word triplet (or “trigram”), and so on to all “n-grams.” 

A simple way of estimating P(D) is direct sampling : identify a well-defined population of interest, draw a random sample from the population, hand code all the documents in the sample, and count the documents in each category. 

In practice, the number of word stems to choose to avoid sparseness bias mainly seems to be a function of the number of unique word stems in the documents. 

The optimal number of words to use per subset is applicationspecific, but can be determined empirically through crossvalidation within the labeled set. 

A key advantage of estimating P(D) without the intermediate step of computing the individual classifications is that the required assumptions are much less restrictive. 

If an opinion is being expressed (2) use the scale from −2 (extremely negative) to 2 (extremely positive) to summarize the opinion of the blog’s author about the figure.”5Using hand coding to track opinion change in the blogosphere in real time is infeasible and even after the fact would be an enormously expensive task. 

The criterion for success in the classification literature, the percent correctly classified in a test set, is obviously appropriate for individual-level classification, but it can be seriously misleading when characterizing document populations. 

since they are optimized for a different purpose, computer science methods often produce biased estimates of these category proportions.