What are the future works in "A method of automated nonparametric content analysis for social science" ?

With the explosion of numerous types and huge quantities of text available to researchers on the web and elsewhere, the authors hope social scientists will begin to use these methods, and develop others, to harvest this new information and to improve their knowledge of the political, social, cultural, and economic worlds.

How many words are to avoid sparseness bias?

In practice, the number of word stems to choose to avoid sparseness bias mainly seems to be a function of the number of unique word stems in the documents.

How can the authors determine the optimal number of words to use per subset?

The optimal number of words to use per subset is applicationspecific, but can be determined empirically through crossvalidation within the labeled set.

What is the key advantage of estimating P(D) without the intermediate step?

A key advantage of estimating P(D) without the intermediate step of computing the individual classifications is that the required assumptions are much less restrictive.

What is the way to track the opinion of a blog?

If an opinion is being expressed (2) use the scale from −2 (extremely negative) to 2 (extremely positive) to summarize the opinion of the blog’s author about the figure.”5Using hand coding to track opinion change in the blogosphere in real time is infeasible and even after the fact would be an enormously expensive task.

What is the criterion for success in the classification literature?

The criterion for success in the classification literature, the percent correctly classified in a test set, is obviously appropriate for individual-level classification, but it can be seriously misleading when characterizing document populations.

What is the reason why computer science methods are often biased?

since they are optimized for a different purpose, computer science methods often produce biased estimates of these category proportions.

(Open Access) A Method of Automated Nonparametric Content Analysis for Social Science (2010) | Daniel J. Hopkins

Q: What contributions have the authors mentioned in the paper "A method of automated nonparametric content analysis for social science" ?

By directly optimizing for this social science goal, the authors develop a method that gives approximately unbiased estimates of category proportions even when the optimal classifier performs poorly. The authors also make available software that implements their methods and large corpora of text for further analysis.

Q: What can the authors use to represent the content of a post?

The authors can also use counts of variables or code variables to represent meta-data, such as the URL, title, blogroll, or whether the post links to known liberal or conservative sites (Thomas, Pang, and Lee 2006).

Q: What type of variables are used to sum up the preprocessed text?

the authors summarize the preprocessed text as dichotomous variables, one type for the presence or absence of each word stem (or “unigram”), a second type for each word pair (or “bigram”), a third type for each word triplet (or “trigram”), and so on to all “n-grams.”

Q: What is the common method of estimating P(D)?

A simple way of estimating P(D) is direct sampling : identify a well-defined population of interest, draw a random sample from the population, hand code all the documents in the sample, and count the documents in each category.

A Method of Automated Nonparametric Content

Analysis for Social Science

Daniel J. Hopkins

Georgetown University

Gary King

Harvard University

The increasing availability of digitized tex t presents enormous opportunities for social scientists. Yet hand coding many

blogs, speeches, government records, newspapers, or other sources of unstructured text is infeasible. Although com-

puter scientists have methods for automated content analysis, most are optimized to classify individual documents,

whereas social scientists instead want generalizations about the population of documents, such as the proportion in a

given category. Unfortunately, even a method with a high percent of individual documents correctly classified can be

hugely biase d when estimating category proportions. By directly optimizing for this social sc ience goal, we develop a

method that gives approximately unbiased estimates of category proportions even when the optimal classifier perfor ms

poorly. We illustrate with diverse data sets, including the daily expressed opinions of thousands of people about the

U.S. presidency. We also make available software that implements our me thods and large corpora of text for further

analysis.

fforts to systematically categorize text documents

date to the late 1600s, when the Church tracked

the proportion of printed texts which were non-

religious (Krippendorff 2004). Similar techniques were

used by earlier generations of social scientists, including

Waples, Berelson, and Bradshaw (1940, which apparently

includes the first use of the term “content analysis”) and

Berelson and de Grazia (1947). Content analyses like these

have spread to a vast array of fields, with automated meth-

ods now joining projects based on hand coding, and have

increased at least sixfold from 1980 to 2002 (Neuendorf

2002). The recent explosive increase in web pages, blogs,

emails, dig itized books and ar ticles, transcripts, and elec-

Daniel J. Hopkins is Assistant Professor of Government, Georgetown University, 681 Intercultural Center, Washington, DC 20057

(dhopkins@iq.harvard.edu, http://www.danhopkins.org). Gary King is Albert J. Weatherhead III University Professor, Harvard University,

Institute for Quantitative Social Science, 1737 Cambridge St., Cambridge, MA 02138 (king@harvard.edu, http://gking.harvard.edu).

Replication materials are available at Hopkins and King (2009); see http://hdl.handle.net/1902.1/12898. Our special thanks to our inde-

fatigable undergraduate coders Sam Caporal, Katie Colton, Nicholas Hayes, Grace Kim, Matthew Knowles, Katherine McCabe, Andrew

Prokop, and Keneshia Washington. Each coded numerous blogs, dealt with the unending changes we made to our coding schemes, and

made many important suggestions that greatly improved our work. Matthew Knowles also helped us track down and understand the

many scholarly literatures that intersected with our work, and Steven Melendez provided invaluable computer science wizardry; both are

coauthors of the open source and free computer program that implements the methods described herein (ReadMe: Software for Automated

Content Analysis; see http://gking.harvard.edu/readme). We thank Ying Lu for her wisdom and advice, Stuart Shieber for introducing us

to the relevant computer science literature, and http://Blogpulse.com for getting us started with more than a million blog URLs. Thanks

to Ken Benoit, Doug Bond, Justin Grimmer, Matt Hindman, Dan Ho, Pranam Kolari, Mark Kantrowitz, Lillian Lee, Will Lowe, Andrew

Martin, Burt Monroe, Stephen Purpur a, Phil Schrodt, Stuart Shulman, and Kevin Quinn for helpful suggestions or data. Thanks also to

the Library of Congress (PA#NDP03-1), the Center for the Study of American Politics at Yale University, the Multidisciplinary Program

on Inequality and Social Policy, and the Institute for Quantitative Social Science for research support.

tronic versions of government documents (Lyman and

Varian 2003) suggests the potential for many new ap-

plications. Given the infeasibility of much larger scale

human-based coding, the need for automated methods is

growing fast. Indeed, large-scale projects based solely on

hand coding have stopped altogether in some fields (King

and Lowe 2003, 618).

This article introduces new methods of automated

content analysis designed to estimate the primary quan-

tity of interest in many social science applications. These

new methods take as data a potentially large set of

text documents, of which a small subset is hand coded

into an investigator-chosen set of mutually exclusive and

American Journal of Political Science, Vol. 54, No. 1, January 2010, Pp. 229–247



2010, Midwest Political Science Association ISSN 0092-5853

229

230 DANIEL J. HOPKINS AND GARY KING

exhaustive categories.

As output, the methods give ap-

proximately unbiased and statistically consistent esti-

mates of the proportion of all documents in each category.

Accurate estimates of these document category proportions

have not been a goal of most work in the classification lit-

erature, which has focused instead on increasing the accu-

racy of classification into individual document categories.

Unfortunately, methods tuned to maximize the percent

of documents correctly classified can still produce sub-

stantial biases in the aggregate proportion of documents

within each category. This poses no problem for the task

for which these methods were designed, but it suggests

that a new approach may be of use for many social science

applications.

When social scientists use formal content analysis, it

is typically to make generalizations using document cat-

egory proportions. Consider examples as far-ranging as

Mayhew (1991, chap. 3), Gamson (1992, chaps. 3, 6, 7,

and 9), Zaller (1992, chap. 9), Gerring (1998, chaps. 3–7),

Mutz (1998, chap. 8), Gilens (1999, chap. 5), Mendel-

berg (2001, chap. 5), Rudalevige (2002, chap. 4), Kellstedt

(2003, chap. 2), Jones and Baumgartner (2005, chaps.

3–10), and Hillygus and Shields (2008, chap. 6). In all

these cases and many others, researchers conducted con-

tent analyses to learn about the distribution of classifi-

cations in a population, not to assert the classification

of any particular document (which would be easy to do

through a close reading of the document in question). For

example, the manager of a congressional office would find

useful an automated method of sorting indiv idual con-

stituent letters by policy area so they can be routed to the

most informed staffer to draft a response. In contrast, po-

litical scientists would be interested primarily in tracking

the proportion of mail (and thus constituent concerns)

in each policy area. Policy makers or computer scientists

may be interested in finding the needle in the haystack

Although some excellent content analysis methods are able to del-

egate to the computer both the choice of the categorization scheme

and the classification of documents into the chosen categories, our

applications require methods where the social scientist chooses the

questions and the data provide the answers. The former so-called

“unsupervised learning methods” are versions of cluster analy-

sis and have the great advantage of requiring fewer startup costs,

since no theoretical choices about categories are necessar y ex ante

and no hand coding is required (Quinn et al. 2009; Simon and

Xeons 2004). In contrast, the latter so-called “supervised learning

methods,” which require a choice of categories and a s ample of

hand-coded documents, have the advantage of letting the social

scientist, rather than the computer program, determine the most

theoretically interesting questions (Kolari, Finin, and Joshi 2006;

Laver, Benoit, and Garry 2003; Pang, Lee, and Vaithyanathan 2002).

These approaches, and others such as dictionary-based methods

(Gerner et al. 1994; King and Lowe 2003), accomplish somewhat

different tasks and so can often be productively used together, such

as for discovering a relevant set of categories in part from the data.

(such as a potential terrorist threat or the right web page

to display from a search), but social scientists are more

commonly interested in characterizing the haystack. Cer-

tainly, individual document classifications, when avail-

able, provide additional information to social scientists,

since they enable one to aggregate in unanticipated ways,

serve as variables in regression-type analyses, and help

guide deeper qualitative inquiries into the nature of spe-

cific documents. But they do not usually (as in B enoit

and Laver 2003) constitute the ultimate quantities of

interest.

Automated content analysis is a new field and is newer

still within political science. We thus begin in the second

section with a concrete example to help fix ideas and de-

fine key concepts, including an analysis of expressed opin-

ionthroughblogpostsaboutSenatorJohnKerry.Wenext

explain how to represent unstructured text as structured

variables amenable to statistical analysis. The following

section discusses problems with existing methods. We

introduce our methods in the fifth section along with

empirical verification from several data sets in the sixth

section. The last section concludes. The appendix pro-

vides intercoder reliability statistics and offers a method

for coping with errors in hand-coded documents.

Measuring Political Opinions in

Blogs: A Running Example

Although our methodology works for any unstructured

text, we use blogs as our running example. Blogs (or “web

logs”) are periodic web postings usually listed in reverse

chronological order.

For present purposes, we define our

inferential target as expressed sentiment about each can-

didate in the 2008 American presidential election. Mea-

suring the national conversation in this way is not the

only way to define the population of interest, but it seems

to be of considerable public interest and may also be of

interest to political scientists studying activists (Verba,

Schlozman, and Brady 1995), the media (Drezner and

Farrell 2004), public opinion (Gamson 1992), social net-

works (Adamic and Glance 2005; Huckfeldt and Sprague

1995), or elite influence (Grindle 2005; Hindman, Tsiout-

siouliklis, and Johnson 2003; Zaller 1992). We attempted

to collect all English-language blog posts from highly

political people who blog about politics all the time, as

Eight percent of U.S. Internet users (about 12 million people),

claim to have their ow n blog (Lenhart and Fox 2006). The growth

worldwide has been explosive, from essentially none in 2000 to

estimates today that range up to 185.62 million worldwide. Blogs

are a remarkably democratic technology, with 72.82 million in

China and at least 700,000 in Iran (Helmond 2008).

A METHOD OF AUTOMATED NONPARAMETRIC CONTENT ANALYSIS FOR SOCIAL SCIENCE 231

well as others who normally blog about gardening or

their love lives, but choose to join the national conversa-

tion about the presidency for one or more posts. Bloggers’

opinions get counted when they post and not otherwise,

just as in earlier centuries when public opinion was syn-

onymous with visible public expressions rather than at-

titudes and nonattitudes expressed in sur vey responses

(Ginsberg 1986).

Our specific goal is to compute the proportion of

blogs each day or week in each of seven categories, in-

cluding extremely negative (−2), neg ative (−1), neutral

(0), positive (1), extremely positive (2), no opinion (NA),

and not a blog (NB).

Although the first five categories

are logically ordered, the set of all seven categories is

not (which rules out innovative approaches like Word-

scores, which presently requires a single dimension; Laver,

Benoit, and Garry 2003). Bloggers write to express opin-

ions and so category 0 is not common, although it and

NA occur commonly if the blogger is writing primarily

about something other than our subject of study. Cate-

gory NB ensures that the category list is exhaustive. This

coding scheme represents a difficult test case because of

the mixed data types, because “sentiment categorization

is more difficult than topic classification” (Pang, Lee, and

Vaithyanathan 2002, 83), and because the language used

ranges from the Queen’s English to “my crunchy gf thinks

dubya h id the wmd’s, :)!!”

Wenowpreviewthetypeofempiricalresultsweseek.

To do this, we apply the nonparametric method described

below to blogosphere opinions about John Kerry before,

We obtained our list of blogs by beginning with eight public

blog directories and two other sources we obtained pr ivately,

including www.globeofblogs.com, http://truthlaidbear.com, www

.nycbloggers.com, http://dir.yahoo.com/Computers

and Internet/

Internet/, www.bloghop.com/highrating.htm, http://www

.blogrolling.com/top.phtml, a list of blogs provided by

blogrolling.com, and 1.3 million additional blogs made available

to us by Blogpulse.com. We then continuously crawl out from

the links or “ blogroll” on each of these blogs, adding seeds along

the way from Google and other sources, to identify our target

population.

Our specific instructions to coders read as follows: “Below is one

entry in our database of blog posts. Please read the entire ent ry.

Then, answer the questions at the bottom of this page: (1) indicate

whether this entry is in fact a blog posting that contains an opinion

about a national political figure. If an opinion is being expressed (2)

use the scale from −2 (extremely negative) to 2 (extremely positive)

to summarize the opinion of the blog’s author about the figure.”

Using hand coding to track opinion change in the blogosphere in

realtime is infeasible and even after the fact wouldbean enormously

expensive task. Using unsupervised learning methods to answer the

questions posed is also usually infeasible. Applied to blogs, these

methods often pick up topics rather than sentiment or irrelevant

features such as the informality of the text.

FIGURE 1 Blogosphere Responses to Kerry ’s

Botched Joke

Notes: Each line gives a time series of estimates of the proportion

of all English-language blog posts in categori es ranging from −2

(extremely negative, colored red) to 2 (extremely positive, colored

blue). The spike in the −2 category immediately followed Kerry’s

joke. Results were estimated with our nonpar ametric method in

Section 5.2.

during, and after the botched joke in the 2006 election

cycle, which was said to have caused him to not enter

the 2008 contest (“You know, education—if you make

themostofit... you can do well. If you don’t, you get

stuck in Iraq”). Figure 1 gives a time-series plot of the

proportion of blog posts in each of the opinion categories

over time. The sharp increase in the extremely negative

(−2) category occurred immediately following Kerry’s

joke. Note also the concomitant drop in other categories

occurred primarily from the −1category,buteventhe

proportion in the positive categories dropped to some

degree. Although the media portrayed this joke as his

motivation for not entering the race, this figure suggests

that his high negatives before and after this event may

have been even more relevant.

These results come from an analysis of word patterns

in 10,000 blog posts, of which only 442 from five days

in early November were actually read and hand coded

by the researchers. In other words, the method outlined

in this article recovers a highly plausible pattern for sev-

eral months using word patterns contained in a small,

nonrandom subset of just a few days when anti-Kerry

sentiment was at its peak. This was one incident in the

232 DANIEL J. HOPKINS AND GARY KING

run-up to the 2008 campaign, but it gives a sense of the

widespread applicability of the methods. Although we do

not offer these in this ar ticle, one could easily imag ine

many similar analyses of political or social events where

scale or resource constraints make it impossible to con-

tinuously read and manually categorize texts. We offer

more formal validation of our methods below.

Representing Text Statistically

We now explain how to represent unstruc tured text as

structured variables amenable to statistical analysis, first

by coding variables and then via statistical notation.

Coding Variables

To analyze text statistically, we represent natural language

as numerical variables following standard procedures

(Joachims 1998; Kolari, Finin, and Joshi 2006; Manning

and Sch

utze 1999; Pang, Lee, and Vaithyanathan 2002).

For example, for our key var iable, we summarize a docu-

ment (a blog post) with its category. Other variables are

computed from the text in three additional steps, each of

which works without human input, and all of which are

designed to reduce the complexity of text.

First, we drop non-English-language blogs (Cavnar

and Trenkle 1994), as well as spam blogs (with a tech-

nology we do not share publicly; for another, see Ko-

lari, Finin, and Joshi 2006). For the purposes of this

article, we focus on blog posts about President George

W. Bush (which we define as those that use the terms

“Bush,” “George W.,” “Dubya,” or “King George”) and

similarly for each of the 2008 presidential candidates. We

develop specific filters for each person of interest, en-

abling us to exclude others with similar names, such as to

avoid confusing Bill and Hillary Clinton. For our present

methodological purposes, we focus on 4,303 blog posts

about President Bush collected February 1–5, 2006, and

6,468 posts about Senator Hillary Clinton collected Au-

gust 26–30, 2006. Our method works without filtering

(and in foreign languages), but filters help focus the lim-

ited time of human coders on the categories of interest.

Second, we preprocess the text within each docu-

ment by converting to lowercase, removing all punctua-

tion, and stemming by, for example, reducing “consist,”

“consisted,” “consistency,” “consistent,” “consistently,”

“consisting,” and “consists” to their stem, which is “con-

sist.” Preprocessing text strips out information, in addi-

tion to reducing complexity, but experience in this liter-

ature is that the trade-off is well worth it (Porter 1980;

Quinn et al. 2009).

Finally, we summar ize the preprocessed text as di-

chotomous variables, one type for the presence or absence

of each word stem (or “unigram”), a second type for each

word pair (or “bigram”), a third type for each word triplet

(or “trigram”), and so on to all “n-grams.” This defini-

tion is not limited to dic tionary words. In our application,

we measure only the presence or absence of stems rather

than counts (the second time the word “awful” appears

in a blog post does not provide as much information

as the first). Even so, the number of variables remain-

ing is enormous. For example, our sample of 10,771 blog

posts about President Bush and Senator Clinton includes

201,676 unique unig rams, 2,392,027 unique bigrams, and

5,761,979 unique trigrams. T he usual choice to simplify

further i s to consider only dichotomous stemmed uni-

gram indicator variables (the presence or absence of e ach

ofalistofwordstems),whichwehavefoundtoworkwell.

We also delete stemmed unigrams appearing in fewer than

1% or greater than 99% of all documents, which results in

3,672 variables. These procedures effectively group the in-

finite range of possible blog posts to “only” 2

3,672

distinct

types. This makes the problem feasible but still represents

a huge number (larger than the number of elementary

particles in the universe).

Researchers interested in similar problems in com-

puter science commonly find that “bag of words” sim-

plifications like this are highly effective (e.g., Pang, Lee,

and Vaithyanathan 2002; Sebastiani 2002), and our anal-

ysis reinforces that finding. This seems counterintuitive

at first, since it is easy to write text whose meaning is

lost when word order is discarded (e.g., “I hate Clinton.

I love Obama”). But empirically, most text sources make

the same point in enough different ways that representing

the needed information abstractly is usually sufficient. As

an analogy, when channel surfing for something to watch

on television, pausing for only a few hundred milliseconds

on a channel is typically sufficient; similarly, the negative

content of a vitriolic post about President Bush is usu-

ally easy to spot after only a sentence or two. When the

bag of words approach is not a sufficient representation,

many procedures are available: we can code where each

word stem appears in a document, tag each word with

its part of speech, or include selective bigrams, such as

by replacing “white house” with “w hite

house” (Das and

Chen 2001). We can also use counts of variables or code

variables to represent meta-data, such as the URL, title,

blogroll, or whether the post links to known liberal or

conservative sites (Thomas, Pang, and Lee 2006). Many

other similar tricks suggested in the computer science

A METHOD OF AUTOMATED NONPARAMETRIC CONTENT ANALYSIS FOR SOCIAL SCIENCE 233

literature may be useful for some problems (Pang and

Lee 2008), and all can be included in the methodology

described below, but we have not found them necessary

forthemanyapplicationswehavetriedtodate.

Notation and Quantities of Interest

Our procedures require two sets of text documents. The

first is a small labeled set,forwhicheachdocumenti

(i = 1,...,n) is labeled with one of the given categories,

usually by reading and hand coding (we discuss how large

n needs to be in the sixth section, and what to do if hand

coders are not sufficiently reliable in the appendix). We

denote the D

ocument category variable as D

,whichin

general takes on the value D

= j , for possible categories

j = 1,...,J .

(In our running example, D

takes on

the potential values {−2, −1, 1, 0, 1, 2, NA, NB}.) We

denote the second, larger population set of documents

as the inferential target, and in which each document 

(for  = 1,...,L) has an unobserved classification D



Sometimes the labeled set is a sample from the population

and so the two overlap; more often it is a nonrandom

sample from a different source than the population, such

as from earlier in time.

All other information is computed directly from the

documents. To define these variables for the labeled set

denote S

asequalto1ifwordStem k (k = 1,...,K )is

used at least once in document i (for i = 1,...,n)and

0 otherwise (and similarly for the population set, sub-

stituting index i with index ). This makes our abstract

summary of the text of document i the set of these vari-

ables, {S

,...,S

}, which we summarize as the K × 1

vector of word stem variables S

.WerefertoS

as a word

stem profile since it provides a summary of all the word

stems (or other information) used in a document.

The quantity of interest in most of the supervised

learning literature is the set of individual classifications for

all documents in the population, {D

,...,D

}. In con-

trast, the quantity of interest for most content analyses in

social science is the aggregate proportion of all (or a subset

of all) of these population documents that fall into each

category: P (D) ={P (D = 1),...,P (D = J )}



where

P(D)isaJ × 1 vector, each element of which is a pro-

portion computed by direct tabulation:

P (D = j ) =



=1

1(D



= j ), (1)

This notation is from King and Lu(2008), who use related methods

applied to unrelated substantive applications that do not involve

coding text, and different mnemonic associations.

where 1(a) = 1ifa is true and 0 otherwise. Document

category D

is one variable with many possible values,

whereas word profile S

constitutes a set of dichotomous

variables. This means that P(D) is a multinomial d istri-

bution with J possible values and P (S) is a multinomial

distribution with 2

possible values, each of which is a

word stem profile.

Issues with Existing Approaches

This section discusses problems with two common meth-

ods that arise when they are used to estimate social aggre-

gates rather than individual classifications.

Existing Approaches

A simple way of estimating P(D)isdirect sampling:iden-

tify a well-defined population of interest, draw a random

sample from the population, hand code all the documents

in the sample, and count the documents in each category.

This method requires basic sampling theory, no abstract

numerical summaries of any text, and no classifications

of individual documents in the unlabeled population.

The second approach to estimating P(D), the aggre-

gation of individual document classifications, is standard

in the superv ised learning literature. The idea is to first

use the labeled sample to estimate a functional relation-

ship between document category D and word features

S.Typically,D serves as a multicategory dependent vari-

able and is predicted with a set of explanatory variables

,...,S

}, using some statistical, machine learning,

or rule-based method (such as multinomial logit, regres-

sion, discriminant analysis, radial basis functions, CART,

random forests, neural networks, support vector ma-

chines, maximum entropy, or others). Then the coeffi-

cients of the model are estimated, and both the coeffi-

cients and the data-generating process are assumed the

same in the labeled sample as in the population. The

coefficients are then used with the features measured in

the population, S



, to predict the classification for each

population document D



. Social scientists then aggregate

the indiv idual classifications via equation (1) to estimate

their quantity of interest, P(D).

Problems

Unfortunately, as Hand (2006) points out, the standard

supervised learning approach to individual document

A Method of Automated Nonparametric Content Analysis for Social Science

Figures

Citations

The Nature and Origins of Mass Opinion.

From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series

Sentiment in short strength detection informal text

How Censorship in China Allows Government Criticism But Silences Collective Expression

How Censorship in China Allows Government Criticism but Silences Collective Expression

References

Content analysis: an introduction to its methodology

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Foundations of Statistical Natural Language Processing

Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

Related Papers (5)

Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts

Latent dirichlet allocation

Extracting policy positions from political texts using words as data

Content analysis: an introduction to its methodology

From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series

Frequently Asked Questions (11)

Q1. What contributions have the authors mentioned in the paper "A method of automated nonparametric content analysis for social science" ?

Q2. What are the future works in "A method of automated nonparametric content analysis for social science" ?

Q3. What can the authors use to represent the content of a post?

Q4. What type of variables are used to sum up the preprocessed text?

Q5. What is the common method of estimating P(D)?

Q6. How many words are to avoid sparseness bias?

Q7. How can the authors determine the optimal number of words to use per subset?

Q8. What is the key advantage of estimating P(D) without the intermediate step?

Q9. What is the way to track the opinion of a blog?

Q10. What is the criterion for success in the classification literature?

Q11. What is the reason why computer science methods are often biased?