A Method of Automated Nonparametric Content Analysis for Social Science
read more
Citations
The Nature and Origins of Mass Opinion.
From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series
Sentiment in short strength detection informal text
How Censorship in China Allows Government Criticism But Silences Collective Expression
How Censorship in China Allows Government Criticism but Silences Collective Expression
References
Content analysis: an introduction to its methodology
The Elements of Statistical Learning: Data Mining, Inference, and Prediction
The Elements of Statistical Learning: Data Mining, Inference, and Prediction
Foundations of Statistical Natural Language Processing
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
Related Papers (5)
Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts
Frequently Asked Questions (11)
Q2. What are the future works in "A method of automated nonparametric content analysis for social science" ?
With the explosion of numerous types and huge quantities of text available to researchers on the web and elsewhere, the authors hope social scientists will begin to use these methods, and develop others, to harvest this new information and to improve their knowledge of the political, social, cultural, and economic worlds.
Q3. What can the authors use to represent the content of a post?
The authors can also use counts of variables or code variables to represent meta-data, such as the URL, title, blogroll, or whether the post links to known liberal or conservative sites (Thomas, Pang, and Lee 2006).
Q4. What type of variables are used to sum up the preprocessed text?
the authors summarize the preprocessed text as dichotomous variables, one type for the presence or absence of each word stem (or “unigram”), a second type for each word pair (or “bigram”), a third type for each word triplet (or “trigram”), and so on to all “n-grams.”
Q5. What is the common method of estimating P(D)?
A simple way of estimating P(D) is direct sampling : identify a well-defined population of interest, draw a random sample from the population, hand code all the documents in the sample, and count the documents in each category.
Q6. How many words are to avoid sparseness bias?
In practice, the number of word stems to choose to avoid sparseness bias mainly seems to be a function of the number of unique word stems in the documents.
Q7. How can the authors determine the optimal number of words to use per subset?
The optimal number of words to use per subset is applicationspecific, but can be determined empirically through crossvalidation within the labeled set.
Q8. What is the key advantage of estimating P(D) without the intermediate step?
A key advantage of estimating P(D) without the intermediate step of computing the individual classifications is that the required assumptions are much less restrictive.
Q9. What is the way to track the opinion of a blog?
If an opinion is being expressed (2) use the scale from −2 (extremely negative) to 2 (extremely positive) to summarize the opinion of the blog’s author about the figure.”5Using hand coding to track opinion change in the blogosphere in real time is infeasible and even after the fact would be an enormously expensive task.
Q10. What is the criterion for success in the classification literature?
The criterion for success in the classification literature, the percent correctly classified in a test set, is obviously appropriate for individual-level classification, but it can be seriously misleading when characterizing document populations.
Q11. What is the reason why computer science methods are often biased?
since they are optimized for a different purpose, computer science methods often produce biased estimates of these category proportions.