scispace - formally typeset
Open AccessProceedings Article

Dealing with big data: The case of Twitter

A.P.J. van den Bosch, +1 more
- Vol. 3, pp 121-134
TLDR
This paper shows how the data was collected and stored, and how the usefulness of this tweet analysis resource was determined: relating word frequency to real-life events, finding words related to a topic, and gathering information about conversations.
Abstract
As data sets keep growing, computational linguists are experiencing more big data problems: challenging demands on storage and processing caused by very large data sets. An example of this is dealing with social media data: including metadata, the messages of the social media site Twitter in 2012 comprise more than 250 terabytes of structured text. Handling data volumes like this requires parallel computing architectures with appropriate software tools. In this paper we present our experiences in working with such a big data set, a collection of two billion Dutch tweets. We show how we collected and stored the data. Next we deal with searching in the data using the Hadoop framework and visualizing search results. In order to determine the usefulness of this tweet analysis resource, we have performed three case studies based on the data: relating word frequency to real-life events, finding words related to a topic, and gathering information about conversations. The three case studies are presented in this paper. Access to this current and expanding tweet data set is offered via the website twiqs.nl.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Social Media data: Challenges, opportunities and limitations in urban studies

TL;DR: A comprehensive and descriptive framework for the study of urban phenomena through LBSN data is the main contribution of this study.
Journal ArticleDOI

Signaling sarcasm

TL;DR: It is hypothesized that explicit markers such as hashtags are the digital extralinguistic equivalent of non-verbal expressions that people employ in live interaction when conveying sarcasm.
Journal ArticleDOI

Too Far to Care? Measuring Public Attention and Fear for Ebola Using Twitter

TL;DR: Spatial and social distance are important predictors of public attention to worldwide crisis such as epidemics and need to be taken into account when communicating about human tragedies.
Journal ArticleDOI

Big data and social media: A scientometrics analysis

TL;DR: Thematic analysis shows that the subject nearly maintained an important and well-developed research field and for better results the research can merge with “big data analytics” and “twitter” that are important topics in this field but not developed well.
Journal Article

Extracting Actionable Information from Microtexts

TL;DR: This dissertation proposes a semi-automatic method for extracting actionable information from microtexts and suggests a method which facilitates the definition of relevance for an analyst’s context and the use of this definition to analyze new data.
References
More filters
Proceedings Article

Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment

TL;DR: It is found that the mere number of messages mentioning a party reflects the election result, and joint mentions of two parties are in line with real world political ties and coalitions.
Proceedings ArticleDOI

WTF: the who to follow service at Twitter

TL;DR: An architectural overview of the architecture of WTF is provided and a few graph recommendation algorithms implemented in Cassovary are described and evaluated, including a novel approach based on a combination of random walks and SALSA.
Book ChapterDOI

Using Statistics in Lexical Analysis

TL;DR: The computational tools available for studying machine-readable corpora are at present still rather primitive and use these corpora and the basic concordancing tool mentioned above to fill in detailed syntactic descriptions (prompting a move, towards more thorough descriptions of lexical syntax).
Proceedings Article

Recognizing Named Entities in Tweets

TL;DR: This work proposes to combine a K-Nearest Neighbors classifier with a linear Conditional Random Fields model under a semi-supervised learning framework to tackle the challenges of Named Entities Recognition for tweets.
Book

Lexical acquisition: Exploiting on-line resources to build a lexicon.

Uri Zernik
TL;DR: This book discusses Lexical Acquisition Through Symbol Recirculation, Lexical Representation, and Lexicons for Broad Coverage Semantics, which are concerned with the acquisition of semantic meaning in the Lexical Knowledge-Base.
Related Papers (5)