You are where you tweet: a content-based approach to geo-locating twitter users

doi:10.1145/1871437.1871535

Home
/
Papers
/
You are where you tweet: a content-based approach to geo-locating twitter users

Proceedings Article•DOI•

You are where you tweet: a content-based approach to geo-locating twitter users

Zhiyuan Cheng¹, James Caverlee¹, Kyumin Lee¹•Institutions (1)

Texas A&M University¹

26 Oct 2010-pp 759-768

TL;DR: A probabilistic framework for estimating a Twitter user's city-level location based purely on the content of the user's tweets, which can overcome the sparsity of geo-enabled features in these services and enable new location-based personalized information services, the targeting of regional advertisements, and so on.

read less

Abstract: We propose and evaluate a probabilistic framework for estimating a Twitter user's city-level location based purely on the content of the user's tweets, even in the absence of any other geospatial cues By augmenting the massive human-powered sensing capabilities of Twitter and related microblogging services with content-derived location information, this framework can overcome the sparsity of geo-enabled features in these services and enable new location-based personalized information services, the targeting of regional advertisements, and so on Three of the key features of the proposed approach are: (i) its reliance purely on tweet content, meaning no need for user IP information, private login information, or external knowledge bases; (ii) a classification component for automatically identifying words in tweets with a strong local geo-scope; and (iii) a lattice-based neighborhood smoothing model for refining a user's location estimate The system estimates k possible locations for each user in descending order of confidence On average we find that the location estimates converge quickly (needing just 100s of tweets), placing 51% of Twitter users within 100 miles of their actual location

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose

[...]

Fred Morstatter¹, Jürgen Pfeffer², Huan Liu¹, Kathleen M. Carley²•Institutions (2)

Arizona State University¹, Carnegie Mellon University²

21 Jun 2013-arXiv: Social and Information Networks

TL;DR: Data collected using Twitter's sampled API service is compared with data collected using the full, albeit costly, Firehose stream that includes every single published tweet to help researchers and practitioners understand the implications of using the Streaming API.

...read moreread less

Abstract: Twitter is a social media giant famous for the exchange of short, 140-character messages called "tweets". In the scientific community, the microblogging site is known for openness in sharing its data. It provides a glance into its millions of users and billions of tweets through a "Streaming API" which provides a sample of all tweets matching some parameters preset by the API user. The API service has been used by many researchers, companies, and governmental institutions that want to extract knowledge in accordance with a diverse array of questions pertaining to social media. The essential drawback of the Twitter API is the lack of documentation concerning what and how much data users get. This leads researchers to question whether the sampled data is a valid representation of the overall activity on Twitter. In this work we embark on answering this question by comparing data collected using Twitter's sampled API service with data collected using the full, albeit costly, Firehose stream that includes every single published tweet. We compare both datasets using common statistical metrics as well as metrics that allow us to compare topics, networks, and locations of tweets. The results of our work will help researchers and practitioners understand the implications of using the Streaming API.

...read moreread less

848 citations

Cites background from "You are where you tweet: a content-..."

...Geolocation is an important part of a tweet, and the study of the location of content and users is currently an active area of research (Cheng, Caverlee, and Lee 2010; Wakamiya, Lee, and Sumiya 2011)....
[...]

Journal Article•DOI•

Collaborative Filtering beyond the User-Item Matrix: A Survey of the State of the Art and Future Challenges

[...]

Yue Shi¹, Martha Larson¹, Alan Hanjalic¹•Institutions (1)

Delft University of Technology¹

01 May 2014-ACM Computing Surveys

TL;DR: A comprehensive introduction to a large body of research, more than 200 key references, is provided, with the aim of supporting the further development of recommender systems exploiting information beyond the U-I matrix.

...read moreread less

Abstract: Over the past two decades, a large amount of research effort has been devoted to developing algorithms that generate recommendations. The resulting research progress has established the importance of the user-item (U-I) matrix, which encodes the individual preferences of users for items in a collection, for recommender systems. The U-I matrix provides the basis for collaborative filtering (CF) techniques, the dominant framework for recommender systems. Currently, new recommendation scenarios are emerging that offer promising new information that goes beyond the U-I matrix. This information can be divided into two categories related to its source: rich side information concerning users and items, and interaction information associated with the interplay of users and items. In this survey, we summarize and analyze recommendation scenarios involving information sources and the CF algorithms that have been recently developed to address them. We provide a comprehensive introduction to a large body of research, more than 200 key references, with the aim of supporting the further development of recommender systems exploiting information beyond the U-I matrix. On the basis of this material, we identify and discuss what we see as the central challenges lying ahead for recommender system technology, both in terms of extensions of existing techniques as well as of the integration of techniques and technologies drawn from other research areas.

...read moreread less

777 citations

Cites background from "You are where you tweet: a content-..."

...Similarly, the geotags of a user’s tweets may be used to trace the location of the user [Cheng et al. 2010]....
[...]

Proceedings Article•

Exploring Millions of Footprints in Location Sharing Services

[...]

Zhiyuan Cheng¹, James Caverlee¹, Kyumin Lee¹, Daniel Z. Sui²•Institutions (2)

Texas A&M University¹, Ohio State University²

05 Jul 2011

TL;DR: It is found that LSS users follow the “Levy Flight” mobility pattern and adopt periodic behaviors; while geographic and economic constraints affect mobility patterns, so does individual social status; and Content and sentiment-based analysis of posts associated with checkins can provide a rich source of context for better understanding how users engage with these services.

...read moreread less

Abstract: Location sharing services (LSS) like Foursquare, Gowalla, and Facebook Places support hundreds of millions of user-driven footprints (i.e., "checkins"). Those global-scale footprints provide a unique opportunity to study the social and temporal characteristics of how people use these services and to model patterns of human mobility, which are significant factors for the design of future mobile+location-based services, traffic forecasting, urban planning, as well as epidemiological models of disease spread. In this paper, we investigate 22 million checkins across 220,000 users and report a quantitative assessment of human mobility patterns by analyzing the spatial, temporal, social, and textual aspects associated with these footprints. We find that: (i) LSS users follow the “Levy Flight” mobility pattern and adopt periodic behaviors; (ii) While geographic and economic constraints affect mobility patterns, so does individual social status; and (iii) Content and sentiment-based analysis of posts associated with checkins can provide a rich source of context for better understanding how users engage with these services.

...read moreread less

742 citations

Cites background from "You are where you tweet: a content-..."

...(Cheng, Caverlee, and Lee 2010) modeled the spatial distribution of words in Twitter’s user-generated content to predict the user’s location....
[...]

Journal Article•DOI•

Processing Social Media Messages in Mass Emergency: A Survey

[...]

Muhammad Imran¹, Carlos Castillo¹, Fernando Diaz², Sarah Vieweg¹•Institutions (2)

Qatar Computing Research Institute¹, Microsoft²

26 Jun 2015-ACM Computing Surveys

TL;DR: This survey surveys the state of the art regarding computational methods to process social media messages and highlights both their contributions and shortcomings, and methodically examines a series of key subproblems ranging from the detection of events to the creation of actionable and useful summaries.

...read moreread less

Abstract: Social media platforms provide active communication channels during mass convergence and emergency events such as disasters caused by natural hazards. As a result, first responders, decision makers, and the public can use this information to gain insight into the situation as it unfolds. In particular, many social media messages communicated during emergencies convey timely, actionable information. Processing social media messages to obtain such information, however, involves solving multiple challenges including: parsing brief and informal messages, handling information overload, and prioritizing different types of information found in messages. These challenges can be mapped to classical information processing operations such as filtering, classifying, ranking, aggregating, extracting, and summarizing. We survey the state of the art regarding computational methods to process social media messages and highlight both their contributions and shortcomings. In addition, we examine their particularities, and methodically examine a series of key subproblems ranging from the detection of events to the creation of actionable and useful summaries. Research thus far has, to a large extent, produced methods to extract situational awareness information from social media. In this survey, we cover these various approaches, and highlight their benefits and shortcomings. We conclude with research challenges that go beyond situational awareness, and begin to look at supporting decision making and coordinating emergency-response actions.

...read moreread less

710 citations

Journal Article•DOI•

Using Social Media to Enhance Emergency Situation Awareness

[...]

Jie Yin¹, Andrew Lampert², Mark Cameron¹, Bella Robinson¹, Robert Power¹ - Show less +1 more•Institutions (2)

Commonwealth Scientific and Industrial Research Organisation¹, Palantir Technologies²

01 Nov 2012-IEEE Intelligent Systems

TL;DR: In this paper, a system uses natural language processing and data mining techniques to extract situation awareness information from Twitter messages generated during various disasters and crises, such as hurricanes, floods, and floods.

...read moreread less

Abstract: The described system uses natural language processing and data mining techniques to extract situation awareness information from Twitter messages generated during various disasters and crises.

...read moreread less

649 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Book•

Data Mining: Practical Machine Learning Tools and Techniques

[...]

Ian H. Witten, Eibe Frank, Mark Hall

25 Oct 1999

TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.

...read moreread less

Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

...read moreread less

20,196 citations

"You are where you tweet: a content-..." refers methods in this paper

...Using these features, we train a local word classier using the Weka toolkit [ 20 ] { which implements several standard classication algorithms like Naive Bayes, SVM, AdaBoost, etc....
[...]

Book•

Numerical Recipes in C: The Art of Scientific Computing

[...]

William H. Press¹, Brian P. Flannery², Saul A. Teukolsky³, William T. Vetterling⁴•Institutions (4)

Harvard University¹, ExxonMobil², Cornell University³, Polaroid Corporation⁴

31 Jan 1986

TL;DR: Numerical Recipes: The Art of Scientific Computing as discussed by the authors is a complete text and reference book on scientific computing with over 100 new routines (now well over 300 in all), plus upgraded versions of many of the original routines, with many new topics presented at the same accessible level.

...read moreread less

Abstract: From the Publisher: This is the revised and greatly expanded Second Edition of the hugely popular Numerical Recipes: The Art of Scientific Computing. The product of a unique collaboration among four leading scientists in academic research and industry, Numerical Recipes is a complete text and reference book on scientific computing. In a self-contained manner it proceeds from mathematical and theoretical considerations to actual practical computer routines. With over 100 new routines (now well over 300 in all), plus upgraded versions of many of the original routines, this book is more than ever the most practical, comprehensive handbook of scientific computing available today. The book retains the informal, easy-to-read style that made the first edition so popular, with many new topics presented at the same accessible level. In addition, some sections of more advanced material have been introduced, set off in small type from the main body of the text. Numerical Recipes is an ideal textbook for scientists and engineers and an indispensable reference for anyone who works in scientific computing. Highlights of the new material include a new chapter on integral equations and inverse methods; multigrid methods for solving partial differential equations; improved random number routines; wavelet transforms; the statistical bootstrap method; a new chapter on "less-numerical" algorithms including compression coding and arbitrary precision arithmetic; band diagonal linear systems; linear algebra on sparse matrices; Cholesky and QR decomposition; calculation of numerical derivatives; Pade approximants, and rational Chebyshev approximation; new special functions; Monte Carlo integration in high-dimensional spaces; globally convergent methods for sets of nonlinear equations; an expanded chapter on fast Fourier methods; spectral analysis on unevenly sampled data; Savitzky-Golay smoothing filters; and two-dimensional Kolmogorov-Smirnoff tests. All this is in addition to material on such basic top

...read moreread less

12,662 citations

Journal Article•DOI•

Numerical Recipes in C: The Art of Scientific Computing

[...]

Mary C. Seiler, Fritz A. Seiler

01 Sep 1989-Risk Analysis

11,285 citations

Book•

Data Mining

[...]

Ian Witten

01 Jan 2008

TL;DR: In this paper, generalized estimating equations (GEE) with computing using PROC GENMOD in SAS and multilevel analysis of clustered binary data using generalized linear mixed-effects models with PROC LOGISTIC are discussed.

...read moreread less

Abstract: tic regression, and it concerns studying the effect of covariates on the risk of disease. The chapter includes generalized estimating equations (GEE’s) with computing using PROC GENMOD in SAS and multilevel analysis of clustered binary data using generalized linear mixed-effects models with PROC LOGISTIC. As a prelude to the following chapter on repeated-measures data, Chapter 5 presents time series analysis. The material on repeated-measures analysis uses linear additive models with GEE’s and PROC MIXED in SAS for linear mixed-effects models. Chapter 7 is about survival data analysis. All computing throughout the book is done using SAS procedures.

...read moreread less

9,995 citations

Proceedings Article•DOI•

Earthquake shakes Twitter users: real-time event detection by social sensors

[...]

Takeshi Sakaki¹, Makoto Okazaki¹, Yutaka Matsuo¹•Institutions (1)

University of Tokyo¹

26 Apr 2010

TL;DR: This paper investigates the real-time interaction of events such as earthquakes in Twitter and proposes an algorithm to monitor tweets and to detect a target event and produces a probabilistic spatiotemporal model for the target event that can find the center and the trajectory of the event location.

...read moreread less

Abstract: Twitter, a popular microblogging service, has received much attention recently. An important characteristic of Twitter is its real-time nature. For example, when an earthquake occurs, people make many Twitter posts (tweets) related to the earthquake, which enables detection of earthquake occurrence promptly, simply by observing the tweets. As described in this paper, we investigate the real-time interaction of events such as earthquakes in Twitter and propose an algorithm to monitor tweets and to detect a target event. To detect a target event, we devise a classifier of tweets based on features such as the keywords in a tweet, the number of words, and their context. Subsequently, we produce a probabilistic spatiotemporal model for the target event that can find the center and the trajectory of the event location. We consider each Twitter user as a sensor and apply Kalman filtering and particle filtering, which are widely used for location estimation in ubiquitous/pervasive computing. The particle filter works better than other comparable methods for estimating the centers of earthquakes and the trajectories of typhoons. As an application, we construct an earthquake reporting system in Japan. Because of the numerous earthquakes and the large number of Twitter users throughout the country, we can detect an earthquake with high probability (96% of earthquakes of Japan Meteorological Agency (JMA) seismic intensity scale 3 or more are detected) merely by monitoring tweets. Our system detects earthquakes promptly and sends e-mails to registered users. Notification is delivered much faster than the announcements that are broadcast by the JMA.

...read moreread less

3,976 citations

"You are where you tweet: a content-..." refers background in this paper

...Categories and Subject Descriptors: H.2.8 [Database Management]: Database applications–Data mining; J.4 [Computer Application]: Social and Behavioral Sciences General Terms: Algorithms, Experimentation Keywords: Twitter, location-based estimation, spatial data mining, text mining...
[...]
...Together, the lack of user adoption of geo-based features per user or per tweet signals that the promise of Twitter as a location-based sensing system may have only limited reach and impact....
[...]