scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

You are where you tweet: a content-based approach to geo-locating twitter users

26 Oct 2010-pp 759-768
TL;DR: A probabilistic framework for estimating a Twitter user's city-level location based purely on the content of the user's tweets, which can overcome the sparsity of geo-enabled features in these services and enable new location-based personalized information services, the targeting of regional advertisements, and so on.
Abstract: We propose and evaluate a probabilistic framework for estimating a Twitter user's city-level location based purely on the content of the user's tweets, even in the absence of any other geospatial cues By augmenting the massive human-powered sensing capabilities of Twitter and related microblogging services with content-derived location information, this framework can overcome the sparsity of geo-enabled features in these services and enable new location-based personalized information services, the targeting of regional advertisements, and so on Three of the key features of the proposed approach are: (i) its reliance purely on tweet content, meaning no need for user IP information, private login information, or external knowledge bases; (ii) a classification component for automatically identifying words in tweets with a strong local geo-scope; and (iii) a lattice-based neighborhood smoothing model for refining a user's location estimate The system estimates k possible locations for each user in descending order of confidence On average we find that the location estimates converge quickly (needing just 100s of tweets), placing 51% of Twitter users within 100 miles of their actual location

Content maybe subject to copyright    Report

Citations
More filters
Posted Content
TL;DR: Data collected using Twitter's sampled API service is compared with data collected using the full, albeit costly, Firehose stream that includes every single published tweet to help researchers and practitioners understand the implications of using the Streaming API.
Abstract: Twitter is a social media giant famous for the exchange of short, 140-character messages called "tweets". In the scientific community, the microblogging site is known for openness in sharing its data. It provides a glance into its millions of users and billions of tweets through a "Streaming API" which provides a sample of all tweets matching some parameters preset by the API user. The API service has been used by many researchers, companies, and governmental institutions that want to extract knowledge in accordance with a diverse array of questions pertaining to social media. The essential drawback of the Twitter API is the lack of documentation concerning what and how much data users get. This leads researchers to question whether the sampled data is a valid representation of the overall activity on Twitter. In this work we embark on answering this question by comparing data collected using Twitter's sampled API service with data collected using the full, albeit costly, Firehose stream that includes every single published tweet. We compare both datasets using common statistical metrics as well as metrics that allow us to compare topics, networks, and locations of tweets. The results of our work will help researchers and practitioners understand the implications of using the Streaming API.

848 citations


Cites background from "You are where you tweet: a content-..."

  • ...Geolocation is an important part of a tweet, and the study of the location of content and users is currently an active area of research (Cheng, Caverlee, and Lee 2010; Wakamiya, Lee, and Sumiya 2011)....

    [...]

Journal ArticleDOI
TL;DR: A comprehensive introduction to a large body of research, more than 200 key references, is provided, with the aim of supporting the further development of recommender systems exploiting information beyond the U-I matrix.
Abstract: Over the past two decades, a large amount of research effort has been devoted to developing algorithms that generate recommendations. The resulting research progress has established the importance of the user-item (U-I) matrix, which encodes the individual preferences of users for items in a collection, for recommender systems. The U-I matrix provides the basis for collaborative filtering (CF) techniques, the dominant framework for recommender systems. Currently, new recommendation scenarios are emerging that offer promising new information that goes beyond the U-I matrix. This information can be divided into two categories related to its source: rich side information concerning users and items, and interaction information associated with the interplay of users and items. In this survey, we summarize and analyze recommendation scenarios involving information sources and the CF algorithms that have been recently developed to address them. We provide a comprehensive introduction to a large body of research, more than 200 key references, with the aim of supporting the further development of recommender systems exploiting information beyond the U-I matrix. On the basis of this material, we identify and discuss what we see as the central challenges lying ahead for recommender system technology, both in terms of extensions of existing techniques as well as of the integration of techniques and technologies drawn from other research areas.

777 citations


Cites background from "You are where you tweet: a content-..."

  • ...Similarly, the geotags of a user’s tweets may be used to trace the location of the user [Cheng et al. 2010]....

    [...]

Proceedings Article
05 Jul 2011
TL;DR: It is found that LSS users follow the “Levy Flight” mobility pattern and adopt periodic behaviors; while geographic and economic constraints affect mobility patterns, so does individual social status; and Content and sentiment-based analysis of posts associated with checkins can provide a rich source of context for better understanding how users engage with these services.
Abstract: Location sharing services (LSS) like Foursquare, Gowalla, and Facebook Places support hundreds of millions of user-driven footprints (i.e., "checkins"). Those global-scale footprints provide a unique opportunity to study the social and temporal characteristics of how people use these services and to model patterns of human mobility, which are significant factors for the design of future mobile+location-based services, traffic forecasting, urban planning, as well as epidemiological models of disease spread. In this paper, we investigate 22 million checkins across 220,000 users and report a quantitative assessment of human mobility patterns by analyzing the spatial, temporal, social, and textual aspects associated with these footprints. We find that: (i) LSS users follow the “Levy Flight” mobility pattern and adopt periodic behaviors; (ii) While geographic and economic constraints affect mobility patterns, so does individual social status; and (iii) Content and sentiment-based analysis of posts associated with checkins can provide a rich source of context for better understanding how users engage with these services.

742 citations


Cites background from "You are where you tweet: a content-..."

  • ...(Cheng, Caverlee, and Lee 2010) modeled the spatial distribution of words in Twitter’s user-generated content to predict the user’s location....

    [...]

Journal ArticleDOI
TL;DR: This survey surveys the state of the art regarding computational methods to process social media messages and highlights both their contributions and shortcomings, and methodically examines a series of key subproblems ranging from the detection of events to the creation of actionable and useful summaries.
Abstract: Social media platforms provide active communication channels during mass convergence and emergency events such as disasters caused by natural hazards. As a result, first responders, decision makers, and the public can use this information to gain insight into the situation as it unfolds. In particular, many social media messages communicated during emergencies convey timely, actionable information. Processing social media messages to obtain such information, however, involves solving multiple challenges including: parsing brief and informal messages, handling information overload, and prioritizing different types of information found in messages. These challenges can be mapped to classical information processing operations such as filtering, classifying, ranking, aggregating, extracting, and summarizing. We survey the state of the art regarding computational methods to process social media messages and highlight both their contributions and shortcomings. In addition, we examine their particularities, and methodically examine a series of key subproblems ranging from the detection of events to the creation of actionable and useful summaries. Research thus far has, to a large extent, produced methods to extract situational awareness information from social media. In this survey, we cover these various approaches, and highlight their benefits and shortcomings. We conclude with research challenges that go beyond situational awareness, and begin to look at supporting decision making and coordinating emergency-response actions.

710 citations

Journal ArticleDOI
TL;DR: In this paper, a system uses natural language processing and data mining techniques to extract situation awareness information from Twitter messages generated during various disasters and crises, such as hurricanes, floods, and floods.
Abstract: The described system uses natural language processing and data mining techniques to extract situation awareness information from Twitter messages generated during various disasters and crises.

649 citations

References
More filters
Book
25 Oct 1999
TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

20,196 citations


"You are where you tweet: a content-..." refers methods in this paper

  • ...Using these features, we train a local word classier using the Weka toolkit [ 20 ] { which implements several standard classication algorithms like Naive Bayes, SVM, AdaBoost, etc....

    [...]

Book
31 Jan 1986
TL;DR: Numerical Recipes: The Art of Scientific Computing as discussed by the authors is a complete text and reference book on scientific computing with over 100 new routines (now well over 300 in all), plus upgraded versions of many of the original routines, with many new topics presented at the same accessible level.
Abstract: From the Publisher: This is the revised and greatly expanded Second Edition of the hugely popular Numerical Recipes: The Art of Scientific Computing. The product of a unique collaboration among four leading scientists in academic research and industry, Numerical Recipes is a complete text and reference book on scientific computing. In a self-contained manner it proceeds from mathematical and theoretical considerations to actual practical computer routines. With over 100 new routines (now well over 300 in all), plus upgraded versions of many of the original routines, this book is more than ever the most practical, comprehensive handbook of scientific computing available today. The book retains the informal, easy-to-read style that made the first edition so popular, with many new topics presented at the same accessible level. In addition, some sections of more advanced material have been introduced, set off in small type from the main body of the text. Numerical Recipes is an ideal textbook for scientists and engineers and an indispensable reference for anyone who works in scientific computing. Highlights of the new material include a new chapter on integral equations and inverse methods; multigrid methods for solving partial differential equations; improved random number routines; wavelet transforms; the statistical bootstrap method; a new chapter on "less-numerical" algorithms including compression coding and arbitrary precision arithmetic; band diagonal linear systems; linear algebra on sparse matrices; Cholesky and QR decomposition; calculation of numerical derivatives; Pade approximants, and rational Chebyshev approximation; new special functions; Monte Carlo integration in high-dimensional spaces; globally convergent methods for sets of nonlinear equations; an expanded chapter on fast Fourier methods; spectral analysis on unevenly sampled data; Savitzky-Golay smoothing filters; and two-dimensional Kolmogorov-Smirnoff tests. All this is in addition to material on such basic top

12,662 citations

Book
01 Jan 2008
TL;DR: In this paper, generalized estimating equations (GEE) with computing using PROC GENMOD in SAS and multilevel analysis of clustered binary data using generalized linear mixed-effects models with PROC LOGISTIC are discussed.
Abstract: tic regression, and it concerns studying the effect of covariates on the risk of disease. The chapter includes generalized estimating equations (GEE’s) with computing using PROC GENMOD in SAS and multilevel analysis of clustered binary data using generalized linear mixed-effects models with PROC LOGISTIC. As a prelude to the following chapter on repeated-measures data, Chapter 5 presents time series analysis. The material on repeated-measures analysis uses linear additive models with GEE’s and PROC MIXED in SAS for linear mixed-effects models. Chapter 7 is about survival data analysis. All computing throughout the book is done using SAS procedures.

9,995 citations

Proceedings ArticleDOI
26 Apr 2010
TL;DR: This paper investigates the real-time interaction of events such as earthquakes in Twitter and proposes an algorithm to monitor tweets and to detect a target event and produces a probabilistic spatiotemporal model for the target event that can find the center and the trajectory of the event location.
Abstract: Twitter, a popular microblogging service, has received much attention recently. An important characteristic of Twitter is its real-time nature. For example, when an earthquake occurs, people make many Twitter posts (tweets) related to the earthquake, which enables detection of earthquake occurrence promptly, simply by observing the tweets. As described in this paper, we investigate the real-time interaction of events such as earthquakes in Twitter and propose an algorithm to monitor tweets and to detect a target event. To detect a target event, we devise a classifier of tweets based on features such as the keywords in a tweet, the number of words, and their context. Subsequently, we produce a probabilistic spatiotemporal model for the target event that can find the center and the trajectory of the event location. We consider each Twitter user as a sensor and apply Kalman filtering and particle filtering, which are widely used for location estimation in ubiquitous/pervasive computing. The particle filter works better than other comparable methods for estimating the centers of earthquakes and the trajectories of typhoons. As an application, we construct an earthquake reporting system in Japan. Because of the numerous earthquakes and the large number of Twitter users throughout the country, we can detect an earthquake with high probability (96% of earthquakes of Japan Meteorological Agency (JMA) seismic intensity scale 3 or more are detected) merely by monitoring tweets. Our system detects earthquakes promptly and sends e-mails to registered users. Notification is delivered much faster than the announcements that are broadcast by the JMA.

3,976 citations


"You are where you tweet: a content-..." refers background in this paper

  • ...Categories and Subject Descriptors: H.2.8 [Database Management]: Database applications–Data mining; J.4 [Computer Application]: Social and Behavioral Sciences General Terms: Algorithms, Experimentation Keywords: Twitter, location-based estimation, spatial data mining, text mining...

    [...]

  • ...Together, the lack of user adoption of geo-based features per user or per tweet signals that the promise of Twitter as a location-based sensing system may have only limited reach and impact....

    [...]