Home
/
Authors
/
Vanessa Murdock

Author

Vanessa Murdock

Other affiliations: University of Massachusetts Amherst, University of Waterloo, Colorado State University ...read more

Bio: Vanessa Murdock is an academic researcher from Microsoft. The author has contributed to research in topics: Web search query & Ranking (information retrieval). The author has an hindex of 31, co-authored 97 publications receiving 3205 citations. Previous affiliations of Vanessa Murdock include University of Massachusetts Amherst & University of Waterloo.

Papers published on a yearly basis

2023
2022
2019
2018
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2005
2004
2002
2001

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Know your neighbors: web spam detection using the web topology

[...]

Carlos Castillo¹, Debora Donato¹, Aristides Gionis¹, Vanessa Murdock¹, Fabrizio Silvestri² - Show less +1 more•Institutions (2)

Yahoo!¹, Istituto di Scienza e Tecnologie dell'Informazione²

23 Jul 2007

TL;DR: A spam detection system that combines link-based and content-based features, and uses the topology of the Web graph by exploiting the link dependencies among the Web pages, which finds that linked hosts tend to belong to the same class.

...read moreread less

Abstract: Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that combines link-based and content-based features, and uses the topology of the Web graph by exploiting the link dependencies among the Web pages. We find that linked hosts tend to belong to the same class: either both are spam or both are non-spam. We demonstrate three methods of incorporating the Web graph topology into the predictions obtained by our base classifier: (i) clustering the host graph, and assigning the label of all hosts in the cluster by majority vote, (ii) propagating the predicted labels to neighboring hosts, and (iii) using the predicted labels of neighboring hosts as new features and retraining the classifier. The result is an accurate system for detecting Web spam, tested on a large and public dataset, using algorithms that can be applied in practice to large-scale Web data.

...read moreread less

362 citations

Proceedings Article•DOI•

Placing flickr photos on a map

[...]

Pavel Serdyukov¹, Vanessa Murdock², Roelof van Zwol²•Institutions (2)

University of Twente¹, Yahoo!²

19 Jul 2009

TL;DR: Generic methods for placing photos uploaded to Flickr on the World map use the textual annotations provided by the users to predict the single most probable location where the image was taken to achieve at least twice the precision of the state-of-the-art reported in the literature.

...read moreread less

Abstract: In this paper we investigate generic methods for placing photos uploaded to Flickr on the World map. As primary input for our methods we use the textual annotations provided by the users to predict the single most probable location where the image was taken. Central to our approach is a language model based entirely on the annotations provided by users. We define extensions to improve over the language model using tag-based smoothing and cell-based smoothing, and leveraging spatial ambiguity. Further we demonstrate how to incorporate GeoNames\footnote{http://www.geonames.org visited May 2009}, a large external database of locations. For varying levels of granularity, we are able to place images on a map with at least twice the precision of the state-of-the-art reported in the literature.

...read moreread less

316 citations

Proceedings Article•DOI•

"I'm eating a sandwich in Glasgow": modeling locations with tweets

[...]

Sheila Kinsella¹, Vanessa Murdock², Neil O'Hare³•Institutions (3)

National University of Ireland, Galway¹, Yahoo!², Dublin City University³

28 Oct 2011

TL;DR: This paper creates language models of locations using coordinates extracted from geotagged Twitter data that can meet the performance of the industry standard tool for predicting both the tweet and the user at the country, state and city levels, and far exceed its performance at the hyper-local level.

...read moreread less

Abstract: Social media such as Twitter generate large quantities of data about what a person is thinking and doing in a particular location. We leverage this data to build models of locations to improve our understanding of a user's geographic context. Understanding the user's geographic context can in turn enable a variety of services that allow us to present information, recommend businesses and services, and place advertisements that are relevant at a hyper-local level.In this paper we create language models of locations using coordinates extracted from geotagged Twitter data. We model locations at varying levels of granularity, from the zip code to the country level. We measure the accuracy of these models by the degree to which we can predict the location of an individual tweet, and further by the accuracy with which we can predict the location of a user. We find that we can meet the performance of the industry standard tool for predicting both the tweet and the user at the country, state and city levels, and far exceed its performance at the hyper-local level, achieving a three- to ten-fold increase in accuracy at the zip code level.

...read moreread less

271 citations

Proceedings Article•DOI•

The impact of caching on search engines

[...]

Ricardo Baeza-Yates¹, Aristides Gionis¹, Flavio Junqueira¹, Vanessa Murdock¹, Vassilis Plachouras¹, Fabrizio Silvestri² - Show less +2 more•Institutions (2)

Yahoo!¹, Istituto di Scienza e Tecnologie dell'Informazione²

23 Jul 2007

TL;DR: Using a query log spanning a whole year, a new algorithm is proposed for static caching of posting lists, which outperforms previous methods and can achieve higher hit rates than caching query answers.

...read moreread less

Abstract: In this paper we study the trade-offs in designing efficient caching systems for Web search engines. We explore the impact of different approaches, such as static vs. dynamic caching, and caching query results vs.caching posting lists. Using a query log spanning a whole year we explore the limitations of caching and we demonstrate that caching posting lists can achieve higher hit rates than caching query answers. We propose a new algorithm for static caching of posting lists, which outperforms previous methods. We also study the problem of finding the optimal way to split the static cache between answers and posting lists. Finally, we measure how the changes in the query log affect the effectiveness of static caching, given our observation that the distribution of the queries changes slowly over time. Our results and observations are applicable to different levels of the data-access hierarchy, for instance, for a memory/disk layer or a broker/remote server layer.

...read moreread less

217 citations

Proceedings Article•DOI•

Automatic tagging and geotagging in video collections and communities

[...]

Martha Larson¹, Mohammad Soleymani², Pavel Serdyukov³, Stevan Rudinac¹, Christian Wartena⁴, Vanessa Murdock⁵, Gerald Friedland⁶, Roeland Ordelman⁷, Gareth J. F. Jones⁸ - Show less +5 more•Institutions (8)

Delft University of Technology¹, University of Geneva², Yandex³, Novay⁴, Yahoo!⁵, International Computer Science Institute⁶, Netherlands Institute for Sound and Vision⁷, Dublin City University⁸

18 Apr 2011

TL;DR: This work overviews three tasks offered in the MediaEval 2010 benchmarking initiative, for each, describing its use scenario, definition and the data set released.

...read moreread less

Abstract: Automatically generated tags and geotags hold great promise to improve access to video collections and online communities. We overview three tasks offered in the MediaEval 2010 benchmarking initiative, for each, describing its use scenario, definition and the data set released. For each task, a reference algorithm is presented that was used within MediaEval 2010 and comments are included on lessons learned. The Tagging Task, Professional involves automatically matching episodes in a collection of Dutch television with subject labels drawn from the keyword thesaurus used by the archive staff. The Tagging Task, Wild Wild Web involves automatically predicting the tags that are assigned by users to their online videos. Finally, the Placing Task requires automatically assigning geo-coordinates to videos. The specification of each task admits the use of the full range of available information including user-generated metadata, speech recognition transcripts, audio, and visual features.

...read moreread less

116 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

Collapse

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

Earthquake shakes Twitter users: real-time event detection by social sensors

[...]

Takeshi Sakaki¹, Makoto Okazaki¹, Yutaka Matsuo¹•Institutions (1)

University of Tokyo¹

26 Apr 2010

TL;DR: This paper investigates the real-time interaction of events such as earthquakes in Twitter and proposes an algorithm to monitor tweets and to detect a target event and produces a probabilistic spatiotemporal model for the target event that can find the center and the trajectory of the event location.

...read moreread less

Abstract: Twitter, a popular microblogging service, has received much attention recently. An important characteristic of Twitter is its real-time nature. For example, when an earthquake occurs, people make many Twitter posts (tweets) related to the earthquake, which enables detection of earthquake occurrence promptly, simply by observing the tweets. As described in this paper, we investigate the real-time interaction of events such as earthquakes in Twitter and propose an algorithm to monitor tweets and to detect a target event. To detect a target event, we devise a classifier of tweets based on features such as the keywords in a tweet, the number of words, and their context. Subsequently, we produce a probabilistic spatiotemporal model for the target event that can find the center and the trajectory of the event location. We consider each Twitter user as a sensor and apply Kalman filtering and particle filtering, which are widely used for location estimation in ubiquitous/pervasive computing. The particle filter works better than other comparable methods for estimating the centers of earthquakes and the trajectories of typhoons. As an application, we construct an earthquake reporting system in Japan. Because of the numerous earthquakes and the large number of Twitter users throughout the country, we can detect an earthquake with high probability (96% of earthquakes of Japan Meteorological Agency (JMA) seismic intensity scale 3 or more are detected) merely by monitoring tweets. Our system detects earthquakes promptly and sends e-mails to registered users. Notification is delivered much faster than the announcements that are broadcast by the JMA.

...read moreread less

3,976 citations

Book•

Learning to Rank for Information Retrieval

[...]

Tie-Yan Liu¹•Institutions (1)

Microsoft¹

27 Jun 2009

TL;DR: Three major approaches to learning to rank are introduced, i.e., the pointwise, pairwise, and listwise approaches, the relationship between the loss functions used in these approaches and the widely-used IR evaluation measures are analyzed, and the performance of these approaches on the LETOR benchmark datasets is evaluated.

...read moreread less

Abstract: This tutorial is concerned with a comprehensive introduction to the research area of learning to rank for information retrieval. In the first part of the tutorial, we will introduce three major approaches to learning to rank, i.e., the pointwise, pairwise, and listwise approaches, analyze the relationship between the loss functions used in these approaches and the widely-used IR evaluation measures, evaluate the performance of these approaches on the LETOR benchmark datasets, and demonstrate how to use these approaches to solve real ranking applications. In the second part of the tutorial, we will discuss some advanced topics regarding learning to rank, such as relational ranking, diverse ranking, semi-supervised ranking, transfer ranking, query-dependent ranking, and training data preprocessing. In the third part, we will briefly mention the recent advances on statistical learning theory for ranking, which explain the generalization ability and statistical consistency of different ranking methods. In the last part, we will conclude the tutorial and show several future research directions.

...read moreread less

2,515 citations

Proceedings Article•DOI•

You are where you tweet: a content-based approach to geo-locating twitter users

[...]

Zhiyuan Cheng¹, James Caverlee¹, Kyumin Lee¹•Institutions (1)

Texas A&M University¹

26 Oct 2010

TL;DR: A probabilistic framework for estimating a Twitter user's city-level location based purely on the content of the user's tweets, which can overcome the sparsity of geo-enabled features in these services and enable new location-based personalized information services, the targeting of regional advertisements, and so on.

...read moreread less

Abstract: We propose and evaluate a probabilistic framework for estimating a Twitter user's city-level location based purely on the content of the user's tweets, even in the absence of any other geospatial cues By augmenting the massive human-powered sensing capabilities of Twitter and related microblogging services with content-derived location information, this framework can overcome the sparsity of geo-enabled features in these services and enable new location-based personalized information services, the targeting of regional advertisements, and so on Three of the key features of the proposed approach are: (i) its reliance purely on tweet content, meaning no need for user IP information, private login information, or external knowledge bases; (ii) a classification component for automatically identifying words in tweets with a strong local geo-scope; and (iii) a lattice-based neighborhood smoothing model for refining a user's location estimate The system estimates k possible locations for each user in descending order of confidence On average we find that the location estimates converge quickly (needing just 100s of tweets), placing 51% of Twitter users within 100 miles of their actual location

...read moreread less

1,213 citations

Journal Article•DOI•

Graph based anomaly detection and description: a survey

[...]

Leman Akoglu¹, Hanghang Tong², Danai Koutra³•Institutions (3)

Stony Brook University¹, City University of New York², Carnegie Mellon University³

01 May 2015-Data Mining and Knowledge Discovery

TL;DR: This survey aims to provide a general, comprehensive, and structured overview of the state-of-the-art methods for anomaly detection in data represented as graphs, and gives a general framework for the algorithms categorized under various settings.

...read moreread less

Abstract: Detecting anomalies in data is a vital task, with numerous high-impact applications in areas such as security, finance, health care, and law enforcement. While numerous techniques have been developed in past years for spotting outliers and anomalies in unstructured collections of multi-dimensional points, with graph data becoming ubiquitous, techniques for structured graph data have been of focus recently. As objects in graphs have long-range correlations, a suite of novel technology has been developed for anomaly detection in graph data. This survey aims to provide a general, comprehensive, and structured overview of the state-of-the-art methods for anomaly detection in data represented as graphs. As a key contribution, we give a general framework for the algorithms categorized under various settings: unsupervised versus (semi-)supervised approaches, for static versus dynamic graphs, for attributed versus plain graphs. We highlight the effectiveness, scalability, generality, and robustness aspects of the methods. What is more, we stress the importance of anomaly attribution and highlight the major techniques that facilitate digging out the root cause, or the `why', of the detected anomalies for further analysis and sense-making. Finally, we present several real-world applications of graph-based anomaly detection in diverse domains, including financial, auction, computer traffic, and social networks. We conclude our survey with a discussion on open theoretical and practical challenges in the field.

...read moreread less

998 citations

Proceedings Article•DOI•

Novelty and diversity in information retrieval evaluation

[...]

Charles L. A. Clarke¹, Maheedhar Kolla¹, Gordon V. Cormack¹, Olga Vechtomova¹, Azin Ashkan¹, Stefan Büttcher¹, Ian MacKinnon¹ - Show less +3 more•Institutions (1)

University of Waterloo¹

20 Jul 2008

TL;DR: This paper develops a framework for evaluation that systematically rewards novelty and diversity into a specific evaluation measure, based on cumulative gain, and demonstrates the feasibility of this approach using a test collection based on the TREC question answering track.

...read moreread less

Abstract: Evaluation measures act as objective functions to be optimized by information retrieval systems. Such objective functions must accurately reflect user requirements, particularly when tuning IR systems and learning ranking functions. Ambiguity in queries and redundancy in retrieved documents are poorly reflected by current evaluation measures. In this paper, we present a framework for evaluation that systematically rewards novelty and diversity. We develop this framework into a specific evaluation measure, based on cumulative gain. We demonstrate the feasibility of our approach using a test collection based on the TREC question answering track.

...read moreread less

988 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse