Home
/
Authors
/
Hannes Marais

Author

Hannes Marais

Bio: Hannes Marais is an academic researcher. The author has contributed to research in topics: Web query classification & Web search query. The author has an hindex of 2, co-authored 2 publications receiving 1597 citations.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Analysis of a very large web search engine query log

[...]

Craig Silverstein¹, Hannes Marais, Monika Henzinger¹, Michael Moricz•Institutions (1)

Google¹

01 Sep 1999

TL;DR: It is shown that web users type in short queries, mostly look at the first 10 results only, and seldom modify the query, suggesting that traditional information retrieval techniques may not work well for answering web search requests.

...read moreread less

Abstract: In this paper we present an analysis of an AltaVista Search Engine query log consisting of approximately 1 billion entries for search requests over a period of six weeks. This represents almost 285 million user sessions, each an attempt to fill a single information need. We present an analysis of individual queries, query duplication, and query sessions. We also present results of a correlation analysis of the log entries, studying the interaction of terms within queries. Our data supports the conjecture that web users differ significantly from the user assumed in the standard information retrieval literature. Specifically, we show that web users type in short queries, mostly look at the first 10 results only, and seldom modify the query. This suggests that traditional information retrieval techniques may not work well for answering web search requests. The correlation analysis showed that the most highly correlated items are constituents of phrases. This result indicates it may be useful for search engines to consider search terms as parts of phrases even if the user did not explicitly specify them as such.

...read moreread less

1,255 citations

Analysis of a Very Large AltaVista Query Log

[...]

Craig Silverstein, Monika Henzinger, Hannes Marais, Michael Moricz

01 Jan 1998

TL;DR: In this paper, an analysis of a 280 GB AltaVista search engine query log consisting of approximately 1 billion entries for search requests over a period of six weeks is presented, which represents approximately 285 million user sessions, each an attempt to fill a single information need.

...read moreread less

Abstract: In this paper we present an analysis of a 280 GB AltaVista Search Engine query log consisting of approximately 1 billion entries for search requests over a period of six weeks. This represents approximately 285 million user sessions, each an attempt to fill a single information need. We present an analysis of individual queries, query duplication, and query sessions. Furthermore we present results of a correlation analysis of the log entries, studying the interaction of terms within queries. Our data supports the conjecture that web users differ significantly from the user assumed in the standard information retrieval literature. Specifically, we show that web users type in short queries, mostly look at the first 10 results only, and seldom modify the query. This suggests that traditional information retrieval techniques might not work well for answering web search requests. The correlation analysis showed that the most highly correlated items are constituents of phrases. This result indicates it may be useful for search engines to consider search terms as parts of phrases even if the user did not explicitly specify them as such.

...read moreread less

366 citations

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

Optimizing search engines using clickthrough data

[...]

Thorsten Joachims¹•Institutions (1)

Cornell University¹

23 Jul 2002

TL;DR: The goal of this paper is to develop a method that utilizes clickthrough data for training, namely the query-log of the search engine in connection with the log of links the users clicked on in the presented ranking.

...read moreread less

Abstract: This paper presents an approach to automatically optimizing the retrieval quality of search engines using clickthrough data. Intuitively, a good information retrieval system should present relevant documents high in the ranking, with less relevant documents following below. While previous approaches to learning retrieval functions from examples exist, they typically require training data generated from relevance judgments by experts. This makes them difficult and expensive to apply. The goal of this paper is to develop a method that utilizes clickthrough data for training, namely the query-log of the search engine in connection with the log of links the users clicked on in the presented ranking. Such clickthrough data is available in abundance and can be recorded at very low cost. Taking a Support Vector Machine (SVM) approach, this paper presents a method for learning retrieval functions. From a theoretical perspective, this method is shown to be well-founded in a risk minimization framework. Furthermore, it is shown to be feasible even for large sets of queries and features. The theoretical results are verified in a controlled experiment. It shows that the method can effectively adapt the retrieval function of a meta-search engine to a particular group of users, outperforming Google in terms of retrieval quality after only a couple of hundred training examples.

...read moreread less

4,453 citations

Journal Article•DOI•

Social Implications of the Internet

[...]

Paul DiMaggio¹, Eszter Hargittai¹, W. Russell Neuman², John Robinson³•Institutions (3)

Princeton University¹, University of Pennsylvania², University of Maryland, College Park³

01 Aug 2001-Review of Sociology

TL;DR: The Internet is a critically important research site for sociologists testing theories of technology diffusion and media effects, particularly because it is a medium uniquely capable of integrating modes of communication and forms of content.

...read moreread less

Abstract: The Internet is a critically important research site for sociologists testing theories of technology diffusion and media effects, particularly because it is a medium uniquely capable of integrating modes of communication and forms of content. Current research tends to focus on the Internet's implications in five domains: 1) inequality (the “digital divide”); 2) community and social capital; 3) political participation; 4) organizations and other economic institutions; and 5) cultural participation and cultural diversity. A recurrent theme across domains is that the Internet tends to complement rather than displace existing media and patterns of behavior. Thus in each domain, utopian claims and dystopic warnings based on extrapolations from technical possibilities have given way to more nuanced and circumscribed understandings of how Internet use adapts to existing patterns, permits certain innovations, and reinforces particular kinds of change. Moreover, in each domain the ultimate social implications of t...

...read moreread less

1,754 citations

Journal Article•DOI•

Focused crawling: a new approach to topic-specific Web resource discovery

[...]

Soumen Chakrabarti¹, Martin van den Berg², Byron Dom³•Institutions (3)

Indian Institute of Technology Bombay¹, FX Palo Alto Laboratory², IBM³

17 May 1999

TL;DR: A new hypertext resource discovery system called a Focused Crawler that is robust against large perturbations in the starting set of URLs, and capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius.

...read moreread less

Abstract: The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines In this paper we describe a new hypertext resource discovery system called a Focused Crawler The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics The topics are specified not using keywords, but using exemplary documents Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date To achieve such goal-directed crawling, we designed two hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, and a distiller that identifies hypertext nodes that are great access points to many relevant pages within a few links We report on extensive focused-crawling experiments using several topics at different levels of specificity Focused crawling acquires relevant pages steadily while standard crawling quickly loses its way, even though they are started from the same root set Focused crawling is robust against large perturbations in the starting set of URLs It discovers largely overlapping sets of resources in spite of these perturbations It is also capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius Our anecdotes suggest that focused crawling is very effective for building high-quality collections of Web documents on specific topics, using modest desktop hardware © 1999 Published by Elsevier Science BV All rights reserved

...read moreread less

1,700 citations

Book•

Google's PageRank and Beyond: The Science of Search Engine Rankings

[...]

Amy N. Langville, Carl D. Meyer

03 Jul 2006

TL;DR: Any business seriously interested in improving its rankings in the major search engines can benefit from the clear examples, sample code, and list of resources provided.

...read moreread less

Abstract: Why doesn't your home page appear on the first page of search results, even when you query your own name? How do other web pages always appear at the top? What creates these powerful rankings? And how? The first book ever about the science of web page rankings, Google's PageRank and Beyond supplies the answers to these and other questions and more. The book serves two very different audiences: the curious science reader and the technical computational reader. The chapters build in mathematical sophistication, so that the first five are accessible to the general academic reader. While other chapters are much more mathematical in nature, each one contains something for both audiences. For example, the authors include entertaining asides such as how search engines make money and how the Great Firewall of China influences research. The book includes an extensive background chapter designed to help readers learn more about the mathematics of search engines, and it contains several MATLAB codes and links to sample web data sets. The philosophy throughout is to encourage readers to experiment with the ideas and algorithms in the text. Any business seriously interested in improving its rankings in the major search engines can benefit from the clear examples, sample code, and list of resources provided. Many illustrative examples and entertaining asides MATLAB code Accessible and informal style Complete and self-contained section for mathematics review

...read moreread less

1,548 citations

Journal Article•DOI•

Searching the Web: the public and their queries

[...]

Amanda Spink¹, Dietmar Wolfram², Major B. J. Jansen³, Tefko Saracevic⁴•Institutions (4)

Pennsylvania State University¹, University of Wisconsin–Milwaukee², University of Maryland, College Park³, Rutgers University⁴

01 Feb 2001-Journal of the Association for Information Science and Technology

TL;DR: It is found that most people use few search terms, few modified queries, view few Web pages, and rarely use advanced search features, and the language of Web queries is distinctive.

...read moreread less

Abstract: In studying actual Web searching by the public at large, we analyzed over one million Web queries by users of the Excite search engine. We found that most people use few search terms, few modified queries, view few Web pages, and rarely use advanced search features. A small number of search terms are used with high frequency, and a great many terms are unique; the language of Web queries is distinctive. Queries about recreation and entertainment rank highest. Findings are compared to data from two other large studies of Web queries. This study provides an insight into the public practices and choices in Web searching.

...read moreread less

1,153 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse