Author
Hannes Marais
Bio: Hannes Marais is an academic researcher. The author has contributed to research in topics: Web query classification & Web search query. The author has an hindex of 2, co-authored 2 publications receiving 1597 citations.
Papers
More filters
••
01 Sep 1999TL;DR: It is shown that web users type in short queries, mostly look at the first 10 results only, and seldom modify the query, suggesting that traditional information retrieval techniques may not work well for answering web search requests.
Abstract: In this paper we present an analysis of an AltaVista Search Engine query log consisting of approximately 1 billion entries for search requests over a period of six weeks. This represents almost 285 million user sessions, each an attempt to fill a single information need. We present an analysis of individual queries, query duplication, and query sessions. We also present results of a correlation analysis of the log entries, studying the interaction of terms within queries. Our data supports the conjecture that web users differ significantly from the user assumed in the standard information retrieval literature. Specifically, we show that web users type in short queries, mostly look at the first 10 results only, and seldom modify the query. This suggests that traditional information retrieval techniques may not work well for answering web search requests. The correlation analysis showed that the most highly correlated items are constituents of phrases. This result indicates it may be useful for search engines to consider search terms as parts of phrases even if the user did not explicitly specify them as such.
1,255 citations
01 Jan 1998
TL;DR: In this paper, an analysis of a 280 GB AltaVista search engine query log consisting of approximately 1 billion entries for search requests over a period of six weeks is presented, which represents approximately 285 million user sessions, each an attempt to fill a single information need.
Abstract: In this paper we present an analysis of a 280 GB AltaVista Search Engine query log consisting of approximately 1 billion entries for search requests over a period of six weeks. This represents approximately 285 million user sessions, each an attempt to fill a single information need. We present an analysis of individual queries, query duplication, and query sessions. Furthermore we present results of a correlation analysis of the log entries, studying the interaction of terms within queries. Our data supports the conjecture that web users differ significantly from the user assumed in the standard information retrieval literature. Specifically, we show that web users type in short queries, mostly look at the first 10 results only, and seldom modify the query. This suggests that traditional information retrieval techniques might not work well for answering web search requests. The correlation analysis showed that the most highly correlated items are constituents of phrases. This result indicates it may be useful for search engines to consider search terms as parts of phrases even if the user did not explicitly specify them as such.
366 citations
Cited by
More filters
••
23 Jul 2002TL;DR: The goal of this paper is to develop a method that utilizes clickthrough data for training, namely the query-log of the search engine in connection with the log of links the users clicked on in the presented ranking.
Abstract: This paper presents an approach to automatically optimizing the retrieval quality of search engines using clickthrough data. Intuitively, a good information retrieval system should present relevant documents high in the ranking, with less relevant documents following below. While previous approaches to learning retrieval functions from examples exist, they typically require training data generated from relevance judgments by experts. This makes them difficult and expensive to apply. The goal of this paper is to develop a method that utilizes clickthrough data for training, namely the query-log of the search engine in connection with the log of links the users clicked on in the presented ranking. Such clickthrough data is available in abundance and can be recorded at very low cost. Taking a Support Vector Machine (SVM) approach, this paper presents a method for learning retrieval functions. From a theoretical perspective, this method is shown to be well-founded in a risk minimization framework. Furthermore, it is shown to be feasible even for large sets of queries and features. The theoretical results are verified in a controlled experiment. It shows that the method can effectively adapt the retrieval function of a meta-search engine to a particular group of users, outperforming Google in terms of retrieval quality after only a couple of hundred training examples.
4,453 citations
••
TL;DR: The Internet is a critically important research site for sociologists testing theories of technology diffusion and media effects, particularly because it is a medium uniquely capable of integrating modes of communication and forms of content.
Abstract: The Internet is a critically important research site for sociologists testing theories of technology diffusion and media effects, particularly because it is a medium uniquely capable of integrating modes of communication and forms of content. Current research tends to focus on the Internet's implications in five domains: 1) inequality (the “digital divide”); 2) community and social capital; 3) political participation; 4) organizations and other economic institutions; and 5) cultural participation and cultural diversity. A recurrent theme across domains is that the Internet tends to complement rather than displace existing media and patterns of behavior. Thus in each domain, utopian claims and dystopic warnings based on extrapolations from technical possibilities have given way to more nuanced and circumscribed understandings of how Internet use adapts to existing patterns, permits certain innovations, and reinforces particular kinds of change. Moreover, in each domain the ultimate social implications of t...
1,754 citations
••
17 May 1999TL;DR: A new hypertext resource discovery system called a Focused Crawler that is robust against large perturbations in the starting set of URLs, and capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius.
Abstract: The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines In this paper we describe a new hypertext resource discovery system called a Focused Crawler The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics The topics are specified not using keywords, but using exemplary documents Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date To achieve such goal-directed crawling, we designed two hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, and a distiller that identifies hypertext nodes that are great access points to many relevant pages within a few links We report on extensive focused-crawling experiments using several topics at different levels of specificity Focused crawling acquires relevant pages steadily while standard crawling quickly loses its way, even though they are started from the same root set Focused crawling is robust against large perturbations in the starting set of URLs It discovers largely overlapping sets of resources in spite of these perturbations It is also capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius Our anecdotes suggest that focused crawling is very effective for building high-quality collections of Web documents on specific topics, using modest desktop hardware © 1999 Published by Elsevier Science BV All rights reserved
1,700 citations
•
03 Jul 2006
TL;DR: Any business seriously interested in improving its rankings in the major search engines can benefit from the clear examples, sample code, and list of resources provided.
Abstract: Why doesn't your home page appear on the first page of search results, even when you query your own name? How do other web pages always appear at the top? What creates these powerful rankings? And how? The first book ever about the science of web page rankings, Google's PageRank and Beyond supplies the answers to these and other questions and more. The book serves two very different audiences: the curious science reader and the technical computational reader. The chapters build in mathematical sophistication, so that the first five are accessible to the general academic reader. While other chapters are much more mathematical in nature, each one contains something for both audiences. For example, the authors include entertaining asides such as how search engines make money and how the Great Firewall of China influences research. The book includes an extensive background chapter designed to help readers learn more about the mathematics of search engines, and it contains several MATLAB codes and links to sample web data sets. The philosophy throughout is to encourage readers to experiment with the ideas and algorithms in the text. Any business seriously interested in improving its rankings in the major search engines can benefit from the clear examples, sample code, and list of resources provided. Many illustrative examples and entertaining asides MATLAB code Accessible and informal style Complete and self-contained section for mathematics review
1,548 citations
••
TL;DR: It is found that most people use few search terms, few modified queries, view few Web pages, and rarely use advanced search features, and the language of Web queries is distinctive.
Abstract: In studying actual Web searching by the public at large, we analyzed over one million Web queries by users of the Excite search engine. We found that most people use few search terms, few modified queries, view few Web pages, and rarely use advanced search features. A small number of search terms are used with high frequency, and a great many terms are unique; the language of Web queries is distinctive. Queries about recreation and entertainment rank highest. Findings are compared to data from two other large studies of Web queries. This study provides an insight into the public practices and choices in Web searching.
1,153 citations