scispace - formally typeset
Search or ask a question
Author

Jan Pedersen

Other affiliations: AmeriCorps VISTA
Bio: Jan Pedersen is an academic researcher from Yahoo!. The author has contributed to research in topics: Web search query & Web page. The author has an hindex of 18, co-authored 26 publications receiving 2685 citations. Previous affiliations of Jan Pedersen include AmeriCorps VISTA.

Papers
More filters
Book ChapterDOI
31 Aug 2004
TL;DR: This paper proposes techniques to semi-automatically separate reputable, good pages from spam, and shows that they can effectively filter out spam from a significant fraction of the web, based on a good seed set of less than 200 sites.
Abstract: Web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine's results. While human experts can identify spam, it is too expensive to manually evaluate a large number of pages. Instead, we propose techniques to semi-automatically separate reputable, good pages from spam. We first select a small set of seed pages to be evaluated by an expert. Once we manually identify the reputable seed pages, we use the link structure of the web to discover other pages that are likely to be good. In this paper we discuss possible ways to implement the seed selection and the discovery of good pages. We present results of experiments run on the World Wide Web indexed by AltaVista and evaluate the performance of our techniques. Our results show that we can effectively filter out spam from a significant fraction of the web, based on a good seed set of less than 200 sites.

1,259 citations

Journal ArticleDOI
Daniel C. Fain1, Jan Pedersen1
TL;DR: GoTo as discussed by the authors is a GoTo sponsorisee sur le Web, which permet aux annonceurs de faire apparaitre leurs contenus dans les resultats affiches par un moteur de recherche, a demarre en 1998 avec GoTo, acquis par Yahoo! en 2002.
Abstract: La recherche sponsorisee sur le Web, qui permet aux annonceurs de faire apparaitre leurs contenus dans les resultats affiches par un moteur de recherche, a demarre en 1998 avec GoTo, acquis par Yahoo! en 2002. Le prix de l'apparition sur un site donne peut etre fonction du nombre d'apparitions (cout pour mille), des clicks sur un lien (cout par click) ou des actes engendres (cout par action). La recherche payee souleve un interet grandissant au sein de la communaute academique.

231 citations

Journal ArticleDOI
01 Jan 2006
TL;DR: An approach to interactive information retrieval (IR) contextually within a multitasking framework is proposed, and there are a broad variety of topics in multitasking search sessions.
Abstract: A user's single session with a Web search engine or information retrieval (IR) system may consist of seeking information on single or multiple topics, and switch between tasks or multitasking information behavior. Most Web search sessions consist of two queries of approximately two words. However, some Web search sessions consist of three or more queries. We present findings from two studies. First, a study of two-query search sessions on the Alta Vista Web search engine, and second, a study of three or more query search sessions on the Alta Vista Web search engine. We examine the degree of multitasking search and information task switching during these two sets of Alta Vista Web search sessions. A sample of two-query and three or more query sessions were filtered from Alta Vista transaction logs from 2002 and qualitatively analyzed. Sessions ranged in duration from less than a minute to a few hours. Findings include: (1) 81% of two-query sessions included multiple topics, (2) 91.3% of three or more query sessions included multiple topics, (3) there are a broad variety of topics in multitasking search sessions, and (4) three or more query sessions sometimes contained frequent topic changes. Multitasking is found to be a growing element in Web searching. This paper proposes an approach to interactive information retrieval (IR) contextually within a multitasking framework. The implications of our findings for Web design and further research are discussed.

182 citations

Proceedings ArticleDOI
01 Sep 2006
TL;DR: The concept of spam mass, a measure of the impact of link spamming on a page's ranking, is introduced, and how to estimate spam mass and how the estimates can help identifying pages that benefit significantly from links spamming are discussed.
Abstract: Link spamming intends to mislead search engines and trigger an artificially high link-based ranking of specific target web pages. This paper introduces the concept of spam mass, a measure of the impact of link spamming on a page's ranking. We discuss how to estimate spam mass and how the estimates can help identifying pages that benefit significantly from link spamming. In our experiments on the host-level Yahoo! web graph we use spam mass estimates to successfully identify tens of thousands of instances of heavyweight link spamming.

163 citations

01 Jan 2007
TL;DR: An analysis of 10 months worth of Yahoo! Answers data is performed that provides insights into user behavior and impact as well as into various aspects of the service and its possible evolution.
Abstract: Yahoo! Answers represents a new type of community portal that allows users to post questions and/or answer questions asked by other members of the community, already featuring a very large number of questions and several million users. Other recently launched services, like Microsoft’s Live QnA and Amazon’s Askville, follow the same basic interaction model. The popularity and the particular characteristics of this model call for a closer study that can help a deeper understanding of the entities involved, their interactions, and the implications of the model. Such understanding is a crucial step in social and algorithmic research that could yield improvements to various components of the service, for instance, personalizing the interaction with the system based on user interest. In this paper, we perform an analysis of 10 months worth of Yahoo! Answers data that provides insights into user behavior and impact as well as into various aspects of the service and its possible evolution.

116 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

13,246 citations

Book
Tie-Yan Liu1
27 Jun 2009
TL;DR: Three major approaches to learning to rank are introduced, i.e., the pointwise, pairwise, and listwise approaches, the relationship between the loss functions used in these approaches and the widely-used IR evaluation measures are analyzed, and the performance of these approaches on the LETOR benchmark datasets is evaluated.
Abstract: This tutorial is concerned with a comprehensive introduction to the research area of learning to rank for information retrieval. In the first part of the tutorial, we will introduce three major approaches to learning to rank, i.e., the pointwise, pairwise, and listwise approaches, analyze the relationship between the loss functions used in these approaches and the widely-used IR evaluation measures, evaluate the performance of these approaches on the LETOR benchmark datasets, and demonstrate how to use these approaches to solve real ranking applications. In the second part of the tutorial, we will discuss some advanced topics regarding learning to rank, such as relational ranking, diverse ranking, semi-supervised ranking, transfer ranking, query-dependent ranking, and training data preprocessing. In the third part, we will briefly mention the recent advances on statistical learning theory for ranking, which explain the generalization ability and statistical consistency of different ranking methods. In the last part, we will conclude the tutorial and show several future research directions.

2,515 citations

Book
01 Oct 2011
TL;DR: This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets, and explains the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing.
Abstract: The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets. It begins with a discussion of the map-reduce framework, an important tool for parallelizing algorithms automatically. The authors explain the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing. The PageRank idea and related tricks for organizing the Web are covered next. Other chapters cover the problems of finding frequent itemsets and clustering. The final chapters cover two applications: recommendation systems and Web advertising, each vital in e-commerce. Written by two authorities in database and Web technologies, this book is essential reading for students and practitioners alike.

1,795 citations

Book
03 Jul 2006
TL;DR: Any business seriously interested in improving its rankings in the major search engines can benefit from the clear examples, sample code, and list of resources provided.
Abstract: Why doesn't your home page appear on the first page of search results, even when you query your own name? How do other web pages always appear at the top? What creates these powerful rankings? And how? The first book ever about the science of web page rankings, Google's PageRank and Beyond supplies the answers to these and other questions and more. The book serves two very different audiences: the curious science reader and the technical computational reader. The chapters build in mathematical sophistication, so that the first five are accessible to the general academic reader. While other chapters are much more mathematical in nature, each one contains something for both audiences. For example, the authors include entertaining asides such as how search engines make money and how the Great Firewall of China influences research. The book includes an extensive background chapter designed to help readers learn more about the mathematics of search engines, and it contains several MATLAB codes and links to sample web data sets. The philosophy throughout is to encourage readers to experiment with the ideas and algorithms in the text. Any business seriously interested in improving its rankings in the major search engines can benefit from the clear examples, sample code, and list of resources provided. Many illustrative examples and entertaining asides MATLAB code Accessible and informal style Complete and self-contained section for mathematics review

1,548 citations

Book
01 Jul 2012
TL;DR: Active learning as discussed by the authors is a general approach that allows a machine learning algorithm to choose the data from which it learns by posing "queries", usually in the form of unlabeled data instances to be labeled by an oracle (e.g., a human annotator) that already understands the nature of the problem.
Abstract: The key idea behind active learning is that a machine learning algorithm can perform better with less training if it is allowed to choose the data from which it learns. An active learner may pose "queries," usually in the form of unlabeled data instances to be labeled by an "oracle" (e.g., a human annotator) that already understands the nature of the problem. This sort of approach is well-motivated in many modern machine learning and data mining applications, where unlabeled data may be abundant or easy to come by, but training labels are difficult, time-consuming, or expensive to obtain. This book is a general introduction to active learning. It outlines several scenarios in which queries might be formulated, and details many query selection algorithms which have been organized into four broad categories, or "query selection frameworks." We also touch on some of the theoretical foundations of active learning, and conclude with an overview of the strengths and weaknesses of these approaches in practice, including a summary of ongoing work to address these open challenges and opportunities.

1,374 citations