Home
/
Authors
/
Jan Pedersen

Author

Jan Pedersen

Bio: Jan Pedersen is an academic researcher from Yahoo!. The author has contributed to research in topics: Web search query & Web page. The author has an hindex of 18, co-authored 26 publications receiving 2685 citations. Previous affiliations of Jan Pedersen include AmeriCorps VISTA.

Papers

PDF

Open Access

More filters

Book Chapter•DOI•

Combating web spam with trustrank

[...]

Zoltan Gyongyi¹, Hector Garcia-Molina¹, Jan Pedersen²•Institutions (2)

Stanford University¹, Yahoo!²

31 Aug 2004

TL;DR: This paper proposes techniques to semi-automatically separate reputable, good pages from spam, and shows that they can effectively filter out spam from a significant fraction of the web, based on a good seed set of less than 200 sites.

...read moreread less

Abstract: Web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine's results. While human experts can identify spam, it is too expensive to manually evaluate a large number of pages. Instead, we propose techniques to semi-automatically separate reputable, good pages from spam. We first select a small set of seed pages to be evaluated by an expert. Once we manually identify the reputable seed pages, we use the link structure of the web to discover other pages that are likely to be good. In this paper we discuss possible ways to implement the seed selection and the discovery of good pages. We present results of experiments run on the World Wide Web indexed by AltaVista and evaluate the performance of our techniques. Our results show that we can effectively filter out spam from a significant fraction of the web, based on a good seed set of less than 200 sites.

...read moreread less

1,259 citations

Journal Article•DOI•

Sponsored search: A brief history

[...]

Daniel C. Fain¹, Jan Pedersen¹•Institutions (1)

Yahoo!¹

01 Dec 2006-Bulletin of The American Society for Information Science and Technology

TL;DR: GoTo as discussed by the authors is a GoTo sponsorisee sur le Web, which permet aux annonceurs de faire apparaitre leurs contenus dans les resultats affiches par un moteur de recherche, a demarre en 1998 avec GoTo, acquis par Yahoo! en 2002.

...read moreread less

Abstract: La recherche sponsorisee sur le Web, qui permet aux annonceurs de faire apparaitre leurs contenus dans les resultats affiches par un moteur de recherche, a demarre en 1998 avec GoTo, acquis par Yahoo! en 2002. Le prix de l'apparition sur un site donne peut etre fonction du nombre d'apparitions (cout pour mille), des clicks sur un lien (cout par click) ou des actes engendres (cout par action). La recherche payee souleve un interet grandissant au sein de la communaute academique.

...read moreread less

231 citations

Journal Article•DOI•

Multitasking during web search sessions

[...]

Amanda Spink¹, Minsoo Park¹, Bernard J. Jansen², Jan Pedersen³•Institutions (3)

University UCINF¹, Pennsylvania State University², Yahoo!³

01 Jan 2006

TL;DR: An approach to interactive information retrieval (IR) contextually within a multitasking framework is proposed, and there are a broad variety of topics in multitasking search sessions.

...read moreread less

Abstract: A user's single session with a Web search engine or information retrieval (IR) system may consist of seeking information on single or multiple topics, and switch between tasks or multitasking information behavior. Most Web search sessions consist of two queries of approximately two words. However, some Web search sessions consist of three or more queries. We present findings from two studies. First, a study of two-query search sessions on the Alta Vista Web search engine, and second, a study of three or more query search sessions on the Alta Vista Web search engine. We examine the degree of multitasking search and information task switching during these two sets of Alta Vista Web search sessions. A sample of two-query and three or more query sessions were filtered from Alta Vista transaction logs from 2002 and qualitatively analyzed. Sessions ranged in duration from less than a minute to a few hours. Findings include: (1) 81% of two-query sessions included multiple topics, (2) 91.3% of three or more query sessions included multiple topics, (3) there are a broad variety of topics in multitasking search sessions, and (4) three or more query sessions sometimes contained frequent topic changes. Multitasking is found to be a growing element in Web searching. This paper proposes an approach to interactive information retrieval (IR) contextually within a multitasking framework. The implications of our findings for Web design and further research are discussed.

...read moreread less

182 citations

Proceedings Article•DOI•

Link spam detection based on mass estimation

[...]

Zoltan Gyongyi¹, Pavel Berkhin², Hector Garcia-Molina¹, Jan Pedersen²•Institutions (2)

Stanford University¹, Yahoo!²

01 Sep 2006

TL;DR: The concept of spam mass, a measure of the impact of link spamming on a page's ranking, is introduced, and how to estimate spam mass and how the estimates can help identifying pages that benefit significantly from links spamming are discussed.

...read moreread less

Abstract: Link spamming intends to mislead search engines and trigger an artificially high link-based ranking of specific target web pages. This paper introduces the concept of spam mass, a measure of the impact of link spamming on a page's ranking. We discuss how to estimate spam mass and how the estimates can help identifying pages that benefit significantly from link spamming. In our experiments on the host-level Yahoo! web graph we use spam mass estimates to successfully identify tens of thousands of instances of heavyweight link spamming.

...read moreread less

163 citations

Questioning Yahoo! Answers

[...]

Zoltan Gyongyi¹, Georgia Koutrika¹, Jan Pedersen, Hector Garcia-Molina¹•Institutions (1)

Stanford University¹

01 Jan 2007

TL;DR: An analysis of 10 months worth of Yahoo! Answers data is performed that provides insights into user behavior and impact as well as into various aspects of the service and its possible evolution.

...read moreread less

Abstract: Yahoo! Answers represents a new type of community portal that allows users to post questions and/or answer questions asked by other members of the community, already featuring a very large number of questions and several million users. Other recently launched services, like Microsoft’s Live QnA and Amazon’s Askville, follow the same basic interaction model. The popularity and the particular characteristics of this model call for a closer study that can help a deeper understanding of the entities involved, their interactions, and the implications of the model. Such understanding is a crucial step in social and algorithmic research that could yield improvements to various components of the service, for instance, personalizing the interaction with the system based on user interest. In this paper, we perform an analysis of 10 months worth of Yahoo! Answers data that provides insights into user behavior and impact as well as into various aspects of the service and its possible evolution.

...read moreread less

116 citations

1
2
3
4
…
5
6

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Machine learning

[...]

Thomas G. Dietterich¹•Institutions (1)

Oregon State University¹

01 Dec 1996-ACM Computing Surveys

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.

...read moreread less

Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

...read moreread less

13,246 citations

Book•

Learning to Rank for Information Retrieval

[...]

Tie-Yan Liu¹•Institutions (1)

Microsoft¹

27 Jun 2009

TL;DR: Three major approaches to learning to rank are introduced, i.e., the pointwise, pairwise, and listwise approaches, the relationship between the loss functions used in these approaches and the widely-used IR evaluation measures are analyzed, and the performance of these approaches on the LETOR benchmark datasets is evaluated.

...read moreread less

Abstract: This tutorial is concerned with a comprehensive introduction to the research area of learning to rank for information retrieval. In the first part of the tutorial, we will introduce three major approaches to learning to rank, i.e., the pointwise, pairwise, and listwise approaches, analyze the relationship between the loss functions used in these approaches and the widely-used IR evaluation measures, evaluate the performance of these approaches on the LETOR benchmark datasets, and demonstrate how to use these approaches to solve real ranking applications. In the second part of the tutorial, we will discuss some advanced topics regarding learning to rank, such as relational ranking, diverse ranking, semi-supervised ranking, transfer ranking, query-dependent ranking, and training data preprocessing. In the third part, we will briefly mention the recent advances on statistical learning theory for ranking, which explain the generalization ability and statistical consistency of different ranking methods. In the last part, we will conclude the tutorial and show several future research directions.

...read moreread less

2,515 citations

Book•

Mining of Massive Datasets

[...]

Anand Rajaraman¹, Jeffrey D. Ullman²•Institutions (2)

Walmart Labs¹, Stanford University²

01 Oct 2011

TL;DR: This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets, and explains the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing.

...read moreread less

Abstract: The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets. It begins with a discussion of the map-reduce framework, an important tool for parallelizing algorithms automatically. The authors explain the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing. The PageRank idea and related tricks for organizing the Web are covered next. Other chapters cover the problems of finding frequent itemsets and clustering. The final chapters cover two applications: recommendation systems and Web advertising, each vital in e-commerce. Written by two authorities in database and Web technologies, this book is essential reading for students and practitioners alike.

...read moreread less

1,795 citations

Book•

Google's PageRank and Beyond: The Science of Search Engine Rankings

[...]

Amy N. Langville, Carl D. Meyer

03 Jul 2006

TL;DR: Any business seriously interested in improving its rankings in the major search engines can benefit from the clear examples, sample code, and list of resources provided.

...read moreread less

Abstract: Why doesn't your home page appear on the first page of search results, even when you query your own name? How do other web pages always appear at the top? What creates these powerful rankings? And how? The first book ever about the science of web page rankings, Google's PageRank and Beyond supplies the answers to these and other questions and more. The book serves two very different audiences: the curious science reader and the technical computational reader. The chapters build in mathematical sophistication, so that the first five are accessible to the general academic reader. While other chapters are much more mathematical in nature, each one contains something for both audiences. For example, the authors include entertaining asides such as how search engines make money and how the Great Firewall of China influences research. The book includes an extensive background chapter designed to help readers learn more about the mathematics of search engines, and it contains several MATLAB codes and links to sample web data sets. The philosophy throughout is to encourage readers to experiment with the ideas and algorithms in the text. Any business seriously interested in improving its rankings in the major search engines can benefit from the clear examples, sample code, and list of resources provided. Many illustrative examples and entertaining asides MATLAB code Accessible and informal style Complete and self-contained section for mathematics review

...read moreread less

1,548 citations

Book•

Active Learning

[...]

Burr Settles

01 Jul 2012

TL;DR: Active learning as discussed by the authors is a general approach that allows a machine learning algorithm to choose the data from which it learns by posing "queries", usually in the form of unlabeled data instances to be labeled by an oracle (e.g., a human annotator) that already understands the nature of the problem.

...read moreread less

Abstract: The key idea behind active learning is that a machine learning algorithm can perform better with less training if it is allowed to choose the data from which it learns. An active learner may pose "queries," usually in the form of unlabeled data instances to be labeled by an "oracle" (e.g., a human annotator) that already understands the nature of the problem. This sort of approach is well-motivated in many modern machine learning and data mining applications, where unlabeled data may be abundant or easy to come by, but training labels are difficult, time-consuming, or expensive to obtain. This book is a general introduction to active learning. It outlines several scenarios in which queries might be formulated, and details many query selection algorithms which have been organized into four broad categories, or "query selection frameworks." We also touch on some of the theoretical foundations of active learning, and conclude with an overview of the strengths and weaknesses of these approaches in practice, including a summary of ongoing work to address these open challenges and opportunities.

...read moreread less

1,374 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse