scispace - formally typeset
Search or ask a question

Showing papers by "Soumen Chakrabarti published in 2009"


Proceedings ArticleDOI
28 Jun 2009
TL;DR: This work gives formulations for the trade-off between local spot-to-entity compatibility and measures of global coherence between entities, and investigates practical solutions based on local hill-climbing, rounding integer linear programs, and pre-clustering entities followed by local optimization within clusters.
Abstract: To take the first step beyond keyword-based search toward entity-based search, suitable token spans ("spots") on documents must be identified as references to real-world entities from an entity catalog. Several systems have been proposed to link spots on Web pages to entities in Wikipedia. They are largely based on local compatibility between the text around the spot and textual metadata associated with the entity. Two recent systems exploit inter-label dependencies, but in limited ways. We propose a general collective disambiguation approach. Our premise is that coherent documents refer to entities from one or a few related topics or domains. We give formulations for the trade-off between local spot-to-entity compatibility and measures of global coherence between entities. Optimizing the overall entity assignment is NP-hard. We investigate practical solutions based on local hill-climbing, rounding integer linear programs, and pre-clustering entities followed by local optimization within clusters. In experiments involving over a hundred manually-annotated Web pages and tens of thousands of spots, our approaches significantly outperform recently-proposed algorithms.

476 citations



Proceedings ArticleDOI
19 Jul 2009
TL;DR: This work introduces Quantity Consensus Queries (QCQs), where each answer is a tight quantity interval distilled from evidence of relevance in thousands of snippets, and proposes two new algorithms that learn to aggregate information from multiple snippets.
Abstract: Web search is increasingly exploiting named entities like persons, places, businesses, addresses and dates. Entity ranking is also of current interest at INEX and TREC. Numerical quantities are an important class of entities, especially in queries about prices and features related to products, services and travel. We introduce Quantity Consensus Queries (QCQs), where each answer is a tight quantity interval distilled from evidence of relevance in thousands of snippets. Entity search and factoid question answering have benefited from aggregating evidence from multiple promising snippets, but these do not readily apply to quantities. Here we propose two new algorithms that learn to aggregate information from multiple snippets. We show that typical signals used in entity ranking, like rarity of query words and their lexical proximity to candidate quantities, are very noisy. Our algorithms learn to score and rankquantity intervals directly, combining snippet quantity and snippet text information. We report on experiments using hundreds of QCQs with ground truth taken from TREC QA, Wikipedia Infoboxes, and other sources, leading to tens of thousands of candidate snippets and quantities. Our algorithms yield about 20% better MAP and NDCG compared to the best-known collective rankers, and are 35% better than scoring snippets independent of each other.

32 citations


Book ChapterDOI
01 Jan 2009

13 citations


01 Jan 2009
TL;DR: This work demonstrates CSAW, a system for Curating and Searching the Annotated Web, and describes the beginnings of a next-generation Web search API that significantly extends the capabilities of APIs provided by popular search engines today.
Abstract: We demonstrate CSAW, a system for Curating and Searching the Annotated Web. CSAW annotates named entities and quantities in Web-scale text corpora, and, where confident, connects these annotations with entries in an entity and type catalog such as Wikipedia. The semistructured catalog, together with the unstructured corpus, forms a composite database that CSAW can then search using powerful reachability, proximity and aggregation primitives. Specifically, we can look for snippets with mentions of specific entities, entities of a specified type, quantities with specified types or units, find unions and intersections of snippet sets, and then aggregate evidence from snippet sets into ranked responses. Responses are not page URLs as in standard Web search, but ranked tables where the cells can be entity references, quantities, or token snippets. We will show a subset of CSAW’s capabilities, and describe the beginnings of a next-generation Web search API that significantly extends the capabilities of APIs provided by popular search engines today.

6 citations


Proceedings ArticleDOI
06 Dec 2009
TL;DR: A new, efficient Monte Carlo sampling method is given to compute the objective and gradient of this approximation, which can then be used in a quasi-Newton optimizer like LBFGS.
Abstract: Learning to rank is an important area at the interface of machine learning, information retrieval and Web search. The central challenge in optimizing various measures of ranking loss is that the objectives tend to be non-convex and discontinuous. To make such functions amenable to gradient based optimization procedures one needs to design clever bounds. In recent years, boosting, neural networks, support vector machines, and many other techniques have been applied. However, there is little work on directly modeling a conditional probability Pr(y|x_q) where y is a permutation of the documents to be ranked and x_q represents their feature vectors with respect to a query q. A major reason is that the space of y is huge: n! if n documents must be ranked. We first propose an intuitive and appealing expected loss minimization objective, and give an efficient shortcut to evaluate it despite the huge space of ys. Unfortunately, the optimization is non-convex, so we propose a convex approximation. We give a new, efficient Monte Carlo sampling method to compute the objective and gradient of this approximation, which can then be used in a quasi-Newton optimizer like LBFGS. Extensive experiments with the widely-used LETOR dataset show large ranking accuracy improvements beyond recent and competitive algorithms.

6 citations