Top 7 papers published by Soumen Chakrabarti from Indian Institute of Technology Bombay in 2009

Proceedings Article•DOI•

Collective annotation of Wikipedia entities in web text

[...]

Sayali Kulkarni¹, Amit Singh¹, Ganesh Ramakrishnan¹, Soumen Chakrabarti¹•Institutions (1)

28 Jun 2009

TL;DR: This work gives formulations for the trade-off between local spot-to-entity compatibility and measures of global coherence between entities, and investigates practical solutions based on local hill-climbing, rounding integer linear programs, and pre-clustering entities followed by local optimization within clusters.

...read moreread less

Abstract: To take the first step beyond keyword-based search toward entity-based search, suitable token spans ("spots") on documents must be identified as references to real-world entities from an entity catalog. Several systems have been proposed to link spots on Web pages to entities in Wikipedia. They are largely based on local compatibility between the text around the spot and textual metadata associated with the entity. Two recent systems exploit inter-label dependencies, but in limited ways. We propose a general collective disambiguation approach. Our premise is that coherent documents refer to entities from one or a few related topics or domains. We give formulations for the trade-off between local spot-to-entity compatibility and measures of global coherence between entities. Optimizing the overall entity assignment is NP-hard. We investigate practical solutions based on local hill-climbing, rounding integer linear programs, and pre-clustering entities followed by local optimization within clusters. In experiments involving over a hundred manually-annotated Web pages and tens of thousands of spots, our approaches significantly outperform recently-proposed algorithms.

...read moreread less

476 citations

The Morgan Kaufmann Series in Data Management Systems (Selected Titles)

[...]

01 Jan 2009

TL;DR: Distributed Algorithms contains the most significant algorithms and impossibility results in the area, all in a simple automata-theoretic setting, and their complexity is analyzed according to precisely defined complexity measures.

...read moreread less

436 citations

Proceedings Article•DOI•

Learning to rank for quantity consensus queries

[...]

Somnath Banerjee¹, Soumen Chakrabarti², Ganesh Ramakrishnan²•Institutions (2)

Hewlett-Packard¹, Indian Institute of Technology Bombay²

19 Jul 2009

TL;DR: This work introduces Quantity Consensus Queries (QCQs), where each answer is a tight quantity interval distilled from evidence of relevance in thousands of snippets, and proposes two new algorithms that learn to aggregate information from multiple snippets.

...read moreread less

Abstract: Web search is increasingly exploiting named entities like persons, places, businesses, addresses and dates. Entity ranking is also of current interest at INEX and TREC. Numerical quantities are an important class of entities, especially in queries about prices and features related to products, services and travel. We introduce Quantity Consensus Queries (QCQs), where each answer is a tight quantity interval distilled from evidence of relevance in thousands of snippets. Entity search and factoid question answering have benefited from aggregating evidence from multiple promising snippets, but these do not readily apply to quantities. Here we propose two new algorithms that learn to aggregate information from multiple snippets. We show that typical signals used in entity ranking, like rarity of query words and their lexical proximity to candidate quantities, are very noisy. Our algorithms learn to score and rankquantity intervals directly, combining snippet quantity and snippet text information. We report on experiments using hundreds of QCQs with ground truth taken from TREC QA, Wikipedia Infoboxes, and other sources, leading to tens of thousands of candidate snippets and quantities. Our algorithms yield about 20% better MAP and NDCG compared to the best-known collective rankers, and are 35% better than scoring snippets independent of each other.

...read moreread less

32 citations

Book Chapter•DOI•

Focused Web Crawling

[...]

Soumen Chakrabarti¹•Institutions (1)

Indian Institutes of Technology¹

01 Jan 2009

13 citations

Curating and Searching the Annotated Web

[...]

Amit Singh, Sayali Kulkarni, Somnath Banerjee, Ganesh Ramakrishnan, Soumen Chakrabarti - Show less +1 more

01 Jan 2009

TL;DR: This work demonstrates CSAW, a system for Curating and Searching the Annotated Web, and describes the beginnings of a next-generation Web search API that significantly extends the capabilities of APIs provided by popular search engines today.

...read moreread less

Abstract: We demonstrate CSAW, a system for Curating and Searching the Annotated Web. CSAW annotates named entities and quantities in Web-scale text corpora, and, where confident, connects these annotations with entries in an entity and type catalog such as Wikipedia. The semistructured catalog, together with the unstructured corpus, forms a composite database that CSAW can then search using powerful reachability, proximity and aggregation primitives. Specifically, we can look for snippets with mentions of specific entities, entities of a specified type, quantities with specified types or units, find unions and intersections of snippet sets, and then aggregate evidence from snippet sets into ranked responses. Responses are not page URLs as in standard Web search, but ranked tables where the cells can be entity references, quantities, or token snippets. We will show a subset of CSAW’s capabilities, and describe the beginnings of a next-generation Web search API that significantly extends the capabilities of APIs provided by popular search engines today.

...read moreread less

6 citations

Proceedings Article•DOI•

Conditional Models for Non-smooth Ranking Loss Functions

[...]

Avinava Dubey¹, Jinesh Machchhar², Chiranjib Bhattacharyya³, Soumen Chakrabarti²•Institutions (3)

IBM¹, Indian Institute of Technology Bombay², Indian Institute of Science³

06 Dec 2009

TL;DR: A new, efficient Monte Carlo sampling method is given to compute the objective and gradient of this approximation, which can then be used in a quasi-Newton optimizer like LBFGS.

...read moreread less

Abstract: Learning to rank is an important area at the interface of machine learning, information retrieval and Web search. The central challenge in optimizing various measures of ranking loss is that the objectives tend to be non-convex and discontinuous. To make such functions amenable to gradient based optimization procedures one needs to design clever bounds. In recent years, boosting, neural networks, support vector machines, and many other techniques have been applied. However, there is little work on directly modeling a conditional probability Pr(y|x_q) where y is a permutation of the documents to be ranked and x_q represents their feature vectors with respect to a query q. A major reason is that the space of y is huge: n! if n documents must be ranked. We first propose an intuitive and appealing expected loss minimization objective, and give an efficient shortcut to evaluate it despite the huge space of ys. Unfortunately, the optimization is non-convex, so we propose a convex approximation. We give a new, efficient Monte Carlo sampling method to compute the objective and gradient of this approximation, which can then be used in a quasi-Newton optimizer like LBFGS. Extensive experiments with the widely-used LETOR dataset show large ranking accuracy improvements beyond recent and competitive algorithms.

...read moreread less

6 citations

Book Chapter•DOI•

Text Search-Enhanced with Types and Entities

[...]

Soumen Chakrabarti, Sujatha Das, Vijay Krishnan, Kriti Puniyani

15 Jun 2009

Showing papers by "Soumen Chakrabarti published in 2009"