scispace - formally typeset
Search or ask a question

Showing papers by "Soumen Chakrabarti published in 2010"


Journal ArticleDOI
01 Sep 2010
TL;DR: This paper proposes new machine learning techniques to annotate table cells with entities that they likely mention, table columns with types from which entities are drawn for cells in the column, and relations that pairs of table columns seek to express, and a new graphical model for making all these labeling decisions for each table simultaneously.
Abstract: Tables are a universal idiom to present relational data. Billions of tables on Web pages express entity references, attributes and relationships. This representation of relational world knowledge is usually considerably better than completely unstructured, free-format text. At the same time, unlike manually-created knowledge bases, relational information mined from "organic" Web tables need not be constrained by availability of precious editorial time. Unfortunately, in the absence of any formal, uniform schema imposed on Web tables, Web search cannot take advantage of these high-quality sources of relational information. In this paper we propose new machine learning techniques to annotate table cells with entities that they likely mention, table columns with types from which entities are drawn for cells in the column, and relations that pairs of table columns seek to express. We propose a new graphical model for making all these labeling decisions for each table simultaneously, rather than make separate local decisions for entities, types and relations. Experiments using the YAGO catalog, DB-Pedia, tables from Wikipedia, and over 25 million HTML tables from a 500 million page Web crawl uniformly show the superiority of our approach. We also evaluate the impact of better annotations on a prototype relational Web search tool. We demonstrate clear benefits of our annotations beyond indexing tables in a purely textual manner.

425 citations


Proceedings Article
26 Apr 2010
TL;DR: The two-tier review process ensures in-depth, reliable and fair evaluations, and a new category of Application and Experience papers that were reviewed for merit in design, implementation, benchmarking or extensive experience, as distinct from a core technical idea as in regular research submissions.
Abstract: Welcome to the World Wide Web Conference held during April 26-30, 2010, at the Raleigh Convention Center in Raleigh, North Carolina, USA. The WWW Conference is the largest and premier annual forum where researchers and developers from around the world assemble to share, discuss and debate the latest developments on Web technologies and standards and the Web's impact on society and culture. We are pleased to present the proceedings of the conference as its published record. Based on input from several office holders in recent WWW conferences, we implemented a number of modifications in the review process this time. In earlier years, WWW used a partitioned track system, and each paper was sent to exactly one track. This year, we implemented a system of overlapping (broad) areas and (fine) topics. Each broad area was represented by at least two, but often more, area chairs (ACs), who helped recruit the rest of the program committee (PC) members, but PC members were not partitioned by area. Each paper could potentially be assigned to any PC member. We downloaded a number of recent papers by each PC member to create a profile, and used its similarity with each submitted paper as one signal into the paper assignment process, while paying close attention to bids for papers by PC members. The assignments were then fine-tuned by the ACs. Each paper was first reviewed by three PC members. Then the ACs initiated discussions, solicited additional reviews if needed, and wrote at least one meta-review per paper summarizing and justifying the final decision. For most papers, the ACs had a confident decision before the PC meeting held 14-15th January. At this meeting, particularly complicated decisions were made and reviewed. Overall, we believe the two-tier review process ensures in-depth, reliable and fair evaluations. Other new features include a new demo track, where anyone, not just industrial exhibitors, can show a working system, and a new category of Application and Experience (A+E) papers that were reviewed for merit in design, implementation, benchmarking or extensive experience, as distinct from a core technical idea as in regular research submissions. Some A+E papers were nominated for demos. Other tracks for posters, tutorials, workshops, and developers were run as usual, separate from the research track. 754 research papers were submitted. Of these, 91 were accepted as regular research papers and 14 were accepted as A+E papers. A total of 24 tutorials were proposed and 11 accepted. A total of 19 workshops were proposed and 11 were accepted. A total of 90 posters and 27 demos will be exhibited.

67 citations



Journal Article
TL;DR: Recent work on adding structure to keyword search is surveyed on three axes, allowing more power than simple keyword queries, but while avoiding the complexity of elaborate query languages that demand extensive schema knowledge.
Abstract: Keyword search has traditionally focussed on retrieving documents in ranked order, given simple keyword queries. Similarly, work on keyword queries on structured data has focussed on retrieving closely connected pieces of data that together contain given query keywords. In recent years, there has been a good deal of work that attempts to go beyond the above paradigms, to improve search experience on unstructured textual data as well as on structured or semi-structured data. In this paper, we survey recent work on adding structure to keyword search, which can be categorized on three axes: (a) adding structure to unstructured data, (b) adding structure to answers, and (c) adding structure to queries allowing more power than simple keyword queries, but while avoiding the complexity of elaborate query languages that demand extensive schema knowledge.

21 citations



Patent
26 Apr 2010
TL;DR: In this article, a method of querying a collection of electronic documents, comprising defining a query for retrieving a numerical answer, said query comprising one or more search terms and a tolerance for said numerical answer; defining a set of document portions from said collection, each document portion in said set being extracted from an electronic document and comprising at least one term relevant to one of the search terms, and a numerical value associated with the at least 1 term; ordering the associated numerical values contained in the set, defining a plurality of results groups, each results group comprising an interval of ordered numerical values
Abstract: Disclosed is a method of querying a collection of electronic documents, comprising defining a query for retrieving a numerical answer, said query comprising one or more search terms and a tolerance for said numerical answer; defining a set of document portions from said collection, each document portion in said set being extracted from an electronic document and comprising at least one term relevant to at least one of the one or more search terms and a numerical value associated with the at least one term; ordering the associated numerical values contained in said set; defining a plurality of results groups, each results group comprising an interval of ordered numerical values, each interval having a range not exceeding the tolerance; ranking the results groups; and returning at least the interval of the highest ranked results group as a response to said query A computer program product for executing this method on a computer processor and a server are also disclosed.

10 citations