scispace - formally typeset
Search or ask a question

Showing papers by "Srikanta Bedathur published in 2009"


Proceedings ArticleDOI
28 Jun 2009
TL;DR: In this paper, the authors investigate the value of incorporating the history information available on the interactions (or links) of the current social network state and show that time-stamps of past interactions significantly improve the prediction accuracy of new and recurrent links over rather sophisticated methods proposed recently.
Abstract: Prediction of links - both new as well as recurring - in a social network representing interactions between individuals is an important problem. In the recent years, there is significant interest in methods that use only the graph structure to make predictions. However, most of them consider a single snapshot of the network as the input, neglecting an important aspect of these social networks viz., their evolution over time.In this work, we investigate the value of incorporating the history information available on the interactions (or links) of the current social network state. Our results unequivocally show that time-stamps of past interactions significantly improve the prediction accuracy of new and recurrent links over rather sophisticated methods proposed recently. Furthermore, we introduce a novel testing method which reflects the application of link prediction better than previous approaches.

227 citations


Proceedings Article
01 Jan 2009
TL;DR: A measure of across-time semantic similarity that assesses the degree of relatedness between two terms when used at different times is proposed and proposed as a crucial building block for a novel query reformulation technique based on a hidden Markov model (HMM).
Abstract: Web archives play an important role in preserving our cultural heritage for future generations. When searching them, a serious problem arises from the fact that terminology evolves constantly. Since today’s users formulate queries using current terminology, old but relevant documents are often not retrieved. The query saint petersburg museum, for instance, does not retrieve documents from the 1970s about museums in Leningrad (the former name of Saint Petersburg). We address this problem by determining query reformulations that paraphrase the user’s information need using terminology prevalent in the past. A measure of across-time semantic similarity that assesses the degree of relatedness between two terms when used at different times is proposed. Using this measure as a crucial building block, we propose a novel query reformulation technique based on a hidden Markov model (HMM). Experiments on twenty years worth of New York Times articles demonstrate the usefulness and efficiency of our approach.

48 citations


Proceedings Article
01 Jan 2009
TL;DR: This work proposes a novel approach, based on language models, to make temporal expressions first-class citizens of the retrieval model, and presents experiments that show actual improvements in retrieval effectiveness.
Abstract: Temporal expressions, such as between 1992 and 2000, are frequent across many kinds of documents. Text retrieval, though, treats them as common terms, thus ignoring their inherent semantics. For queries with a strong temporal component, such as U.S. president 1997, this leads to a decrease in retrieval effectiveness, since relevant documents (e.g., a biography of Bill Clinton containing the aforementioned temporal expression) can not be reliably matched to the query. We propose a novel approach, based on language models, to make temporal expressions first-class citizens of the retrieval model. In addition, we present experiments that show actual improvements in retrieval effectiveness.

37 citations


01 Jan 2009
TL;DR: The results unequivocally show that time-stamps of past interactions significantly improve the prediction accuracy of new and recurrent links over rather sophisticated methods proposed recently.
Abstract: Prediction of links - both new as well as recurring - in a social network representing interactions between individuals is an important problem. In the recent years, there is significant interest in methods that use only the graph structure to make predictions. However, most of them consider a single snapshot of the network as the input, neglecting an important aspect of these social networks viz., their evolution over time. In this work, we investigate the value of incorporating the history information available on the interactions (or links) of the current social network state. Our results unequivocally show that time-stamps of past interactions significantly improve the prediction accuracy of new and recurrent links over rather sophisticated methods proposed recently. Furthermore, we introduce a novel testing method which reflects the application of link prediction better than previous approaches.

29 citations


Proceedings ArticleDOI
15 Jun 2009
TL;DR: EverLast, a scalable distributed framework for next generation Web archival and temporal text analytics over the archive, is proposed, built on a loosely-coupled distributed architecture that can be deployed over large-scale peer-to-peer networks.
Abstract: The World Wide Web has become a key source of knowledge pertaining to almost every walk of life. Unfortunately, much of data on the Web is highly ephemeral in nature, with more than 50-80% of content estimated to be changing within a short time. Continuing the pioneering efforts of many national (digital) libraries, organizations such as the International Internet Preservation Consortium (IIPC), the Internet Archive (IA) and the European Archive (EA) have been tirelessly working towards preserving the ever changing Web.However, while these web archiving efforts have paid significant attention towards long term preservation of Web data, they have paid little attention to developing an global-scale infrastructure for collecting, archiving, and performing historical analyzes on the collected data. Based on insights from our recent work on building text analytics for Web Archives, we propose EverLast, a scalable distributed framework for next generation Web archival and temporal text analytics over the archive. Our system is built on a loosely-coupled distributed architecture that can be deployed over large-scale peer-to-peer networks. In this way, we allow the integration of many archival efforts taken mainly at a national level by national digital libraries. Key features of EverLast include support of time-based text search & analysis and the use of human-assisted archive gathering. In this paper, we outline the overall architecture of EverLast, and present some promising preliminary results.

21 citations


01 Jan 2009
TL;DR: This work develops preprocessing and indexing methods for phrases, paired with new search techniques for the top-k most interesting phrases on ad-hoc subsets of the corpus, and investigates alternative definitions of phrase interestingness, based on the probability of phrase occurrences.
Abstract: Large text corpora with news, customer mail and reports, or Web 2.0 contributions offer a great potential for enhancing business-intelligence applications. We propose a framework for performing text analytics on such data in a versatile, efficient, and scalable manner. While much of the prior literature has emphasized mining keywords or tags in blogs or social-tagging communities, we emphasize the analysis of interesting phrases. These include named entities, important quotations, market slogans, and other multi-word phrases that are prominent in a dynamically derived ad-hoc subset of the corpus, e.g., being frequent in the subset but relatively infrequent in the overall corpus. The ad-hoc subset may be derived by means of a keyword query against the corpus, or by focusing on a particular time period. We investigate alternative definitions of phrase interestingness, based on the probability of phrase occurrences. We develop preprocessing and indexing methods for phrases, paired with new search techniques for the top-k most interesting phrases on ad-hoc subsets of the corpus. Our framework is evaluated using a large-scale real-world corpus of New York Times news articles.

4 citations


01 Jan 2009
TL;DR: The goal is to build a scalable peer-to-peer framework for web archival and to further support time-travel search over it with an initial design with crawling, persistent storage and indexing and the partitioning strategies for historical analysis of data are analyzed.
Abstract: The World Wide Web has become a key source of knowledge pertaining to almost every walk of life. The goal is to build a scalable peer-to-peer framework for web archival and to further support time-travel search over it.We provide an initial design with crawling, persistent storage and indexing and also analyze the partitioning strategies for historical analysis of data. Peer-to-peer (p2p) systems are a nice fit here but they suffer from churn and communication overhead and hence require controlled replication for availability and load balancing. The core of the contribution is of index organization by temporally partitioning the time-travel index lists for supporting efficient time-travel search. We also analyze the partitioning strategies in terms of improving replication to improve availability while still keeping the overall blowup if the index in check. We present various heuristic approaches with detailed experimental analysis exploring the nature of partitioning algorithms in a distributed setting.

1 citations