Showing papers in "ACM Transactions on Information Systems in 2013"

PDF

Open Access

Journal Article•DOI•

Sparse hashing for fast multimedia search

[...]

Xiaofeng Zhu¹, Zi Huang¹, Hong Cheng², Jiangtao Cui³, Heng Tao Shen¹ - Show less +1 more•Institutions (3)

University of Queensland¹, The Chinese University of Hong Kong², Xidian University³

17 May 2013-ACM Transactions on Information Systems

TL;DR: Experimental results show that the proposed sparse hashing significantly outperforms state-of-the-art techniques and is able to perform fast approximate similarity search by efficient bit XOR operations in the memory of a modern PC with short binary code representations.

...read moreread less

Abstract: Hash-based methods achieve fast similarity search by representing high-dimensional data with compact binary codes. However, both generating binary codes and encoding unseen data effectively and efficiently remain very challenging tasks. In this article, we focus on these tasks to implement approximate similarity search by proposing a novel hash based method named sparse hashing (SH for short). To generate interpretable (or semantically meaningful) binary codes, the proposed SH first converts original data into low-dimensional data through a novel nonnegative sparse coding method. SH then converts the low-dimensional data into Hamming space (i.e., binary encoding low-dimensional data) by a new binarization rule. After this, training data are represented by generated binary codes. To efficiently and effectively encode unseen data, SH learns hash functions by taking a-priori knowledge into account, such as implicit group effect of the features in training data, and the correlations between original space and the learned Hamming space. SH is able to perform fast approximate similarity search by efficient bit XOR operations in the memory of a modern PC with short binary code representations. Experimental results show that the proposed SH significantly outperforms state-of-the-art techniques.

...read moreread less

121 citations

Journal Article•DOI•

Enriching Documents with Examples: A Corpus Mining Approach

[...]

Jinhan Kim¹, Sanghoon Lee¹, Seung-won Hwang¹, Sunghun Kim²•Institutions (2)

Pohang University of Science and Technology¹, Hong Kong University of Science and Technology²

01 Jan 2013-ACM Transactions on Information Systems

TL;DR: This work proposes a novel code example recommendation system that combines the strength of browsing documents and searching for code examples and returns API documents embedded with high-quality code example summaries mined from the Web.

...read moreread less

Abstract: Software developers increasingly rely on information from the Web, such as documents or code examples on application programming interfaces (APIs), to facilitate their development processes. However, API documents often do not include enough information for developers to fully understand how to use the APIs, and searching for good code examples requires considerable effort.To address this problem, we propose a novel code example recommendation system that combines the strength of browsing documents and searching for code examples and returns API documents embedded with high-quality code example summaries mined from the Web. Our evaluation results show that our approach provides code examples with high precision and boosts programmer productivity.

...read moreread less

88 citations

Journal Article•DOI•

Behavioral dynamics on the web: Learning, modeling, and prediction

[...]

Kira Radinsky¹, Krysta M. Svore², Susan T. Dumais², Milad Shokouhi², Jaime Teevan², Alex Bocharov², Eric Horvitz² - Show less +3 more•Institutions (2)

Technion – Israel Institute of Technology¹, Microsoft²

05 Aug 2013-ACM Transactions on Information Systems

TL;DR: A temporal modeling framework adapted from physics and signal processing is developed and harnessed to predict temporal patterns in search behavior using smoothing, trends, periodicities, and surprises and presents two applications where new methods introduced for the temporal modeling of user behavior significantly improve upon the state of the art.

...read moreread less

Abstract: The queries people issue to a search engine and the results clicked following a query change over time. For example, after the earthquake in Japan in March 2011, the query japan spiked in popularity and people issuing the query were more likely to click government-related results than they would prior to the earthquake. We explore the modeling and prediction of such temporal patterns in Web search behavior. We develop a temporal modeling framework adapted from physics and signal processing and harness it to predict temporal patterns in search behavior using smoothing, trends, periodicities, and surprises. Using current and past behavioral data, we develop a learning procedure that can be used to construct models of users' Web search activities. We also develop a novel methodology that learns to select the best prediction model from a family of predictive models for a given query or a class of queries. Experimental results indicate that the predictive models significantly outperform baseline models that weight historical evidence the same for all queries. We present two applications where new methods introduced for the temporal modeling of user behavior significantly improve upon the state of the art. Finally, we discuss opportunities for using models of temporal dynamics to enhance other areas of Web search and information retrieval.

...read moreread less

73 citations

Journal Article•DOI•

Practical linear-time O(1)-workspace suffix sorting for constant alphabets

[...]

Ge Nong¹•Institutions (1)

Sun Yat-sen University¹

05 Aug 2013-ACM Transactions on Information Systems

TL;DR: In this experiment, SACA-K outperforms SA-IS that was previously the most time- and space-efficient linear-time SA construction algorithm (SACA), and is around 33% faster and uses a smaller deterministic workspace of K words, where the workspace is the space needed beyond the input string and the output SA.

...read moreread less

Abstract: This article presents an O(n)-time algorithm called SACA-K for sorting the suffixes of an input string T[0, n-1] over an alphabet A[0, K-1]. The problem of sorting the suffixes of T is also known as constructing the suffix array (SA) for T. The theoretical memory usage of SACA-K is nlogK p nlogn p Klogn bits. Moreover, we also have a practical implementation for SACA-K that uses n bytes p (n p 256) words and is suitable for strings over any alphabet up to full ASCII, where a word is log n bits. In our experiment, SACA-K outperforms SA-IS that was previously the most time- and space-efficient linear-time SA construction algorithm (SACA). SACA-K is around 33p faster and uses a smaller deterministic workspace of K words, where the workspace is the space needed beyond the input string and the output SA. Given K=O(1), SACA-K runs in linear time and O(1) workspace. To the best of our knowledge, such a result is the first reported in the literature with a practical source code publicly available.

...read moreread less

70 citations

Journal Article•DOI•

Discovering tasks from search engine query logs

[...]

Claudio Lucchese¹, Salvatore Orlando², Raffaele Perego¹, Fabrizio Silvestri¹, Gabriele Tolomei² - Show less +1 more•Institutions (2)

Istituto di Scienza e Tecnologie dell'Informazione¹, Ca' Foscari University of Venice²

05 Aug 2013-ACM Transactions on Information Systems

TL;DR: This work proposes a two-step methodology for discovering tasks that users try to perform through search engines, and presents a set of query similarity functions based on unsupervised and supervised learning approaches that exploit these functions in order to detect user tasks.

...read moreread less

Abstract: Although Web search engines still answer user queries with lists of ten blue links to webpages, people are increasingly issuing queries to accomplish their daily tasks (e.g., finding a recipe, booking a flight, reading online news, etc.). In this work, we propose a two-step methodology for discovering tasks that users try to perform through search engines. First, we identify user tasks from individual user sessions stored in search engine query logs. In our vision, a user task is a set of possibly noncontiguous queries (within a user search session), which refer to the same need. Second, we discover collective tasks by aggregating similar user tasks, possibly performed by distinct users. To discover user tasks, we propose query similarity functions based on unsupervised and supervised learning approaches. We present a set of query clustering methods that exploit these functions in order to detect user tasks. All the proposed solutions were evaluated on a manually-built ground truth, and two of them performed better than state-of-the-art approaches. To detect collective tasks, we propose four methods that cluster previously discovered user tasks, which in turn are represented by the bag-of-words extracted from their composing queries. These solutions were also evaluated on another manually-built ground truth.

...read moreread less

68 citations

Journal Article•DOI•

About learning models with multiple query-dependent features

[...]

Craig Macdonald¹, Rodrygo L. T. Santos¹, Iadh Ounis¹, Ben He²•Institutions (2)

University of Glasgow¹, Chinese Academy of Sciences²

05 Aug 2013-ACM Transactions on Information Systems

TL;DR: This work investigates three research questions in order to empirically ascertain best practices for learning-to-rank deployments and shows and explains why multiple weighting models should be deployed as features, and finds that training the weighting model's parameters degrades learned model's effectiveness.

...read moreread less

Abstract: Several questions remain unanswered by the existing literature concerning the deployment of query-dependent features within learning to rank. In this work, we investigate three research questions in order to empirically ascertain best practices for learning-to-rank deployments. (i) Previous work in data fusion that pre-dates learning to rank showed that while different retrieval systems could be effectively combined, the combination of multiple models within the same system was not as effective. In contrast, the existing learning-to-rank datasets (e.g., LETOR), often deploy multiple weighting models as query-dependent features within a single system, raising the question as to whether such a combination is needed. (ii) Next, we investigate whether the training of weighting model parameters, traditionally required for effective retrieval, is necessary within a learning-to-rank context. (iii) Finally, we note that existing learning-to-rank datasets use weighting model features calculated on different fields (e.g., title, content, or anchor text), even though such weighting models have been criticized in the literature. Experiments addressing these three questions are conducted on Web search datasets, using various weighting models as query-dependent and typical query-independent features, which are combined using three learning-to-rank techniques. In particular, we show and explain why multiple weighting models should be deployed as features. Moreover, we unexpectedly find that training the weighting model's parameters degrades learned model's effectiveness. Finally, we show that computing a weighting model separately for each field is less effective than more theoretically-sound field-based weighting models.

...read moreread less

64 citations

Journal Article•DOI•

Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods

[...]

Katja Hofmann¹, Shimon Whiteson¹, Maarten de Rijke¹•Institutions (1)

University of Amsterdam¹

01 Nov 2013-ACM Transactions on Information Systems

TL;DR: A probabilistic interleave method is proposed, which is sound and has fidelity, and it is shown empirically that, by marginalizing out variables that are known, it is more efficient than existing interleaved comparison methods.

...read moreread less

Abstract: Ranker evaluation is central to the research into search engines, be it to compare rankers or to provide feedback for learning to rank. Traditional evaluation approaches do not scale well because they require explicit relevance judgments of document-query pairs, which are expensive to obtain. A promising alternative is the use of interleaved comparison methods, which compare rankers using click data obtained when interleaving their rankings.In this article, we propose a framework for analyzing interleaved comparison methods. An interleaved comparison method has fidelity if the expected outcome of ranker comparisons properly corresponds to the true relevance of the ranked documents. It is sound if its estimates of that expected outcome are unbiased and consistent. It is efficient if those estimates are accurate with only little data.We analyze existing interleaved comparison methods and find that, while sound, none meet our criteria for fidelity. We propose a probabilistic interleave method, which is sound and has fidelity. We show empirically that, by marginalizing out variables that are known, it is more efficient than existing interleaved comparison methods. Using importance sampling we derive a sound extension that is able to reuse historical data collected in previous comparisons of other ranker pairs.

...read moreread less

60 citations

Journal Article•DOI•

Regularized Latent Semantic Indexing: A New Approach to Large-Scale Topic Modeling

[...]

Quan Wang¹, Jun Xu², Hang Li², Nick Craswell²•Institutions (2)

Peking University¹, Microsoft²

01 Jan 2013-ACM Transactions on Information Systems

TL;DR: This article introduces Regularized Latent Semantic Indexing (RLSI)---including a batch version and an online version, referred to as batch and online RLSI, respectively---to scale up topic modeling and proposes adopting ℓ1 norm on topics andℓ2 norm on document representations to create a model with compact and readable topics and which is useful for retrieval.

...read moreread less

Abstract: Topic modeling provides a powerful way to analyze the content of a collection of documents. It has become a popular tool in many research areas, such as text mining, information retrieval, natural language processing, and other related fields. In real-world applications, however, the usefulness of topic modeling is limited due to scalability issues. Scaling to larger document collections via parallelization is an active area of research, but most solutions require drastic steps, such as vastly reducing input vocabulary. In this article we introduce Regularized Latent Semantic Indexing (RLSI)---including a batch version and an online version, referred to as batch RLSI and online RLSI, respectively---to scale up topic modeling. Batch RLSI and online RLSI are as effective as existing topic modeling techniques and can scale to larger datasets without reducing input vocabulary. Moreover, online RLSI can be applied to stream data and can capture the dynamic evolution of topics. Both versions of RLSI formalize topic modeling as a problem of minimizing a quadratic loss function regularized by e1 and/or e2 norm. This formulation allows the learning process to be decomposed into multiple suboptimization problems which can be optimized in parallel, for example, via MapReduce. We particularly propose adopting e1 norm on topics and e2 norm on document representations to create a model with compact and readable topics and which is useful for retrieval. In learning, batch RLSI processes all the documents in the collection as a whole, while online RLSI processes the documents in the collection one by one. We also prove the convergence of the learning of online RLSI. Relevance ranking experiments on three TREC datasets show that batch RLSI and online RLSI perform better than LSI, PLSI, LDA, and NMF, and the improvements are sometimes statistically significant. Experiments on a Web dataset containing about 1.6 million documents and 7 million terms, demonstrate a similar boost in performance.

...read moreread less

48 citations

Journal Article•DOI•

Transfer joint embedding for cross-domain named entity recognition

[...]

Sinno Jialin Pan¹, Zhiqiang Toh¹, Jian Su¹•Institutions (1)

Institute for Infocomm Research Singapore¹

17 May 2013-ACM Transactions on Information Systems

TL;DR: A new transfer learning method, named Transfer Joint Embedding (TJE), for cross-domain multiclass classification, which can fully exploit the relationships between classes (labels), and reduce domain difference in data distributions for transfer learning.

...read moreread less

Abstract: Named Entity Recognition (NER) is a fundamental task in information extraction from unstructured text. Most previous machine-learning-based NER systems are domain-specific, which implies that they may only perform well on some specific domains (e.g., Newswire) but tend to adapt poorly to other related but different domains (e.g., Weblog). Recently, transfer learning techniques have been proposed to NER. However, most transfer learning approaches to NER are developed for binary classification, while NER is a multiclass classification problem in nature. Therefore, one has to first reduce the NER task to multiple binary classification tasks and solve them independently. In this article, we propose a new transfer learning method, named Transfer Joint Embedding (TJE), for cross-domain multiclass classification, which can fully exploit the relationships between classes (labels), and reduce domain difference in data distributions for transfer learning. More specifically, we aim to embed both labels (outputs) and high-dimensional features (inputs) from different domains (e.g., a source domain and a target domain) into a unified low-dimensional latent space, where 1) each label is represented by a prototype and the intrinsic relationships between labels can be measured by Euclidean distance; 2) the distance in data distributions between the source and target domains can be reduced; 3) the source domain labeled data are closer to their corresponding label-prototypes than others. After the latent space is learned, classification on the target domain data can be done with the simple nearest neighbor rule in the latent space. Furthermore, in order to scale up TJE, we propose an efficient algorithm based on stochastic gradient descent (SGD). Finally, we apply the proposed TJE method for NER across different domains on the ACE 2005 dataset, which is a benchmark in Natural Language Processing (NLP). Experimental results demonstrate the effectiveness of TJE and show that TJE can outperform state-of-the-art transfer learning approaches to NER.

...read moreread less

34 citations

Journal Article•DOI•

Fast candidate generation for real-time tweet search with bloom filter chains

[...]

Nima Asadi¹, Jimmy Lin¹•Institutions (1)

University of Maryland, College Park¹

05 Aug 2013-ACM Transactions on Information Systems

TL;DR: This work explores candidate generation in a two-stage retrieval architecture where an initial list of results is processed by a second-stage rescorer to produce the final output in the context of real-time search on tweets.

...read moreread less

Abstract: The rise of social media and other forms of user-generated content have created the demand for real-time search: against a high-velocity stream of incoming documents, users desire a list of relevant results at the time the query is issued. In the context of real-time search on tweets, this work explores candidate generation in a two-stage retrieval architecture where an initial list of results is processed by a second-stage rescorer to produce the final output. We introduce Bloom filter chains, a novel extension of Bloom filters that can dynamically expand to efficiently represent an arbitrarily long and growing list of monotonically-increasing integers with a constant false positive rate. Using a collection of Bloom filter chains, a novel approximate candidate generation algorithm called BWand is able to perform both conjunctive and disjunctive retrieval. Experiments show that our algorithm is many times faster than competitive baselines and that this increased performance does not require sacrificing end-to-end effectiveness. Our results empirically characterize the trade-off space defined by output quality, query evaluation speed, and memory footprint for this particular search architecture.

...read moreread less

30 citations

Journal Article•DOI•

X-Class: Associative Classification of XML Documents by Structure

[...]

Gianni Costa¹, Riccardo Ortale¹, Ettore Ritacco¹•Institutions (1)

Indian Council of Agricultural Research¹

01 Jan 2013-ACM Transactions on Information Systems

TL;DR: Empirical evidence reveals that the classifiers induced by X-Class are compact, scalable, and at least as effective as the established competitors.

...read moreread less

Abstract: The supervised classification of XML documents by structure involves learning predictive models in which certain structural regularities discriminate the individual document classes. Hitherto, research has focused on the adoption of prespecified substructures. This is detrimental for classification effectiveness, since the a priori chosen substructures may not accord with the structural properties of the XML documents. Therein, an unexplored question is how to choose the type of structural regularity that best adapts to the structures of the available XML documents.We tackle this problem through X-Class, an approach that handles all types of tree-like substructures and allows for choosing the most discriminatory one. Algorithms are designed to learn compact rule-based classifiers in which the chosen substructures discriminate the classes of XML documents.X-Class is studied across various domains and types of substructures. Its classification performance is compared against several rule-based and SVM-based competitors. Empirical evidence reveals that the classifiers induced by X-Class are compact, scalable, and at least as effective as the established competitors. In particular, certain substructures allow the induction of very compact classifiers that generally outperform the rule-based competitors in terms of effectiveness over all chosen corpora of XML data. Furthermore, such classifiers are substantially as effective as the SVM-based competitor, with the additional advantage of a high-degree of interpretability.

...read moreread less

Journal Article•DOI•

Modeling reformulation using query distributions

[...]

Xiaobing Xue¹, W. Bruce Croft¹•Institutions (1)

University of Massachusetts Amherst¹

17 May 2013-ACM Transactions on Information Systems

TL;DR: A novel framework is proposed that models reformulation as a distribution of actual queries, where each query is a variation of the original query, and thus captures important query-level dependencies between words and phrases.

...read moreread less

Abstract: Query reformulation modifies the original query with the aim of better matching the vocabulary of the relevant documents, and consequently improving ranking effectiveness. Previous models typically generate words and phrases related to the original query, but do not consider how these words and phrases would fit together in actual queries. In this article, a novel framework is proposed that models reformulation as a distribution of actual queries, where each query is a variation of the original query. This approach considers an actual query as the basic unit and thus captures important query-level dependencies between words and phrases. An implementation of this framework that only uses publicly available resources is proposed, which makes fair comparisons with other methods using TREC collections possible. Specifically, this implementation consists of a query generation step that analyzes the passages containing query words to generate reformulated queries and a probability estimation step that learns a distribution for reformulated queries by optimizing the retrieval performance. Experiments on TREC collections show that the proposed model can significantly outperform previous reformulation models.

...read moreread less

Journal Article•DOI•

Mining pure high-order word associations via information geometry for information retrieval

[...]

Yuexian Hou¹, Xiaozhao Zhao¹, Dawei Song¹, Wenjie Li²•Institutions (2)

Tianjin University¹, Hong Kong Polytechnic University²

05 Aug 2013-ACM Transactions on Information Systems

TL;DR: This article derives and proves the sufficient criteria that entail UPD and CPD, within the well-principled information geometry (IG) framework, leading to a more feasible UPD/CPD identification procedure.

...read moreread less

Abstract: The classical bag-of-word models for information retrieval (IR) fail to capture contextual associations between words. In this article, we propose to investigate pure high-order dependence among a number of words forming an unseparable semantic entity, that is, the high-order dependence that cannot be reduced to the random coincidence of lower-order dependencies. We believe that identifying these pure high-order dependence patterns would lead to a better representation of documents and novel retrieval models. Specifically, two formal definitions of pure dependence—unconditional pure dependence (UPD) and conditional pure dependence (CPD)—are defined. The exact decision on UPD and CPD, however, is NP-hard in general. We hence derive and prove the sufficient criteria that entail UPD and CPD, within the well-principled information geometry (IG) framework, leading to a more feasible UPD/CPD identification procedure. We further develop novel methods for extracting word patterns with pure high-order dependence. Our methods are applied to and extensively evaluated on three typical IR tasks: text classification and text retrieval without and with query expansion.

...read moreread less

Journal Article•DOI•

Improving Text Classification Accuracy by Training Label Cleaning

[...]

Andrea Esuli, Fabrizio Sebastiani

01 Nov 2013-ACM Transactions on Information Systems

TL;DR: Three different techniques for performing TLC are presented and, on three widely used TC benchmarks, they are evaluated by their capability of spotting training documents that, for experimental reasons only, the authors have purposefully mislabelled.

...read moreread less

Abstract: In text classification (TC) and other tasks involving supervised learning, labelled data may be scarce or expensive to obtain. Semisupervised learning and active learning are two strategies whose aim is maximizing the effectiveness of the resulting classifiers for a given amount of training effort. Both strategies have been actively investigated for TC in recent years. Much less research has been devoted to a third such strategy, training label cleaning (TLC), which consists in devising ranking functions that sort the original training examples in terms of how likely it is that the human annotator has mislabelled them. This provides a convenient means for the human annotator to revise the training set so as to improve its quality. Working in the context of boosting-based learning methods for multilabel classification we present three different techniques for performing TLC and, on three widely used TC benchmarks, evaluate them by their capability of spotting training documents that, for experimental reasons only, we have purposefully mislabelled. We also evaluate the degradation in classification effectiveness that these mislabelled texts bring about, and to what extent training label cleaning can prevent this degradation.

...read moreread less

Journal Article•DOI•

The Effect of Social and Physical Detachment on Information Need

[...]

Elad Yom-Tov¹, Fernando Diaz¹•Institutions (1)

Yahoo!¹

01 Jan 2013-ACM Transactions on Information Systems

TL;DR: In this paper, the authors investigated how the information need of users, as expressed by their queries, is affected by their physical detachment, as estimated by their location in relation to an event, and by their social detachment, quantified by the number of their acquaintances who may be affected by the event.

...read moreread less

Abstract: The information need of users and the documents which answer this need are frequently contingent on the different characteristics of users. This is especially evident during natural disasters, such as earthquakes and violent weather incidents, which create a strong transient information need. In this article, we investigate how the information need of users, as expressed by their queries, is affected by their physical detachment, as estimated by their physical location in relation to that of the event, and by their social detachment, as quantified by the number of their acquaintances who may be affected by the event. Drawing on large-scale data from ten major events, we show that social and physical detachment levels of users are a major influence on their search engine queries. We demonstrate how knowing social and physical detachment levels can assist in improving retrieval for two applications: identifying search queries related to events and ranking results in response to event-related queries. We find that the average precision in identifying relevant search queries improves by approximately 18p, and that the average precision of ranking that uses detachment information improves by 10p. Using both types of detachment achieved a larger gain in performance than each of them separately.

...read moreread less

Journal Article•DOI•

Social Link Prediction in Online Social Tagging Systems

[...]

Charalampos Chelmis¹, Viktor K. Prasanna¹•Institutions (1)

University of Southern California¹

01 Nov 2013-ACM Transactions on Information Systems

TL;DR: This article proposes latent topic models as a principled way of reducing the dimensionality of such data and capturing the dynamics of collaborative annotation process and proposes three generative processes to model latent user tastes with respect to resources they annotate with metadata.

...read moreread less

Abstract: Social networks have become a popular medium for people to communicate and distribute ideas, content, news, and advertisements. Social content annotation has naturally emerged as a method of categorization and filtering of online information. The unrestricted vocabulary users choose from to annotate content has often lead to an explosion of the size of space in which search is performed. In this article, we propose latent topic models as a principled way of reducing the dimensionality of such data and capturing the dynamics of collaborative annotation process. We propose three generative processes to model latent user tastes with respect to resources they annotate with metadata. We show that latent user interests combined with social clues from the immediate neighborhood of users can significantly improve social link prediction in the online music social media site Last.fm. Most link prediction methods suffer from the high class imbalance problem, resulting in low precision and/or recall. In contrast, our proposed classification schemes for social link recommendation achieve high precision and recall with respect to not only the dominant class (nonexistence of a link), but also with respect to sparse positive instances, which are the most vital in social tie prediction.

...read moreread less

Journal Article•DOI•

Effective and Robust Query-Based Stemming

[...]

Jiaul H. Paik¹, Swapan K. Parui¹, Dipasree Pal¹, Stephen Robertson²•Institutions (2)

Indian Statistical Institute¹, Microsoft²

01 Nov 2013-ACM Transactions on Information Systems

TL;DR: This article presents corpus-based fully automatic stemming algorithms which consistently and significantly outperform four state of the art strong stemmers of completely varying principles and confirms that the robustness of the proposed query-based stemming algorithms are remarkably better than the existing strong baselines.

...read moreread less

Abstract: Stemming is a widely used technique in information retrieval systems to address the vocabulary mismatch problem arising out of morphological phenomena. The major shortcoming of the commonly used stemmers is that they accept the morphological variants of the query words without considering their thematic coherence with the given query, which leads to poor performance. Moreover, for many queries, such approaches also produce retrieval performance that is poorer than no stemming, thereby degrading the robustness. The main goal of this article is to present corpus-based fully automatic stemming algorithms which address these issues. A set of experiments on six TREC collections and three other non-English collections containing news and web documents shows that the proposed query-based stemming algorithms consistently and significantly outperform four state of the art strong stemmers of completely varying principles. Our experiments also confirm that the robustness of the proposed query-based stemming algorithms are remarkably better than the existing strong baselines.

...read moreread less

Journal Article•DOI•

Approximate Recall Confidence Intervals

[...]

William Webber¹•Institutions (1)

University of Maryland, College Park¹

01 Jan 2013-ACM Transactions on Information Systems

TL;DR: In this paper, the authors examined several methods for estimating two-tailed recall confidence intervals and found that the normal approximation in current use provides poor coverage in many circumstances, even when adjusted to correct its inappropriate symmetry.

...read moreread less

Abstract: Recall, the proportion of relevant documents retrieved, is an important measure of effectiveness in information retrieval, particularly in the legal, patent, and medical domains. Where document sets are too large for exhaustive relevance assessment, recall can be estimated by assessing a random sample of documents, but an indication of the reliability of this estimate is also required. In this article, we examine several methods for estimating two-tailed recall confidence intervals. We find that the normal approximation in current use provides poor coverage in many circumstances, even when adjusted to correct its inappropriate symmetry. Analytic and Bayesian methods based on the ratio of binomials are generally more accurate but are inaccurate on small populations. The method we recommend derives beta-binomial posteriors on retrieved and unretrieved yield, with fixed hyperparameters, and a Monte Carlo estimate of the posterior distribution of recall. We demonstrate that this method gives mean coverage at or near the nominal level, across several scenarios, while being balanced and stable. We offer advice on sampling design, including the allocation of assessments to the retrieved and unretrieved segments, and compare the proposed beta-binomial with the officially reported normal intervals for recent TREC Legal Track iterations.

...read moreread less

Journal Article•DOI•

Efficient fuzzy search in large text collections

[...]

Hannah Bast¹, Marjan Celikik¹•Institutions (1)

University of Freiburg¹

17 May 2013-ACM Transactions on Information Systems

TL;DR: This work considers the problem of fuzzy full-text search in large text collections and considers two similarity measures, one where the query terms match similar terms in the collection (e.g., algorithm matches algoritm or vice versa) and one where they match terms with a similar prefix in theCollection.

...read moreread less

Abstract: We consider the problem of fuzzy full-text search in large text collections, that is, full-text search which is robust against errors both on the side of the query as well as on the side of the documents. Standard inverted-index techniques work extremely well for ordinary full-text search but fail to achieve interactive query times (below 100 milliseconds) for fuzzy full-text search even on moderately-sized text collections (above 10 GBs of text). We present new preprocessing techniques that achieve interactive query times on large text collections (100 GB of text, served by a single machine). We consider two similarity measures, one where the query terms match similar terms in the collection (e.g., algorithm matches algoritm or vice versa) and one where the query terms match terms with a similar prefix in the collection (e.g., alori matches algorithm). The latter is important when we want to display results instantly after each keystroke (search as you type). All algorithms have been fully integrated into the CompleteSearch engine.

...read moreread less

Journal Article•DOI•

Studying the clustering paradox and scalability of search in highly distributed environments

[...]

Weimao Ke¹, Javed Mostafa²•Institutions (2)

Drexel University¹, University of North Carolina at Chapel Hill²

17 May 2013-ACM Transactions on Information Systems

TL;DR: This article addresses one important aspect of scalability challenges facing information retrieval models and investigates a decentralized, organic view of information systems pertaining to search in large-scale networks.

...read moreread less

Abstract: With the ubiquitous production, distribution and consumption of information, today's digital environments such as the Web are increasingly large and decentralized. It is hardly possible to obtain central control over information collections and systems in these environments. Searching for information in these information spaces has brought about problems beyond traditional boundaries of information retrieval (IR) research. This article addresses one important aspect of scalability challenges facing information retrieval models and investigates a decentralized, organic view of information systems pertaining to search in large-scale networks. Drawing on observations from earlier studies, we conduct a series of experiments on decentralized searches in large-scale networked information spaces. Results show that how distributed systems interconnect is crucial to retrieval performance and scalability of searching. Particularly, in various experimental settings and retrieval tasks, we find a consistent phenomenon, namely, the Clustering Paradox, in which the level of network clustering (semantic overlay) imposes a scalability limit. Scalable searches are well supported by a specific, balanced level of network clustering emerging from local system interconnectivity. Departure from that level, either stronger or weaker clustering, leads to search performance degradation, which is dramatic in large-scale networks.

...read moreread less

Journal Article•DOI•

Efficient Video Stream Monitoring for Near-Duplicate Detection and Localization in a Large-Scale Repository

[...]

Chih-Yi Chiu¹, Tsung-Han Tsai¹, Guei-Wun Han¹, Cheng-Yu Hsieh¹, Sheng-Yang Li¹ - Show less +1 more•Institutions (1)

National Chiayi University¹

01 Nov 2013-ACM Transactions on Information Systems

TL;DR: This article presents a simple but effective algorithm called incremental similarity update, which takes a lightweight computation to filter out unnecessary time-consuming computation for the actual similarity between two videos, making the search process more efficient.

...read moreread less

Abstract: In this article, we study the efficiency problem of video stream near-duplicate monitoring in a large-scale repository. Existing stream monitoring methods are mainly designed for a short video to scan over a query stream; they have difficulty being scalable for a large number of long videos. We present a simple but effective algorithm called incremental similarity update to address the problem. That is, a similarity upper bound between two videos can be calculated incrementally by leveraging the prior knowledge of the previous calculation. The similarity upper bound takes a lightweight computation to filter out unnecessary time-consuming computation for the actual similarity between two videos, making the search process more efficient. We integrate the algorithm with inverted indexing to obtain a candidate list from the repository for the given query stream. Meanwhile, the algorithm is applied to scan each candidate for locating exact near-duplicate subsequences. We implement several state-of-the-art methods for comparison in terms of accuracy, execution time, and memory consumption. Experimental results demonstrate the proposed algorithm yields comparable accuracy, compact memory size, and more efficient execution time.

...read moreread less

Journal Article•DOI•

The Impacts of Structural Difference and Temporality of Tweets on Retrieval Effectiveness

[...]

Lifeng Jia¹, Clement Yu¹, Weiyi Meng²•Institutions (2)

University of Illinois at Chicago¹, Binghamton University²

01 Nov 2013-ACM Transactions on Information Systems

TL;DR: Experimental results show that the way of ranking the two types of tweets separately and then merging them together yields better retrieval effectiveness than ranking them simultaneously, and the method compares favorably with state-of-the-art methods in retrieval effectiveness.

...read moreread less

Abstract: To explore the information seeking behaviors in microblogosphere, the microblog track at TREC 2011 introduced a real-time ad-hoc retrieval task that aims at ranking relevant tweets in reverse-chronological order. We study this problem via a two-phase approach: 1) retrieving tweets in an ad-hoc way; 2) utilizing the temporal information of tweets to enhance the retrieval effectiveness of tweets. Tweets can be categorized into two types. One type consists of short messages not containing any URL of a Web page. The other type has at least one URL of a Web page in addition to a short message. These two types of tweets have different structures. In the first phase, to address the structural difference of tweets, we propose a method to rank tweets using the divide-and-conquer strategy. Specifically, we first rank the two types of tweets separately. This produces two rankings, one for each type. Then we merge these two rankings of tweets into one ranking. In the second phase, we first categorize queries into several types by exploring the temporal distributions of their top-retrieved tweets from the first phase; then we calculate the time-related relevance scores of tweets according to the classified types of queries; finally we combine the time scores with the IR scores from the first phase to produce a ranking of tweets. Experimental results achieved by using the TREC 2011 and TREC 2012 queries over the TREC Tweets2011 collection show that: (i) our way of ranking the two types of tweets separately and then merging them together yields better retrieval effectiveness than ranking them simultaneously; (ii) our way of incorporating temporal information into the retrieval process yields further improvements, and (iii) our method compares favorably with state-of-the-art methods in retrieval effectiveness.

...read moreread less