scispace - formally typeset
Search or ask a question

Showing papers presented at "International ACM SIGIR Conference on Research and Development in Information Retrieval in 2016"


Proceedings ArticleDOI
07 Jul 2016
TL;DR: A new learning algorithm based on the element-wise Alternating Least Squares (eALS) technique is designed, for efficiently optimizing a Matrix Factorization (MF) model with variably-weighted missing data and exploiting this efficiency to then seamlessly devise an incremental update strategy that instantly refreshes a MF model given new feedback.
Abstract: This paper contributes improvements on both the effectiveness and efficiency of Matrix Factorization (MF) methods for implicit feedback. We highlight two critical issues of existing works. First, due to the large space of unobserved feedback, most existing works resort to assign a uniform weight to the missing data to reduce computational complexity. However, such a uniform assumption is invalid in real-world settings. Second, most methods are also designed in an offline setting and fail to keep up with the dynamic nature of online data. We address the above two issues in learning MF models from implicit feedback. We first propose to weight the missing data based on item popularity, which is more effective and flexible than the uniform-weight assumption. However, such a non-uniform weighting poses efficiency challenge in learning the model. To address this, we specifically design a new learning algorithm based on the element-wise Alternating Least Squares (eALS) technique, for efficiently optimizing a MF model with variably-weighted missing data. We exploit this efficiency to then seamlessly devise an incremental update strategy that instantly refreshes a MF model given new feedback. Through comprehensive experiments on two public datasets in both offline and online protocols, we show that our implemented, open-source (https://github.com/hexiangnan/sigir16-eals) eALS consistently outperforms state-of-the-art implicit MF methods.

916 citations


Proceedings ArticleDOI
Feng Yu1, Qiang Liu1, Shu Wu1, Liang Wang1, Tieniu Tan1 
07 Jul 2016
TL;DR: This work proposes a novel model, Dynamic REcurrent bAsket Model (DREAM), based on Recurrent Neural Network (RNN), which not only learns a dynamic representation of a user but also captures global sequential features among baskets.
Abstract: Next basket recommendation becomes an increasing concern. Most conventional models explore either sequential transaction features or general interests of users. Further, some works treat users' general interests and sequential behaviors as two totally divided matters, and then combine them in some way for next basket recommendation. Moreover, the state-of-the-art models are based on the assumption of Markov Chains (MC), which only capture local sequential features between two adjacent baskets. In this work, we propose a novel model, Dynamic REcurrent bAsket Model (DREAM), based on Recurrent Neural Network (RNN). DREAM not only learns a dynamic representation of a user but also captures global sequential features among baskets. The dynamic representation of a specific user can reveal user's dynamic interests at different time, and the global sequential features reflect interactions of all baskets of the user over time. Experiment results on two public datasets indicate that DREAM is more effective than the state-of-the-art models for next basket recommendation.

420 citations


Proceedings ArticleDOI
Rui Yan1, Yiping Song1, Hua Wu1
07 Jul 2016
TL;DR: This paper proposes a retrieval-based conversation system with the deep learning-to-respond schema through a deep neural network framework driven by web data and demonstrates significant performance improvement against a series of standard and state-of-art baselines for conversational purposes.
Abstract: To establish an automatic conversation system between humans and computers is regarded as one of the most hardcore problems in computer science, which involves interdisciplinary techniques in information retrieval, natural language processing, artificial intelligence, etc. The challenges lie in how to respond so as to maintain a relevant and continuous conversation with humans. Along with the prosperity of Web 2.0, we are now able to collect extremely massive conversational data, which are publicly available. It casts a great opportunity to launch automatic conversation systems. Owing to the diversity of Web resources, a retrieval-based conversation system will be able to find at least some responses from the massive repository for any user inputs. Given a human issued message, i.e., query, our system would provide a reply after adequate training and learning of how to respond. In this paper, we propose a retrieval-based conversation system with the deep learning-to-respond schema through a deep neural network framework driven by web data. The proposed model is general and unified for different conversation scenarios in open domain. We incorporate the impact of multiple data inputs, and formulate various features and factors with optimization into the deep learning framework. In the experiments, we investigate the effectiveness of the proposed deep neural network structures with better combinations of all different evidence. We demonstrate significant performance improvement against a series of standard and state-of-art baselines in terms of p@1, MAP, nDCG, and MRR for conversational purposes.

385 citations


Proceedings ArticleDOI
07 Jul 2016
TL;DR: A simple, fast, and effective topic model for short texts, named GPU-DMM, based on the Dirichlet Multinomial Mixture model, which achieves comparable or better topic representations than state-of-the-art models, measured by topic coherence.
Abstract: For many applications that require semantic understanding of short texts, inferring discriminative and coherent latent topics from short texts is a critical and fundamental task. Conventional topic models largely rely on word co-occurrences to derive topics from a collection of documents. However, due to the length of each document, short texts are much more sparse in terms of word co-occurrences. Data sparsity therefore becomes a bottleneck for conventional topic models to achieve good results on short texts. On the other hand, when a human being interprets a piece of short text, the understanding is not solely based on its content words, but also her background knowledge (e.g., semantically related words). The recent advances in word embedding offer effective learning of word semantic relations from a large corpus. Exploiting such auxiliary word embeddings to enrich topic modeling for short texts is the main focus of this paper. To this end, we propose a simple, fast, and effective topic model for short texts, named GPU-DMM. Based on the Dirichlet Multinomial Mixture (DMM) model, GPU-DMM promotes the semantically related words under the same topic during the sampling process by using the generalized Polya urn (GPU) model. In this sense, the background knowledge about word semantic relatedness learned from millions of external documents can be easily exploited to improve topic modeling for short texts. Through extensive experiments on two real-world short text collections in two languages, we show that GPU-DMM achieves comparable or better topic representations than state-of-the-art models, measured by topic coherence. The learned topic representation leads to the best accuracy in text classification task, which is used as an indirect evaluation.

293 citations


Proceedings ArticleDOI
07 Jul 2016
TL;DR: It is empirically demonstrate that learning-to-rank that accounts for query-dependent selection bias yields significant improvements in search effectiveness through online experiments with one of the world's largest personal search engines.
Abstract: Click-through data has proven to be a critical resource for improving search ranking quality. Though a large amount of click data can be easily collected by search engines, various biases make it difficult to fully leverage this type of data. In the past, many click models have been proposed and successfully used to estimate the relevance for individual query-document pairs in the context of web search. These click models typically require a large quantity of clicks for each individual pair and this makes them difficult to apply in systems where click data is highly sparse due to personalized corpora and information needs, e.g., personal search. In this paper, we study the problem of how to leverage sparse click data in personal search and introduce a novel selection bias problem and address it in the learning-to-rank framework. This paper proposes a few bias estimation methods, including a novel query-dependent one that captures queries with similar results and can successfully deal with sparse data. We empirically demonstrate that learning-to-rank that accounts for query-dependent selection bias yields significant improvements in search effectiveness through online experiments with one of the world's largest personal search engines.

272 citations


Proceedings ArticleDOI
07 Jul 2016
TL;DR: This paper proposes a principled CF hashing framework called Discrete Collaborative Filtering (DCF), which directly tackles the challenging discrete optimization that should have been treated adequately in hashing, and devise a computationally efficient algorithm with a rigorous convergence proof of DCF.
Abstract: We address the efficiency problem of Collaborative Filtering (CF) by hashing users and items as latent vectors in the form of binary codes, so that user-item affinity can be efficiently calculated in a Hamming space. However, existing hashing methods for CF employ binary code learning procedures that most suffer from the challenging discrete constraints. Hence, those methods generally adopt a two-stage learning scheme composed of relaxed optimization via discarding the discrete constraints, followed by binary quantization. We argue that such a scheme will result in a large quantization loss, which especially compromises the performance of large-scale CF that resorts to longer binary codes. In this paper, we propose a principled CF hashing framework called Discrete Collaborative Filtering (DCF), which directly tackles the challenging discrete optimization that should have been treated adequately in hashing. The formulation of DCF has two advantages: 1) the Hamming similarity induced loss that preserves the intrinsic user-item similarity, and 2) the balanced and uncorrelated code constraints that yield compact yet informative binary codes. We devise a computationally efficient algorithm with a rigorous convergence proof of DCF. Through extensive experiments on several real-world benchmarks, we show that DCF consistently outperforms state-of-the-art CF hashing techniques, e.g, though using only 8 bits, DCF is even significantly better than other methods using 128 bits.

252 citations


Proceedings ArticleDOI
07 Jul 2016
TL;DR: In this paper, an adaptive clustering technique for content recommendation based on exploration-exploitation strategies in contextual multi-armed bandit settings is proposed, which takes into account the collaborative effects that arise due to the interaction of the users with the items, by dynamically grouping users based on the items under consideration and, at the same time, grouping items based on similarity of the clusterings induced over the users.
Abstract: Classical collaborative filtering, and content-based filtering methods try to learn a static recommendation model given training data. These approaches are far from ideal in highly dynamic recommendation domains such as news recommendation and computational advertisement, where the set of items and users is very fluid. In this work, we investigate an adaptive clustering technique for content recommendation based on exploration-exploitation strategies in contextual multi-armed bandit settings. Our algorithm takes into account the collaborative effects that arise due to the interaction of the users with the items, by dynamically grouping users based on the items under consideration and, at the same time, grouping items based on the similarity of the clusterings induced over the users. The resulting algorithm thus takes advantage of preference patterns in the data in a way akin to collaborative filtering methods. We provide an empirical analysis on medium-size real-world datasets, showing scalability and increased prediction performance (as measured by click-through rate) over state-of-the-art methods for clustering bandits. We also provide a regret analysis within a standard linear stochastic noise setting.

228 citations


Proceedings ArticleDOI
07 Jul 2016
TL;DR: A novel deep neural network based architecture that models the combination of long-term static and short-term temporal user preferences to improve the recommendation performance and a novel pre-train method to reduce the number of free parameters significantly is proposed.
Abstract: Modeling temporal behavior in recommendation systems is an important and challenging problem. Its challenges come from the fact that temporal modeling increases the cost of parameter estimation and inference, while requiring large amount of data to reliably learn the model with the additional time dimensions. Therefore, it is often difficult to model temporal behavior in large-scale real-world recommendation systems. In this work, we propose a novel deep neural network based architecture that models the combination of long-term static and short-term temporal user preferences to improve the recommendation performance. To train the model efficiently for large-scale applications, we propose a novel pre-train method to reduce the number of free parameters significantly. The resulted model is applied to a real-world data set from a commercial News recommendation system. We compare to a set of established baselines and the experimental results show that our method outperforms the state-of-the-art significantly.

168 citations


Proceedings ArticleDOI
07 Jul 2016
TL;DR: This paper proposes a novel model called LRPPM-CF, which is able to improve the performance in the tasks of capturing users' interested features and item recommendation by about 17%-24% and 7%-13%, respectively, as compared with several state-of-the-art methods.
Abstract: Incorporating phrase-level sentiment analysis on users' textual reviews for recommendation has became a popular meth-od due to its explainable property for latent features and high prediction accuracy. However, the inherent limitations of the existing model make it difficult to (1) effectively distinguish the features that are most interesting to users, (2) maintain the recommendation performance especially when the set of items is scaled up to multiple categories, and (3) model users' implicit feedbacks on the product features. In this paper, motivated by these shortcomings, we first introduce a tensor matrix factorization algorithm to Learn to Rank user Preferences based on Phrase-level sentiment analysis across Multiple categories (LRPPM for short), and then by combining this technique with Collaborative Filtering (CF) method, we propose a novel model called LRPPM-CF to boost the performance of recommendation. Thorough experiments on two real-world datasets demonstrate that our proposed model is able to improve the performance in the tasks of capturing users' interested features and item recommendation by about 17%-24% and 7%-13%, respectively, as compared with several state-of-the-art methods.

147 citations


Proceedings ArticleDOI
07 Jul 2016
TL;DR: GeoBurst is a method that enables effective and real-time local event detection from geo-tagged tweet streams and significantly outperforms state-of-the-art methods in precision, and is orders of magnitude faster.
Abstract: The real-time discovery of local events (e.g., protests, crimes, disasters) is of great importance to various applications, such as crime monitoring, disaster alarming, and activity recommendation. While this task was nearly impossible years ago due to the lack of timely and reliable data sources, the recent explosive growth in geo-tagged tweet data brings new opportunities to it. That said, how to extract quality local events from geo-tagged tweet streams in real time remains largely unsolved so far. We propose GeoBurst, a method that enables effective and real-time local event detection from geo-tagged tweet streams. With a novel authority measure that captures the geo-topic correlations among tweets, GeoBurst first identifies several pivots in the query window. Such pivots serve as representative tweets for potential local events and naturally attract similar tweets to form candidate events. To select truly interesting local events from the candidate list, GeoBurst further summarizes continuous tweet streams and compares the candidates against historical activities to obtain spatiotemporally bursty ones. Finally, GeoBurst also features an updating module that finds new pivots with little time cost when the query window shifts. As such, GeoBurst is capable of monitoring continuous streams in real time. We used crowdsourcing to evaluate GeoBurst on two real-life data sets that contain millions of geo-tagged tweets. The results demonstrate that GeoBurst significantly outperforms state-of-the-art methods in precision, and is orders of magnitude faster.

138 citations


Proceedings ArticleDOI
07 Jul 2016
TL;DR: This paper approaches seamless multimodal hashing by proposing a novel Composite Correlation Quantization (CCQ) model, which jointly finds correlation-maximal mappings that transform different modalities into isomorphic latent space, and learns composite quantizers that convert the isomorph latent features into compact binary codes.
Abstract: Efficient similarity retrieval from large-scale multimodal database is pervasive in modern search engines and social networks. To support queries across content modalities, the system should enable cross-modal correlation and computation-efficient indexing. While hashing methods have shown great potential in achieving this goal, current attempts generally fail to learn isomorphic hash codes in a seamless scheme, that is, they embed multiple modalities in a continuous isomorphic space and separately threshold embeddings into binary codes, which incurs substantial loss of retrieval accuracy. In this paper, we approach seamless multimodal hashing by proposing a novel Composite Correlation Quantization (CCQ) model. Specifically, CCQ jointly finds correlation-maximal mappings that transform different modalities into isomorphic latent space, and learns composite quantizers that convert the isomorphic latent features into compact binary codes. An optimization framework is devised to preserve both intra-modal similarity and inter-modal correlation through minimizing both reconstruction and quantization errors, which can be trained from both paired and partially paired data in linear time. A comprehensive set of experiments clearly show the superior effectiveness and efficiency of CCQ against the state of the art hashing methods for both unimodal and cross-modal retrieval.

Proceedings ArticleDOI
07 Jul 2016
TL;DR: This paper examines the logs of a commercial search engine's mobile interface, and compares the spoken queries to the typed-in queries, placing special emphasis on the semantic and syntactic characteristics of the two types of queries.
Abstract: The growing popularity of mobile search and the advancement in voice recognition technologies have opened the door for web search users to speak their queries, rather than type them. While this kind of voice search is still in its infancy, it is gradually becoming more widespread. In this paper, we examine the logs of a commercial search engine's mobile interface, and compare the spoken queries to the typed-in queries. We place special emphasis on the semantic and syntactic characteristics of the two types of queries. %Our analysis suggests that voice queries focus more on audio-visual content and question answering, and less on social networking and adult domains. We also conduct an empirical evaluation showing that the language of voice queries is closer to natural language than typed queries. Our analysis reveals further differences between voice and text search, which have implications for the design of future voice-enabled search tools.

Proceedings ArticleDOI
07 Jul 2016
TL;DR: This paper proposes an automatic method to predict user satisfaction with intelligent assistants that exploits all the interaction signals, including voice commands and physical touch gestures on the device, and finds that interaction signals that capture the user reading patterns have a high impact.
Abstract: There is a rapid growth in the use of voice-controlled intelligent personal assistants on mobile devices, such as Microsoft's Cortana, Google Now, and Apple's Siri. They significantly change the way users interact with search systems, not only because of the voice control use and touch gestures, but also due to the dialogue-style nature of the interactions and their ability to preserve context across different queries. Predicting success and failure of such search dialogues is a new problem, and an important one for evaluating and further improving intelligent assistants. While clicks in web search have been extensively used to infer user satisfaction, their significance in search dialogues is lower due to the partial replacement of clicks with voice control, direct and voice answers, and touch gestures. In this paper, we propose an automatic method to predict user satisfaction with intelligent assistants that exploits all the interaction signals, including voice commands and physical touch gestures on the device. First, we conduct an extensive user study to measure user satisfaction with intelligent assistants, and simultaneously record all user interactions. Second, we show that the dialogue style of interaction makes it necessary to evaluate the user experience at the overall task level as opposed to the query level. Third, we train a model to predict user satisfaction, and find that interaction signals that capture the user reading patterns have a high impact: when including all available interaction signals, we are able to improve the prediction accuracy of user satisfaction from 71% to 81% over a baseline that utilizes only click and query features.

Proceedings ArticleDOI
07 Jul 2016
TL;DR: This work proposes a new collective, graph-based disambiguation algorithm utilizing semantic entity and document embeddings for robust entity disambIGuation, and indicates that the algorithm achieves better results than non-publicly available state-of-the-art algorithms.
Abstract: Entity disambiguation is the task of mapping ambiguous terms in natural-language text to its entities in a knowledge base. It finds its application in the extraction of structured data in RDF (Resource Description Framework) from textual documents, but equally so in facilitating artificial intelligence applications, such as Semantic Search, Reasoning and Question & Answering. We propose a new collective, graph-based disambiguation algorithm utilizing semantic entity and document embeddings for robust entity disambiguation. Robust thereby refers to the property of achieving better than state-of-the-art results over a wide range of very different data sets. Our approach is also able to abstain if no appropriate entity can be found for a specific surface form. Our evaluation shows, that our approach achieves significantly (>5%) better results than all other publicly available disambiguation algorithms on 7 of 9 datasets without data set specific tuning. Moreover, we discuss the influence of the quality of the knowledge base on the disambiguation accuracy and indicate that our algorithm achieves better results than non-publicly available state-of-the-art algorithms.

Proceedings ArticleDOI
07 Jul 2016
TL;DR: This paper develops a collaborative contextual bandit algorithm, in which the adjacency graph among users is leveraged to share context and payoffs among neighboring users while online updating, and rigorously proves an improved upper regret bound.
Abstract: Contextual bandit algorithms provide principled online learning solutions to find optimal trade-offs between exploration and exploitation with companion side-information. They have been extensively used in many important practical scenarios, such as display advertising and content recommendation. A common practice estimates the unknown bandit parameters pertaining to each user independently. This unfortunately ignores dependency among users and thus leads to suboptimal solutions, especially for the applications that have strong social components. In this paper, we develop a collaborative contextual bandit algorithm, in which the adjacency graph among users is leveraged to share context and payoffs among neighboring users while online updating. We rigorously prove an improved upper regret bound of the proposed collaborative bandit algorithm comparing to conventional independent bandit algorithms. Extensive experiments on both synthetic and three large-scale real-world datasets verified the improvement of our proposed algorithm against several state-of-the-art contextual bandit algorithms.

Proceedings ArticleDOI
07 Jul 2016
TL;DR: This work addresses the ad hoc document retrieval task by devising novel types of entity-based language models that utilize information about single terms in the query and documents as well as term sequences marked as entities by some entity-linking tool.
Abstract: We address the ad hoc document retrieval task by devising novel types of entity-based language models. The models utilize information about single terms in the query and documents as well as term sequences marked as entities by some entity-linking tool. The key principle of the language models is accounting, simultaneously, for the uncertainty inherent in the entity-markup process and the balance between using entity-based and term-based information. Empirical evaluation demonstrates the merits of using the language models for retrieval. For example, the performance transcends that of a state-of-the-art term proximity method. We also show that the language models can be effectively used for cluster-based document retrieval and query expansion.

Proceedings ArticleDOI
07 Jul 2016
TL;DR: It is shown how to augment any TAR method to achieve guaranteed reliability, for a quantifiable level of additional review effort, and argued that optimizing reliability according to the traditional goal-post method is inconsistent with certain subjective aspects of quality, and that optimizing a Taguchi quality loss function may be more apt.
Abstract: The objective of technology-assisted review ("TAR") is to find as much relevant information as possible with reasonable effort. Quality is a measure of the extent to which a TAR method achieves this objective, while reliability is a measure of how consistently it achieves an acceptable result. We are concerned with how to define, measure, and achieve high quality and high reliability in TAR. When quality is defined using the traditional goal-post method of specifying a minimum acceptable recall threshold, the quality and reliability of a TAR method are both, by definition, equal to the probability of achieving the threshold. Assuming this definition of quality and reliability, we show how to augment any TAR method to achieve guaranteed reliability, for a quantifiable level of additional review effort. We demonstrate this result by augmenting the TAR method supplied as the baseline model implementation for the TREC 2015 Total Recall Track, measuring reliability and effort for 555 topics from eight test collections. While our empirical results corroborate our claim of guaranteed reliability, we observe that the augmentation strategy may entail disproportionate effort, especially when the number of relevant documents is low. To address this limitation, we propose stopping criteria for the model implementation that may be applied with no additional review effort, while achieving empirical reliability that compares favorably to the provably reliable method. We further argue that optimizing reliability according to the traditional goal-post method is inconsistent with certain subjective aspects of quality, and that optimizing a Taguchi quality loss function may be more apt.

Proceedings ArticleDOI
07 Jul 2016
TL;DR: It is shown that external assessors are capable of annotating usefulness when provided with more search context information and it is suggested that a usefulness-based evaluation method can be defined to better reflect the quality of search systems perceived by the users.
Abstract: Relevance is a fundamental concept in information retrieval (IR) studies. It is however often observed that relevance as annotated by secondary assessors may not necessarily mean usefulness and satisfaction perceived by users. In this study, we confirm the difference by a laboratory study in which we collect relevance annotations by external assessors, usefulness and user satisfaction information by users, for a set of search tasks. We also find that a measure based on usefulness rather than relevance annotated has a better correlation with user satisfaction. However, we show that external assessors are capable of annotating usefulness when provided with more search context information. In addition, we also show that it is possible to generate automatically usefulness labels when some training data is available. Our findings explain why traditional system-centric evaluation metrics are not well aligned with user satisfaction and suggest that a usefulness-based evaluation method can be defined to better reflect the quality of search systems perceived by the users.

Proceedings ArticleDOI
07 Jul 2016
TL;DR: The first test collection for personal lifelog data is introduced, which has been employed for the NTCIR12-Lifelog task and the requirements for the test collection are motivated, the process of creating thetest collection is described, and suggestions are given for possible applications.
Abstract: Test collections have a long history of supporting repeatable and comparable evaluation in Information Retrieval (IR). However, thus far, no shared test collection exists for IR systems that are designed to index and retrieve multimodal lifelog data. In this paper we introduce the first test collection for personal lifelog data, which has been employed for the NTCIR12-Lifelog task. In this paper, the requirements for the test collection are motivated, the process of creating the test collection is described, along with an overview of the test collection. Finally suggestions are given for possible applications of the test collection.

Proceedings ArticleDOI
07 Jul 2016
TL;DR: The UQV100 test collection is described, designed to incorporate variability from users and unlock new opportunities for novel investigations and analysis, including for problems such as task-intent retrieval performance and consistency, query clustering, query difficulty prediction, and relevance feedback, among others.
Abstract: We describe the UQV100 test collection, designed to incorporate variability from users. Information need ?backstories? were written for 100 topics (or sub-topics) from the TREC 2013 and 2014 Web Tracks. Crowd workers were asked to read the backstories, and provide the queries they would use; plus effort estimates of how many useful documents they would have to read to satisfy the need. A total of 10,835 queries were collected from 263 workers. After normalization and spell-correction, 5,764 unique variations remained; these were then used to construct a document pool via Indri-BM25 over the ClueWeb12-B corpus. Qualified crowd workers made relevance judgments relative to the backstories, using a relevance scale similar to the original TREC approach; first to a pool depth of ten per query, then deeper on a set of targeted documents. The backstories, query variations, normalized and spell-corrected queries, effort estimates, run outputs, and relevance judgments are made available collectively as the UQV100 test collection. We also make available the judging guidelines and the gold hits we used for crowd-worker qualification and spam detection. We believe this test collection will unlock new opportunities for novel investigations and analysis, including for problems such as task-intent retrieval performance and consistency (independent of query variation), query clustering, query difficulty prediction, and relevance feedback, among others.

Proceedings ArticleDOI
07 Jul 2016
TL;DR: This work revisits different phases in the KBQA process and demonstrates that text resources improve question interpretation, candidate generation and ranking, and introduces a new system, Text2KB, that enriches question answering over a knowledge base by using external text data.
Abstract: One of the major challenges for automated question answering over Knowledge Bases (KBQA) is translating a natural language question to the Knowledge Base (KB) entities and predicates. Previous systems have used a limited amount of training data to learn a lexicon that is later used for question answering. This approach does not make use of other potentially relevant text data, outside the KB, which could supplement the available information. We introduce a new system, Text2KB, that enriches question answering over a knowledge base by using external text data. Specifically, we revisit different phases in the KBQA process and demonstrate that text resources improve question interpretation, candidate generation and ranking. Building on a state-of-the-art traditional KBQA system, Text2KB utilizes web search results, community question answering and general text document collection data, to detect question topic entities, map question phrases to KB predicates, and to enrich the features of the candidates derived from the KB. Text2KB significantly improves performance over the baseline KBQA method, as measured on a popular WebQuestions dataset. The results and insights developed in this work can guide future efforts on combining textual and structured KB data for question answering.

Proceedings ArticleDOI
07 Jul 2016
TL;DR: This work presents a new oversampling method specifically designed for classifying data for which the distributional hypothesis holds, according to which the meaning of a feature is somehow determined by its distribution in large corpora of data.
Abstract: The accuracy of many classification algorithms is known to suffer when the data are imbalanced (i.e., when the distribution of the examples across the classes is severely skewed). Many applications of binary text classification are of this type, with the positive examples of the class of interest far outnumbered by the negative examples. Oversampling (i.e., generating synthetic training examples of the minority class) is an often used strategy to counter this problem. We present a new oversampling method specifically designed for classifying data (such as text) for which the distributional hypothesis holds, according to which the meaning of a feature is somehow determined by its distribution in large corpora of data. Our Distributional Random Oversampling method generates new random minority-class synthetic documents by exploiting the distributional properties of the terms in the collection. We discuss results we have obtained on the Reuters-21578, OHSUMED-S, and RCV1-v2 datasets.

Journal ArticleDOI
27 Jun 2016
TL;DR: This paper discusses, summarize, and adapt the main findings of the Dagstuhl seminar to the context of IR evaluation -- both system-oriented and user-oriented -- in order to raise awareness in the community and stimulate the fields towards and increased reproducibility of the authors' experiments.
Abstract: The Dagstuhl Seminar on "Reproducibility of Data-Oriented Experiments in e-Science", held on 24-29 January 2016, focused on the core issues and approaches to reproducibility of experiments from a multidisciplinary point of view, sharing the experience coming from several fields of computer science.In this paper, we discuss, summarize, and adapt the main findings of the seminar to the context of IR evaluation -- both system-oriented and user-oriented -- in order to raise awareness in our community and stimulate the fields towards and increased reproducibility of our experiments.

Proceedings ArticleDOI
07 Jul 2016
TL;DR: This work learns namespace definitions by clustering the MLP results and mapping those clusters to subject classification schemata, and discovers that identifier namespaces improve the performance of automated identifier-definition extraction, and elevate it to a level that cannot be achieved within the document context alone.
Abstract: Mathematical formulae are essential in science, but face challenges of ambiguity, due to the use of a small number of identifiers to represent an immense number of concepts. Corresponding to word sense disambiguation in Natural Language Processing, we disambiguate mathematical identifiers. By regarding formulae and natural text as one monolithic information source, we are able to extract the semantics of identifiers in a process we term Mathematical Language Processing (MLP). As scientific communities tend to establish standard (identifier) notations, we use the document domain to infer the actual meaning of an identifier. Therefore, we adapt the software development concept of namespaces to mathematical notation. Thus, we learn namespace definitions by clustering the MLP results and mapping those clusters to subject classification schemata. In addition, this gives fundamental insights into the usage of mathematical notations in science, technology, engineering and mathematics. Our gold standard based evaluation shows that MLP extracts relevant identifier-definitions. Moreover, we discover that identifier namespaces improve the performance of automated identifier-definition extraction, and elevate it to a level that cannot be achieved within the document context alone.

Proceedings ArticleDOI
07 Jul 2016
TL;DR: This tutorial summarizes and unifies the emerging body of methods on counterfactual evaluation and learning that provide a well-founded way to evaluate and optimize online metrics by exploiting logs of past user interactions.
Abstract: Online metrics measured through A/B tests have become the gold standard for many evaluation questions. But can we get the same results as A/B tests without actually fielding a new system? And can we train systems to optimize online metrics without subjecting users to an online learning algorithm? This tutorial summarizes and unifies the emerging body of methods on counterfactual evaluation and learning. These counterfactual techniques provide a well-founded way to evaluate and optimize online metrics by exploiting logs of past user interactions. In particular, the tutorial unifies the causal inference, information retrieval, and machine learning view of this problem, providing the basis for future research in this emerging area of great potential impact. Supplementary material and resources are available online at http://www.cs.cornell.edu/~adith/CfactSIGIR2016.

Journal ArticleDOI
29 Jan 2016
TL;DR: The SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR) took place on Thursday, August 13, 2015 in Santiago, Chile and was to provide a venue through which the authors of open source search engines could compare performance of indexing and searching on theSame collections and on the same machines.
Abstract: The SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR) took place on Thursday, August 13, 2015 in Santiago, Chile. The goal of the workshop was two fold. The first to provide a venue for the publication and presentation of negative results. The second was to provide a venue through which the authors of open source search engines could compare performance of indexing and searching on the same collections and on the same machines - encouraging the sharing of ideas and discoveries in a like-to-like environment. In total three papers were presented and seven systems participated.

Proceedings ArticleDOI
07 Jul 2016
TL;DR: Tweet2vec as discussed by the authors uses character-level CNN-LSTM encoder-decoder to generate vector representations for English-language tweets, which can be used to learn tweet embeddings for different languages.
Abstract: We present Tweet2Vec, a novel method for generating general-purpose vector representation of tweets. The model learns tweet embeddings using character-level CNN-LSTM encoder-decoder. We trained our model on 3 million, randomly selected English-language tweets. The model was evaluated using two methods: tweet semantic similarity and tweet sentiment categorization, outperforming the previous state-of-the-art in both tasks. The evaluations demonstrate the power of the tweet embeddings generated by our model for various tweet categorization tasks. The vector representations generated by our model are generic, and hence can be applied to a variety of tasks. Though the model presented in this paper is trained on English-language tweets, the method presented can be used to learn tweet embeddings for different languages.

Journal ArticleDOI
27 Jun 2016
TL;DR: A new online evaluation paradigm called multileaving that extends upon interleaving is introduced that means that fewer users need to be exposed to possibly inferior search engines as they adapt more quickly to changes in user preferences.
Abstract: More than half the world's population uses web search engines, resulting in over half a billion queries every single day. For many people, web search engines such as Baidu, Bing, Google, and Yandex are among the first resources they go to when a question arises. Moreover, for many search engines have become the most trusted route to information, more so even than traditional media such as newspapers, news websites or news channels on television. What web search engines present people with greatly influences what they believe to be true and consequently it influences their thoughts, opinions, decisions, and the actions they take. With this in mind two things are important, from an information retrieval research perspective. First, it is important to understand how well search engines (rankers) perform and secondly this knowledge should be used to improve them. This thesis is about these two topics: evaluation of search engines and learning search engines.In the first part of this thesis we investigate how user interactions with search engines can be used to evaluate search engines. In particular, we introduce a new online evaluation paradigm called multileaving that extends upon interleaving. With multileaving, many rankers can be compared at once by combining document lists from these rankers into a single result list and attributing user interactions with this list to the rankers. Then we investigate the relation between A/B testing and interleaved comparison methods. Both studies lead to much higher sensitivity of the evaluation methods, meaning that fewer user interactions are required to arrive at reliable conclusions. This has the important implication that fewer users need to be exposed to the results from possibly inferior search engines.In the second part of this thesis we turn to online learning to rank. We learn from the evaluation methods introduced and extended upon in the first part. We learn the parameters of base rankers based on user interactions. Then we use the multileaving methods as feedback in our learning method, leading to much faster convergence than existing methods. Again, the important implication is that fewer users need to be exposed to possibly inferior search engines as they adapt more quickly to changes in user preferences. The last part of this thesis is of a different nature than the earlier two parts. As opposed to the earlier chapters, we no longer study algorithms. Progress in information retrieval research has always been driven by a combination of algorithms, shared resources, and evaluation.In the last part we focus on the latter two. We introduce a new shared resource and a new evaluation paradigm. Firstly, we propose Lerot. Lerot is an online evaluation framework that allows us to simulate users interacting with a search engine. Our implementation has been released as open source software and is currently being used by researchers around the world. Secondly we introduce OpenSearch, a new evaluation paradigm involving real users of real search engines. We describe an implementation of this paradigm that has already been widely adopted by the research community through challenges at CLEF and TREC.

Proceedings ArticleDOI
07 Jul 2016
TL;DR: Why it is believed that the time is right for an intermediate-level tutorial on online learning to rank, the objectives of the proposed tutorial, its relevance, as well as more practical details, such as format, schedule and support materials.
Abstract: During the past 10--15 years offline learning to rank has had a tremendous influence on information retrieval, both scientifically and in practice. Recently, as the limitations of offline learning to rank for information retrieval have become apparent, there is increased attention for online learning to rank methods for information retrieval in the community. Such methods learn from user interactions rather than from a set of labeled data that is fully available for training up front. Below we describe why we believe that the time is right for an intermediate-level tutorial on online learning to rank, the objectives of the proposed tutorial, its relevance, as well as more practical details, such as format, schedule and support materials.

Proceedings ArticleDOI
Tetsuya Sakai1
07 Jul 2016
TL;DR: A systematic review of 840 SIGIR full papers and 215 TOIS papers published between 2006 and 2015 reports on how IR researchers (fail to) report on significance test results, what types of tests they use, and how the reporting practices may have changed over the last decade.
Abstract: We conducted a systematic review of 840 SIGIR full papers and 215 TOIS papers published between 2006 and 2015. The original objective of the study was to identify IR effectiveness experiments that are seriously underpowered (i.e., the sample size is far too small so that the probability of missing a real difference is extremely high) or overpowered (i.e., the sample size is so large that a difference will be considered statistically significant even if the actual effect size is extremely small). However, it quickly became clear to us that many IR effectiveness papers either lack significance testing or fail to report p-values and/or test statistics, which prevents us from conducting power analysis. Hence we first report on how IR researchers (fail to) report on significance test results, what types of tests they use, and how the reporting practices may have changed over the last decade. From those papers that reported enough information for us to conduct power analysis, we identify extremely overpowered and underpowered experiments, as well as appropriate sample sizes for future experiments. The raw results of our systematic survey of 1,055 papers and our R scripts for power analysis are available online. Our hope is that this study will help improve the reporting practices and experimental designs of future IR effectiveness studies.