scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Information Systems in 2015"


Journal ArticleDOI
TL;DR: To speed up the process of producing the top-k recommendations from large-scale social media data, an efficient query-processing technique is developed to support the proposed temporal context-aware recommender system (TCARS), and an item-weighting scheme is proposed to enable them to favor items that better represent topics related to user interests and topicsrelated to temporal context.
Abstract: Social media provides valuable resources to analyze user behaviors and capture user preferences. This article focuses on analyzing user behaviors in social media systems and designing a latent class statistical mixture model, named temporal context-aware mixture model (TCAM), to account for the intentions and preferences behind user behaviors. Based on the observation that the behaviors of a user in social media systems are generally influenced by intrinsic interest as well as the temporal context (e.g., the public's attention at that time), TCAM simultaneously models the topics related to users' intrinsic interests and the topics related to temporal context and then combines the influences from the two factors to model user behaviors in a unified way. Considering that users' interests are not always stable and may change over time, we extend TCAM to a dynamic temporal context-aware mixture model (DTCAM) to capture users' changing interests. To alleviate the problem of data sparsity, we exploit the social and temporal correlation information by integrating a social-temporal regularization framework into the DTCAM model. To further improve the performance of our proposed models (TCAM and DTCAM), an item-weighting scheme is proposed to enable them to favor items that better represent topics related to user interests and topics related to temporal context, respectively. Based on our proposed models, we design a temporal context-aware recommender system (TCARS). To speed up the process of producing the top-k recommendations from large-scale social media data, we develop an efficient query-processing technique to support TCARS. Extensive experiments have been conducted to evaluate the performance of our models on four real-world datasets crawled from different social media sites. The experimental results demonstrate the superiority of our models, compared with the state-of-the-art competitor methods, by modeling user behaviors more precisely and making more effective and efficient recommendations.

150 citations


Journal ArticleDOI
TL;DR: The results show that similar patterns of user activity are observed at both the cognitive and page use levels, and activity patterns are able to distinguish between task types in similar ways and between tasks of different levels of difficulty.
Abstract: Personalization of support for information seeking depends crucially on the information retrieval system's knowledge of the task that led the person to engage in information seeking. Users work during information search sessions to satisfy their task goals, and their activity is not random. To what degree are there patterns in the user activity during information search sessionsq Do activity patterns reflect the user's situation as the user moves through the search task under the influence of his or her task goalq Do these patterns reflect aspects of different types of information-seeking tasksq Could such activity patterns identify contexts within which information seeking takes placeq To investigate these questions, we model sequences of user behaviors in two independent user studies of information search sessions (N = 32 users, 128 sessions, and N = 40 users, 160 sessions). Two representations of user activity patterns are used. One is based on the sequences of page use; the other is based on a cognitive representation of information acquisition derived from eye movement patterns in service of the reading process. One of the user studies considered journalism work tasks; the other concerned background research in genomics using search tasks taken from the TREC Genomics Track. The search tasks differed in basic dimensions of complexity, specificity, and the type of information product (intellectual or factual) needed to achieve the overall task goal. The results show that similar patterns of user activity are observed at both the cognitive and page use levels. The activity patterns at both representation layers are able to distinguish between task types in similar ways and, to some degree, between tasks of different levels of difficulty. We explore relationships between the results and task difficulty and discuss the use of activity patterns to explore events within a search session. User activity patterns can be at least partially observed in server-side search logs. A focus on patterns of user activity sequences may contribute to the development of information systems that better personalize the user's search experience.

98 citations


Journal ArticleDOI
Quan Yuan1, Gao Cong1, Kaiqi Zhao1, Zongyang Ma1, Aixin Sun1 
TL;DR: Experimental results on two real-world datasets show that the proposed model is effective in discovering users’ spatial-temporal topics and significantly outperforms state-of-the-art baselines for various tasks including location prediction for tweets and requirement-aware location recommendation.
Abstract: Micro-blogging services and location-based social networks, such as Twitter, Weibo, and Foursquare, enable users to post short messages with timestamps and geographical annotations. The rich spatial-temporal-semantic information of individuals embedded in these geo-annotated short messages provides exciting opportunity to develop many context-aware applications in ubiquitous computing environments. Example applications include contextual recommendation and contextual search. To obtain accurate recommendations and most relevant search results, it is important to capture users’ contextual information (e.g., time and location) and to understand users’ topical interests and intentions. While time and location can be readily captured by smartphones, understanding user’s interests and intentions calls for effective methods in modeling user mobility behavior. Here, user mobility refers to who visits which place at what time for what activity. That is, user mobility behavior modeling must consider user (Who), spatial (Where), temporal (When), and activity (What) aspects. Unfortunately, no previous studies on user mobility behavior modeling have considered all of the four aspects jointly, which have complex interdependencies. In our preliminary study, we propose the first solution named W4 (short for Who, Where, When, and What) to discover user mobility behavior from the four aspects. In this article, we further enhance W4 and propose a nonparametric Bayesian model named EW4 (short for Enhanced W4). EW4 requires no parameter tuning and achieves better results over W4 in our experiments. Given some of the four aspects of a user (e.g., time), our model is able to infer information of the other aspects (e.g., location and topical words). Thus, our model has a variety of context-aware applications, particularly in contextual search and recommendation. Experimental results on two real-world datasets show that the proposed model is effective in discovering users’ spatial-temporal topics. The model also significantly outperforms state-of-the-art baselines for various tasks including location prediction for tweets and requirement-aware location recommendation.

84 citations


Journal ArticleDOI
TL;DR: How context and mobility influence people's motivations to meet new people is outlined and innovative design concepts for mediating mobile encounters through context-aware social matching systems are presented.
Abstract: Mobile social matching systems have the potential to transform the way we make new social ties, but only if we are able to overcome the many challenges that exist as to how systems can utilize contextual data to recommend interesting and relevant people to users and facilitate valuable encounters between strangers. This article outlines how context and mobility influence people's motivations to meet new people and presents innovative design concepts for mediating mobile encounters through context-aware social matching systems. Findings from two studies are presented. The first, a survey study (n = 117) explored the concept of contextual rarity of shared user attributes as a measure to improve desirability in mobile social matches. The second, an interview study (n = 58) explored people's motivations to meet others in various contexts. From these studies we derived a set of novel context-aware social matching concepts, including contextual sociability and familiarity as an indicator of opportune social context; contextual engagement as an indicator of opportune personal context; and contextual rarity, oddity, and activity partnering as an indicator of opportune relational context. The findings of these studies establish the importance of different contextual factors and frame the design space of context-aware social matching systems.

70 citations


Journal ArticleDOI
TL;DR: The goal in the present article is to structure TBII on the basis of the five generic activities and consider the evaluation of each activity using the program theory framework and combine these activity-based program theories in an overall evaluation framework for TBIi.
Abstract: Evaluation is central in research and development of information retrieval (IR). In addition to designing and implementing new retrieval mechanisms, one must also show through rigorous evaluation that they are effective. A major focus in IR is IR mechanisms’ capability of ranking relevant documents optimally for the users, given a query. Searching for information in practice involves searchers, however, and is highly interactive. When human searchers have been incorporated in evaluation studies, the results have often suggested that better ranking does not necessarily lead to better search task, or work task, performance. Therefore, it is not clear which system or interface features should be developed to improve the effectiveness of human task performance. In the present article, we focus on the evaluation of task-based information interaction (TBII). We give special emphasis to learning tasks to discuss TBII in more concrete terms. Information interaction is here understood as behavioral and cognitive activities related to task planning, searching information items, selecting between them, working with them, and synthesizing and reporting. These five generic activities contribute to task performance and outcome and can be supported by information systems. In an attempt toward task-based evaluation, we introduce program theory as the evaluation framework. Such evaluation can investigate whether a program consisting of TBII activities and tools works and how it works and, further, provides a causal description of program (in)effectiveness. Our goal in the present article is to structure TBII on the basis of the five generic activities and consider the evaluation of each activity using the program theory framework. Finally, we combine these activity-based program theories in an overall evaluation framework for TBII. Such an evaluation is complex due to the large number of factors affecting information interaction. Instead of presenting tested program theories, we illustrate how the evaluation of TBII should be accomplished using the program theory framework in the evaluation of systems and behaviors, and their interactions, comprehensively in context.

50 citations


Journal ArticleDOI
TL;DR: This article investigates and extends selective search, an approach that partitions the dataset based on document similarity to obtain topic-based shards, and searches only a few shards that are estimated to contain relevant documents for the query.
Abstract: The traditional search solution for large collections divides the collection into subsets (shards), and processes the query against all shards in parallel (exhaustive search). The search cost and the computational requirements of this approach are often prohibitively high for organizations with few computational resources. This article investigates and extends an alternative: selective search, an approach that partitions the dataset based on document similarity to obtain topic-based shards, and searches only a few shards that are estimated to contain relevant documents for the query. We propose shard creation techniques that are scalable, efficient, self-reliant, and create topic-based shards with low variance in size, and high density of relevant documents. The experimental results demonstrate that the effectiveness of selective search is on par with that of exhaustive search, and the corresponding search costs are substantially lower with the former. Also, the majority of the queries perform as well or better with selective search. An oracle experiment that uses optimal shard ranking for a query indicates that selective search can outperform the effectiveness of exhaustive search. Comparison with a query optimization technique shows higher improvements in efficiency with selective search. The overall best efficiency is achieved when the two techniques are combined in an optimized selective search approach.

48 citations


Journal ArticleDOI
TL;DR: A novel multilingual summarizer that exploits an itemset-based model to summarize collections of documents ranging over the same topic, which makes minimal use of language-dependent analyses and is easily applicable to document collections written in different languages.
Abstract: Multidocument summarization addresses the selection of a compact subset of highly informative sentences, i.e., the summary, from a collection of textual documents. To perform sentence selection, two parallel strategies have been proposed: (a) apply general-purpose techniques relying on data mining or information retrieval techniques, and/or (b) perform advanced linguistic analysis relying on semantics-based models (e.g., ontologies) to capture the actual sentence meaning. Since there is an increasing need for processing documents written in different languages, the attention of the research community has recently focused on summarizers based on strategy (a).This article presents a novel multilingual summarizer, namely MWI-Sum (Multilingual Weighted Itemset-based Summarizer), that exploits an itemset-based model to summarize collections of documents ranging over the same topic. Unlike previous approaches, it extracts frequent weighted itemsets tailored to the analyzed collection and uses them to drive the sentence selection process. Weighted itemsets represent correlations among multiple highly relevant terms that are neglected by previous approaches. The proposed approach makes minimal use of language-dependent analyses. Thus, it is easily applicable to document collections written in different languages.Experiments performed on benchmark and real-life collections, English-written and not, demonstrate that the proposed approach performs better than state-of-the-art multilingual document summarizers.

43 citations


Journal ArticleDOI
TL;DR: Group-simple, group-Scheme, Group-AFOR, and Group-PFD are proposed in this article to accelerate integer compression algorithms for data-oriented tasks, especially in the era of big data.
Abstract: Compression algorithms are important for data-oriented tasks, especially in the era of “Big Data.” Modern processors equipped with powerful SIMD instruction sets provide us with an opportunity for achieving better compression performance. Previous research has shown that SIMD-based optimizations can multiply decoding speeds. Following these pioneering studies, we propose a general approach to accelerate compression algorithms. By instantiating the approach, we have developed several novel integer compression algorithms, called Group-Simple, Group-Scheme, Group-AFOR, and Group-PFD, and implemented their corresponding vectorized versions. We evaluate the proposed algorithms on two public TREC datasets, a Wikipedia dataset, and a Twitter dataset. With competitive compression ratios and encoding speeds, our SIMD-based algorithms outperform state-of-the-art nonvectorized algorithms with respect to decoding speeds.

41 citations


Journal ArticleDOI
Ryen W. White1, Eric Horvitz1
TL;DR: A methodology for measuring participants' beliefs and confidence about the efficacy of treatment before, during, and after search episodes is presented and predictive models to estimate postsearch beliefs using sets of features about behavior and content are built.
Abstract: We investigate how beliefs about the efficacy of medical interventions are influenced by searchers' exposure to information on retrieved Web pages. We present a methodology for measuring participants' beliefs and confidence about the efficacy of treatment before, during, and after search episodes. We consider interventions studied in the Cochrane collection of meta-analyses. We extract related queries from search engine logs and consider the Cochrane assessments as ground truth. We analyze the dynamics of belief over time and show the influence of prior beliefs and confidence at the end of sessions. We present evidence for confirmation bias and for anchoring-and-adjustment during search and retrieval. Then, we build predictive models to estimate postsearch beliefs using sets of features about behavior and content. The findings provide insights about the influence of Web content on the beliefs of people and have implications for the design of search systems.

39 citations


Journal ArticleDOI
TL;DR: This article analyzes users’ behavioral patterns and compares them to the patterns in desktop-to-desktop web search, and examines several approaches of using Mobile Touch Interactions (MTIs) to infer relevant content so that such content can be used for supporting subsequent search queries on desktop computers.
Abstract: Mobile devices enable people to look for information at the moment when their information needs are triggered. While experiencing complex information needs that require multiple search sessions, users may utilize desktop computers to fulfill information needs started on mobile devices. Under the context of mobile-to-desktop web search, this article analyzes users’ behavioral patterns and compares them to the patterns in desktop-to-desktop web search. Then, we examine several approaches of using Mobile Touch Interactions (MTIs) to infer relevant content so that such content can be used for supporting subsequent search queries on desktop computers. The experimental data used in this article was collected through a user study involving 24 participants and six properly designed cross-device web search tasks. Our experimental results show that (1) users’ mobile-to-desktop search behaviors do significantly differ from desktop-to-desktop search behaviors in terms of information exploration, sense-making and repeated behaviors. (2) MTIs can be employed to predict the relevance of click-through documents, but applying document-level relevant content based on the predicted relevance does not improve search performance. (3) MTIs can also be used to identify the relevant text chunks at a fine-grained subdocument level. Such relevant information can achieve better search performance than the document-level relevant content. In addition, such subdocument relevant information can be combined with document-level relevance to further improve the search performance. However, the effectiveness of these methods relies on the sufficiency of click-through documents. (4) MTIs can also be obtained from the Search Engine Results Pages (SERPs). The subdocument feedbacks inferred from this set of MTIs even outperform the MTI-based subdocument feedback from the click-through documents.

37 citations


Journal ArticleDOI
TL;DR: This article proposes a graphical model to score queries that exploits a latent topic space, which is automatically derived from the query log, to detect semantic dependency of terms in a query and dependency among topics.
Abstract: An important way to improve users’ satisfaction in Web search is to assist them by issuing more effective queries. One such approach is query reformulation, which generates new queries according to the current query issued by users. A common procedure for conducting reformulation is to generate some candidate queries first, then a scoring method is employed to assess these candidates. Currently, most of the existing methods are context based. They rely heavily on the context relation of terms in the history queries and cannot detect and maintain the semantic consistency of queries. In this article, we propose a graphical model to score queries. The proposed model exploits a latent topic space, which is automatically derived from the query log, to detect semantic dependency of terms in a query and dependency among topics. Meanwhile, the graphical model also captures the term context in the history query by skip-bigram and n-gram language models. In addition, our model can be easily extended to consider users’ history search interests when we conduct query reformulation for different users. In the task of candidate query generation, we investigate a social tagging data resource—Delicious bookmark—to generate addition and substitution patterns that are employed as supplements to the patterns generated from query log data.

Journal ArticleDOI
TL;DR: The objective of which is to identify the evoked emotions of readers by online documents such as news articles, and a novel Latent Discriminative Model (LDM) is proposed for this task, which gives rise to a new Emotional Dependency-based LDM (eLDM).
Abstract: Sentiment analysis of such opinionated online texts as reviews and comments has received increasingly close attention, yet most of the work is intended to deal with the detection of authors’ emotion. In contrast, this article presents our study of the social emotion detection problem, the objective of which is to identify the evoked emotions of readers by online documents such as news articles. A novel Latent Discriminative Model (LDM) is proposed for this task. LDM works by introducing intermediate hidden variables to model the latent structure of input text corpora. To achieve this, it defines a joint distribution over emotions and latent variables, conditioned on the observed text documents. Moreover, we assume that social emotions are not independent but correlated with one another, and the dependency of them is capable of providing additional guidance to LDM in the training process. The inclusion of this emotional dependency into LDM gives rise to a new Emotional Dependency-based LDM (eLDM). We evaluate the proposed models through a series of empirical evaluations on two real-world corpora of news articles. Experimental results verify the effectiveness of LDM and eLDM in social emotion detection.

Journal ArticleDOI
TL;DR: DSA-IS is a new disk-friendly method for sequentially retrieving the preceding character of a sorted suffix to induce the order of the preceding suffix in order to emulate the induced sorting algorithm SA-IS.
Abstract: We present in this article an external memory algorithm, called disk SA-IS (DSA-IS), to exactly emulate the induced sorting algorithm SA-IS previously proposed for sorting suffixes in RAM. DSA-IS is a new disk-friendly method for sequentially retrieving the preceding character of a sorted suffix to induce the order of the preceding suffix. For a sizen string of a constant or integer alphabet, given the RAM capacity Ω ((nW)0.5), where W is the size of each I/O buffer that is large enough to amortize the overhead of each access to disk, both the CPU time and peak disk use of DSA-IS are O(n). Our experimental study shows that on average, DSA-IS achieves the best time and space results of all of the existing external memory algorithms based on the induced sorting principle.

Journal ArticleDOI
TL;DR: A novel query change retrieval model (QCM) is proposed, which uses syntactic editing changes between consecutive queries, as well as the relationship between query changes and previously retrieved documents, to enhance session search.
Abstract: Modern information retrieval (IR) systems exhibit user dynamics through interactivity. These dynamic aspects of IR, including changes found in data, users, and systems, are increasingly being utilized in search engines. Session search is one such IR task—document retrieval within a session. During a session, a user constantly modifies queries to find documents that fulfill an information need. Existing IR techniques for assisting the user in this task are limited in their ability to optimize over changes, learn with a minimal computational footprint, and be responsive. This article proposes a novel query change retrieval model (QCM), which uses syntactic editing changes between consecutive queries, as well as the relationship between query changes and previously retrieved documents, to enhance session search. We propose modeling session search as a Markov decision process (MDP). We consider two agents in this MDP: the user agent and the search engine agent. The user agent’s actions are query changes that we observe, and the search engine agent’s actions are term weight adjustments as proposed in this work. We also investigate multiple query aggregation schemes and their effectiveness on session search. Experiments show that our approach is highly effective and outperforms top session search systems in TREC 2011 and TREC 2012.

Journal ArticleDOI
TL;DR: This work is interested in repositories of image/text multimedia objects and it study multimodal information fusion techniques in the context of content-based multimedia information retrieval, focusing on graph-based methods, which have proven to provide state-of-the-art performances.
Abstract: Multimedia collections are more than ever growing in size and diversity. Effective multimedia retrieval systems are thus critical to access these datasets from the end-user perspective and in a scalable way. We are interested in repositories of image/text multimedia objects and we study multimodal information fusion techniques in the context of content-based multimedia information retrieval. We focus on graph-based methods, which have proven to provide state-of-the-art performances. We particularly examine two such methods: cross-media similarities and random-walk-based scores. From a theoretical viewpoint, we propose a unifying graph-based framework, which encompasses the two aforementioned approaches. Our proposal allows us to highlight the core features one should consider when using a graph-based technique for the combination of visual and textual information. We compare cross-media and random-walk-based results using three different real-world datasets. From a practical standpoint, our extended empirical analyses allow us to provide insights and guidelines about the use of graph-based methods for multimodal information fusion in content-based multimedia information retrieval.

Journal ArticleDOI
TL;DR: This article model document generation as a random process with reinforcement (a multivariate Pólya process) and develop a Dirichlet compound multinomial language model that captures word burstiness directly and essentially introduces a measure closely related to idf, which gives theoretical justification for combining the term and document event spaces in tf-idf type schemes.
Abstract: The multinomial language model has been one of the most effective models of retrieval for more than a decade. However, the multinomial distribution does not model one important linguistic phenomenon relating to term dependency—that is, the tendency of a term to repeat itself within a document (i.e., word burstiness). In this article, we model document generation as a random process with reinforcement (a multivariate Polya process) and develop a Dirichlet compound multinomial language model that captures word burstiness directly. We show that the new reinforced language model can be computed as efficiently as current retrieval models, and with experiments on an extensive set of TREC collections, we show that it significantly outperforms the state-of-the-art language model for a number of standard effectiveness metrics. Experiments also show that the tuning parameter in the proposed model is more robust than that in the multinomial language model. Furthermore, we develop a constraint for the verbosity hypothesis and show that the proposed model adheres to the constraint. Finally, we show that the new language model essentially introduces a measure closely related to idf, which gives theoretical justification for combining the term and document event spaces in tf-idf type schemes.

Journal ArticleDOI
TL;DR: An analysis of different interleaving methods as applied to aggregated search engine result pages and proposes two vertical-aware methods, one derived from the widely used Team-Draft Interleaving method by adjusting it in such a way that it respects vertical document groupings and another based on the recently introduced OptimizedInterleaving framework.
Abstract: A result page of a modern search engine often goes beyond a simple list of “10 blue links.” Many specific user needs (e.g., News, Image, Video) are addressed by so-called aggregated or vertical search solutions: specially presented documents, often retrieved from specific sources, that stand out from the regular organic Web search results. When it comes to evaluating ranking systems, such complex result layouts raise their own challenges. This is especially true for so-called interleaving methods that have arisen as an important type of online evaluation: by mixing results from two different result pages, interleaving can easily break the desired Web layout in which vertical documents are grouped together, and hence hurt the user experience.We conduct an analysis of different interleaving methods as applied to aggregated search engine result pages. Apart from conventional interleaving methods, we propose two vertical-aware methods: one derived from the widely used Team-Draft Interleaving method by adjusting it in such a way that it respects vertical document groupings, and another based on the recently introduced Optimized Interleaving framework. We show that our proposed methods are better at preserving the user experience than existing interleaving methods while still performing well as a tool for comparing ranking systems. For evaluating our proposed vertical-aware interleaving methods, we use real-world click data as well as simulated clicks and simulated ranking systems.

Journal ArticleDOI
TL;DR: This article introduces a novel neural network architecture called KNET that leverages both words’ contextual information and morphological knowledge to learn word embeddings and demonstrates that the proposed KNET framework can greatly enhance the effectiveness of wordembeddings.
Abstract: Neural network techniques are widely applied to obtain high-quality distributed representations of words (i.e., word embeddings) to address text mining, information retrieval, and natural language processing tasks. Most recent efforts have proposed several efficient methods to learn word embeddings from context such that they can encode both semantic and syntactic relationships between words. However, it is quite challenging to handle unseen or rare words with insufficient context. Inspired by the study on the word recognition process in cognitive psychology, in this article, we propose to take advantage of seemingly less obvious but essentially important morphological knowledge to address these challenges. In particular, we introduce a novel neural network architecture called KNET that leverages both words’ contextual information and morphological knowledge to learn word embeddings. Meanwhile, this new learning architecture is also able to benefit from noisy knowledge and balance between contextual information and morphological knowledge. Experiments on an analogical reasoning task and a word similarity task both demonstrate that the proposed KNET framework can greatly enhance the effectiveness of word embeddings.

Journal ArticleDOI
TL;DR: This work designs algorithms that, given a collection of documents and a distribution over user queries, return a small subset of the document collection in such a way that they can efficiently provide high-quality answers to user queries using only the selected subset.
Abstract: We design algorithms that, given a collection of documents and a distribution over user queries, return a small subset of the document collection in such a way that we can efficiently provide high-quality answers to user queries using only the selected subset. This approach has applications when space is a constraint or when the query-processing time increases significantly with the size of the collection. We study our algorithms through the lens of stochastic analysis and prove that even though they use only a small fraction of the entire collection, they can provide answers to most user queries, achieving a performance close to the optimal. To complement our theoretical findings, we experimentally show the versatility of our approach by considering two important cases in the context of Web search. In the first case, we favor the retrieval of documents that are relevant to the query, whereas in the second case we aim for document diversification. Both the theoretical and the experimental analysis provide strong evidence of the potential value of query covering in diverse application scenarios.

Journal ArticleDOI
TL;DR: Information systems that leverage contextual knowledge about their users and their search situations – such as histories, demographics, surroundings, constraints or devices – can provide tailored search experiences and higher-quality task outcomes.
Abstract: Information systems that leverage contextual knowledge about their users and their search situations – such as histories, demographics, surroundings, constraints or devices – can provide tailored search experiences and higher-quality task outcomes. Within information retrieval, there is a growing focus on how knowledge of user interests, intentions, and context can improve aspects of search and recommendation such as ranking and query suggestion, especially for exploratory and/or complex tasks that can span multiple queries or search sessions. The interactions that occur during these complex tasks provide context that can be leveraged by search systems to support users’ broader information-seeking activities. Next-generation recommender systems face analogous challenges, including integrating signals from user exploration to update recommendations in real time. Within the space of search, much of the work on modeling context and search personalization has focused on constructing topical profiles of the user’s shortand long-term search history [Gauch et al. 2004; Chirita et al. 2005; Speretta and Gauch 2005; Ma et al. 2007; Bennett et al. 2010; White et al. 2010; Xiang et al. 2010; Sontag et al. 2012] or more generally, models of their query and result-click sequences [Cao et al. 2008; Cao et al. 2009; Mihalkova and Mooney 2009]. Related research has also considered a more content-driven representation such as language-model based approaches [Tan et al. 2006] or weighted term vectors derived from long-term desktop search activities [Teevan et al. 2005; Matthijs and Radlinski 2011]. However, a variety of recent investigations to contextualize search include a broader set of factors based on: a user’s location [Bennett et al. 2011], a user’s task-based search activity [Jones and Klinkner 2008; Kanoulas et al. 2011b; 2011a; Kanoulas et al. 2012; Sontag et al. 2012; Melucci 2012; Raman et al. 2014], the long-term vs. short-term interests of the user [Sugiyama et al. 2004; Li et al. 2007; Bennett et al. 2012], the ability of users to consume information at differing levels of complexity [Collins-Thompson et al. 2011], and patterns of re-finding the same search result over time [Teevan et al. 2011; Shokouhi et al. 2013]. The growth in the types of context explored and the information available to search systems derives from the timely convergence of several factors. The rapid growth in the use of different devices – most notably smartphones and tablets, but also including stationary devices such as game consoles, smart televisions, and augmented conference rooms – provides opportunities to obtain both raw and derived contextual signals that could power next-generation search and recommendation systems. The use of such signals in search and recommendation tasks has been recently explored in such venues as the Context-awareness in Retrieval and Recommendation workshops at IUI 20112012 [Luca et al. 2011; Luca et al. 2012], WSDM 2013 [Bohmer et al. 2013], and ECIR 2014 [Said et al. 2014]. Furthermore, a variety of recent work and venues have noted that much information retrieval research on web search has focused on optimizing and evaluating single queries, even though a significant fraction of queries are associated with more complex tasks [Jones and Klinkner 2008; Kanoulas et al. 2011b; 2011a; Belkin et al. 2012a;

Journal ArticleDOI
TL;DR: This work aims to explore how such a domain model is best utilised for profile-biased summarisation of documents in a navigation scenario in which such summaries can be displayed as hover text as a user moves the mouse over a link.
Abstract: Information systems that utilise contextual information have the potential of helping a user identify relevant information more quickly and more accurately than systems that work the same for all users and contexts. Contextual information comes in a variety of types, often derived from records of past interactions between a user and the information system. It can be individual or group based. We are focusing on the latter, harnessing the search behaviour of cohorts of users, turning it into a domain model that can then be used to assist other users of the same cohort. More specifically, we aim to explore how such a domain model is best utilised for profile-biased summarisation of documents in a navigation scenario in which such summaries can be displayed as hover text as a user moves the mouse over a link. The main motivation is to help a user find relevant documents more quickly. Given the fact that the Web in general has been studied extensively already, we focus our attention on Web sites and similar document collections. Such collections can be notoriously difficult to search or explore. The process of acquiring the domain model is not a research interest here; we simply adopt a biologically inspired method that resembles the idea of ant colony optimisation. This has been shown to work well in a variety of application areas. The model can be built in a continuous learning cycle that exploits search patterns as recorded in typical query log files. Our research explores different summarisation techniques, some of which use the domain model and some that do not. We perform task-based evaluations of these different techniques—thus of the impact of the domain model and profile-biased summarisation—in the context of Web site navigation.

Journal ArticleDOI
TL;DR: This article proposes two-stage normalization by performing verbosity and scope normalization separately, and by employing different penalization functions, which leads to marginal but statistically significant improvements over standard retrieval models.
Abstract: The standard approach for term frequency normalization is based only on the document length. However, it does not distinguish the verbosity from the scope, these being the two main factors determining the document length. Because the verbosity and scope have largely different effects on the increase in term frequency, the standard approach can easily suffer from insufficient or excessive penalization depending on the specific type of long document. To overcome these problems, this article proposes two-stage normalization by performing verbosity and scope normalization separately, and by employing different penalization functions. In verbosity normalization, each document is prenormalized by dividing the term frequency by the verbosity of the document. In scope normalization, an existing retrieval model is applied in a straightforward manner to the prenormalized document, finally leading us to formulate our proposed verbosity normalized (VN) retrieval model. Experimental results carried out on standard TREC collections demonstrate that the VN model leads to marginal but statistically significant improvements over standard retrieval models.

Journal ArticleDOI
Su Yan1, Xiaojun Wan1
TL;DR: A novel extractive topic-focused multidocument summarization framework that proposes a new kind of more meaningful and informative units named frequent Deep Dependency Sub-Structure (DDSS) and a topic-sensitive Multi-Task Learning (MTL) model for frequent DDSS ranking.
Abstract: Most extractive style topic-focused multidocument summarization systems generate a summary by ranking textual units in multiple documents and extracting a proper subset of sentences biased to the given topic. Usually, the textual units are simply represented as sentences or n-grams, which do not carry deep syntactic and semantic information. This article presents a novel extractive topic-focused multidocument summarization framework. The framework proposes a new kind of more meaningful and informative units named frequent Deep Dependency Sub-Structure (DDSS) and a topic-sensitive Multi-Task Learning (MTL) model for frequent DDSS ranking. Given a document set, first, we parse all the sentences into deep dependency structures with a Head-driven Phrase Structure Grammar (HPSG) parser and mine the frequent DDSSs after semantic normalization. Then we employ a topic-sensitive MTL model to learn the importance of these frequent DDSSs. Finally, we exploit an Integer Linear Programming (ILP) formulation and use the frequent DDSSs as the essentials for summary extraction. Experimental results on two DUC datasets demonstrate that our proposed approach can achieve state-of-the-art performance. Both the DDSS information and the topic-sensitive MTL model are validated to be very helpful for topic-focused multidocument summarization.

Journal ArticleDOI
Hui Yang1
TL;DR: A novel minimum-evolution hierarchy construction framework that directly learns semantic distances from training data and from users to construct hierarchies is proposed to produce globally optimized hierarchical structures by incorporating user-generated task specifications into the general learning framework.
Abstract: Hierarchies serve as browsing tools to access information in document collections. This article explores techniques to derive browsing hierarchies that can be used as an information map for task-based search. It proposes a novel minimum-evolution hierarchy construction framework that directly learns semantic distances from training data and from users to construct hierarchies. The aim is to produce globally optimized hierarchical structures by incorporating user-generated task specifications into the general learning framework. Both an automatic version of the framework and an interactive version are presented. A comparison with state-of-the-art systems and a user study jointly demonstrate that the proposed framework is highly effective.

Journal ArticleDOI
Aditya Pal1
TL;DR: A Cutoff-Aggregation algorithm that aggregates the entity similarity within a community to compute that community's relevance and introduces two k-nearest-neighbor algorithms that are a natural instantiation of the CA algorithm, which are computationally efficient and evaluate several ranking algorithms over the aggregate similarity scores computed by the two knn algorithms.
Abstract: An online community consists of a group of users who share a common interest, background, or experience, and their collective goal is to contribute toward the welfare of the community members. Several websites allow their users to create and manage niche communities, such as Yahoo! Groups, Facebook Groups, Google+ Circles, and WebMD Forums. These community services also exist within enterprises, such as IBM Connections. Question answering within these communities enables their members to exchange knowledge and information with other community members. However, the onus of finding the right community for question asking lies with an individual user. The overwhelming number of communities necessitates the need for a good question routing strategy so that new questions get routed to an appropriately focused community and thus get resolved in a reasonable time frame.In this article, we consider the novel problem of routing a question to the right community and propose a framework for selecting and ranking the relevant communities for a question. We propose several novel features for modeling the three main entities of the system: questions, users, and communities. We propose features such as language attributes, inclination to respond, user familiarity, and difficulty of a question; based on these features, we propose similarity metrics between the routed question and the system entities. We introduce a Cutoff-Aggregation (CA) algorithm that aggregates the entity similarity within a community to compute that community's relevance. We introduce two k-nearest-neighbor (knn) algorithms that are a natural instantiation of the CA algorithm, which are computationally efficient and evaluate several ranking algorithms over the aggregate similarity scores computed by the two knn algorithms. We propose clustering techniques to speed up our recommendation framework and show how pipelining can improve the model performance. We demonstrate the effectiveness of our framework on two large real-world datasets.

Journal ArticleDOI
TL;DR: It is found that with the whole index stored in main memory, PRF retrieval using a spstring or wvbc forward index excels in time efficiency over an inverted index, being able to obtain the same levels of performance measures at shorter times.
Abstract: The inverted index is the dominant indexing method in information retrieval systems. It enables fast return of the list of all documents containing a given query term. However, for retrieval schemes involving query expansion, as in pseudo-relevance feedback (PRF), the retrieval time based on an inverted index increases linearly with the number of expansion terms. In this regard, we have examined the use of a forward index, which consists of the mapping of each document to its constituent terms. We propose a novel forward index-based reranking scheme to shorten the PRF retrieval time. In our method, a first retrieval of the original query is performed using an inverted index, and then a forward index is employed for the PRF part. We have studied several new forward indexes, including using a novel spstring data structure and the weighted variable bit-block compression (wvbc) signature. With modern hardware such as solid-state drives (SSDs) and sufficiently large main memory, forward index methods are particularly promising. We find that with the whole index stored in main memory, PRF retrieval using a spstring or wvbc forward index excels in time efficiency over an inverted index, being able to obtain the same levels of performance measures at shorter times.

Journal ArticleDOI
TL;DR: A transformation-aware soft cascading (TASC) approach for multimodal video copy detection that can achieve excellent copy detection accuracy and localization precision with a very high processing efficiency.
Abstract: How to precisely and efficiently detect near-duplicate copies with complicated audiovisual transformations from a large-scale video database is a challenging task. To cope with this challenge, this article proposes a transformation-aware soft cascading (TASC) approach for multimodal video copy detection. Basically, our approach divides query videos into some categories and then for each category designs a transformation-aware chain to organize several detectors in a cascade structure. In each chain, efficient but simple detectors are placed in the forepart, whereas effective but complex detectors are located in the rear. To judge whether two videos are near-duplicates, a Detection-on-Copy-Units mechanism is introduced in the TASC, which makes the decision of copy detection depending on the similarity between their most similar fractions, called copy units (CUs), rather than the video-level similarity. Following this, we propose a CU search algorithm to find a pair of CUs from two videos and a CU-based localization algorithm to find the precise locations of their copy segments that are with the asserted CUs as the center. Moreover, to address the problem that the copies and noncopies are possibly linearly inseparable in the feature space, the TASC also introduces a flexible strategy, called soft decision boundary, to replace the single threshold strategy for each detector. Its basic idea is to automatically learn two thresholds for each detector to examine the easy-to-judge copies and noncopies, respectively, and meanwhile to train a nonlinear classifier to further check those hard-to-judge ones. Extensive experiments on three benchmark datasets showed that the TASC can achieve excellent copy detection accuracy and localization precision with a very high processing efficiency.

Journal ArticleDOI
TL;DR: This article presents a new IR model based on concepts taken from both IR and digital signal processing (like Fourier analysis of signals and filtering) that allows the whole IR process to be seen as a physical phenomenon.
Abstract: Information retrieval (IR) systems are designed, in general, to satisfy the information need of a user who expresses it by means of a query, by providing him with a subset of documents selected from a collection and ordered by decreasing relevance to the query. Such systems are based on IR models, which define how to represent the documents and the query, as well as how to determine the relevance of a document for a query. In this article, we present a new IR model based on concepts taken from both IR and digital signal processing (like Fourier analysis of signals and filtering). This allows the whole IR process to be seen as a physical phenomenon, where the query corresponds to a signal, the documents correspond to filters, and the determination of the relevant documents to the query is done by filtering that signal. Tests showed that the quality of the results provided by this IR model is comparable with the state-of-the-art.