scispace - formally typeset
Search or ask a question

Showing papers in "Information Processing and Management in 2012"


Journal ArticleDOI
TL;DR: This paper reports on the first attempts to combine crowdsourcing and TREC: the aim is to validate the use of crowdsourcing for relevance assessment, using the Amazon Mechanical Turk crowdsourcing platform to run experiments on TREC data, evaluate the outcomes, and discuss the results.
Abstract: Crowdsourcing has recently gained a lot of attention as a tool for conducting different kinds of relevance evaluations At a very high level, crowdsourcing describes outsourcing of tasks to a large group of people instead of assigning such tasks to an in-house employee This crowdsourcing approach makes possible to conduct information retrieval experiments extremely fast, with good results at a low cost This paper reports on the first attempts to combine crowdsourcing and TREC: our aim is to validate the use of crowdsourcing for relevance assessment To this aim, we use the Amazon Mechanical Turk crowdsourcing platform to run experiments on TREC data, evaluate the outcomes, and discuss the results We make emphasis on the experiment design, execution, and quality control to gather useful results, with particular attention to the issue of agreement among assessors Our position, supported by the experimental results, is that crowdsourcing is a cheap, quick, and reliable alternative for relevance assessment

159 citations


Journal ArticleDOI
TL;DR: Results suggest that research performance of scholars' is significantly correlated with scholars' ego-network measures, and scholars with efficient collaboration networks who maintain a strong co-authorship relationship with one primary co-author within a group of linked co-authors perform better than those researchers with many relationships to the same group of links.
Abstract: In this study, we propose and validate social networks based theoretical model for exploring scholars' collaboration (co-authorship) network properties associated with their citation-based research performance (i.e., g-index). Using structural holes theory, we focus on how a scholar's egocentric network properties of density, efficiency and constraint within the network associate with their scholarly performance. For our analysis, we use publication data of high impact factor journals in the field of ''Information Science & Library Science'' between 2000 and 2009, extracted from Scopus. The resulting database contained 4837 publications reflecting the contributions of 8069 authors. Results from our data analysis suggest that research performance of scholars' is significantly correlated with scholars' ego-network measures. In particular, scholars with more co-authors and those who exhibit higher levels of betweenness centrality (i.e., the extent to which a co-author is between another pair of co-authors) perform better in terms of research (i.e., higher g-index). Furthermore, scholars with efficient collaboration networks who maintain a strong co-authorship relationship with one primary co-author within a group of linked co-authors (i.e., co-authors that have joint publications) perform better than those researchers with many relationships to the same group of linked co-authors.

156 citations


Journal ArticleDOI
TL;DR: The hypothesis of this paper is that the results obtained by applying traditional similarities measures can be improved by taking contextual information, drawn from the entire body of users, and using it to calculate the singularity which exists, for each item, in the votes cast by each pair of users that you wish to compare.
Abstract: Recommender systems play an important role in reducing the negative impact of information overload on those websites where users have the possibility of voting for their preferences on items. The most normal technique for dealing with the recommendation mechanism is to use collaborative filtering, in which it is essential to discover the most similar users to whom you desire to make recommendations. The hypothesis of this paper is that the results obtained by applying traditional similarities measures can be improved by taking contextual information, drawn from the entire body of users, and using it to calculate the singularity which exists, for each item, in the votes cast by each pair of users that you wish to compare. As such, the greater the measure of singularity result between the votes cast by two given users, the greater the impact this will have on the similarity. The results, tested on the Movielens, Netflix and FilmAffinity databases, corroborate the excellent behaviour of the singularity measure proposed.

143 citations


Journal ArticleDOI
TL;DR: The experimental results, comparing CMFS with six well-known feature selection algorithms, show that the proposed method CMFS is significantly superior to Information Gain (IG), Chi statistic (CHI), Document Frequency (DF), Orthogonal Centroid Feature Selection (OCFS) and DIA association factor (DIA) when Naive Bayes classifier is used.
Abstract: The feature selection, which can reduce the dimensionality of vector space without sacrificing the performance of the classifier, is widely used in text categorization. In this paper, we proposed a new feature selection algorithm, named CMFS, which comprehensively measures the significance of a term both in inter-category and intra-category. We evaluated CMFS on three benchmark document collections, 20-Newsgroups, Reuters-21578 and WebKB, using two classification algorithms, Naive Bayes (NB) and Support Vector Machines (SVMs). The experimental results, comparing CMFS with six well-known feature selection algorithms, show that the proposed method CMFS is significantly superior to Information Gain (IG), Chi statistic (CHI), Document Frequency (DF), Orthogonal Centroid Feature Selection (OCFS) and DIA association factor (DIA) when Naive Bayes classifier is used and significantly outperforms IG, DF, OCFS and DIA when Support Vector Machines are used.

114 citations


Journal ArticleDOI
TL;DR: This work presents an ontology-based retrieval approach, that supports data organization and visualization and provides a friendly navigation model, that exploits the fuzzy extension of the Formal Concept Analysis theory to elicit conceptualizations from datasets and generate a hierarchy-based representation of extracted knowledge.
Abstract: In recent years, knowledge structuring is assuming important roles in several real world applications such as decision support, cooperative problem solving, e-commerce, Semantic Web and, even in planning systems. Ontologies play an important role in supporting automated processes to access information and are at the core of new strategies for the development of knowledge-based systems. Yet, developing an ontology is a time-consuming task which often needs an accurate domain expertise to tackle structural and logical difficulties in the definition of concepts as well as conceivable relationships. This work presents an ontology-based retrieval approach, that supports data organization and visualization and provides a friendly navigation model. It exploits the fuzzy extension of the Formal Concept Analysis theory to elicit conceptualizations from datasets and generate a hierarchy-based representation of extracted knowledge. An intuitive graphical interface provides a multi-facets view of the built ontology. Through a transparent query-based retrieval, final users navigate across concepts, relations and population.

107 citations


Journal ArticleDOI
TL;DR: The conclusions are that the h-index is preferable to the impact factor for a variety of reasons, especially the selective coverage of the impact factors and the fact that it disadvantages journals that publish many papers.
Abstract: This paper considers the use of the h-index as a measure of a journal's research quality and contribution. We study a sample of 455 journals in business and management all of which are included in the ISI Web of Science (WoS) and the Association of Business School's peer review journal ranking list. The h-index is compared with both the traditional impact factors, and with the peer review judgements. We also consider two sources of citation data - the WoS itself and Google Scholar. The conclusions are that the h-index is preferable to the impact factor for a variety of reasons, especially the selective coverage of the impact factor and the fact that it disadvantages journals that publish many papers. Google Scholar is also preferred to WoS as a data source. However, the paper notes that it is not sufficient to use any single metric to properly evaluate research achievements.

80 citations


Journal ArticleDOI
TL;DR: Interestingly, gratification factors for mobile content contribution were found to have significant effects on mobile content retrieval intention and vice versa while the self-gratification factor for content contribution had a significant negative effect on content retrieved intention.
Abstract: Using the uses and gratifications (UnG) theory, this paper explores the gratification factors for which people contribute and retrieve mobile content. Through the deployment of MobiTOP, a mobile content sharing application, it was found that perceived gratification factors for mobile content contribution were different from those for mobile content retrieval. In particular, factors which had significant positive effects on content contribution stemmed from leisure/entertainment and easy access. Factors fuelling content retrieval included the efficient provision of information resources/services and the need for high quality information, both of which tend to be information-centric. Interestingly, gratification factors for mobile content contribution were also found to have significant effects on mobile content retrieval intention and vice versa. Specifically, the access gratification factor had a significant positive effect on content retrieval intention while the self-gratification factor for content contribution had a significant negative effect on content retrieval intention.

74 citations


Journal ArticleDOI
TL;DR: This work provides a detailed analysis of four MapReduce indexing strategies of varying complexity, and concludes that MapReduced is a suitable framework for the deployment of large-scale indexing.
Abstract: In Information Retrieval (IR), the efficient indexing of terabyte-scale and larger corpora is still a difficult problem. MapReduce has been proposed as a framework for distributing data-intensive operations across multiple processing machines. In this work, we provide a detailed analysis of four MapReduce indexing strategies of varying complexity. Moreover, we evaluate these indexing strategies by implementing them in an existing IR framework, and performing experiments using the Hadoop MapReduce implementation, in combination with several large standard TREC test corpora. In particular, we examine the efficiency of the indexing strategies, and for the most efficient strategy, we examine how it scales with respect to corpus size, and processing power. Our results attest to both the importance of minimising data transfer between machines for IO intensive tasks like indexing, and the suitability of the per-posting list MapReduce indexing strategy, in particular for indexing at a terabyte-scale. Hence, we conclude that MapReduce is a suitable framework for the deployment of large-scale indexing.

72 citations


Journal ArticleDOI
TL;DR: This survey investigates the key techniques and impacts in the use of PIR and AH technology in order to identify such affordances and limitations and illustrates an example of a potential synergy in a hybridised approach, where adaptation can be tailored in different aspects of Pir and AH systems.
Abstract: A key driver for next generation web information retrieval systems is becoming the degree to which a user's search and presentation experience is adapted to individual user properties and contexts of use. Over the past decades, two parallel threads of personalisation research have emerged, one originating in the document space in the area of Personalised Information Retrieval (PIR) and the other arising from the hypertext space in the field of Adaptive Hypermedia (AH). PIR typically aims to bias search results towards more personally relevant information by modifying traditional document ranking algorithms. Such techniques tend to represent users with simplified personas (often based on historic interests), enabling the efficient calculation of personalised ranked lists. On the other hand, the field of Adaptive Hypermedia (AH) has addressed the challenge of biasing content retrieval and presentation by adapting towards multiple characteristics. These characteristics, more typically called personalisation ''dimensions'', include user goals or prior knowledge, enabling adaptive and personalised result compositions and navigations. The question arises as to whether it is possible to provide a comparison of PIR and AH, where the respective strengths and limitations can be exposed, but also where potential complementary affordances can be identified. This survey investigates the key techniques and impacts in the use of PIR and AH technology in order to identify such affordances and limitations. In particular, the techniques are analysed by examining key activities in the retrieval process, namely (i) query adaptation, (ii) adaptive retrieval and (iii) adaptive result composition and presentation. In each of these areas, the survey identifies individual strengths and limitations. Following this comparison of techniques, the paper also illustrates an example of a potential synergy in a hybridised approach, where adaptation can be tailored in different aspects of PIR and AH systems. Moreover, the concerns resulting from interdependencies and the respective tradeoffs of techniques are discussed, along with potential future directions and remaining challenges.

71 citations


Journal ArticleDOI
TL;DR: An empirical analysis of evolving knowledge networks of successful patent collaboration at national level in 1980s, 1990s, and 2000s indicates that the wide and deep participation in international knowledge interactions may have great contribution to the economic competitiveness.
Abstract: In this paper, we provide an empirical analysis of evolving knowledge networks of successful patent collaboration at national level in 1980s, 1990s, and 2000s. All countries are classified into main knowledge creators (Organisation for Economic Co-operation and Development (OECD) group) and main knowledge users (non-OECD group) in order to distinguish specific characteristics of knowledge interactions within groups and between groups. The analyses are carried out from four aspects, i.e., the overall distribution of knowledge interactions among countries, the countries' ability to inhabit and facilitate the knowledge flows among others with the help of flow betweenness measures, the countries' bridgeness between two groups with the recently developed Q-measures, and the most important bilateral knowledge interactions. Results show that although most of the international knowledge interactions still take place within the OECD group, the non-OECD countries have improved their performance significantly. They participate much more in international patenting and collaborations and play much more important roles in facilitating knowledge interactions among others. Among them, China and Taiwan are two most dazzling new stars according to their performance in international knowledge interactions. Considering together with their rapidly improved world competitiveness, the findings indicate that the wide and deep participation in international knowledge interactions may have great contribution to the economic competitiveness.

64 citations


Journal ArticleDOI
TL;DR: A new model for aggregating multiple criteria evaluations for relevance assessment by considering the existence of a prioritization relationship over the criteria is proposed, where relevance is modeled as a multidimensional property of documents.
Abstract: A new model for aggregating multiple criteria evaluations for relevance assessment is proposed. An Information Retrieval context is considered, where relevance is modeled as a multidimensional property of documents. The usefulness and effectiveness of such a model are demonstrated by means of a case study on personalized Information Retrieval with multi-criteria relevance. The following criteria are considered to estimate document relevance: aboutness, coverage, appropriateness, and reliability. The originality of this approach lies in the aggregation of the considered criteria in a prioritized way, by considering the existence of a prioritization relationship over the criteria. Such a prioritization is modeled by making the weights associated to a criterion dependent upon the satisfaction of the higher-priority criteria. This way, it is possible to take into account the fact that the weight of a less important criterion should be proportional to the satisfaction degree of the more important criterion. Experimental evaluations are also reported.

Journal ArticleDOI
TL;DR: A new set of indices is created based on time, volume, frequency and represents a resolution to provide a more precise set of prediction indices for emerging topic detection and gives a promising indication of emerging topics in conferences and journals.
Abstract: Emerging topic detection is a vital research area for researchers and scholars interested in searching for and tracking new research trends and topics. The current methods of text mining and data mining used for this purpose focus only on the frequency of which subjects are mentioned, and ignore the novelty of the subject which is also critical, but beyond the scope of a frequency study. This work tackles this inadequacy to propose a new set of indices for emerging topic detection. They are the novelty index (NI) and the published volume index (PVI). This new set of indices is created based on time, volume, frequency and represents a resolution to provide a more precise set of prediction indices. They are then utilized to determine the detection point (DP) of new emerging topics. Following the detection point, the intersection decides the worth of a new topic. The algorithms presented in this paper can be used to decide the novelty and life span of an emerging topic in a specific field. The entire comprehensive collection of the ACM Digital Library is examined in the experiments. The application of the NI and PVI gives a promising indication of emerging topics in conferences and journals.

Journal ArticleDOI
TL;DR: This paper presents the anonymization of query logs using microaggregation, guaranteeing the k-anonymity of the users in the query log, while preserving its utility, and provides the evaluation of the proposal in real query logs, showing the privacy and utility achieved.
Abstract: The anonymization of query logs is an important process that needs to be performed prior to the publication of such sensitive data. This ensures the anonymity of the users in the logs, a problem that has been already found in released logs from well known companies. This paper presents the anonymization of query logs using microaggregation. Our proposal ensures the k-anonymity of the users in the query log, while preserving its utility. We provide the evaluation of our proposal in real query logs, showing the privacy and utility achieved, as well as providing estimations for the use of such data in data mining processes based on clustering.

Journal ArticleDOI
TL;DR: This work sketches the information retrieval intellectual landscape through visualizations of citation behaviors, addressing information retrieval's co-authorship network, highly productive authors, highly cited journals and papers, author-assigned keywords, active institutions, and the import of ideas from other disciplines.
Abstract: Information retrieval is a long established subfield of library and information science. Since its inception in the early- to mid -1950s, it has grown as a result, in part, of well-regarded retrieval system evaluation exercises/campaigns, the proliferation of Web search engines, and the expansion of digital libraries. Although researchers have examined the intellectual structure and nature of the general field of library and information science, the same cannot be said about the subfield of information retrieval. We address that in this work by sketching the information retrieval intellectual landscape through visualizations of citation behaviors. Citation data for 10years (2000-2009) were retrieved from the Web of Science and analyzed using existing visualization techniques. Our results address information retrieval's co-authorship network, highly productive authors, highly cited journals and papers, author-assigned keywords, active institutions, and the import of ideas from other disciplines.

Journal ArticleDOI
TL;DR: This work presents the Permutation Prefix Index, an index data structure that supports efficient approximate similarity search and shows how the effectiveness can easily reach optimal levels just by adopting two ''boosting'' strategies: multiple index search and multiple query search, which both have nice parallelization properties.
Abstract: We present the Permutation Prefix Index (this work is a revised and extended version of Esuli (2009b), presented at the 2009 LSDS-IR Workshop, held in Boston) (PP-Index), an index data structure that supports efficient approximate similarity search. The PP-Index belongs to the family of the permutation-based indexes, which are based on representing any indexed object with ''its view of the surrounding world'', i.e., a list of the elements of a set of reference objects sorted by their distance order with respect to the indexed object. In its basic formulation, the PP-Index is strongly biased toward efficiency. We show how the effectiveness can easily reach optimal levels just by adopting two ''boosting'' strategies: multiple index search and multiple query search, which both have nice parallelization properties. We study both the efficiency and the effectiveness properties of the PP-Index, experimenting with collections of sizes up to one hundred million objects, represented in a very high-dimensional similarity space.

Journal ArticleDOI
TL;DR: An associative author name disambiguation approach that identifies authorship by extracting, from training examples, rules associating citation features (e.g., coauthor names, work title, publication venue) to specific authors is introduced.
Abstract: Authorship disambiguation is an urgent issue that affects the quality of digital library services and for which supervised solutions have been proposed, delivering state-of-the-art effectiveness. However, particular challenges such as the prohibitive cost of labeling vast amounts of examples (there are many ambiguous authors), the huge hypothesis space (there are several features and authors from which many different disambiguation functions may be derived), and the skewed author popularity distribution (few authors are very prolific, while most appear in only few citations), may prevent the full potential of such techniques. In this article, we introduce an associative author name disambiguation approach that identifies authorship by extracting, from training examples, rules associating citation features (e.g., coauthor names, work title, publication venue) to specific authors. As our main contribution we propose three associative author name disambiguators: (1) EAND (Eager Associative Name Disambiguation), our basic method that explores association rules for name disambiguation; (2) LAND (Lazy Associative Name Disambiguation), that extracts rules on a demand-driven basis at disambiguation time, reducing the hypothesis space by focusing on examples that are most suitable for the task; and (3) SLAND (Self-Training LAND), that extends LAND with self-training capabilities, thus drastically reducing the amount of examples required for building effective disambiguation functions, besides being able to detect novel/unseen authors in the test set. Experiments demonstrate that all our disambigutators are effective and that, in particular, SLAND is able to outperform state-of-the-art supervised disambiguators, providing gains that range from 12% to more than 400%, being extremely effective and practical.

Journal ArticleDOI
TL;DR: An overview of automatic methods for building domain knowledge structures (domain models) from text collections inspired by the ubiquitous propagation of domain model structures that are emerging in several research disciplines is given.
Abstract: This paper presents an overview of automatic methods for building domain knowledge structures (domain models) from text collections. Applications of domain models have a long history within knowledge engineering and artificial intelligence. In the last couple of decades they have surfaced noticeably as a useful tool within natural language processing, information retrieval and semantic web technology. Inspired by the ubiquitous propagation of domain model structures that are emerging in several research disciplines, we give an overview of the current research landscape and some techniques and approaches. We will also discuss trade-offs between different approaches and point to some recent trends.

Journal ArticleDOI
TL;DR: This study investigated whether and how different factors in relation to task, user-perceived knowledge, search process, and system affect users' search tactic selection, finding seven factors were significantly associated with tactic selection.
Abstract: This study investigated whether and how different factors in relation to task, user-perceived knowledge, search process, and system affect users' search tactic selection. Thirty-one participants, representing the general public with their own tasks, were recruited for this study. Multiple methods were employed to collect data, including pre-questionnaire, verbal protocols, log analysis, diaries, and post-questionnaires. Statistical analysis revealed that seven factors were significantly associated with tactic selection. These factors consist of work task types, search task types, familiarity with topic, search skills, search session length, search phases, and system types. Moreover, the study also discovered, qualitatively, in what ways these factors influence the selection of search tactics. Based on the findings, the authors discuss practical implications for system design to support users' application of multiple search tactics for each factor.

Journal ArticleDOI
TL;DR: A greedy heuristic to prioritize items for caching is proposed, based on gains computed by using items' past access frequencies, estimated computational costs, and storage overheads, which performs better than dividing the entire cache space among particular item types at fixed proportions.
Abstract: Caching is a crucial performance component of large-scale web search engines, as it greatly helps reducing average query response times and query processing workloads on backend search clusters. In this paper, we describe a multi-level static cache architecture that stores five different item types: query results, precomputed scores, posting lists, precomputed intersections of posting lists, and documents. Moreover, we propose a greedy heuristic to prioritize items for caching, based on gains computed by using items' past access frequencies, estimated computational costs, and storage overheads. This heuristic takes into account the inter-dependency between individual items when making its caching decisions, i.e., after a particular item is cached, gains of all items that are affected by this decision are updated. Our simulations under realistic assumptions reveal that the proposed heuristic performs better than dividing the entire cache space among particular item types at fixed proportions.

Journal ArticleDOI
TL;DR: This paper proposes a weighted consensus summarization method to combine the results from single summarization systems, and this method outperforms other combination methods.
Abstract: Multi-document summarization is a fundamental tool for document understanding and has received much attention recently. Given a collection of documents, a variety of summarization methods based on different strategies have been proposed to extract the most important sentences from the original documents. However, very few studies have been reported on aggregating different summarization methods to possibly generate better summary results. In this paper, we propose a weighted consensus summarization method to combine the results from single summarization systems. We evaluate and compare our proposed weighted consensus method with various baseline combination methods. Experimental results on DUC2002 and DUC2004 data sets demonstrate the performance improvement by aggregating multiple summarization systems, and our proposed weighted consensus summarization method outperforms other combination methods.

Journal ArticleDOI
TL;DR: Results of the experiments show that with the support of the proposed recommendation mechanism, requesters in forums can easily find similar discussion threads to avoid spamming the same discussion and this mechanism provides a relatively efficient and active way to find the appropriate experts.
Abstract: Nowadays, online forums have become a useful tool for knowledge management in Web-based technology This study proposes a social recommender system which generates discussion thread and expert recommendations based on semantic similarity, profession and reliability, social intimacy and popularity, and social network-based Markov Chain (SNMC) models for knowledge sharing in online forum communities The advantage of the proposed mechanism is its relatively comprehensive consideration of the aspects of knowledge sharing Accordingly, results of our experiments show that with the support of the proposed recommendation mechanism, requesters in forums can easily find similar discussion threads to avoid spamming the same discussion In addition, if the requesters cannot find qualified discussion threads, this mechanism provides a relatively efficient and active way to find the appropriate experts

Journal ArticleDOI
TL;DR: A model of mobile PIM agent (PIMA) that aims to improve PIM on mobile devices through natural language interface and application integration is proposed and conducted to evaluate PIMA empirically with prototype systems.
Abstract: Managing personal information such as to-dos and contacts has become our daily routines, consuming more time than needed Existing PIM tools require extensive involvement of human users This becomes a problem in using mobile devices due to their physical constraints To address the limitations of traditional PIM tools, we propose a model of mobile PIM agent (PIMA) that aims to improve PIM on mobile devices through natural language interface and application integration We conducted a user study to evaluate PIMA empirically with prototype systems The results show that mobile PIMA improved perceived usefulness, ease-of-use, and efficiency of PIM on mobile devices, which in turn accounted for positive attitude and intention to use the system The findings of this study provide suggestions for designing and developing PIM applications on mobile devices

Journal ArticleDOI
TL;DR: Findings confirm and extend findings of previous studies, providing strong statistical evidence of an association between the information search process and the choices of relevance criteria by users, and identifying specific changes in the user preferences for specific criteria over the course of the informationsearch process.
Abstract: Relevance judgments occur within an information search process, where time, context and situation can impact the judgments. The determination of relevance is dependent on a number of factors and variables which include the criteria used to determine relevance. The relevance judgment process and the criteria used to make those judgments are manifestations of the cognitive changes which occur during the information search process. Understanding why these relevance criteria choices are made, and how they vary over the information search process can provide important information about the dynamic relevance judgment process. This information can be used to guide the development of more adaptive information retrieval systems which respond to the cognitive changes of users during the information search process. The research data analyzed here was collected in two separate studies which examined a subject's relevance judgment over an information search process. Statistical analysis was used to examine these results and determine if there were relationships between criteria selections, relevance judgments, and the subject's progression through the information search process. Findings confirm and extend findings of previous studies, providing strong statistical evidence of an association between the information search process and the choices of relevance criteria by users, and identifying specific changes in the user preferences for specific criteria over the course of the information search process.

Journal ArticleDOI
TL;DR: This paper presents a novel categorization method, the three phase categorization (TPC) algorithm, which classifies patents down to the subgroup level with reasonable accuracy, and experimental results indicate that the TPC algorithm can achieve 36.07% accuracy at the sub group level.
Abstract: An automatic patent categorization system would be invaluable to individual inventors and patent attorneys, saving them time and effort by quickly identifying conflicts with existing patents. In recent years, it has become more and more common to classify all patent documents using the International Patent Classification (IPC), a complex hierarchical classification system comprised of eight sections, 128 classes, 648 subclasses, about 7200 main groups, and approximately 72,000 subgroups. So far, however, no patent categorization method has been developed that can classify patents down to the subgroup level (the bottom level of the IPC). Therefore, this paper presents a novel categorization method, the three phase categorization (TPC) algorithm, which classifies patents down to the subgroup level with reasonable accuracy. The experimental results for the TPC algorithm, using the WIPO-alpha collection, indicate that our classification method can achieve 36.07% accuracy at the subgroup level. This is approximately a 25,764-fold improvement over a random guess.

Journal ArticleDOI
TL;DR: A generative probabilistic model is proposed, describing categories by distributions, handling the feature selection problem by introducing a binary exclusion/inclusion latent vector, which is updated via an efficient Metropolis search.
Abstract: The automated classification of texts into predefined categories has witnessed a booming interest, due to the increased availability of documents in digital form and the ensuing need to organize them. An important problem for text classification is feature selection, whose goals are to improve classification effectiveness, computational efficiency, or both. Due to categorization unbalancedness and feature sparsity in social text collection, filter methods may work poorly. In this paper, we perform feature selection in the training process, automatically selecting the best feature subset by learning, from a set of preclassified documents, the characteristics of the categories. We propose a generative probabilistic model, describing categories by distributions, handling the feature selection problem by introducing a binary exclusion/inclusion latent vector, which is updated via an efficient Metropolis search. Real-life examples illustrate the effectiveness of the approach.

Journal ArticleDOI
TL;DR: Two different methods are proposed for estimating the energy consumption and GHG emissions of an IR system or service, and the four key enablers of a Green IR viz.
Abstract: Nowadays we use information retrieval systems and services as part of our many day-to-day activities ranging from a web and database search to searching for various digital libraries, audio and video collections/services, and so on. However, IR systems and services make extensive use of ICT (information and communication technologies) and increasing use of ICT can significantly increase greenhouse gas (GHG, a term used to denote emission of harmful gases in the atmosphere) emissions. Sustainable development, and more importantly environmental sustainability, has become a major area of concern of various national and international bodies and as a result various initiatives and measures are being proposed for reducing the environmental impact of industries, businesses, governments and institutions. Research also shows that appropriate use of ICT can reduce the overall GHG emissions of a business, product or service. Green IT and cloud computing can play a key role in reducing the environmental impact of ICT. This paper proposes the concept of Green IR systems and services that can play a key role in reducing the overall environmental impact of various ICT-based services in education and research, business, government, etc., that are increasingly being reliant on access and use of digital information. However, to date there has not been any systematic research towards building Green IR systems and services. This paper points out the major challenges in building Green IR systems and services, and two different methods are proposed for estimating the energy consumption, and the corresponding GHG emissions, of an IR system or service. This paper also proposes the four key enablers of a Green IR viz. Standardize, Share, Reuse and Green behavior. Further research required to achieve these for building Green IR systems and services are also mentioned.

Journal ArticleDOI
TL;DR: It is concluded that both quality and accessibility influence the selection of human information sources, although quality exerts a slightly stronger influence.
Abstract: This study focuses on how the accessibility and quality of co-workers in organizations affect their use as information source. Prior research has produced inconsistent findings concerning these factors' respective influence on source selection. In this article, we argue that one potential reason for this lies in the lack of coherent definitions of accessibility and quality. To bridge this gap, we unpack these concepts into their underlying dimensions, based on insights derived from social capital theory, more specifically Nahapiet and Ghoshal's (1998) contribution, to uncovering the multidimensionality of social capital. We empirically test the dimensionality of accessibility and quality, as well as the relative influence of these concepts on human information source selection, in a scenario experiment within an organization. Findings support the proposed dimensionality, and lead to the conclusion that both quality and accessibility influence the selection of human information sources, although quality exerts a slightly stronger influence.

Journal ArticleDOI
TL;DR: This research investigates how people's perceptions of information retrieval (IR) systems, their perceptions of search tasks, and their perception of self-efficacy influence the amount of invested mental effort (AIME) they put into using two different IR systems: a Web search engine and a library system.
Abstract: This research investigates how people's perceptions of information retrieval (IR) systems, their perceptions of search tasks, and their perceptions of self-efficacy influence the amount of invested mental effort (AIME) they put into using two different IR systems: a Web search engine and a library system. It also explores the impact of mental effort on an end user's search experience. To assess AIME in online searching, two experiments were conducted using these methods: Experiment 1 relied on self-reports and Experiment 2 employed the dual-task technique. In both experiments, data were collected through search transaction logs, a pre-search background questionnaire, a post-search questionnaire and an interview. Important findings are these: (1) subjects invested greater mental effort searching a library system than searching the Web; (2) subjects put little effort into Web searching because of their high sense of self-efficacy in their searching ability and their perception of the easiness of the Web; (3) subjects did not recognize that putting mental effort into searching was something needed to improve the search results; and (4) data collected from multiple sources proved to be effective for assessing mental effort in online searching.

Journal ArticleDOI
TL;DR: This work proposes a distributed index structure for similarity data management called the Metric Index (M-Index) which can answer queries in precise and approximate manner and is proved its usability by developing a full-featured publicly-available Web application.
Abstract: Metric space is a universal and versatile model of similarity that can be applied in various areas of non-text information retrieval. However, a general, efficient and scalable solution for metric data management is still a resisting research challenge. In this work, we try to make an important step towards such management system that would be able to scale to data collections of billions of objects. We propose a distributed index structure for similarity data management called the Metric Index (M-Index) which can answer queries in precise and approximate manner. This technique can take advantage of any distributed hash table that supports interval queries and utilize it as an underlying index. We have performed numerous experiments to test various settings of the M-Index structure and we have proved its usability by developing a full-featured publicly-available Web application.

Journal ArticleDOI
TL;DR: An efficient and effective solution to the problem of choosing the queries to suggest to web search engine users in order to help them in rapidly satisfying their information needs, which remarkably outperforms two other state-of-the-art solutions.
Abstract: This paper proposes an efficient and effective solution to the problem of choosing the queries to suggest to web search engine users in order to help them in rapidly satisfying their information needs. By exploiting a weak function for assessing the similarity between the current query and the knowledge base built from historical users' sessions, we re-conduct the suggestion generation phase to the processing of a full-text query over an inverted index. The resulting query recommendation technique is very efficient and scalable, and is less affected by the data-sparsity problem than most state-of-the-art proposals. Thus, it is particularly effective in generating suggestions for rare queries occurring in the long tail of the query popularity distribution. The quality of suggestions generated is assessed by evaluating the effectiveness in forecasting the users' behavior recorded in historical query logs, and on the basis of the results of a reproducible user study conducted on publicly-available, human-assessed data. The experimental evaluation conducted shows that our proposal remarkably outperforms two other state-of-the-art solutions, and that it can generate useful suggestions even for rare and never seen queries.