Showing papers in &quot;Information Processing and Management in 2012&quot;

Egocentric analysis of co-authorship network structure, position and performance

TL;DR: This paper reports on the first attempts to combine crowdsourcing and TREC: the aim is to validate the use of crowdsourcing for relevance assessment, using the Amazon Mechanical Turk crowdsourcing platform to run experiments on TREC data, evaluate the outcomes, and discuss the results.

...read moreread less

Abstract: Crowdsourcing has recently gained a lot of attention as a tool for conducting different kinds of relevance evaluations At a very high level, crowdsourcing describes outsourcing of tasks to a large group of people instead of assigning such tasks to an in-house employee This crowdsourcing approach makes possible to conduct information retrieval experiments extremely fast, with good results at a low cost This paper reports on the first attempts to combine crowdsourcing and TREC: our aim is to validate the use of crowdsourcing for relevance assessment To this aim, we use the Amazon Mechanical Turk crowdsourcing platform to run experiments on TREC data, evaluate the outcomes, and discuss the results We make emphasis on the experiment design, execution, and quality control to gather useful results, with particular attention to the issue of agreement among assessors Our position, supported by the experimental results, is that crowdsourcing is a cheap, quick, and reliable alternative for relevance assessment

...read moreread less

159 citations

Journal Article•DOI•

[...]

Alireza Abbasi¹, Kon Shing Kenneth Chung¹, Liaquat Hossain¹•Institutions (1)

University of Sydney¹

A collaborative filtering similarity measure based on singularities

TL;DR: Results suggest that research performance of scholars' is significantly correlated with scholars' ego-network measures, and scholars with efficient collaboration networks who maintain a strong co-authorship relationship with one primary co-author within a group of linked co-authors perform better than those researchers with many relationships to the same group of links.

...read moreread less

Abstract: In this study, we propose and validate social networks based theoretical model for exploring scholars' collaboration (co-authorship) network properties associated with their citation-based research performance (i.e., g-index). Using structural holes theory, we focus on how a scholar's egocentric network properties of density, efficiency and constraint within the network associate with their scholarly performance. For our analysis, we use publication data of high impact factor journals in the field of ''Information Science & Library Science'' between 2000 and 2009, extracted from Scopus. The resulting database contained 4837 publications reflecting the contributions of 8069 authors. Results from our data analysis suggest that research performance of scholars' is significantly correlated with scholars' ego-network measures. In particular, scholars with more co-authors and those who exhibit higher levels of betweenness centrality (i.e., the extent to which a co-author is between another pair of co-authors) perform better in terms of research (i.e., higher g-index). Furthermore, scholars with efficient collaboration networks who maintain a strong co-authorship relationship with one primary co-author within a group of linked co-authors (i.e., co-authors that have joint publications) perform better than those researchers with many relationships to the same group of linked co-authors.

...read moreread less

156 citations

Journal Article•DOI•

[...]

Jesús Bobadilla¹, Fernando Ortega¹, Antonio Hernando¹•Institutions (1)

Technical University of Madrid¹

A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization

TL;DR: The hypothesis of this paper is that the results obtained by applying traditional similarities measures can be improved by taking contextual information, drawn from the entire body of users, and using it to calculate the singularity which exists, for each item, in the votes cast by each pair of users that you wish to compare.

...read moreread less

Abstract: Recommender systems play an important role in reducing the negative impact of information overload on those websites where users have the possibility of voting for their preferences on items. The most normal technique for dealing with the recommendation mechanism is to use collaborative filtering, in which it is essential to discover the most similar users to whom you desire to make recommendations. The hypothesis of this paper is that the results obtained by applying traditional similarities measures can be improved by taking contextual information, drawn from the entire body of users, and using it to calculate the singularity which exists, for each item, in the votes cast by each pair of users that you wish to compare. As such, the greater the measure of singularity result between the votes cast by two given users, the greater the impact this will have on the similarity. The results, tested on the Movielens, Netflix and FilmAffinity databases, corroborate the excellent behaviour of the singularity measure proposed.

...read moreread less

143 citations

Journal Article•DOI•

[...]

Jieming Yang¹, Yuanning Liu², Xiaodong Zhu², Zhen Liu³, Xiaoxu Zhang² - Show less +1 more•Institutions (3)

Northeast Dianli University¹, Jilin University², Nagasaki Institute of Applied Science³

Hierarchical web resources retrieval by exploiting Fuzzy Formal Concept Analysis

TL;DR: The experimental results, comparing CMFS with six well-known feature selection algorithms, show that the proposed method CMFS is significantly superior to Information Gain (IG), Chi statistic (CHI), Document Frequency (DF), Orthogonal Centroid Feature Selection (OCFS) and DIA association factor (DIA) when Naive Bayes classifier is used.

...read moreread less

Abstract: The feature selection, which can reduce the dimensionality of vector space without sacrificing the performance of the classifier, is widely used in text categorization. In this paper, we proposed a new feature selection algorithm, named CMFS, which comprehensively measures the significance of a term both in inter-category and intra-category. We evaluated CMFS on three benchmark document collections, 20-Newsgroups, Reuters-21578 and WebKB, using two classification algorithms, Naive Bayes (NB) and Support Vector Machines (SVMs). The experimental results, comparing CMFS with six well-known feature selection algorithms, show that the proposed method CMFS is significantly superior to Information Gain (IG), Chi statistic (CHI), Document Frequency (DF), Orthogonal Centroid Feature Selection (OCFS) and DIA association factor (DIA) when Naive Bayes classifier is used and significantly outperforms IG, DF, OCFS and DIA when Support Vector Machines are used.

...read moreread less

114 citations

Journal Article•DOI•

[...]

Carmen De Maio¹, Giuseppe Fenza¹, Vincenzo Loia¹, Sabrina Senatore¹•Institutions (1)

University of Salerno¹

Using the h-index to measure the quality of journals in the field of business and management

TL;DR: This work presents an ontology-based retrieval approach, that supports data organization and visualization and provides a friendly navigation model, that exploits the fuzzy extension of the Formal Concept Analysis theory to elicit conceptualizations from datasets and generate a hierarchy-based representation of extracted knowledge.

...read moreread less

Abstract: In recent years, knowledge structuring is assuming important roles in several real world applications such as decision support, cooperative problem solving, e-commerce, Semantic Web and, even in planning systems. Ontologies play an important role in supporting automated processes to access information and are at the core of new strategies for the development of knowledge-based systems. Yet, developing an ontology is a time-consuming task which often needs an accurate domain expertise to tackle structural and logical difficulties in the definition of concepts as well as conceivable relationships. This work presents an ontology-based retrieval approach, that supports data organization and visualization and provides a friendly navigation model. It exploits the fuzzy extension of the Formal Concept Analysis theory to elicit conceptualizations from datasets and generate a hierarchy-based representation of extracted knowledge. An intuitive graphical interface provides a multi-facets view of the built ontology. Through a transparent query-based retrieval, final users navigate across concepts, relations and population.

...read moreread less

107 citations

Journal Article•DOI•

[...]

John Mingers¹, Frederico Macri¹, Dan Alex Petrovici¹•Institutions (1)

University of Kent¹

Mobile content contribution and retrieval: An exploratory study using the uses and gratifications paradigm

TL;DR: The conclusions are that the h-index is preferable to the impact factor for a variety of reasons, especially the selective coverage of the impact factors and the fact that it disadvantages journals that publish many papers.

...read moreread less

Abstract: This paper considers the use of the h-index as a measure of a journal's research quality and contribution. We study a sample of 455 journals in business and management all of which are included in the ISI Web of Science (WoS) and the Association of Business School's peer review journal ranking list. The h-index is compared with both the traditional impact factors, and with the peer review judgements. We also consider two sources of citation data - the WoS itself and Google Scholar. The conclusions are that the h-index is preferable to the impact factor for a variety of reasons, especially the selective coverage of the impact factor and the fact that it disadvantages journals that publish many papers. Google Scholar is also preferred to WoS as a data source. However, the paper notes that it is not sufficient to use any single metric to properly evaluate research achievements.

...read moreread less

80 citations

Journal Article•DOI•

[...]

Alton Y. K. Chua¹, Dion Hoe-Lian Goh¹, Chei Sian Lee¹•Institutions (1)

Nanyang Technological University¹

MapReduce indexing strategies: Studying scalability and efficiency

TL;DR: Interestingly, gratification factors for mobile content contribution were found to have significant effects on mobile content retrieval intention and vice versa while the self-gratification factor for content contribution had a significant negative effect on content retrieved intention.

...read moreread less

Abstract: Using the uses and gratifications (UnG) theory, this paper explores the gratification factors for which people contribute and retrieve mobile content. Through the deployment of MobiTOP, a mobile content sharing application, it was found that perceived gratification factors for mobile content contribution were different from those for mobile content retrieval. In particular, factors which had significant positive effects on content contribution stemmed from leisure/entertainment and easy access. Factors fuelling content retrieval included the efficient provision of information resources/services and the need for high quality information, both of which tend to be information-centric. Interestingly, gratification factors for mobile content contribution were also found to have significant effects on mobile content retrieval intention and vice versa. Specifically, the access gratification factor had a significant positive effect on content retrieval intention while the self-gratification factor for content contribution had a significant negative effect on content retrieval intention.

...read moreread less

74 citations

Journal Article•DOI•

[...]

Richard McCreadie¹, Craig Macdonald¹, Iadh Ounis¹•Institutions (1)

University of Glasgow¹

A comparative survey of Personalised Information Retrieval and Adaptive Hypermedia techniques

TL;DR: This work provides a detailed analysis of four MapReduce indexing strategies of varying complexity, and concludes that MapReduced is a suitable framework for the deployment of large-scale indexing.

...read moreread less

Abstract: In Information Retrieval (IR), the efficient indexing of terabyte-scale and larger corpora is still a difficult problem. MapReduce has been proposed as a framework for distributing data-intensive operations across multiple processing machines. In this work, we provide a detailed analysis of four MapReduce indexing strategies of varying complexity. Moreover, we evaluate these indexing strategies by implementing them in an existing IR framework, and performing experiments using the Hadoop MapReduce implementation, in combination with several large standard TREC test corpora. In particular, we examine the efficiency of the indexing strategies, and for the most efficient strategy, we examine how it scales with respect to corpus size, and processing power. Our results attest to both the importance of minimising data transfer between machines for IO intensive tasks like indexing, and the suitability of the per-posting list MapReduce indexing strategy, in particular for indexing at a terabyte-scale. Hence, we conclude that MapReduce is a suitable framework for the deployment of large-scale indexing.

...read moreread less

72 citations

Journal Article•DOI•

[...]

Ben Steichen¹, Helen Ashman², Vincent Wade¹•Institutions (2)

Trinity College, Dublin¹, University of South Australia²

Patent collaboration and international knowledge flow

TL;DR: This survey investigates the key techniques and impacts in the use of PIR and AH technology in order to identify such affordances and limitations and illustrates an example of a potential synergy in a hybridised approach, where adaptation can be tailored in different aspects of Pir and AH systems.

...read moreread less

Abstract: A key driver for next generation web information retrieval systems is becoming the degree to which a user's search and presentation experience is adapted to individual user properties and contexts of use. Over the past decades, two parallel threads of personalisation research have emerged, one originating in the document space in the area of Personalised Information Retrieval (PIR) and the other arising from the hypertext space in the field of Adaptive Hypermedia (AH). PIR typically aims to bias search results towards more personally relevant information by modifying traditional document ranking algorithms. Such techniques tend to represent users with simplified personas (often based on historic interests), enabling the efficient calculation of personalised ranked lists. On the other hand, the field of Adaptive Hypermedia (AH) has addressed the challenge of biasing content retrieval and presentation by adapting towards multiple characteristics. These characteristics, more typically called personalisation ''dimensions'', include user goals or prior knowledge, enabling adaptive and personalised result compositions and navigations. The question arises as to whether it is possible to provide a comparison of PIR and AH, where the respective strengths and limitations can be exposed, but also where potential complementary affordances can be identified. This survey investigates the key techniques and impacts in the use of PIR and AH technology in order to identify such affordances and limitations. In particular, the techniques are analysed by examining key activities in the retrieval process, namely (i) query adaptation, (ii) adaptive retrieval and (iii) adaptive result composition and presentation. In each of these areas, the survey identifies individual strengths and limitations. Following this comparison of techniques, the paper also illustrates an example of a potential synergy in a hybridised approach, where adaptation can be tailored in different aspects of PIR and AH systems. Moreover, the concerns resulting from interdependencies and the respective tradeoffs of techniques are discussed, along with potential future directions and remaining challenges.

...read moreread less

71 citations

Journal Article•DOI•

[...]

Jiancheng Guan¹, Zifeng Chen•Institutions (1)

Fudan University¹

Multidimensional relevance: Prioritized aggregation in a personalized Information Retrieval setting

TL;DR: An empirical analysis of evolving knowledge networks of successful patent collaboration at national level in 1980s, 1990s, and 2000s indicates that the wide and deep participation in international knowledge interactions may have great contribution to the economic competitiveness.

...read moreread less

Abstract: In this paper, we provide an empirical analysis of evolving knowledge networks of successful patent collaboration at national level in 1980s, 1990s, and 2000s. All countries are classified into main knowledge creators (Organisation for Economic Co-operation and Development (OECD) group) and main knowledge users (non-OECD group) in order to distinguish specific characteristics of knowledge interactions within groups and between groups. The analyses are carried out from four aspects, i.e., the overall distribution of knowledge interactions among countries, the countries' ability to inhabit and facilitate the knowledge flows among others with the help of flow betweenness measures, the countries' bridgeness between two groups with the recently developed Q-measures, and the most important bilateral knowledge interactions. Results show that although most of the international knowledge interactions still take place within the OECD group, the non-OECD countries have improved their performance significantly. They participate much more in international patenting and collaborations and play much more important roles in facilitating knowledge interactions among others. Among them, China and Taiwan are two most dazzling new stars according to their performance in international knowledge interactions. Considering together with their rapidly improved world competitiveness, the findings indicate that the wide and deep participation in international knowledge interactions may have great contribution to the economic competitiveness.

...read moreread less

64 citations

Journal Article•DOI•

[...]

Célia da Costa Pereira¹, Mauro Dragoni², Gabriella Pasi³•Institutions (3)

University of Nice Sophia Antipolis¹, fondazione bruno kessler², University of Milano-Bicocca³

Indices of novelty for emerging topic detection

TL;DR: A new model for aggregating multiple criteria evaluations for relevance assessment by considering the existence of a prioritization relationship over the criteria is proposed, where relevance is modeled as a multidimensional property of documents.

...read moreread less

Abstract: A new model for aggregating multiple criteria evaluations for relevance assessment is proposed. An Information Retrieval context is considered, where relevance is modeled as a multidimensional property of documents. The usefulness and effectiveness of such a model are demonstrated by means of a case study on personalized Information Retrieval with multi-criteria relevance. The following criteria are considered to estimate document relevance: aboutness, coverage, appropriateness, and reliability. The originality of this approach lies in the aggregation of the considered criteria in a prioritized way, by considering the existence of a prioritization relationship over the criteria. Such a prioritization is modeled by making the weights associated to a criterion dependent upon the satisfaction of the higher-priority criteria. This way, it is possible to take into account the fact that the weight of a less important criterion should be proportional to the satisfaction degree of the more important criterion. Experimental evaluations are also reported.

...read moreread less

Journal Article•DOI•

[...]

Yi-Ning Tu¹, Jia-Lang Seng²•Institutions (2)

Fu Jen Catholic University¹, National Chengchi University²

User k-anonymity for privacy preserving data mining of query logs

TL;DR: A new set of indices is created based on time, volume, frequency and represents a resolution to provide a more precise set of prediction indices for emerging topic detection and gives a promising indication of emerging topics in conferences and journals.

...read moreread less

Abstract: Emerging topic detection is a vital research area for researchers and scholars interested in searching for and tracking new research trends and topics. The current methods of text mining and data mining used for this purpose focus only on the frequency of which subjects are mentioned, and ignore the novelty of the subject which is also critical, but beyond the scope of a frequency study. This work tackles this inadequacy to propose a new set of indices for emerging topic detection. They are the novelty index (NI) and the published volume index (PVI). This new set of indices is created based on time, volume, frequency and represents a resolution to provide a more precise set of prediction indices. They are then utilized to determine the detection point (DP) of new emerging topics. Following the detection point, the intersection decides the worth of a new topic. The algorithms presented in this paper can be used to decide the novelty and life span of an emerging topic in a specific field. The entire comprehensive collection of the ACM Digital Library is examined in the experiments. The application of the NI and PVI gives a promising indication of emerging topics in conferences and journals.

...read moreread less

Journal Article•DOI•

[...]

Guillermo Navarro-Arribas¹, Vicenç Torra¹, Arnau Erola, Jordi Castellí-Roca•Institutions (1)

Spanish National Research Council¹

Visualizing and mapping the intellectual structure of information retrieval

TL;DR: This paper presents the anonymization of query logs using microaggregation, guaranteeing the k-anonymity of the users in the query log, while preserving its utility, and provides the evaluation of the proposal in real query logs, showing the privacy and utility achieved.

...read moreread less

Abstract: The anonymization of query logs is an important process that needs to be performed prior to the publication of such sensitive data. This ensures the anonymity of the users in the logs, a problem that has been already found in released logs from well known companies. This paper presents the anonymization of query logs using microaggregation. Our proposal ensures the k-anonymity of the users in the query log, while preserving its utility. We provide the evaluation of our proposal in real query logs, showing the privacy and utility achieved, as well as providing estimations for the use of such data in data mining processes based on clustering.

...read moreread less

Journal Article•DOI•

[...]

Abebe Rorissa¹, Xiaojun Yuan¹•Institutions (1)

State University of New York System¹

Use of permutation prefixes for efficient and scalable approximate similarity search

TL;DR: This work sketches the information retrieval intellectual landscape through visualizations of citation behaviors, addressing information retrieval's co-authorship network, highly productive authors, highly cited journals and papers, author-assigned keywords, active institutions, and the import of ideas from other disciplines.

...read moreread less

Abstract: Information retrieval is a long established subfield of library and information science. Since its inception in the early- to mid -1950s, it has grown as a result, in part, of well-regarded retrieval system evaluation exercises/campaigns, the proliferation of Web search engines, and the expansion of digital libraries. Although researchers have examined the intellectual structure and nature of the general field of library and information science, the same cannot be said about the subfield of information retrieval. We address that in this work by sketching the information retrieval intellectual landscape through visualizations of citation behaviors. Citation data for 10years (2000-2009) were retrieved from the Web of Science and analyzed using existing visualization techniques. Our results address information retrieval's co-authorship network, highly productive authors, highly cited journals and papers, author-assigned keywords, active institutions, and the import of ideas from other disciplines.

...read moreread less

Journal Article•DOI•

[...]

Andrea Esuli¹•Institutions (1)

Istituto di Scienza e Tecnologie dell'Informazione¹

Cost-effective on-demand associative author name disambiguation

TL;DR: This work presents the Permutation Prefix Index, an index data structure that supports efficient approximate similarity search and shows how the effectiveness can easily reach optimal levels just by adopting two ''boosting'' strategies: multiple index search and multiple query search, which both have nice parallelization properties.

...read moreread less

Abstract: We present the Permutation Prefix Index (this work is a revised and extended version of Esuli (2009b), presented at the 2009 LSDS-IR Workshop, held in Boston) (PP-Index), an index data structure that supports efficient approximate similarity search. The PP-Index belongs to the family of the permutation-based indexes, which are based on representing any indexed object with ''its view of the surrounding world'', i.e., a list of the elements of a set of reference objects sorted by their distance order with respect to the indexed object. In its basic formulation, the PP-Index is strongly biased toward efficiency. We show how the effectiveness can easily reach optimal levels just by adopting two ''boosting'' strategies: multiple index search and multiple query search, which both have nice parallelization properties. We study both the efficiency and the effectiveness properties of the PP-Index, experimenting with collections of sizes up to one hundred million objects, represented in a very high-dimensional similarity space.

...read moreread less

Journal Article•DOI•

[...]

Adriano Veloso¹, Anderson A. Ferreira¹, Marcos André Gonçalves¹, Alberto H. F. Laender¹, Wagner Meira¹ - Show less +1 more•Institutions (1)

Universidade Federal de Minas Gerais¹

Automatically structuring domain knowledge from text: An overview of current research

TL;DR: An associative author name disambiguation approach that identifies authorship by extracting, from training examples, rules associating citation features (e.g., coauthor names, work title, publication venue) to specific authors is introduced.

...read moreread less

Abstract: Authorship disambiguation is an urgent issue that affects the quality of digital library services and for which supervised solutions have been proposed, delivering state-of-the-art effectiveness. However, particular challenges such as the prohibitive cost of labeling vast amounts of examples (there are many ambiguous authors), the huge hypothesis space (there are several features and authors from which many different disambiguation functions may be derived), and the skewed author popularity distribution (few authors are very prolific, while most appear in only few citations), may prevent the full potential of such techniques. In this article, we introduce an associative author name disambiguation approach that identifies authorship by extracting, from training examples, rules associating citation features (e.g., coauthor names, work title, publication venue) to specific authors. As our main contribution we propose three associative author name disambiguators: (1) EAND (Eager Associative Name Disambiguation), our basic method that explores association rules for name disambiguation; (2) LAND (Lazy Associative Name Disambiguation), that extracts rules on a demand-driven basis at disambiguation time, reducing the hypothesis space by focusing on examples that are most suitable for the task; and (3) SLAND (Self-Training LAND), that extends LAND with self-training capabilities, thus drastically reducing the amount of examples required for building effective disambiguation functions, besides being able to detect novel/unseen authors in the test set. Experiments demonstrate that all our disambigutators are effective and that, in particular, SLAND is able to outperform state-of-the-art supervised disambiguators, providing gains that range from 12% to more than 400%, being extremely effective and practical.

...read moreread less

Journal Article•DOI•

[...]

Malcolm Clark¹, Yunhyong Kim¹, Udo Kruschwitz², Dawei Song¹, Dyaa Albakour², Stephen Dignum², Ulises Cerviño Beresi¹, Maria Fasli², Anne De Roeck³ - Show less +5 more•Institutions (3)

Robert Gordon University¹, University of Essex², Open University³

Factors affecting the selection of search tactics: Tasks, knowledge, process, and systems

TL;DR: An overview of automatic methods for building domain knowledge structures (domain models) from text collections inspired by the ubiquitous propagation of domain model structures that are emerging in several research disciplines is given.

...read moreread less

Abstract: This paper presents an overview of automatic methods for building domain knowledge structures (domain models) from text collections. Applications of domain models have a long history within knowledge engineering and artificial intelligence. In the last couple of decades they have surfaced noticeably as a useful tool within natural language processing, information retrieval and semantic web technology. Inspired by the ubiquitous propagation of domain model structures that are emerging in several research disciplines, we give an overview of the current research landscape and some techniques and approaches. We will also discuss trade-offs between different approaches and point to some recent trends.

...read moreread less

Journal Article•DOI•

[...]

Iris Xie¹, Soohyung Joo¹•Institutions (1)

University of Wisconsin–Milwaukee¹

A five-level static cache architecture for web search engines

TL;DR: This study investigated whether and how different factors in relation to task, user-perceived knowledge, search process, and system affect users' search tactic selection, finding seven factors were significantly associated with tactic selection.

...read moreread less

Abstract: This study investigated whether and how different factors in relation to task, user-perceived knowledge, search process, and system affect users' search tactic selection. Thirty-one participants, representing the general public with their own tasks, were recruited for this study. Multiple methods were employed to collect data, including pre-questionnaire, verbal protocols, log analysis, diaries, and post-questionnaires. Statistical analysis revealed that seven factors were significantly associated with tactic selection. These factors consist of work task types, search task types, familiarity with topic, search skills, search session length, search phases, and system types. Moreover, the study also discovered, qualitatively, in what ways these factors influence the selection of search tactics. Based on the findings, the authors discuss practical implications for system design to support users' application of multiple search tactics for each factor.

...read moreread less

Journal Article•DOI•

[...]

Rifat Ozcan¹, I. Sengor Altingovde¹, B. Barla Cambazoglu², Flavio Junqueira², Özgür Ulusoy¹ - Show less +1 more•Institutions (2)

Bilkent University¹, Yahoo!²

Weighted consensus multi-document summarization

TL;DR: A greedy heuristic to prioritize items for caching is proposed, based on gains computed by using items' past access frequencies, estimated computational costs, and storage overheads, which performs better than dividing the entire cache space among particular item types at fixed proportions.

...read moreread less

Abstract: Caching is a crucial performance component of large-scale web search engines, as it greatly helps reducing average query response times and query processing workloads on backend search clusters. In this paper, we describe a multi-level static cache architecture that stores five different item types: query results, precomputed scores, posting lists, precomputed intersections of posting lists, and documents. Moreover, we propose a greedy heuristic to prioritize items for caching, based on gains computed by using items' past access frequencies, estimated computational costs, and storage overheads. This heuristic takes into account the inter-dependency between individual items when making its caching decisions, i.e., after a particular item is cached, gains of all items that are affected by this decision are updated. Our simulations under realistic assumptions reveal that the proposed heuristic performs better than dividing the entire cache space among particular item types at fixed proportions.

...read moreread less

Journal Article•DOI•

[...]

Dingding Wang¹, Tao Li¹•Institutions (1)

Florida International University¹

A social recommender mechanism for improving knowledge sharing in online forums

TL;DR: This paper proposes a weighted consensus summarization method to combine the results from single summarization systems, and this method outperforms other combination methods.

...read moreread less

Abstract: Multi-document summarization is a fundamental tool for document understanding and has received much attention recently. Given a collection of documents, a variety of summarization methods based on different strategies have been proposed to extract the most important sentences from the original documents. However, very few studies have been reported on aggregating different summarization methods to possibly generate better summary results. In this paper, we propose a weighted consensus summarization method to combine the results from single summarization systems. We evaluate and compare our proposed weighted consensus method with various baseline combination methods. Experimental results on DUC2002 and DUC2004 data sets demonstrate the performance improvement by aggregating multiple summarization systems, and our proposed weighted consensus summarization method outperforms other combination methods.

...read moreread less

Journal Article•DOI•

[...]

Yung-Ming Li¹, Tzu-Fong Liao¹, Cheng-Yang Lai¹•Institutions (1)

National Chiao Tung University¹

Mobile personal information management agent: Supporting natural language interface and application integration

TL;DR: Results of the experiments show that with the support of the proposed recommendation mechanism, requesters in forums can easily find similar discussion threads to avoid spamming the same discussion and this mechanism provides a relatively efficient and active way to find the appropriate experts.

...read moreread less

Abstract: Nowadays, online forums have become a useful tool for knowledge management in Web-based technology This study proposes a social recommender system which generates discussion thread and expert recommendations based on semantic similarity, profession and reliability, social intimacy and popularity, and social network-based Markov Chain (SNMC) models for knowledge sharing in online forum communities The advantage of the proposed mechanism is its relatively comprehensive consideration of the aspects of knowledge sharing Accordingly, results of our experiments show that with the support of the proposed recommendation mechanism, requesters in forums can easily find similar discussion threads to avoid spamming the same discussion In addition, if the requesters cannot find qualified discussion threads, this mechanism provides a relatively efficient and active way to find the appropriate experts

...read moreread less

Journal Article•DOI•

[...]

Lina Zhou¹, Ammar Mohammed¹, Dongsong Zhang¹•Institutions (1)

University of Maryland, Baltimore County¹

User relevance criteria choices and the information search process

TL;DR: A model of mobile PIM agent (PIMA) that aims to improve PIM on mobile devices through natural language interface and application integration is proposed and conducted to evaluate PIMA empirically with prototype systems.

...read moreread less

Abstract: Managing personal information such as to-dos and contacts has become our daily routines, consuming more time than needed Existing PIM tools require extensive involvement of human users This becomes a problem in using mobile devices due to their physical constraints To address the limitations of traditional PIM tools, we propose a model of mobile PIM agent (PIMA) that aims to improve PIM on mobile devices through natural language interface and application integration We conducted a user study to evaluate PIMA empirically with prototype systems The results show that mobile PIMA improved perceived usefulness, ease-of-use, and efficiency of PIM on mobile devices, which in turn accounted for positive attitude and intention to use the system The findings of this study provide suggestions for designing and developing PIM applications on mobile devices

...read moreread less

Journal Article•DOI•

[...]

Arthur Taylor¹•Institutions (1)

College of Business Administration¹

A three-phase method for patent classification

TL;DR: Findings confirm and extend findings of previous studies, providing strong statistical evidence of an association between the information search process and the choices of relevance criteria by users, and identifying specific changes in the user preferences for specific criteria over the course of the informationsearch process.

...read moreread less

Abstract: Relevance judgments occur within an information search process, where time, context and situation can impact the judgments. The determination of relevance is dependent on a number of factors and variables which include the criteria used to determine relevance. The relevance judgment process and the criteria used to make those judgments are manifestations of the cognitive changes which occur during the information search process. Understanding why these relevance criteria choices are made, and how they vary over the information search process can provide important information about the dynamic relevance judgment process. This information can be used to guide the development of more adaptive information retrieval systems which respond to the cognitive changes of users during the information search process. The research data analyzed here was collected in two separate studies which examined a subject's relevance judgment over an information search process. Statistical analysis was used to examine these results and determine if there were relationships between criteria selections, relevance judgments, and the subject's progression through the information search process. Findings confirm and extend findings of previous studies, providing strong statistical evidence of an association between the information search process and the choices of relevance criteria by users, and identifying specific changes in the user preferences for specific criteria over the course of the information search process.

...read moreread less

Journal Article•DOI•

[...]

Yen-Liang Chen¹, Yuan-Che Chang¹•Institutions (1)

National Central University¹

A Bayesian feature selection paradigm for text classification

TL;DR: This paper presents a novel categorization method, the three phase categorization (TPC) algorithm, which classifies patents down to the subgroup level with reasonable accuracy, and experimental results indicate that the TPC algorithm can achieve 36.07% accuracy at the sub group level.

...read moreread less

Abstract: An automatic patent categorization system would be invaluable to individual inventors and patent attorneys, saving them time and effort by quickly identifying conflicts with existing patents. In recent years, it has become more and more common to classify all patent documents using the International Patent Classification (IPC), a complex hierarchical classification system comprised of eight sections, 128 classes, 648 subclasses, about 7200 main groups, and approximately 72,000 subgroups. So far, however, no patent categorization method has been developed that can classify patents down to the subgroup level (the bottom level of the IPC). Therefore, this paper presents a novel categorization method, the three phase categorization (TPC) algorithm, which classifies patents down to the subgroup level with reasonable accuracy. The experimental results for the TPC algorithm, using the WIPO-alpha collection, indicate that our classification method can achieve 36.07% accuracy at the subgroup level. This is approximately a 25,764-fold improvement over a random guess.

...read moreread less

Journal Article•DOI•

[...]

Guozhong Feng¹, Jianhua Guo¹, Bing-Yi Jing², Lizhu Hao¹•Institutions (2)

Northeast Normal University¹, Hong Kong University of Science and Technology²

An agenda for green information retrieval research

TL;DR: A generative probabilistic model is proposed, describing categories by distributions, handling the feature selection problem by introducing a binary exclusion/inclusion latent vector, which is updated via an efficient Metropolis search.

...read moreread less

Abstract: The automated classification of texts into predefined categories has witnessed a booming interest, due to the increased availability of documents in digital form and the ensuing need to organize them. An important problem for text classification is feature selection, whose goals are to improve classification effectiveness, computational efficiency, or both. Due to categorization unbalancedness and feature sparsity in social text collection, filter methods may work poorly. In this paper, we perform feature selection in the training process, automatically selecting the best feature subset by learning, from a set of preclassified documents, the characteristics of the categories. We propose a generative probabilistic model, describing categories by distributions, handling the feature selection problem by introducing a binary exclusion/inclusion latent vector, which is updated via an efficient Metropolis search. Real-life examples illustrate the effectiveness of the approach.

...read moreread less

Journal Article•DOI•

[...]

Gobinda G. Chowdhury¹•Institutions (1)

University of Technology, Sydney¹

Dimensions of quality and accessibility: Selection of human information sources from a social capital perspective

TL;DR: Two different methods are proposed for estimating the energy consumption and GHG emissions of an IR system or service, and the four key enablers of a Green IR viz.

...read moreread less

Abstract: Nowadays we use information retrieval systems and services as part of our many day-to-day activities ranging from a web and database search to searching for various digital libraries, audio and video collections/services, and so on. However, IR systems and services make extensive use of ICT (information and communication technologies) and increasing use of ICT can significantly increase greenhouse gas (GHG, a term used to denote emission of harmful gases in the atmosphere) emissions. Sustainable development, and more importantly environmental sustainability, has become a major area of concern of various national and international bodies and as a result various initiatives and measures are being proposed for reducing the environmental impact of industries, businesses, governments and institutions. Research also shows that appropriate use of ICT can reduce the overall GHG emissions of a business, product or service. Green IT and cloud computing can play a key role in reducing the environmental impact of ICT. This paper proposes the concept of Green IR systems and services that can play a key role in reducing the overall environmental impact of various ICT-based services in education and research, business, government, etc., that are increasingly being reliant on access and use of digital information. However, to date there has not been any systematic research towards building Green IR systems and services. This paper points out the major challenges in building Green IR systems and services, and two different methods are proposed for estimating the energy consumption, and the corresponding GHG emissions, of an IR system or service. This paper also proposes the four key enablers of a Green IR viz. Standardize, Share, Reuse and Green behavior. Further research required to achieve these for building Green IR systems and services are also mentioned.

...read moreread less

Journal Article•DOI•

[...]

Lilian Woudstra¹, Bart van den Hooff², Alexander P. Schouten²•Institutions (2)

Tilburg University¹, VU University Amsterdam²

Amount of invested mental effort (AIME) in online searching

TL;DR: It is concluded that both quality and accessibility influence the selection of human information sources, although quality exerts a slightly stronger influence.

...read moreread less

Abstract: This study focuses on how the accessibility and quality of co-workers in organizations affect their use as information source. Prior research has produced inconsistent findings concerning these factors' respective influence on source selection. In this article, we argue that one potential reason for this lies in the lack of coherent definitions of accessibility and quality. To bridge this gap, we unpack these concepts into their underlying dimensions, based on insights derived from social capital theory, more specifically Nahapiet and Ghoshal's (1998) contribution, to uncovering the multidimensionality of social capital. We empirically test the dimensionality of accessibility and quality, as well as the relative influence of these concepts on human information source selection, in a scenario experiment within an organization. Findings support the proposed dimensionality, and lead to the conclusion that both quality and accessibility influence the selection of human information sources, although quality exerts a slightly stronger influence.

...read moreread less

Journal Article•DOI•

[...]

Soo Young Rieh¹, Yong-Mi Kim¹, Karen Markey¹•Institutions (1)

University of Michigan¹

TL;DR: This research investigates how people's perceptions of information retrieval (IR) systems, their perceptions of search tasks, and their perception of self-efficacy influence the amount of invested mental effort (AIME) they put into using two different IR systems: a Web search engine and a library system.

...read moreread less

Abstract: This research investigates how people's perceptions of information retrieval (IR) systems, their perceptions of search tasks, and their perceptions of self-efficacy influence the amount of invested mental effort (AIME) they put into using two different IR systems: a Web search engine and a library system. It also explores the impact of mental effort on an end user's search experience. To assess AIME in online searching, two experiments were conducted using these methods: Experiment 1 relied on self-reports and Experiment 2 employed the dual-task technique. In both experiments, data were collected through search transaction logs, a pre-search background questionnaire, a post-search questionnaire and an interview. Important findings are these: (1) subjects invested greater mental effort searching a library system than searching the Web; (2) subjects put little effort into Web searching because of their high sense of self-efficacy in their searching ability and their perception of the easiness of the Web; (3) subjects did not recognize that putting mental effort into searching was something needed to improve the search results; and (4) data collected from multiple sources proved to be effective for assessing mental effort in online searching.

...read moreread less

Journal Article•DOI•

[...]

David Novak¹, Michal Batko¹, Pavel Zezula¹•Institutions (1)

Masaryk University¹

Generating suggestions for queries in the long tail with an inverted index

TL;DR: This work proposes a distributed index structure for similarity data management called the Metric Index (M-Index) which can answer queries in precise and approximate manner and is proved its usability by developing a full-featured publicly-available Web application.

...read moreread less

Abstract: Metric space is a universal and versatile model of similarity that can be applied in various areas of non-text information retrieval. However, a general, efficient and scalable solution for metric data management is still a resisting research challenge. In this work, we try to make an important step towards such management system that would be able to scale to data collections of billions of objects. We propose a distributed index structure for similarity data management called the Metric Index (M-Index) which can answer queries in precise and approximate manner. This technique can take advantage of any distributed hash table that supports interval queries and utilize it as an underlying index. We have performed numerous experiments to test various settings of the M-Index structure and we have proved its usability by developing a full-featured publicly-available Web application.

...read moreread less

Journal Article•DOI•

[...]

Daniele Broccolo¹, Lorenzo Marcon¹, Franco Maria Nardini¹, Raffaele Perego¹, Fabrizio Silvestri¹ - Show less +1 more•Institutions (1)

Istituto di Scienza e Tecnologie dell'Informazione¹