scispace - formally typeset
Search or ask a question

Showing papers in "Information Processing and Management in 2003"


Journal ArticleDOI
TL;DR: The proposed PWI is expressed as a product of the occurrence probabilities of terms and their amounts of information, and corresponds well with the conventional term frequency-inverse document frequency measures that are commonly used in today's information retrieval systems.
Abstract: This paper presents a mathematical definition of the "probability-weighted amount of information" (PWI), a measure of specificity of terms in documents that is based on an information-theoretic view of retrieval events. The proposed PWI is expressed as a product of the occurrence probabilities of terms and their amounts of information, and corresponds well with the conventional term frequency-inverse document frequency measures that are commonly used in today's information retrieval systems. The mathematical definition of the PWI is shown, together with some illustrative examples of the calculation.

1,011 citations


Journal ArticleDOI
TL;DR: This essay concludes with a discussion of the value that can be added to information seeking research and theory as a result of a deeper appreciation of context, particularly in terms of the authors' current multi-contextual environment and individuals taking an active role in contextualizing.
Abstract: While surprisingly little has been written about context at a meaningful level, context is central to most theoretical approaches to information seeking. In this essay I explore in more detail three senses of context. First, I look at context as equivalent to the situation in which a process is immersed. Second, I discuss contingency approaches that detail active ingredients of the situation that have specific, predictable effects. Third, I examine major frameworks for meaning systems. Then, I discuss how a deeper appreciation of context can enhance our understanding of the process of information seeking by examining two vastly different contexts in which it occurs: organizational and cancer-related, an exemplar of everyday life information seeking. This essay concludes with a discussion of the value that can be added to information seeking research and theory as a result of a deeper appreciation of context, particularly in terms of our current multi-contextual environment and individuals taking an active role in contextualizing.

196 citations


Journal ArticleDOI
TL;DR: In this article, the authors analyzed how students' growing understanding of the topic and search experience were related to their choice of search tactics and terms while preparing a research proposal for a small empirical study.
Abstract: The study analyses how students' growing understanding of the topic and search experience were related to their choice of search tactics and terms while preparing a research proposal for a small empirical study. In addition to that, the findings of the study are used to test Vakkari's (2001) theory of task-based IR. The research subjects were 22 students of psychology attending a seminar for preparing the proposal. They made a search for their task in PsychINFO database at the beginning and end of the seminar. Data were collected in several ways. A pre- and post-search interview was conducted in both sessions. The students were asked to think aloud in the sessions. This was recorded as were the transaction logs. The results show that search experience was slightly related to the change of facets. Although the students' vocabulary of the topic grew generating an increased use of specific terms between the sessions, their use of search tactics and operators remained fairly constant. There was no correlation between the terms and tactics used and the total number of useful references found. By comparing these results with the findings of relevant earlier studies the conclusion was drawn that domain knowledge has an impact on searching assuming that users have a sufficient command of the system used. This implies that the tested theory of task-based IR is valid on condition that the searchers are experienced. It is suggested that the theory should be enriched by including search experience in its scope.

195 citations


Journal ArticleDOI
TL;DR: It is confirmed that WT10g contains exploitable link information using a site (homepage) finding experiment and the results show that, on this task, Okapi BM25 works better on propagated link anchor text than on full text.
Abstract: Past research into text retrieval methods for the Web has been restricted by the lack of a test collection capable of supporting experiments which are both realistic and reproducible. The 1.69 million document WT10g collection is proposed as a multi-purpose testbed for experiments with these attributes, in distributed IR, hyperlink algorithms and conventional ad hoc retrieval.WT10g was constructed by selecting from a superset of documents in such a way that desirable corpus properties were preserved or optimised. These properties include: a high degree of inter-server connectivity, integrity of server holdings, inclusion of documents related to a very wide spread of likely queries, and a realistic distribution of server holding sizes. We confirm that WT10g contains exploitable link information using a site (homepage) finding experiment. Our results show that, on this task, Okapi BM25 works better on propagated link anchor text than on full text.WT10g was used in TREC-9 and TREC-2000 and both topic relevance and homepage finding queries and judgments are available.

187 citations


Journal ArticleDOI
TL;DR: This paper proposes the use of Web pages linked with the home page in a different manner from the sole use of home pages in previous research, and derives a scheme for Web site classification based on the k-nearest neighbor (k-NN) approach.
Abstract: Automatic categorization is a viable method to deal with the scaling problem on the World Wide Web. For Web site classification, this paper proposes the use of Web pages linked with the home page in a different manner from the sole use of home pages in previous research. To implement our proposed method, we derive a scheme for Web site classification based on the k-nearest neighbor (k-NN) approach. It consists of three phases: Web page selection (connectivity analysis), Web page classification. and Web site classification. Given a Web site, the Web page selection chooses several representative Web pages using connectivity analysis. The k-NN classifier next classifies each of the selected Web pages. Finally, the classified Web pages are extended to a classification of the entire Web site. To improve performance, we supplement the k-NN approach with a feature selection method and a term weighting scheme using markup tags, and also reform its document-document similarity measure. In our experiments on a Korean commercial Web directory, the proposed system, using both a home page and its linked pages, improved the performance of micro-averaging breakeven point by 30.02%, compared with an ordinary classification which uses a home page only.

146 citations


Journal ArticleDOI
TL;DR: A fuel injection pump comprises a housing a pump assembly mounted within the housing and an annular armature slidable within a bore defined in the housing.
Abstract: The aim of the work described in this paper is to evaluate the influencing effects of query-biased summaries in web searching. For this purpose, a summarisation system has been developed, and a summary tailored to the user's query is generated automatically for each document retrieved. The system aims to provide both a better means of assessing document relevance than titles or abstracts typical of many web search result lists. Through visiting each result page at retrieval-time, the system provides the user with an idea of the current page content and thus deals with the dynamic nature of the web.To examine the effectiveness of this approach, a task-oriented, comparative evaluation between four different web retrieval systems was performed; two that use query-biased summarisation, and two that use the standard ranked titles/abstracts approach. The results from the evaluation indicate that query-biased summarisation techniques appear to be more useful and effective in helping users gauge document relevance than the traditional ranked titles/abstracts approach. The same methodology was used to compare the effectiveness of two of the web's major search engines; Alta Vista and Google.

119 citations


Journal ArticleDOI
TL;DR: A fuzzy evaluation method of Standard Generalized Markup Language documents based on computing with words is presented and using the fuzzy linguistic modeling the user-system interaction is facilitated and the assistance of system is improved.
Abstract: Recommender systems evaluate and filter the great amount of information available on the Web to assist people in their search processes. A fuzzy evaluation method of Standard Generalized Markup Language documents based on computing with words is presented. Given a document type definition (DTD), we consider that its elements are not equally informative. This is indicated in the DTD by defining linguistic importance attributes to the more meaningful elements of DTD chosen. Then, the evaluation method generates linguistic recommendations from linguistic evaluation judgements provided by different recommenders on meaningful elements of DTD. To do so, the evaluation method uses two quantifier guided linguistic aggregation operators, the linguistic weighted averaging operator and the linguistic ordered weighted averaging operator, which allow us to obtain recommendations taking into account the fuzzy majority of the recommenders' judgements. Using the fuzzy linguistic modeling the user-system interaction is facilitated and the assistance of system is improved. The method can be easily extended on the Web to evaluate HyperText Markup Language and eXtensible Markup Language documents.

117 citations


Journal ArticleDOI
TL;DR: A fuzzy multiset model for information clustering is proposed with application to information retrieval on the World Wide Web, observing that a search engine retrieves multiple occurrences of the same subjects with possibly different degrees of relevance.
Abstract: A fuzzy multiset model for information clustering is proposed with application to information retrieval on the World Wide Web. Noting that a search engine retrieves multiple occurrences of the same subjects with possibly different degrees of relevance, we observe that fuzzy multisets provide an appropriate model of information retrieval on the WWW. Information clustering which means both term clustering and document clustering is considered. Three methods of the hard c-means, fuzzy c-means, and an agglomerative method using cluster centers are proposed. Two distances between fuzzy multisets and algorithms for calculating cluster centers are defined. Theoretical properties concerning the clustering algorithms are studied. Illustrative examples are given to show how the algorithms work.

116 citations


Journal ArticleDOI
TL;DR: This work proposes a document-identifier reassignment algorithm that generates a reassignment order for all documents according to the similarity to reassign closer identifiers to the documents having closer relationships.
Abstract: The inverted file is the most popular indexing mechanism for document search in an information retrieval system. Compressing an inverted file can greatly improve document search rate. Traditionally, the d-gap technique is used in the inverted file compression by replacing document identifiers with usually much smaller gap values. However, fluctuating gap values cannot be efficiently compressed by some well-known prefix-free codes. To smoothen and reduce the gap values, we propose a document-identifier reassignment algorithm. This reassignment is based on a similarity factor between documents. We generate a reassignment order for all documents according to the similarity to reassign closer identifiers to the documents having closer relationships. Simulation results show that the average gap values of sample inverted files can be reduced by 30%, and the compression rate of d-gapped inverted file with prefix-free codes can be improved by 15%.

82 citations


Journal ArticleDOI
TL;DR: The design and implementation of a prototype visualization system to enhance author searching, called AuthorLink, is described, based on author co-citation analysis and visualization mapping algorithms such as Kohonen's feature maps and Pathfinder networks.
Abstract: Author searching is traditionally based on the matching of name strings Special characteristics of authors as personal names and subject indicators are not considered This makes it difficult to identify a set of related authors or to group authors by subjects in retrieval systems In this paper, we describe the design and implementation of a prototype visualization system to enhance author searching The system, called AuthorLink, is based on author co-citation analysis and visualization mapping algorithms such as Kohonen's feature maps and Pathfinder networks AuthorLink produces interactive author maps in real time from a database of 126 million records supplied by the Institute for Scientific Information The maps show subject groupings and more fine-grained intellectual connections among authors Through the interactive interface the user can take advantage of such information to refine queries and retrieve documents through point-and-click manipulation of the authors' names

77 citations


Journal ArticleDOI
Hang Li1, Kenji Yamanishi2
TL;DR: Experimental results indicate that the proposed method significantly outperforms methods that combine existing techniques, and a statistical learning approach to the issue is proposed in this paper.
Abstract: Addressed here is the issue of 'topic analysis' which is used to determine a text's topic structure, a representation indicating what topics are included in a text and how those topics change within the text. Topic analysis consists of two main tasks: topic identification and text segmentation. While topic analysis would be extremely useful in a variety of text processing applications, no previous study has so far sufficiently addressed it. A statistical learning approach to the issue is proposed in this paper. More specifically, topics here are represented by means of word clusters, and a finite mixture model, referred to as a stochastic topic model (STM), is employed to represent a word distribution within a text. In topic analysis, a given text is segmented by detecting significant differences between STMs, and topics are identified by means of estimation of STMs. Experimental results indicate that the proposed method significantly outperforms methods that combine existing techniques.

Journal ArticleDOI
TL;DR: Results suggest that the proposed LSP method, allowing comparison of different image regions using different similarity criteria, is more suited for modeling the human perception of image similarity than the existing methods.
Abstract: Local similarity pattern (LSP) is proposed as a new method for computing image similarity. Similarity of a pair of images is expressed in terms of similarities of the corresponding image regions, obtained by the uniform partitioning of the image area. Different from the existing methods, each region-wise similarity is computed using a different combination of image features (color, shape, and texture). In addition, a method for optimizing the LSP-based similarity computation, based on genetic algorithm, is proposed, and incorporated in the relevance feedback mechanism, allowing the user to automatically specify LSP-based queries. LSP is evaluated on five test databases totalling around 2500 images of various sorts. Compared with both the conventional and the relevance feedback methods for computing image similarity, LSP brings in average over 11% increase in the retrieval precision. Results suggest that the proposed LSP method, allowing comparison of different image regions using different similarity criteria, is more suited for modeling the human perception of image similarity than the existing methods.

Journal ArticleDOI
TL;DR: This investigation examines end-user judgment and evaluation behavior during information retrieval (IR) system interactions and extends previous research surrounding relevance as a key construct for representing the value end-users ascribe to items retrieved from IR systems to suggest a predictive multi-stage model of relevance thresholds.
Abstract: This investigation examines end-user judgment and evaluation behavior during information retrieval (IR) system interactions and extends previous research surrounding relevance as a key construct for representing the value end-users ascribe to items retrieved from IR systems. A self-reporting instrument collected evaluative responses from 32 end-users related to 1432 retrieved items in relation to five characteristics of each item: (1) whether it was on topic, (2) whether it was meaningful to the user, (3) whether it was useful in relation to the problem at hand, (4) whether the IR system returned the information in the right form or format, and (5) whether the information retrieved allowed the user to take further action on it. The manner in which these characteristics of the retrieved items were considered, differentiated and aggregated were examined in relation to the region of relevance attributed to those items by the users.The nominal nature of the data collected led to non-parametric statistical analyses that indicated that end-user evaluation of retrieved items to resolve an information problem at hand is most likely a multi-stage process. While end-users may differ in their judgments and evaluations of retrieved items, they appear to make those decisions by using somewhat consistent heuristic approaches that point to a predictive multi-stage model of relevance thresholds that exist on a continuum from topic to meaning (pertinence) to functionality (use).

Journal ArticleDOI
TL;DR: A new text manipulation system FA-Sim that is useful for retrieving information in large heterogeneous texts and for recognizing content similarity in text excerpts is outlined.
Abstract: Conventional approaches to text analysis and information retrieval which measured document similarity by using considering all of the information in texts are a relatively inefficiency for processing large text collections in heterogeneous subject areas. This paper outlined a new text manipulation system FA-Sim that is useful for retrieving information in large heterogeneous texts and for recognizing content similarity in text excerpts. FA-Sim is based on flexible text matching procedures carried out in various contexts and various field ranks. FA-Sim measures texts similarity by using specific field association (FA) terms instead of by comparing all text information. Similarity between texts is faster and higher by using FA-Sim than other two analysis methods. Therefore, Recall and Precision significantly improved by 39% and 37% over these two traditional methods.

Journal ArticleDOI
TL;DR: Trends in multimedia Web searching by Excite users from 1997 to 2001 are examined, which sees multimediaWeb searching undergoing major changes as Web content and searching evolves.
Abstract: Multimedia is proliferating on Web sites, as the Web continues to enhance the integration of multimedia and textual information. In this paper we examine trends in multimedia Web searching by Excite users from 1997 to 2001. Results from an analysis of 1,025,910 Excite queries from 2001 are compared to similar Excite datasets from 1997 to 1999. Findings include: (1) queries per multimedia session have decreased since 1997 as a proportion of general queries due to the introduction of multimedia buttons near the query box, (2) multimedia queries identified are longer than non-multimedia queries, and (3) audio queries are more prevalent than image or video queries in identified multimedia queries. Overall, we see multimedia Web searching undergoing major changes as Web content and searching evolves.

Journal ArticleDOI
TL;DR: The qualitative and quantitative analysis of the data show that users consider both ease-of-use and user control are essential for effective retrieval, emphasizing on the balance between system role and user involvement in achieving various IR sub-tasks.
Abstract: The emergence of Web-based IR systems calls for the need to support ease-of-use as well as user control This study attempts to investigate users' perceptions of ease-of-use versus user control, and desired functionalities as well as desired interface structure of online IR systems in supporting both ease-of-use and user control Forty subjects who had an opportunity to learn and use five online databases participated in the study Multiple methods were employed to collect data The qualitative and quantitative analysis of the data show that users consider both ease-of-use and user control are essential for effective retrieval The results are discussed within the context of a model of optimal support for ease-of-use and user control, particularly, emphasizing on the balance between system role and user involvement in achieving various IR sub-tasks

Journal ArticleDOI
TL;DR: This work constructed a realistic current news test collection using the results obtained from 15 current news Web sites in response to 107 topical queries and found that a low-cost merging scheme worked almost as well as merging based on downloading and rescoring the actual news articles.
Abstract: Metasearching of online current news services is a potentially useful Web application of distributed information retrieval techniques. We constructed a realistic current news test collection using the results obtained from 15 current news Web sites (including ABC News, BBC and AllAfrica) in response to 107 topical queries. Results were judged for relevance by independent assessors. Online news services varied considerably both in the usefulness of the results sets they returned and also in the amount of information they provided which could be exploited by a metasearcher. Using the current news test collection we compared a range of different merging methods. We found that a low-cost merging scheme based on a combination of available evidence (title, summary, rank and server usefulness) worked almost as well as merging based on downloading and rescoring the actual news articles.

Journal ArticleDOI
TL;DR: The paper presents a genetic approach that combines the results from multiple query evaluations and investigates ways to improve the effectiveness of the genetic exploration by combining appropriate techniques and heuristics known in genetic theory or in the IR field.
Abstract: Recent studies suggest that significant improvement in information retrieval performance can be achieved by combining multiple representations of an information need. The paper presents a genetic approach that combines the results from multiple query evaluations. The genetic algorithm aims to optimise the overall relevance estimate by exploring different directions of the document space. We investigate ways to improve the effectiveness of the genetic exploration by combining appropriate techniques and heuristics known in genetic theory or in the IR field. Indeed, the approach uses a niching technique to solve the relevance multimodality problem, a relevance feedback technique to perform genetic transformations on query formulations and evolution heuristics in order to improve the convergence conditions of the genetic process. The effectiveness of the global approach is demonstrated by comparing the retrieval results obtained by both genetic multiple query evaluation and classical single query evaluation performed on a subset of TREC-4 using the Mercure IRS. Moreover, experimental results show the positive effect of the various techniques integrated to our genetic algorithm model.

Journal ArticleDOI
TL;DR: How collaboration and volume of information production were changed over the past century are examined and how older documents are used under today's network environment where new information is easily accessible is explored.
Abstract: Scholarly communication is undergoing transformation under the confluence of many forces. The purpose of this article is to explore trends in transforming scholarly publishing and their implications. It examines how collaboration and volume of information production were changed over the past century. It also explores how older documents are used under today's network environment where new information is easily accessible. Understanding these trends would help us design more effective electronic scholarly publishing systems and digital libraries, and serve the needs of scholars more responsively.

Journal ArticleDOI
TL;DR: A model (ABAMDM, acquisition budget allocation model via data mining) is introduced that addresses the use of descriptive knowledge discovered in the historical circulation data explicitly to support allocating library acquisition budget.
Abstract: Many approaches to decision support for the academic library acquisition budget allocation have been proposed to diversely reflect the management requirements. Different from these methods that focus mainly on either statistical analysis or goal programming, this paper introduces a model (ABAMDM, acquisition budget allocation model via data mining) that addresses the use of descriptive knowledge discovered in the historical circulation data explicitly to support allocating library acquisition budget. The major concern in this study is that the budget allocation should be able to reflect a requirement that the more a department makes use of its acquired materials in the present academic year, the more it can get budget for the coming year. The primary output of the ABAMDM used to derive weights of acquisition budget allocation contains two parts. One is the descriptive knowledge via utilization concentration and the other is the suitability via utilization connection for departments concerned. An applicat-ion to the library of Kun Shan University of Technology was described to demonstrate the introduced ABAMDM in practice.

Journal ArticleDOI
TL;DR: This work analyzes and evaluates the retrieval effectiveness of various indexing and search strategies based on test-collections written in four different languages: English, French, German, and Italian, and describes and evaluates various approaches that might be implemented in order to effectively access document collections written in another language.
Abstract: Search engines play an essential role in the usability of Internet-based information systems and without them the Web would be much less accessible, and at the very least would develop at a much slower rate. Given that non-English users now tend to make up the majority in this environment, our main objective is to analyze and evaluate the retrieval effectiveness of various indexing and search strategies based on test-collections written in four different languages: English, French, German, and Italian. Our second objective is to describe and evaluate various approaches that might be implemented in order to effectively access document collections written in another language. As a third objective, we will explore the underlying problems involved in searching document collections written in the four different languages, and we will suggest and evaluate different database merging strategies capable of providing the user with a single unique result list.

Journal ArticleDOI
TL;DR: Comparisons on relevance feedback genetic techniques that follow the vector space model and one of the best traditional methods of relevance feedback--the Ide dec-hi method are carried out.
Abstract: The present work is the continuation of an earlier study which reviewed the literature on relevance feedback genetic techniques that follow the vector space model (the model that is most commonly used in this type of application), and implemented them so that they could be compared with each other as well as with one of the best traditional methods of relevance feedback--the Ide dec-hi method. We here carry out the comparisons on more test collections (Cranfield, CISI, Medline, and NPL), using the residual collection method for their evaluation as is recommended in this type of technique. We also add some fitness functions of our own design.

Journal ArticleDOI
TL;DR: The goal of this paper is to describe an exploration system for large image databases in order to help the user understand the database as a whole, discover hidden relationships, and formulate insights with minimum effort, using color information.
Abstract: The goal of this paper is to describe an exploration system for large image databases in order to help the user understand the database as a whole, discover hidden relationships, and formulate insights with minimum effort. While the proposed system works with any type of low-level feature representation of images, we describe our system using color information. The system is built in three stages: the feature extraction stage in which images are represented in a way that allows efficient storage and retrieval results closer to the human perception; the second stage consists of building a hierarchy of clusters in which the cluster prototype, as the electronic identification ( e ID) of that cluster, is designed to summarize the cluster in a manner that is suited for quick human comprehension of its components. In a formal definition, an electronic IDentification ( e ID) is the most similar image to the other images from a corresponding cluster; that is, the image in the cluster that maximizes the sum of the squares of the similarity values to the other images of that cluster. Besides summarizing the image database to a certain level of detail, an e ID image will be a way to provide access either to another set of e ID images on a lower level of the hierarchy or to a group of perceptually similar images with itself. As a third stage, the multi-dimensional scaling technique is used to provide us with a tool for the visualization of the database at different levels of details. Moreover, it gives the capability for semi-automatic annotation in the sense that the image database is organized in such a way that perceptual similar images are grouped together to form perceptual contexts. As a result, instead of trying to give all possible meanings to an image, the user will interpret and annotate an image in the context in which that image appears, thus dramatically reducing the time taken to annotate large collection of images.

Journal ArticleDOI
TL;DR: This paper measuring the similarity for ordered sets of documents is a special case of this, where, the higher the rank of a document, the lower its weight is in the fuzzy set.
Abstract: Ordered sets of documents are encountered more and more in information distribution systems, such as information retrieval systems. Classical similarity measures for ordinary sets of documents hence need to be extended to these ordered sets. This is done in this paper using fuzzy set techniques. First a general similarity measure is developed which contains the classical strong similarity measures such as Jaccard, Dice, Cosine and which contains the classical weak similarity measures such as Recall and Precision.Then these measures are extended to comparing fuzzy sets of documents. Measuring the similarity for ordered sets of documents is a special case of this, where, the higher the rank of a document, the lower its weight is in the fuzzy set. Concrete forms of these similarity measures are presented. All these measures are new and the ones for the weak similarity measures are the first of this kind (other strong similarity measures have been given in a previous paper by Egghe and Michel).Some of these measures are then tested in the IR-system Profil-Doc. The engine SPIRIT© extracts ranked documents sets in three different contexts, each for 600 request. The practical useability of the OS-measures is then discussed based on these experiments.

Journal ArticleDOI
TL;DR: The objective of this paper is to investigate how (and under which conditions) other aggregate functions could be applied to fuzzy sets in a flexible query.
Abstract: This paper is situated in the area of flexible queries addressed to regular relational databases. Some works have been carried out in the past years in order to design languages allowing the expression of queries involving preferences through the use of fuzzy predicates. In the relational setting, extensions of the relational algebra as well as of SQL-like languages have been proposed and a wide range of fuzzy queries has been made available. However, the use of conditions involving an aggregate function applying to a fuzzy set is not yet possible except for the cardinality (count) in the context of the so-called fuzzy quantified statements. The objective of this paper is to investigate how (and under which conditions) other aggregate functions (such as the maximum...) could be applied to fuzzy sets in a flexible query.

Journal ArticleDOI
TL;DR: This paper combines computer vision and data mining techniques to model high-level concepts for image retrieval, on the basis of basic perceptual features of the human visual system, by means of a set of fuzzy association rules.
Abstract: In this paper we combine computer vision and data mining techniques to model high-level concepts for image retrieval, on the basis of basic perceptual features of the human visual system. High-level concepts related to these features are learned and represented by means of a set of fuzzy association rules. The concepts so acquired can be used for image retrieval with the advantage that it is not needed to provide an image as a query. Instead, a query is formulated by using the labels that identify the learned concepts as search terms, and the retrieval process calculates the relevance of an image to the query by an inference mechanism. An additional feature of our methodology is that it can capture user's subjectivity. For that purpose, fuzzy sets theory is employed to measure user's assessments about the fulfillment of a concept by an image.

Journal ArticleDOI
TL;DR: It was found that the diversity approach performs as well as and in some cases superior to the supervised method when tested on the data, and the information-centric approach to evaluation is proposed.
Abstract: The paper introduces a novel approach to unsupervised text summarization, which in principle should work for any domain or genre. The novelty lies in exploiting the diversity of concepts in text for summarization, which has not received much attention in the summarization literature. We propose, in addition, what we call the information-centric approach to evaluation, where the quality of summaries is judged not in terms of how well they match human-created summaries but in terms of how well they represent their source documents in IR tasks such document retrieval and text categorization. To find the effectiveness of our approach under the proposed evaluation scheme, we set out to examine how a system with the diversity functionality performs against one without, using the test data known as BMIR-J2. The results demonstrate a clear superiority of the diversity-based approach to a non-diversity-based approach.The paper also addresses the question of how closely the diversity approach models human judgments on summarization. We have created a relatively large volume of data annotated for relevance to summarization by human subjects. We have trained a decision tree-based summarizer using the data, and examined how the diversity method compares with the supervised method in performance when tested on the data. It was found that the diversity approach performs as well as and in some cases superior to the supervised method.

Journal ArticleDOI
TL;DR: A more systematic, multidimensional approach to creating evolving classification/indexing schemes, based on where the user is and what she is trying to do at that moment during the search session is proposed.
Abstract: In this article and two other articles which conceptualize a future stage of the research program (Leide, Cole, Large, & Beheshti, submitted for publication; Cole, Leide, Large, Beheshti, & Brooks, in preparation), we map-out a domain novice user's encounter with an IR system from beginning to end so that appropriate classification-based visualization schemes can be inserted into the encounter process. This article describes the visualization of a navigation classification scheme only. The navigation classification scheme uses the metaphor of a ship and ship's navigator traveling through charted (but unknown to the user) waters, guided by a series of lighthouses. The lighthouses contain mediation interfaces linking the user to the information store through agents created for each. The user's agent is the cognitive, model the user has of the information space, which the system encourages to evolve via interaction with the system's agent. The system's agent is an evolving classification scheme created by professional indexers to represent the structure of the information store. We propose a more systematic, multidimensional approach to creating evolving classification/indexing schemes, based on where the user is and what she is trying to do at that moment during the search session.

Journal ArticleDOI
TL;DR: It is confirmed that the searchers pauses less frequently and for shorter periods as they progressed through searches with more experience and practice, searchers moved more smoothly online, and the hesitation rate decreased over time.
Abstract: This research used an information processing approach to analyze the pausal behavior of end-users. It is based on viewing the search as a series of actions and pauses (rests). The end-users are 41 students and 3 faculty. After instructions, subjects searched through the semester, doing 79 searches. This study identified reasons for pausing, location of pauses, hesitation rate and pausal behavior changes over time. This study confirms that the searchers pauses less frequently and for shorter periods as they progressed through searches with more experience and practice, searchers moved more smoothly online, and the hesitation rate decreased over time. Over a series of searches or cycles within long searches, searchers gradually began to chunk more information between pauses. However, the duration of pauses do not vary significantly over time.

Journal ArticleDOI
TL;DR: Differences use between age groups with patients aged 15 and above being more likely to be responsible for the initial strong surge in use are found and differences use over time are identified.
Abstract: Touch screen kiosks have become a popular way of delivering health information and much of the associated user activity is automatically and routinely recorded in electronic log files. These are a significant source of understanding consumer information seeking behaviour. However, there has been no evaluation in the literature of the long-term use of kiosks. This paper seeks to fill the gap in our knowledge and examines the use of a sample of 20 kiosks over three and half years. Identifies four patterns of use over time, a declining pattern, a stable pattern an increasing pattern and a no trend pattern and discusses the implications of these patterns. For 75% of the patterns reported there is an initial strong take up in use, often more than a doubling in use, in the first four to five months after a kiosk is installed this is followed by a just as rapid decline in use. The research found differences use between age groups with patients aged 15 and above being more likely to be responsible for the initial strong surge in use. The research reported here forms part of a Department of Health (DoH) funded study which is evaluating the use and impact of more than fifty heath kiosks located in all kinds of locations throughout the United Kingdom.