Showing papers in &quot;Information Processing and Management in 2004&quot;

Combining the language model and inference network approaches to retrieval

TL;DR: A multi-document summarizer, MEAD, is presented, which generates summaries using cluster centroids produced by a topic detection and tracking system and an evaluation scheme based on sentence utility and subsumption is applied.

...read moreread less

Abstract: We present a multi-document summarizer, MEAD, which generates summaries using cluster centroids produced by a topic detection and tracking system. We describe two new techniques, a centroid-based summarizer, and an evaluation scheme based on sentence utility and subsumption. We have applied this evaluation to both single and multiple document summaries. Finally, we describe two user studies that test our models of multi-document summarization.

...read moreread less

1,121 citations

Journal Article•DOI•

[...]

Donald Metzler¹, W. Bruce Croft¹•Institutions (1)

University of Massachusetts Amherst¹

The many faces of accessibility : engineers' perception of information sources

TL;DR: This paper combines the language modeling and inference network approaches into a single framework that allows structured queries to be evaluated using language modeling estimates and reaffirms that high quality structured queries outperform unstructured queries.

...read moreread less

Abstract: The inference network retrieval model, as implemented in the InQuery search engine, allows for richly structured queries. However, it incorporates a form of ad hoc tf.idf estimates for word probabilities. Language modeling offers more formal estimation techniques. In this paper we combine the language modeling and inference network approaches into a single framework. The resulting model allows structured queries to be evaluated using language modeling estimates. We explore the issues involved, such as combining beliefs and smoothing of proximity nodes. Experimental results are presented comparing the query likelihood model, the InQuery system, and our new model. The results reaffirm that high quality structured queries outperform unstructured queries and show that our system consistently achieves higher average precision than InQuery.

...read moreread less

310 citations

Journal Article•DOI•

[...]

Raya Fidel¹, Maurice W. Green¹•Institutions (1)

University of Washington¹

Search engine coverage bias: evidence and possible causes

TL;DR: When looking for human information resources, the engineers most frequently selected sources with which they were familiar, while saving time was the most frequently mentioned reason for selecting documentary sources.

...read moreread less

Abstract: Numerous studies of engineers' information seeking behavior have found that accessibility was the factor that influenced most their selection of information sources. The concept of accessibility, however, is ambiguous and was given various interpretations by both researchers and engineers. Detailed interviews with 32 engineers, in which they described incidents of personal information seeking in depth, uncovered some of the specific factors that are part of the concept. Engineers selected sources because they had the right format, the right level of detail, a lot of information in one place, as well as for other reasons. When looking for human information resources, the engineers most frequently selected sources with which they were familiar, while saving time was the most frequently mentioned reason for selecting documentary sources. Future research should continue to examine the concept of accessibility through detailed empirical investigations.

...read moreread less

276 citations

Journal Article•DOI•

[...]

Liwen Vaughan¹, Mike Thelwall²•Institutions (2)

University of Western Ontario¹, Information Technology University²

New measurements for search engine evaluation proposed and tested

TL;DR: It is concluded that the coverage bias does exist but this is due not to deliberate choices of the search engines but occurs as a natural result of cumulative advantage effects of US sites on the Web.

...read moreread less

Abstract: Commercial search engines are now playing an increasingly important role in Web information dissemination and access. Of particular interest to business and national governments is whether the big engines have coverage biased towards the US or other countries. In our study we tested for national biases in three major search engines and found significant differences in their coverage of commercial Web sites The US sites were much better covered than the others in the study: sites from China, Taiwan and Singapore. We then examined the possible technical causes of the differences and found that the language of a site does not affect its coverage by search engines. However, the visibility of a site, measured by the number of links to it, affects its chance to be covered by search engines. We conclude that the coverage bias does exist but this is due not to deliberate choices of the search engines but occurs as a natural result of cumulative advantage effects of US sites on the Web. Nevertheless, the bias remains a cause for international concern.

...read moreread less

267 citations

Journal Article•DOI•

[...]

Liwen Vaughan¹•Institutions (1)

University of Western Ontario¹

Investigation of information encountering in the controlled research environment

TL;DR: A set of measurements is proposed for evaluating Web search engine performance that are adapted from the concepts of recall and precision and newly developed to evaluate search engine stability, an issue unique to Web information retrieval systems.

...read moreread less

Abstract: A set of measurements is proposed for evaluating Web search engine performance. Some measurement are adapted from the concepts of recall and precision, which are commonly used in evaluating traditional information retrieval systems. Others are newly developed to evaluate search engine stability, an issue unique to Web information retrieval systems. An experiment was conducted to test these new measurements by applying them to a performance comparison of three commercial search engines: Google, AltaVista, and Teoma. Twenty-four subjects ranked four sets of Web pages and their rankings were used as benchmarks against which to compare search engine performance. Results show that the proposed measurements are able to distinguish search engine performance very well.

...read moreread less

195 citations

Journal Article•DOI•

[...]

Sanda Erdelez¹•Institutions (1)

University of Missouri¹

Improving text categorization using the importance of sentences

TL;DR: A conceptual framework is developed that explains IE as stopping of information seeking activities for a foreground problem due to noticing, examining, and capturing of information related to some background problem.

...read moreread less

Abstract: Experimental research of opportunistic acquisition of information (OAI) is difficult to design due to the overall opacity of OAI to both the information users and to the researchers. Information encountering (IE) is a specific type of OAI where during search for information on one topic information users accidentally come across information related to some other topic of interest. Building on our prior descriptive investigation of IE, we developed a conceptual framework that explains IE as stopping of information seeking activities for a foreground problem due to noticing, examining, and capturing of information related to some background problem. With objective to evoke IE in users' information behavior and record users' actions during an IE episode, we created a controlled laboratory situation, intended to trigger participants' experience of IE during an information retrieval task. We report about the methodological challenges experienced in this effort and share lessons learned for future experimental studies of opportunistic acquisition of information.

...read moreread less

167 citations

Journal Article•DOI•

[...]

Youngjoong Ko¹, Jinwoo Park¹, Jungyun Seo¹•Institutions (1)

Sogang University¹

Perceptions of credibility of scholarly information on the web

TL;DR: The importance of sentences is measured using text summarization techniques to represent a document as a vector of features with different weights according to the importance of each sentence.

...read moreread less

Abstract: Automatic text categorization is a problem of assigning text documents to pre-defined categories. In order to classify text documents, we must extract useful features. In previous researches, a text document is commonly represented by the term frequency and the inverted document frequency of each feature. Since there is a difference between important sentences and unimportant sentences in a document, the features from more important sentences should be considered more than other features. In this paper, we measure the importance of sentences using text summarization techniques. Then we represent a document as a vector of features with different weights according to the importance of each sentence. To verify our new method, we conduct experiments using two language newsgroup data sets: one written by English and the other written by Korean. Four kinds of classifiers are used in our experiments: Naive Bayes, Rocchio, k-NN, and SVM. We observe that our new method makes a significant improvement in all these classifiers and both data sets.

...read moreread less

139 citations

Journal Article•DOI•

[...]

Ziming Liu¹•Institutions (1)

San Jose State University¹

A generic ranking function discovery framework by genetic programming for information retrieval

TL;DR: Two other types of source credibility play a significant role in shaping students' perceptions of credibility, verifiable credibility and cost-effort credibility.

...read moreread less

Abstract: The purpose of this study is to investigate factors influencing students' perception of the credibility of scholarly information on the web. In addition to the four types of source credibility proposed by previous studies (presumed credibility, reputed credibility, surface credibility, and experienced credibility), this study shows that two other types of source credibility (verifiable credibility and cost-effort credibility) play a significant role in shaping students' perceptions of credibility. Circumstances that affect students' willingness to accept scholarly information on the web are identified. Implications for web system design are also discussed.

...read moreread less

138 citations

Journal Article•DOI•

[...]

Weiguo Fan¹, Michael D. Gordon², Praveen Pathak³•Institutions (3)

Virginia Tech¹, University of Michigan², University of Florida³

The relationship between undergraduates' epistemological beliefs, reflective judgment, and their information-seeking behavior

TL;DR: This paper proposes a novel ranking function discovery framework based on Genetic Programming and shows through various experiments how this new framework helps automate the ranking function design/discovery process.

...read moreread less

Abstract: Ranking functions play a substantial role in the performance of information retrieval (IR) systems and search engines. Although there are many ranking functions available in the IR literature, various empirical evaluation studies show that ranking functions do not perform consistently well across different contexts (queries, collections, users). Moreover, it is often difficult and very expensive for human beings to design optimal ranking functions that work well in all these contexts. In this paper, we propose a novel ranking function discovery framework based on Genetic Programming and show through various experiments how this new framework helps automate the ranking function design/discovery process.

...read moreread less

129 citations

Journal Article•DOI•

[...]

Ethelene Whitmire¹•Institutions (1)

University of Wisconsin-Madison¹

Bayesian network model for semi-structured document classification

TL;DR: Undergraduates' at higher stages of epistemological development exhibited the ability to handle conflicting information sources and to recognize authoritative information sources.

...read moreread less

Abstract: During the fall 2001 semester 15 first-year undergraduates were interviewed about their information-seeking behavior. Undergraduates completed a short-answer questionnaire, the Measure of Epistemological Reflection, measuring their epistemological beliefs and searched the Web and an online public access catalog using tasks from the Reflective Judgment Interview that assessed their reflective judgment level. Undergraduates talked aloud while searching digital environments about the decisions they were making about the information they encountered while transaction analyses software (Lotus ScreenCam) recorded both their search moves and their decision-making through verbal protocol analysis. Analyses included examining the relationship between undergraduates' epistemological beliefs and reflective judgment and how they searched for information in these digital environments. Results indicated that there was a relationship between epistemological beliefs and reflective judgment and information-seeking behavior. Undergraduates' at higher stages of epistemological development exhibited the ability to handle conflicting information sources and to recognize authoritative information sources.

...read moreread less

126 citations

Journal Article•DOI•

[...]

Ludovic Denoyer, Patrick Gallinari

Chinese word segmentation and its effect on information retrieval

TL;DR: This work proposes a generative model able to handle both structure and content which is based on Bayesian networks and shows how to transform it into a discriminant classifier using the method of Fisher kernel.

...read moreread less

Abstract: Recently, a new community has started to emerge around the development of new information research methods for searching and analyzing semi-structured and XML like documents. The goal is to handle both content and structural information, and to deal with different types of information content (text, image, etc.). We consider here the task of structured document classification. We propose a generative model able to handle both structure and content which is based on Bayesian networks. We then show how to transform this generative model into a discriminant classifier using the method of Fisher kernel. The model is then extended for dealing with different types of content information (here text and images). The model was tested on three databases: the classical webKB corpus composed of HTML pages, the new INEX corpus which has become a reference in the field of ad-hoc retrieval for XML documents, and a multimedia corpus of Web pages.

...read moreread less

Journal Article•DOI•

[...]

Schubert Foo¹, Hui Li¹•Institutions (1)

Nanyang Technological University¹

Intra-individual information behaviour in daily life

TL;DR: The findings reveal that the segmentation approach has an effect on IR effectiveness and better IR results are obtained by using the same method for query and document processing as this increase the probability of the query-document match.

...read moreread less

Abstract: A set of IR experiments was carried out to study the impact of Chinese word segmentation and its effect on information retrieval (IR) at the Division of Information Studies, Nanyang Technological University, Singapore. A total of four automatic character-based segmentation approaches and a manual word segmentation approach was first carried out to obtain the word segments for indexing and to evaluate the segmentation accuracy of these automatic approaches. The IR experiments study both the influence of different document segmentation approaches on IR effectiveness and the methods used for query segmentation. Traditional data recall and precision measures were used to gauge IR effectiveness. A number of queries were selected and subjected to further detailed analysis to further explore the influence of word segmentation on IR.The findings reveal that the segmentation approach has an effect on IR effectiveness. Better IR results are obtained by using the same method for query and document processing as this increase the probability of the query-document match. The recognition of a higher number of 2-character words generally contributes to the improvement of IR effectiveness. However, manual segmentation does not always work better than character-based segmentation as a result of the existence of longer words with more than two characters. No evidence is found that ambiguous words resulting from the segmentation process significantly affect IR.

...read moreread less

Journal Article•DOI•

[...]

Heidi Julien¹, David H. Michels²•Institutions (2)

University of Alberta¹, Dalhousie University²

Designing to support situation awareness across distances: an example from a scientific collaboratory

TL;DR: This study characterizes one individual's information behaviour across different daily life situations, to seek behavioural patterns that might be associated with various aspects of each information seeking situation.

...read moreread less

Abstract: This study addresses the lack of attention in the literature paid to detailed analysis of individuals' information behaviour in daily life contexts. In particular, the study characterizes one individual's information behaviour across different daily life situations, to seek behavioural patterns that might be associated with various aspects of each information seeking situation. Data was collected through participant diaries, and subsequent oral interviews. This study reports on source selection, and influence of various aspects of the situations described. These aspects were identified from analysis of the interview transcripts, and include time constraints and pressures, motivation for the information need, context of the information need, type of initiating event, location of information seeking activities, intended application of the information found, and source type.

...read moreread less

Journal Article•DOI•

[...]

Diane H. Sonnenwald¹, Kelly L. Maglaughlin², Mary C. Whitton²•Institutions (2)

University of Gothenburg¹, University of North Carolina at Chapel Hill²

Introduction to logical information systems

TL;DR: This work proposes a framework to guide design decisions to enhance computer-mediated situation awareness during scientific research collaboration and suggests that situation awareness is comprised of contextual, task and process, and socio-emotional information.

...read moreread less

Abstract: When collaborating, individuals rely on situation awareness (the gathering, incorporation and utilization of environmental information) to help them combine their unique knowledge and skills and achieve their goals. When collaborating across distances, situation awareness is mediated by technology. There are few guidelines to help system analysts design systems or applications that support the creation and maintenance of situation awareness for teams or groups. We propose a framework to guide design decisions to enhance computer-mediated situation awareness during scientific research collaboration. The foundation for this framework is previous research in situation awareness and virtual reality, combined with our analysis of interviews with and observations of collaborating scientists. The framework suggests that situation awareness is comprised of contextual, task and process, and socio-emotional information. Research in virtual reality systems suggests control, sensory, distraction and realism attributes of technology contribute to a sense of presence [Presence 7 (1998) 225]. We suggest that consideration of these attributes with respect to contextual, task and process, and socio-emotional information provides insights to guide design decisions. We used the framework when designing a scientific collaboratory system. Results from a controlled experimental evaluation of the collaboratory system help illustrate the framework's utility.

...read moreread less

Journal Article•DOI•

[...]

Sébastien Ferré¹, Olivier Ridoux•Institutions (1)

Aberystwyth University¹

A day in the life of web searching: an exploratory study

TL;DR: The principles of LIS are presented, the constraints they impose on the expression of logics, and hints for their effective implementation are presented.

...read moreread less

Abstract: Logical information systems (LIS) use logic in a uniform way to describe their contents, to query it, to navigate through it, to analyze it, and to maintain it. They can be given an abstract specification that does not depend on the choice of a particular logic, and concrete instances can be obtained by instantiating this specification with a particular logic. In fact, a logic plays in a LIS the role of a schema in databases. We present the principles of LIS, the constraints they impose on the expression of logics, and hints for their effective implementation.

...read moreread less

Journal Article•DOI•

[...]

Seda Ozmutlu¹, Amanda Spink², Huseyin Cenk Ozmutlu¹•Institutions (2)

Uludağ University¹, Pennsylvania State University²

Automatic performance evaluation of web search engines

TL;DR: A comparative time-based Web study of US-based Excite and Norwegian-based Fast Web search logs explores variations in user searching related to changes in time of the day, suggesting fluctuations in Web user behavior over the day.

...read moreread less

Abstract: Understanding Web searching behavior is important in developing more successful and cost-efficient Web search engines. We provide results from a comparative time-based Web study of US-based Excite and Norwegian-based Fast Web search logs, exploring variations in user searching related to changes in time of the day. Findings suggest: (1) fluctuations in Web user behavior over the day, (2) user investigations of query results are much longer, and submission of queries and number of users are much higher in the mornings, and (3) some query characteristics, including terms per query and query reformulation, remain steady throughout the day. Implications and further research are discussed.

...read moreread less

Journal Article•DOI•

[...]

Fazli Can¹, Rabia Nuray¹, Ayisigi B. Sevdik¹•Institutions (1)

Bilkent University¹

Measuring user perceptions of web site reputation

TL;DR: This study introduces automatic Web search engine evaluation method as an efficient and effective assessment tool of such systems and shows that the observed consistencies are statistically significant, indicating that the new method can be successfully used in the evaluation of Web search engines.

...read moreread less

Abstract: Measuring the information retrieval effectiveness of World Wide Web search engines is costly because of human relevance judgments involved. However, both for business enterprises and people it is important to know the most effective Web search engines, since such search engines help their users find higher number of relevant Web pages with less effort. Furthermore, this information can be used for several practical purposes. In this study we introduce automatic Web search engine evaluation method as an efficient and effective assessment tool of such systems. The experiments based on eight Web search engines, 25 queries, and binary user relevance judgments show that our method provides results consistent with human-based evaluations. It is shown that the observed consistencies are statistically significant. This indicates that the new method can be successfully used in the evaluation of Web search engines.

...read moreread less

Journal Article•DOI•

[...]

Elaine G. Toms¹, Adam R. Taves¹•Institutions (1)

University of Toronto¹

Web searching for sexual information: an exploratory study

TL;DR: In this study, a search tool, TOPIC, is compared with three other widely used tools that retrieve information from the Web: AltaVista, Google, and Lycos and insight is gained into which search tools are outputting reputable sites for Web users.

...read moreread less

Abstract: In this study, we compare a search tool, TOPIC, with three other widely used tools that retrieve information from the Web: AltaVista, Google, and Lycos. These tools use different techniques for outputting and ranking Web sites: external link structure (TOPIC and Google) and semantic content analysis (Alta-Vista and Lycos). TOPIC purports to output, and highly rank within its hit list, reputable Web sites for searched topics. In this study, 80 participants reviewed the output (i.e., highly ranked sites) from each tool and assessed the quality of retrieved sites. The 4800 individual assessments of 240 sites that represent 12 topics indicated that Google tends to identify and highly rank significantly more reputable Web sites than TOPIC, which, in turn, outputs more than AltaVista and Lycos, but this was not consistent from topic to topic. Metrics derived from reputation research were used in the assessment and a factor analysis was employed to identify a key factor, which we call 'repute'. The results of this research include insight into the factors that Web users consider in formulating perceptions of Web site reputation, and insight into which search tools are outputting reputable sites for Web users. Our findings, we believe, have implications for Web users and suggest the need for future research to assess the relationship between Web page characteristics and their perceived reputation.

...read moreread less

Journal Article•DOI•

[...]

Amanda Spink¹, H. Cenk Ozmutlu², Daniel P. Lorence¹•Institutions (2)

Pennsylvania State University¹, Uludağ University²

Bayesian networks and information retrieval: an introduction to the special issue

TL;DR: It is found that sexual and non-sexual-related queries exhibited differences in session duration, query outcomes, and search term choices, which have implications for sexual information seeking and Web systems.

...read moreread less

Abstract: Sexuality on the Internet takes many forms and channels, including chat rooms discussions, accessing Websites or searching Web search engines for sexual materials. The study of Web sexual queries provides insight into sexual-related information-seeking behavior, of value to Web users and providers alike. We qualitatively analyzed 58,027 queries from a log of 1,025,910 Excite Web user queries from 1999. We found that sexual and non-sexual-related queries exhibited differences in session duration, query outcomes, and search term choices. Implications for sexual information seeking and Web systems are discussed.

...read moreread less

Journal Article•DOI•

[...]

Luis M. de Campos¹, Juan M. Fernández-Luna¹, Juan F. Huete¹•Institutions (1)

University of Granada¹

Information Technology University¹

TL;DR: This introductory paper presents a short bibliographical review of the works which have applied Bayesian networks to IR and briefly describes the papers in this special issue which give a good clue about some of the new trends in the area of the application of Bayesian Networks to IR.

...read moreread less

Abstract: Bayesian networks, which nowadays constitute the dominant approach for managing probability within the field of Artificial Intelligence, have been applied to Information Retrieval (IR) in different ways during the last 15 years, to solve a wide range of problems where uncertainty is an important feature. In this introductory paper, we first present a short bibliographical review of the works which have applied Bayesian networks to IR. The objective is not to show every approach thoroughly, but rather to provide a brief guide for those researchers who wish to start studying this area.Second, we briefly describe the papers in this special issue, which give a good clue about some of the new trends in the area of the application of Bayesian networks to IR.

...read moreread less

Journal Article•DOI•

[...]

Mike Thelwall¹, David Wilkinson¹•Institutions (1)

The current state of digital reference: validation of a general digital reference model through a survey of digital reference services

TL;DR: The results show that using a combination of all three gives the highest probability of identifying similar sites, but surprisingly this was only a marginal improvement over using links alone.

...read moreread less

Abstract: A common task in both Webmetrics and Web information retrieval is to identify a set of Web pages or sites that are similar in content. In this paper we assess the extent to which links, colinks and couplings can be used to identify similar Web sites. As an experiment, a random sample of 500 pairs of domains from the UK academic Web were taken and human assessments of site similarity, based upon content type, were compared against ratings for the three concepts. The results show that using a combination of all three gives the highest probability of identifying similar sites, but surprisingly this was only a marginal improvement over using links alone. Another unexpected result was that high values for either colink counts or couplings were associated with only a small increased likelihood of similarity. The principal advantage of using couplings and colinks was found to be greater coverage in terms of a much larger number of pairs of sites being connected by these measures, instead of increased probability of similarity. In information retrieval terminology, this is improved recall rather than improved precision.

...read moreread less

Journal Article•DOI•

[...]

Jeffrey Pomerantz¹, Scott Nicholson², Yvonne Belanger², R. David Lankes²•Institutions (2)

University of North Carolina at Chapel Hill¹, Syracuse University²

The SST method: a tool for analysing web information search processes

TL;DR: This study presents a snapshot of the state-of-the-art in digital reference as of late 2001-early 2002, and validates the general process model of asynchronous digital reference.

...read moreread less

Abstract: This paper describes a study conducted to determine the paths digital reference services take through a general process model of asynchronous digital reference. A survey based on the general process model was conducted; each decision point in this model provided the basis for at least one question. Common, uncommon, and wished-for practices are identified, as well as correlations between characteristics of services and the practices employed by those services. Identification of such trends has implications for the development of software tools for digital reference. This study presents a snapshot of the state-of-the-art in digital reference as of late 2001-early 2002, and validates the general process model of asynchronous digital reference.

...read moreread less

Journal Article•DOI•

[...]

Nils Pharo¹, Kalervo Järvelin²•Institutions (2)

Oslo and Akershus University College of Applied Sciences¹, University of Tampere²

WiIRE: the web interactive information retrieval experimentation system prototype

TL;DR: It is shown that the SST method presents a new approach in information seeking and retrieval by focusing on the search process as a phenomenon and by explicating how different information seeking factors directly affect thesearch process.

...read moreread less

Abstract: The article presents the search situation transition (SST) method for analysing Web information search (WIS) processes. The idea of the method is to analyse searching behaviour, the process, in detail and connect both the searchers' actions (captured in a log) and his/her intentions and goals, which log analysis never captures. On the other hand, ex post factor surveys, while popular in WIS research, cannot capture the actual search processes. The method is presented through three facets: its domain, its procedure, and its justification. The method's domain is presented in the form of a conceptual framework which maps five central categories that influence WIS processes; the searcher, the social/organisational environment, the work task, the search task, and the process itself. The method's procedure includes various techniques for data collection and analysis. The article presents examples from real WIS processes and shows how the method can be used to identify the interplay of the categories during the processes. It is shown that the method presents a new approach in information seeking and retrieval by focusing on the search process as a phenomenon and by explicating how different information seeking factors directly affect the search process.

...read moreread less

Journal Article•DOI•

[...]

Elaine G. Toms¹, Luanne Freund¹, Cara Li¹•Institutions (1)

University of Toronto¹

Co-trained support vector machines for large scale unstructured document classification using unlabeled data and syntactic information

TL;DR: The evaluation of the prototype WiIRE indicated significant cost efficiencies in the conduct of IIR studies, and additionally had some novel findings about the human perspective: about half participants would have preferred some personal contact with the researcher, and participants spent a significantly decreasing amount of time on tasks over the course of a session.

...read moreread less

Abstract: We introduce WiIRE, a prototype system for conducting: interactive information retrieval (IIR) experiments via the Internet. We conceived WiIRE to increase validity while streamlining procedures and adding efficiencies to the conduct of IIR experiments. The system incorporates password-controlled access, online questionnaires, study instructions and tutorials, conditional interface assignment, and conditional query assignment as well as provision for data collection. As an initial evaluation, we used WiIRE inhouse to conduct a Web-based IIR experiment using an external search engine with customized search interfaces and the TREC 11 Interactive Track search queries. Our evaluation of the prototype indicated significant cost efficiencies in the conduct of IIR studies, and additionally had some novel findings about the human perspective: about half participants would have preferred some personal contact with the researcher, and participants spent a significantly decreasing amount of time on tasks over the course of a session.

...read moreread less

Journal Article•DOI•

[...]

Seong-Bae Park¹, Byoung-Tak Zhang¹•Institutions (1)

Seoul National University¹

Searching structured documents

TL;DR: The co-training algorithm is used, a partially supervised learning algorithm, in which two separated views for the training data are employed and the small number of labeled data are augmented by the large number of unlabeled data.

...read moreread less

Abstract: Most document classification systems consider only the distribution of content words of the documents, ignoring the syntactic information underlying the documents though it is also an important factor. In this paper, we present an approach for classifying large scale unstructured documents by incorporating both the lexical and the syntactic information of documents. For this purpose, we use the co-training algorithm, a partially supervised learning algorithm, in which two separated views for the training data are employed and the small number of labeled data are augmented by the large number of unlabeled data. Since both the lexical and the syntactic information can play roles of separated views for the unstructured documents, the co-training algorithm enhances the performance of document classification using both of them and a large number of unlabeled documents. The experimental results on Reuters-21578 corpus and TREC-7 filtering documents show the effectiveness of unlabeled documents and the use of both the lexical and the syntactic information.

...read moreread less

Journal Article•DOI•

[...]

Andrew Trotman¹•Institutions (1)

University of Otago¹

Design of a two-stage content-based image retrieval system using texture similarity

TL;DR: A data structure is developed to support structured document searching and document ranking is examined and adapted specifically for structured searching.

...read moreread less

Abstract: Structured document interchange formats such as XML and SGML are ubiquitous, however, information retrieval systems supporting structured searching are not. Structured searching can result in increased precision. A search for the author "Smith" in an unstructured corpus of documents specializing in iron-working could have a lower precision than a structured search for "Smith as author" in the same corpus.Analysis of XML retrieval languages identifies additional functionality that must be supported including searching at, and broken across multiple nodes in the document tree. A data structure is developed to support structured document searching. Application of this structure to information retrieval is then demonstrated. Document ranking is examined and adapted specifically for structured searching.

...read moreread less

Journal Article•DOI•

[...]

Po-Whei Huang¹, S. K. Dai¹•Institutions (1)

National Chung Hsing University¹

Methods for reporting on the targets of links from national systems of university web sites

TL;DR: This paper presents an efficient two-stage image retrieval system with high performance of efficacy based on two novel texture features, the composite sub-band gradient (CSG) vector and the energy distribution pattern (EDP)-string.

...read moreread less

Abstract: Efficacy and efficiency are two important issues in designing content-based image retrieval systems. In this paper, we present an efficient two-stage image retrieval system with high performance of efficacy based on two novel texture features, the composite sub-band gradient (CSG) vector and the energy distribution pattern (EDP)-string. Both features are generated from the sub-images of a wavelet decomposition of the original image. At the first stage, a fuzzy matching process based on EDP-strings is performed and serves as a signature filter to quickly remove a large number of non-promising database images from further consideration. At the second stage, the images passing through the filter will be compared with the query image based on their CSG vectors for detailed feature inspection. By exercising a database of 2400 images obtained from the Brodatz album, we demonstrated that both high efficacy and high efficiency can be achieved simultaneously by our proposed system.

...read moreread less

Journal Article•DOI•

[...]

Mike Thelwall¹•Institutions (1)

Information Technology University¹

Using context information in structured document retrieval: an approach based on influence diagrams

TL;DR: Two methods to assess the reliability of link counts are developed and applied to judge which of seven advanced document models are most appropriate in each case, and it is demonstrated that the choice of model can affect the final results.

...read moreread less

Abstract: Whilst hyperlinks within Web sites may be primarily created for navigation purposes, those between sites are a rich source of information about the content and use of the Web. As a result there is a need to derive descriptive statistics about them, both to help understand the underlying communication processes and so that policy makers can gain insights into the use of online information by those located within their constituency. It is known, however, that using the individual Web link source page as the basic unit of counting is problematical because of the number and size of link anomalies. The challenge addressed in this paper is that of developing methods to assess techniques for counting links from groups of large university Web sites (site outlinks). Two methods to assess the reliability of link counts are developed and applied to judge which of seven advanced document models are most appropriate in each case. The most generally applicable method used is an internal consistency test based upon a highly simplified model of Web linking behaviour. The data used comes from crawls of UK, Australian and New Zealand universities. The standard domain advanced Web document model emerges as the logical choice for comparison purposes within this set. Some descriptive statistics concerning Top Level Domain link targets are given and it is demonstrated that the choice of model can affect the final results.

...read moreread less

Journal Article•DOI•

[...]

Luis M. de Campos¹, Juan M. Fernández-Luna¹, Juan F. Huete¹•Institutions (1)

University of Granada¹

A graphical user interface for the retrieval of hierarchically structured documents

TL;DR: An Information Retrieval System (IRS) which is able to work with structured document collections is presented, based on the influence diagrams formalism: a generalization of Bayesian Networks that provides a visual representation of a decision problem.

...read moreread less

Abstract: In this paper we present an Information Retrieval System (IRS) which is able to work with structured document collections. The model is based on the influence diagrams formalism: a generalization of Bayesian Networks that provides a visual representation of a decision problem. These offer an intuitive way to identify and display the essential elements of the domain (the structured document components and their usefulness) and also how these are related to each other. They have also associated quantitative knowledge that measures the strength of the interactions. By means of this approach, we shall present structured retrieval as a decision-making problem. Two different models have been designed: SID (Simple Influence Diagram) and CID (Context-based Influence Diagram). The main difference between these two models is that the latter also takes into account influences provided by the context in which each structural component is located.

...read moreread less

Journal Article•DOI•

[...]

Fabio Crestani¹, Jesús Vegas², Pablo de la Fuente²•Institutions (2)

University UCINF¹, University of Valladolid²