Showing papers on "Document retrieval published in 2000"

PDF

Open Access

Book Chapter•DOI•

Identifying and Filtering Near-Duplicate Documents

[...]

21 Jun 2000

TL;DR: The algorithm for filtering near-duplicate documents discussed here has been successfully implemented and has been used for the last three years in the context of the AltaVista search engine.

...read moreread less

Abstract: The mathematical concept of document resemblance captures well the informal notion of syntactic similarity. The resemblance can be estimated using a fixed size "sketch" for each document. For a large collection of documents (say hundreds of millions) the size of this sketch is of the order of a few hundred bytes per document. However, for effcient large scale web indexing it is not necessary to determine the actual resemblance value: it suffces to determine whether newly encountered documents are duplicates or near-duplicates of documents already indexed. In other words, it suffces to determine whether the resemblance is above a certain threshold. In this talk we show how this determination can be made using a "sample" of less than 50 bytes per document. The basic approach for computing resemblance has two aspects: first, resemblance is expressed as a set (of strings) intersection problem, and second, the relative size of intersections is evaluated by a process of random sampling that can be done independently for each document. The process of estimating the relative size of intersection of sets and the threshold test discussed above can be applied to arbitrary sets, and thus might be of independent interest. The algorithm for filtering near-duplicate documents discussed here has been successfully implemented and has been used for the last three years in the context of the AltaVista search engine.

...read moreread less

465 citations

Proceedings Article•DOI•

Building a question answering test collection

[...]

Ellen M. Voorhees¹, Dawn M. Tice¹•Institutions (1)

National Institute of Standards and Technology¹

01 Jul 2000

TL;DR: The TREC-8 Question Answering (QA) Track was the first large-scale evaluation of domain-independent question answering systems and was used to investigate whether the evaluation methodology used for document retrieval is appropriate for a different natural language processing task.

...read moreread less

Abstract: The TREC-8 Question Answering (QA) Track was the first large-scale evaluation of domain-independent question answering systems. In addition to fostering research on the QA task, the track was used to investigate whether the evaluation methodology used for document retrieval is appropriate for a different natural language processing task. As with document relevance judging, assessors had legitimate differences of opinions as to whether a response actually answers a question, but comparative evaluation of QA systems was stable despite these differences. Creating a reusable QA test collection is fundamentally more difficult than creating a document retrieval test collection since the QA task has no equivalent to document identifiers.

...read moreread less

463 citations

Proceedings Article•

The TREC spoken document retrieval track: a success story

[...]

John S. Garofolo¹, Cedric G. P. Auzanne¹, Ellen M. Voorhees¹•Institutions (1)

National Institute of Standards and Technology¹

12 Apr 2000

TL;DR: The SDR Track can be declared a success in that it has provided objective, demonstrable proof that this technology can be successfully applied to realistic audio collections using a combination of existing technologies and that it can be objectively evaluated.

...read moreread less

Abstract: This paper describes work within the NIST Text REtrieval Conference (TREC) over the last three years in designing and implementing evaluations of Spoken Document Retrieval (SDR) technology within a broadcast news domain. SDR involves the search and retrieval of excerpts from spoken audio recordings using a combination of automatic speech recognition and information retrieval technologies. The TREC SDR Track has provided an infrastructure for the development and evaluation of SDR technology and a common forum for the exchange of knowledge between the speech recognition and information retrieval research communities. The SDR Track can be declared a success in that it has provided objective, demonstrable proof that this technology can be successfully applied to realistic audio collections using a combination of existing technologies and that it can be objectively evaluated. The design and implementation of each of the SDR evaluations are presented and the results are summarized. Plans for the 2000 TREC SDR Track are presented and thoughts about how the track might evolve are discussed.

...read moreread less

437 citations

Proceedings Article•DOI•

Bridging the lexical chasm: statistical approaches to answer-finding

[...]

Adam L. Berger¹, Rich Caruana¹, David Cohn¹, Dayne Freitag¹, Vibhu Mittal¹ - Show less +1 more•Institutions (1)

Carnegie Mellon University¹

01 Jul 2000

TL;DR: It is shown that the task of “answer-finding” differs from both document retrieval and tradition question-answering, presenting challenges different from those found in these problems.

...read moreread less

Abstract: This paper investigates whether a machine can automatically learn the task of finding, within a large collection of candidate responses, the answers to questions. The learning process consists of inspecting a collection of answered questions and characterizing the relation between question and answer with a statistical model. For the purpose of learning this relation, we propose two sources of data: Usenet FAQ documents and customer service call-center dialogues from a large retail company. We will show that the task of “answer-finding” differs from both document retrieval and tradition question-answering, presenting challenges different from those found in these problems. The central aim of this work is to discover, through theoretical and empirical investigation, those statistical techniques best suited to the answer-finding problem.

...read moreread less

345 citations

Book Chapter•DOI•

Using Noun Phrase Heads to Extract Document Keyphrases

[...]

Ken Barker¹, Nadia Cornacchia¹•Institutions (1)

University of Ottawa¹

14 May 2000-Lecture Notes in Computer Science

TL;DR: The simple noun phrase-based system performs roughly as well as a state-of-the-art, corpus-trained keyphrase extractor; ratings for individual keyphrases do not necessarily correlate with ratings for sets of keyphRases for a document.

...read moreread less

Abstract: Automatically extracting keyphrases from documents is a task with many applications in information retrieval and natural language processing. Document retrieval can be biased towards documents containing relevant keyphrases; documents can be classified or categorized based on their keyphrases; automatic text summarization may extract sentences with high keyphrase scores. This paper describes a simple system for choosing noun phrases from a document as keyphrases. A noun phrase is chosen based on its length, its frequency and the frequency of its head noun. Noun phrases are extracted from a text using a base noun phrase skimmer and an off-the-shelf online dictionary. Experiments involving human judges reveal several interesting results: the simple noun phrase-based system performs roughly as well as a state-of-the-art, corpus-trained keyphrase extractor; ratings for individual keyphrases do not necessarily correlate with ratings for sets of keyphrases for a document; agreement among unbiased judges on the keyphrase rating task is poor.

...read moreread less

270 citations

Patent•

Process and system for retrieval of documents using context-relevant semantic profiles

[...]

Herbert L. Roitblat

29 Aug 2000

TL;DR: In this paper, a neural network is used to extract semantic profiles from text corpus and a new set of documents, such as world wide web pages obtained from the Internet, are then submitted for processing to the same neural network, which computes a semantic profile representation for these pages using the semantic relations learned from profiling the training documents.

...read moreread less

Abstract: A process and system for database storage and retrieval are described along with methods for obtaining semantic profiles from a training text corpus, i.e., text of known relevance, a method for using the training to guide context-relevant document retrieval, and a method for limiting the range of documents that need to be searched after a query. A neural network is used to extract semantic profiles from text corpus. A new set of documents, such as world wide web pages obtained from the Internet, is then submitted for processing to the same neural network, which computes a semantic profile representation for these pages using the semantic relations learned from profiling the training documents. These semantic profiles are then organized into clusters in order to minimize the time required to answer a query. When a user queries the database, i.e., the set of documents, his or her query is similarly transformed into a semantic profile and compared with the semantic profiles of each cluster of documents. The query profile is then compared with each of the documents in that cluster. Documents with the closest weighted match to the query are returned as search results.

...read moreread less

270 citations

Patent•

System and method for context-based document retrieval

[...]

Shermann Loyall Min, Constantin Lorenzo Tanno, Zachary Mainen, William Russell Softky

28 Jul 2000

TL;DR: In this article, a system and method for text-based document retrieval is proposed, which is based on utilizing information contained in the document collection about the statistics of word relationships (context) to facilitate the specification of search queries and document comparison.

...read moreread less

Abstract: A system and method for document retrieval is disclosed The invention addresses a major problem in text-based document retrieval: rapidly finding a small subset of documents in a large document collection (eg Web pages on the Internet) that are relevant to a limited set of query terms supplied by the user The invention is based on utilizing information contained in the document collection about the statistics of word relationships (“context”) to facilitate the specification of search queries and document comparison The method consists of first compiling word relationships into a context database that captures the statistics of word proximity and occurrence throughout the document collection At retrieval time, a search matrix is computed from a set of user-supplied keywords and the context database For each document in the collection, a similar matrix is computed using the contents of the document and the context database Document relevance is determined by comparing the similarity of the search and document matrices The disclosed system therefore retrieves documents with contextual similarity rather than word frequency similarity, simplifying search specification while allowing greater search precision

...read moreread less

221 citations

Journal Article•DOI•

An analysis of research in information systems (1981-1997)

[...]

Enrique Claver, Reyes Gonzalez, Juan Llopis

01 Jun 2000-Information & Management

TL;DR: This article analyzes the articles published in the Management Information Systems Quarterly and Information & Management journals between 1981 and 1997, paying attention to the usual topics as well as strategies, both in research and by the authors.

...read moreread less

210 citations

Journal Article•DOI•

Subword-based approaches for spoken document retrieval

[...]

Kenney Ng¹, Victor W. Zue¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Oct 2000-Speech Communication

TL;DR: It is found that with the appropriate subword units, it is possible to achieve performance comparable to that of text-based word units if the underlying phonetic units are recognized correctly.

...read moreread less

200 citations

Proceedings Article•DOI•

A practical query-by-humming system for a large music database

[...]

Naoko Kosugi, Yuichi Nishihara, Tetsuo Sakata, Masashi Yamamuro, Kazuhiko Kushima - Show less +1 more

30 Oct 2000

TL;DR: New technologies for improving retrieval accuracy, such as partial feature vectors and or'ed retrieval among multiple search keys, are proposed, and it is found that the retrieval accuracy increases by more than 20% compared with the previous system.

...read moreread less

Abstract: A music retrieval system that accepts hummed tunes as queries is described in this paper This system uses similarity retrieval because a hummed tune may contain errors The retrieval result is a list of song names ranked according to the closeness of the match Our ultimate goal is that the correct song should be first on the list This means that eventually our system's similarity retrieval should allow for only one correct answerThe most significant improvement our system has over general query-by-humming systems is that all processing of musical information is done based on beats instead of notes This type of query processing is robust against queries generated from erroneous input In addition, acoustic information is transcribed and converted into relative intervals and is used for making feature vectors This increases the resolution of the retrieval system compared with other general systems, which use only pitch direction informationThe database currently holds over 10,000 songs, and the retrieval time is at most one second This level of performance is mainly achieved through the use of indices for retrieval In this paper, we also report on the results of music analyses of the songs in the database Based on these results, new technologies for improving retrieval accuracy, such as partial feature vectors and or'ed retrieval among multiple search keys, are proposed The effectiveness of these technologies is evaluated quantitatively, and it is found that the retrieval accuracy increases by more than 20% compared with the previous system [9] Practical user interfaces for the system are also described

...read moreread less

182 citations

Journal Article•DOI•

Compression: a key for next-generation text retrieval systems

[...]

Nivio Ziviani, E. Silva de Moura¹, Gonzalo Navarro², Ricardo Baeza-Yates²•Institutions (2)

Federal University of Amazonas¹, University of Chile²

01 Nov 2000-IEEE Computer

TL;DR: The authors' technique combines several data compression features to provide economical storage, faster indexing, and accelerated searches.

...read moreread less

Abstract: The continually growing Web challenges information retrieval systems to deliver data quickly The authors' technique combines several data compression features to provide economical storage, faster indexing, and accelerated searches

...read moreread less

Patent•

Indexed, extensible, interactive document retrieval system

[...]

Michael Zimmermann, Joerg Thielke, Christoph Holz

08 Nov 2000

TL;DR: In this article, a document retrieval system contains a database that relates document word-pair patterns to topics, and assigns topics to each document in response to a word submitted by a requestor, the system retrieves documents containing that word, analyzes the documents to determine their wordpair patterns, matches the document patterns to database patterns that are related to topics.

...read moreread less

Abstract: An Internet or intranet based document retrieval system contains a database that relates document word-pair patterns to topics. In response to a word submitted by a requestor, the system retrieves documents containing that word, analyzes the documents to determine their word-pair patterns, matches the document patterns to database patterns that are related to topics, and thereby assigns topics to each document. If the retrieved documents are assigned to more than one topic, a list of the document topics is presented to the requestor, and the requestor designates the relevant topics. The requestor is then granted access only to documents assigned to relevant topics. A knowledge base linking search terms to documents and documents to topics is established and maintained to speed future searches.

...read moreread less

DOI•

The Eighth Text REtrieval Conference (TREC-8)

[...]

Ellen M. Voorhees, Donna Harman

01 Nov 2000

Proceedings Article•DOI•

Effective information retrieval using genetic algorithms based matching functions adaptation

[...]

Praveen Pathak¹, Michael D. Gordon², Weiguo Fan²•Institutions (2)

Purdue University¹, University of Michigan²

04 Jan 2000

TL;DR: This work looks at the possibility of applying GAs to adapt various matching functions in order to lead to a better retrieval performance than that obtained by using a single matching function.

...read moreread less

Abstract: Knowledge intensive organizations have vast array of information contained in large document repositories. With the advent of E-commerce and corporate intranets/extranets, these repositories are expected to grow at a fast pace. This explosive growth has led to huge, fragmented, and unstructured document collections. Although it has become easier to collect and store information in document collections, it has become increasingly difficult to retrieve relevant information from these large document collections. This paper addresses the issue of improving retrieval performance (in terms of precision and recall) for retrieval from document collections. There are three important paradigms of research in the area of information retrieval (1R): Probabilistic IR, Knowledge-based IR, and, Artificial Intelligence based techniques like neural networks and symbolic learning. Very few researcher have tried to use evolutionary algorithms like genetic algorithms (GAs). Previous attempts at using GAs have concentrated on modifying document representations or modifying query representations. This work looks at the possibility of applying GAs to adapt various matching functions. It is hoped that such an adaptation of the matching functions in lead to a better retrieval performance than that obtained by using a single matching function. An overall matching function is treated as an weighted combination of scores produced by individual matching functions. This overall score is asked to rank and retrieve documents. Weights associated with individual functions are searched using Genetic Algorithms. The idea is tested on a real document collection called the Cranfield collection. The results look very encouraging.

...read moreread less

Report•DOI•

Concept Indexing: A Fast Dimensionality Reduction Algorithm With Applications to Document Retrieval and Categorization

[...]

George Karypis, Eui-Hong Han¹•Institutions (1)

University of Minnesota¹

06 Mar 2000

TL;DR: Experimental results show that the dimensionality reduction computed by CI achieves comparable retrieval performance to that obtained using LSI, while requiring an order of magnitude less time.

...read moreread less

Abstract: : In recent years, we have seen a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intranets. This has led to an increased interest in developing methods that can efficiently categorize and retrieve relevant information. Retrieval techniques based on dimensionality reduction, such as Latent Semantic Indexing (LSI), have been shown to improve the quality of the information being retrieved by capturing the latent meaning of the words present in the documents. Unfortunately, the high computational requirements of LSI and its inability to compute an effective dimensionality reduction in a supervised setting limits its applicability. In this paper we present a fast dimensionality reduction algorithm, called concept indexing (CI) that is equally effective for unsupervised and supervised dimensionality reduction. CI computes a k-dimensional representation of a collection of documents by first clustering the documents into k groups, and then using the centroid vectors of the clusters to derive the axes of the reduced k-dimensional space. Experimental results show that the dimensionality reduction computed by CI achieves comparable retrieval performance to that obtained using LSI, while requiring an order of magnitude less time. Moreover, when CI is used to compute the dimensionality reduction in a supervised setting, it greatly improves the performance of traditional classification algorithms such as C4.5 and kNN.

...read moreread less

Journal Article•DOI•

Applying genetic algorithms to query optimization in document retrieval

[...]

Jorng-Tzong Horng¹, Ching-Chang Yeh¹•Institutions (1)

National Central University¹

01 Sep 2000-Information Processing and Management

TL;DR: A novel approach to automatically retrieve keywords and then uses genetic algorithms to adapt the keyword weights and this approach is faster and uses less memory than the PAT-tree based approach.

...read moreread less

Abstract: This paper proposes a novel approach to automatically retrieve keywords and then uses genetic algorithms to adapt the keyword weights. One of the contributions of the paper is to combine the Bigram (Chen, A., He, J., Xu, L., Gey, F. C., & Meggs, J. 1997. Chinese text retrieval without using a dictionary , ACM SIGIR’97, Philadelphia, PA, USA, pp. 42–49; Yang, Y.-Y., Chang, J.-S., & Chen, K.-J. 1993), Document automatic classification and ranking , Master thesis, Department of Computer Science, National Tsing Hua University) model and PAT-tree structure (Chien, L.-F., Huang, T.-I., & Chien, M.-C. 1997 Pat-tree-based keyword extraction for Chinese information retrieval , ACM SIGIR’97, Philadelphia, PA, US, pp. 50–59) to retrieve keywords. The approach extracts bigrams from documents and uses the bigrams to construct a PAT-tree to retrieve keywords. The proposed approach can retrieve any type of keywords such as technical keywords and a person’s name. Effectiveness of the proposed approach is demonstrated by comparing how effective are the keywords found by both this approach and the PAT-tree based approach. This comparison reveals that our keyword retrieval approach is as accurate as the PAT-tree based approach, yet our approach is faster and uses less memory. The study then applies genetic algorithms to tune the weight of retrieved keywords. Moreover, several documents obtained from web sites are tested and experimental results are compared with those of other approaches, indicating that the proposed approach is highly promising for applications.

...read moreread less

Journal Article•DOI•

Information Retrieval from Documents: A Survey

[...]

Mandar Mitra¹, Bidyut B. Chaudhuri¹•Institutions (1)

Indian Statistical Institute¹

01 May 2000-Information Retrieval

TL;DR: This paper provides an update on Doermann's comprehensive survey of research results in the broad area of document-based information retrieval, and focuses on methods that manipulate document images directly, and perform various information processing tasks such as retrieval, categorization, and summarization, without attempting to completely recognize the textual content of the document.

...read moreread less

Abstract: Given the phenomenal growth in the variety and quantity of data available to users through electronic media, there is a great demand for efficient and effective ways to organize and search through all this information. Besides speech, our principal means of communication is through visual media, and in particular, through documents. In this paper, we provide an update on Doermann's comprehensive survey (1998) of research results in the broad area of document-based information retrieval. The scope of this survey is also somewhat broader, and there is a greater emphasis on relating document image analysis methods to conventional IR methods. Documents are available in a wide variety of formats. Technical papers are often available as ASCII files of clean, correct, text. Other documents may only be available as hardcopies. These documents have to be scanned and stored as images so that they may be processed by a computer. The textual content of these documents may also be extracted and recognized using OCR methods. Our survey covers the broad spectrum of methods that are required to handle different formats like text and images. The core of the paper focuses on methods that manipulate document images directly, and perform various information processing tasks such as retrieval, categorization, and summarization, without attempting to completely recognize the textual content of the document. We start, however, with a brief overview of traditional IR techniques that operate on clean text. We also discuss research dealing with text that is generated by running OCR on document images. Finally, we also briefly touch on the related problem of content-based image retrieval.

...read moreread less

Journal Article•DOI•

The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text

[...]

Paul B. Kantor¹, Ellen M. Voorhees²•Institutions (2)

Rutgers University¹, National Institute of Standards and Technology²

01 May 2000-Information Retrieval

TL;DR: The TREC-5 confusion track used a set of 49 known-item tasks to study the impact of data corruption on retrieval system performance, and retrieval methods that attempted a probabilistic reconstruction of the original clean text fared better than methods that simply accepted corrupted versions of the query text.

...read moreread less

Abstract: A known-item search is a particular information retrieval task in which the system is asked to find a single target document in a large document set. The TREC-5 confusion track used a set of 49 known-item tasks to study the impact of data corruption on retrieval system performance. Two corrupted versions of a 55,600 document corpus whose true content was known were created by applying OCR techniques to page images. The first version of the corpus used the page images as scanned, resulting in an estimated character error rate of approximately 5%. The second version used page images that had been down-sampled, resulting in an estimated character error rate of approximately 20%. The true text and each of the corrupted versions were then searched using the same set of 49 questions. In general, retrieval methods that attempted a probabilistic reconstruction of the original clean text fared better than methods that simply accepted corrupted versions of the query text.

...read moreread less

Journal Article•DOI•

Ontology-driven document enrichment

[...]

Enrico Motta¹, Simon Buckingham Shum¹, John Domingue¹•Institutions (1)

Open University¹

01 Jun 2000-International Journal of Human-computer Studies \/ International Journal of Man-machine Studies

TL;DR: An approach to document enrichment is presented, which consists of developing and integrating formal knowledge models with archives of documents, to provide intelligent knowledge retrieval and (possibly) additional knowledge-intensive services, beyond what is currently available using “standard” information retrieval and search facilities.

...read moreread less

Abstract: In this paper, we present an approach to document enrichment, which consists of developing and integrating formal knowledge models with archives of documents, to provide intelligent knowledge retrieval and (possibly) additional knowledge-intensive services, beyond what is currently available using “standard” information retrieval and search facilities. Our approach is ontology-driven, in the sense that the construction of the knowledge model is carried out in a top-down fashion, by populating a given ontology, rather than in a bottom-up fashion, by annotating a particular document. In this paper, we give an overview of the approach and we examine the various types of issues (e.g. modelling, organizational and user interface issues) which need to be tackled to effectively deploy our approach in the workplace. In addition, we also discuss a number of technologies we have developed to support ontology-driven document enrichment and we illustrate our ideas in the domains of electronic news publishing, scholarly discourse and medical guidelines.

...read moreread less

Proceedings Article•DOI•

Phonetic confusion matrix based spoken document retrieval

[...]

Savitha Srinivasan¹, Dragutin Petkovic¹•Institutions (1)

IBM¹

01 Jul 2000

TL;DR: This work proposes a novel method for phonetic retrieval in the CueVideo system based on the probabilistic formulation of term weighting using phone confusion data in a Bayesian framework and evaluates this method of spoken document retrieval against word-based retrieval for the search levels identified in a realistic video-based distributed learning setting.

...read moreread less

Abstract: Combined word-based index and phonetic indexes have been used to improve the performance of spoken document retrieval systems primarily by addressing the out-of-vocabulary retrieval problem. However, a known problem with phonetic recognition is its limited accuracy in comparison with word level recognition. We propose a novel method for phonetic retrieval in the CueVideo system based on the probabilistic formulation of term weighting using phone confusion data in a Bayesian framework. We evaluate this method of spoken document retrieval against word-based retrieval for the search levels identified in a realistic video-based distributed learning setting. Using our test data, we achieved an average recall of 0.88 with an average precision of 0.69 for retrieval of out-of-vocabulary words on phonetic transcripts with 35% word error rate. For in-vocabulary words, we achieved a 17% improvement in recall over word-based retrieval with a 17% loss in precision for word error rites ranging from 35 to 65%.

...read moreread less

Patent•

System and methods for document retrieval using natural language-based queries

[...]

Pascale Fung, Chi Shun Cheung, Yi Guan, Yongsheng Yang

11 Jul 2000

TL;DR: In this article, a system and associated methods identify documents relevant to an inputted natural language user query, which includes selecting a set of keywords from the user query; determining at least one word, not necessarily found in the user queries, that is semantically similar to a keyword of the set of words; using the keywords and the at least word to determining a subset of word sets from a database of pre-stored word sets, wherein the pre-set word sets are each pre-associated with at least 1 document; determining a plurality of word set, from the subset

...read moreread less

Abstract: A system and associated methods identify documents relevant to an inputted natural-language user query. One associate method includes: selecting a set of keywords from the user query; determining at least one word, not necessarily found in the user query, that is semantically similar to a keyword of the set of keywords; using the set of keywords and the at least one word to determining a subset of word sets from a database of pre-stored word sets, wherein the pre-stored word sets are each pre-associated with at least one document; determining a plurality of word sets, from the subset of word sets, that is most semantically similar to the user query; and identifying documents that have been pre-associated with the plurality of word sets as being relevant to the natural-language user query.

...read moreread less

Book Chapter•DOI•

Information Retrieval with Conceptual Graph Matching

[...]

Manuel Montes-y-Gómez¹, Aurelio López-López, Alexander Gelbukh¹•Institutions (1)

Instituto Politécnico Nacional¹

04 Sep 2000

TL;DR: The use of conceptual graphs for the representaion of text contents in information retrievel and a method for measuring the similarity between two texts represented as conceptual graphs are presented.

...read moreread less

Abstract: The use of conceptual graphs for the representaion of text contents in information retrievel is discussed. A method for measuring the similarity between two texts represented as conceptual graphs is presented. The method is based on well-known strategies of text comparison, such as Dice coefficient, with new elements introduced due to the bipartite nature of the conceptual graphs. Examples of the representation and comparison of the phrases are given. The structure of an information retrieval system using two-level document representation, traditional keywords and conceptual graphs, is presented.

...read moreread less

Journal Article•

Information retrieval with conceptual graph matching

[...]

Manuel Montes-y-Gómez¹, Aurelio López-López, Alexander Gelbukh¹•Institutions (1)

Instituto Politécnico Nacional¹

01 Jan 2000-Lecture Notes in Computer Science

TL;DR: In this paper, a method for measuring the similarity between two texts represented as conceptual graphs is presented, based on well-known strategies of text comparison, such as Dice coefficient, with new elements introduced due to the bipartite nature of the conceptual graphs.

...read moreread less

Abstract: The use of conceptual graphs for the representation of text contents in information retrieval is discussed. A method for measuring the similarity between two texts represented as conceptual graphs is presented. The method is based on well-known strategies of text comparison, such as Dice coefficient, with new elements introduced due to the bipartite nature of the conceptual graphs. Examples of the representation and comparison of the phrases are given. The structure of an information retrieval system using two-level document representation, traditional keywords and conceptual graphs, is presented.

...read moreread less

Proceedings Article•DOI•

Effects of out of vocabulary words in spoken document retrieval (poster session)

[...]

Philip C. Woodland¹, S.E. Johnson¹, Pierre Jourlin¹, K. Sparck Jones¹•Institutions (1)

University of Cambridge¹

01 Jul 2000

TL;DR: In this article, the effects of out-of-vocabulary (OOV) items in spoken document retrieval (SDR) were investigated and the use of a parallel corpus for query and document expansion was found to be especially beneficial.

...read moreread less

Abstract: The effects of out-of-vocabulary (OOV) items in spoken document retrieval (SDR) are investigated. Several sets of transcriptions were created for the TREC-8 SDR task using a speech recognition system varying the vocabulary sizes and OOV rates, and the relative retrieval performance measured. The effects of OOV terms on a simple baseline IR system and on more sophisticated retrieval systems are described. The use of a parallel corpus for query and document expansion is found to be especially beneficial, and with this data set, good retrieval performance can be achieved even for fairly high OOV rates.

...read moreread less

Journal Article•DOI•

Text extraction from colored book and journal covers

[...]

Karin Sobottka, Heino Kronenberg, T. Perroud, Horst Bunke

01 Jun 2000-International Journal on Document Analysis and Recognition

TL;DR: An approach to automatic text extraction from colored book and journal covers is proposed and two methods have been developed for extracting text hypotheses to robustly distinguish between text and non-text elements.

...read moreread less

Abstract: . The automatic retrieval of indexing information from colored paper documents is a challenging problem. In order to build up bibliographic databases, editing by humans is usually necessary to provide information about title, authors and keywords. For automating the indexing process, the identification of text elements is essential. In this article an approach to automatic text extraction from colored book and journal covers is proposed. Two methods have been developed for extracting text hypotheses. The results of both methods are combined to robustly distinguish between text and non-text elements.

...read moreread less

Journal Article•DOI•

Natural language information retrieval: progress report

[...]

Jose Perez-Carballo¹, Tomek Strzalkowski•Institutions (1)

Rutgers University¹

01 Jan 2000

TL;DR: The ‘stream architecture’ is described, a method designed to combine evidence obtained from several different document representations that involved the use of phrases and proper names computed using Natural Language Processing techniques.

...read moreread less

Abstract: Natural language processing (NLP) techniques may hold a tremendous potential for overcoming the inadequacies of purely quantitative methods of text information retrieval, but the empirical evidence to support such predictions has thus far been inadequate and appropriate scale evaluations have been slow to emerge. In this paper, we report on the progress of the Natural Language Information Retrieval project, a joint effort of several sites led by GE Research and its evaluation the 6th Text Retrieval Conference (TREC-6). In this paper we describe the ‘stream architecture’, a method we designed to combine evidence obtained from several different document representations. Some of the document representations used in the experiments described here involved the use of phrases and proper names computed using Natural Language Processing techniques.

...read moreread less

Journal Article•DOI•

Indexing and retrieval of broadcast news

[...]

Steve Renals¹, Dave Abberley¹, David Kirby², Tony Robinson³•Institutions (3)

University of Sheffield¹, BBC Research & Development², University of Cambridge³

01 Sep 2000-Speech Communication

TL;DR: A spoken document retrieval system for British and North American Broadcast News is described, based on a connectionist large vocabulary speech recognizer and a probabilistic information retrieval (IR) system.

...read moreread less

Patent•

System and methods for determining semantic similarity of sentences

[...]

Yi Guan, Pascale Fung

11 Jul 2000

TL;DR: In this paper, a method for computing the similarity between a first and a second set of words comprises identifying a word of the second sentence as being most similar to a word in the first sentence, and computing a score of similarity between the first and second sentence based at least in part on that word.

...read moreread less

Abstract: A system and associated methods determine the semantic similarity of different sentences to one another. A particularly appropriate application of the present invention is to automatic processing of Chinese-language text, for example, for document retrieval. A method for computing the similarity between a first and a second set of words comprises identifying a word of the second set of words as being most similar to a word of the first set of words, wherein the word of the second set of words need not be identical to the word of the first set of words; and computing a score of the similarity between the first and second set of words based at least in part on the word of the second set of words.

...read moreread less

Proceedings Article•DOI•

Fast latent semantic indexing of spoken documents by using self-organizing maps

[...]

Mikko Kurimo

05 Jun 2000

TL;DR: A new latent semantic indexing (LSI) method for spoken audio documents that smoothing by the closest document clusters is important here, because the documents are often short and have a high word error rate (WER).

...read moreread less

Abstract: This paper describes a new latent semantic indexing (LSI) method for spoken audio documents. The framework is indexing broadcast news from radio and TV as a combination of large vocabulary continuous speech recognition (LVCSR), natural language processing (NLP) and information retrieval (IR). For indexing, the documents are presented as vectors of word counts, whose dimensionality is rapidly reduced by random mapping (RM). The obtained vectors are projected into the latent semantic subspace determined by SVD, where the vectors are then smoothed by a self-organizing map (SOM). The smoothing by the closest document clusters is important here, because the documents are often short and have a high word error rate (WER). As the clusters in the semantic subspace reflect the news topics, the SOMs provide an easy way to visualize the index and query results and to explore the database. Test results are reported for TREC's spoken document retrieval databases (www.idiap.ch/kurimo/thisl.html).

...read moreread less

Journal Article•DOI•

Interactive Information Retrieval: Context and Basic Notions

[...]

David Robins

01 Jan 2000-Informing Science The International Journal of an Emerging Transdiscipline

TL;DR: Interactive information retrieval is a line of research that studies users in the process of directly consulting an IR system is called interactive information retrieval (IIR), and a brief background on traditional IR studies is given.

...read moreread less

Abstract: Introduction Information retrieval (IR) is a discipline concerned with the processes by which queries presented to information systems are matched against a "store" of texts (the term text may be substituted with still images, sounds, video clips, paintings, or any other artifact of intellectual activity). The end result of the matching process is a listing of texts that are a subset of the total store. Any number of means may accomplish the matching process, but essentially, when specified attributes in a query are found to correspond with specified attributes of a text, the text is included in the listing. Since the middle of the 20th century, most efforts to improve information retrieval have focused on methods of matching text representations with query representations. Recently, however, researchers have undertaken the task of understanding the human, or user, role in IR. The basic assumption behind these efforts is that we cannot design effective IR systems without some knowledge of how users interact with them. Therefore, this line of research that studies users in the process of directly consulting an IR system is called interactive information retrieval (IIR). In order to understand the context in which IIR has developed, I will give a brief background on traditional IR studies. I will follow this with a description of current models of IIR, and then conclude with a discussion of new directions in IIR. Background: The System Approach The system approach to IR has grown out of concerns with the "library problem" (e.g., Maron & Kuhns, 1960, p. 217), the problem of searching and retrieving relevant documents from IR systems. The hardware and software problems associated with document retrieval and document representation still persist. The development of digitally based IR systems requires computer programs that match requests with stores of documents, and then produce output. In sophisticated systems of this sort, both input terms, and output text, may be ranked according to preset criteria. The challenge to researchers in this area is to develop algorithms that optimize such rankings. There are, however, difficulties with the system orientation to IR. The first problem with the system view is in how IR systems are evaluated. In the system approach to information retrieval, system effectiveness is calculated by two measures: recall and precision. For any given search on a given database, recall is the ratio of the number of relevant documents retrieved to relevant documents in the database. Precision is the ratio of the number of relevant documents retrieved to the number of documents retrieved. These measurements rest on the assumptions that: (a) all documents in the system are known; (b) all documents in the system can be judged in advance for their usefulness (relevance) for any given problem; and (c) users' relevance judgments are a single event based solely on a text's content. Assumption (a) is valid only in the case of small test collections. Assumptions (b) and (c) are based on static notions relevance. A user's judgment of the usefulness of a document may vary with respect to his or her information seeking stage (Kuhlthau, 1991), criteria other than the topic of the document such as availability of the text (Barry, 1994), or his or her ability to express the information need to an intermediary or to an IR system (Belkin, 1980; Taylor, 1968). The second difficulty with the system approach is that language is treated as if it were precise. Although natural language processing systems have made tremendous strides in the past decade (Turtle, 1994), language will remain a problem for system designers. The reason for this is that language can be best understood by how it is used, rather than by what is said (Blair, 1990). In other words, it may be possible to understand more about what a user says to an intermediary if his or her motives or goals are understood. …

...read moreread less

Collapse