scispace - formally typeset
Search or ask a question

Showing papers in "Information Retrieval in 2000"


Journal ArticleDOI
TL;DR: New research in reinforcement learning, information extraction and text classification that enables efficient spidering, the identification of informative text segments, and the population of topic hierarchies are described.
Abstract: Domain-specific internet portals are growing in popularity because they gather content from the Web and organize it for easy access, retrieval and search. For example, www.campsearch.com allows complex queries by age, location, cost and specialty over summer camps. This functionality is not possible with general, Web-wide search engines. Unfortunately these portals are difficult and time-consuming to maintain. This paper advocates the use of machine learning techniques to greatly automate the creation and maintenance of domain-specific Internet portals. We describe new research in reinforcement learning, information extraction and text classification that enables efficient spidering, the identification of informative text segments, and the population of topic hierarchies. Using these techniques, we have built a demonstration system: a portal for computer science research papers. It already contains over 50,000 papers and is publicly available at www.cora.justresearch.com. These techniques are widely applicable to portal creation in other domains.

1,081 citations


Journal ArticleDOI
TL;DR: In this paper, the problem of automatically extracting keyphrases from text is treated as a supervised learning task, where the learning algorithm must learn to classify as positive or negative examples of key phrases.
Abstract: Many academic journals ask their authors to provide a list of about five to fifteen keywords, to appear on the first page of each article. Since these key words are often phrases of two or more words, we prefer to call them keyphrases. There is a wide variety of tasks for which keyphrases are useful, as we discuss in this paper. We approach the problem of automatically extracting keyphrases from text as a supervised learning task. We treat a document as a set of phrases, which the learning algorithm must learn to classify as positive or negative examples of keyphrases. Our first set of experiments applies the C4.5 decision tree induction algorithm to this learning task. We evaluate the performance of nine different configurations of C4.5. The second set of experiments applies the GenEx algorithm to the task. We developed the GenEx algorithm specifically for automatically extracting keyphrases from text. The experimental results support the claim that a custom-designed algorithm (GenEx), incorporating specialized procedural domain knowledge, can generate better keyphrases than a general-purpose algorithm (C4.5). Subjective human evaluation of the keyphrases generated by GenEx suggests that about 80% of the keyphrases are acceptable to human readers. This level of performance should be satisfactory for a wide variety of applications.

869 citations


Journal ArticleDOI
TL;DR: The paper concludes by discussing the potential approaches in developing both the concept-based and content-based indexing methods as well as the user interfaces in photo retrieval systems.
Abstract: Previous research in conceptual indexing methods of images has furnished us with refined theoretical frameworks characterising various aspects of images that could and should be indexed using textual descriptors. The development of digital image processing technologies has bred a brigade of content-based indexing and retrieval methods available for applications. What the users need and in what kinds of environments different indexing and retrieval methods are relevant, has remained an area of less intensive research work. This article presents the results of a field study concentrating on journalists as users of a digital newspaper photo archive. The expressed photo needs, applied selection criteria and observed searching behaviours in journalists' daily work were contrasted with the indexing practices applied by the archivists. The results showed that the journalists achieved satisfactory results when trivial query terms were available, e.g. when photos of named persons were needed. Browsing was the main searching strategy applied by the journalists, but the system did not support browsing well. The access problems faced by the users in particular photo needs are discussed in detail. The paper concludes by discussing the potential approaches in developing both the concept-based and content-based indexing methods as well as the user interfaces in photo retrieval systems.

213 citations


Journal ArticleDOI
TL;DR: A new method for compressing inverted indexes is introduced that yields excellent compression, fast decoding, and exploits clustering—the tendency for words to appear relatively frequently in some parts of the collection and infrequently in others.
Abstract: Information retrieval systems contain large volumes of text, and currently have typical sizes into the gigabyte range. Inverted indexes are one important method for providing search facilities into these collections, but unless compressed require a great deal of space. In this paper we introduce a new method for compressing inverted indexes that yields excellent compression, fast decoding, and exploits clustering—the tendency for words to appear relatively frequently in some parts of the collection and infrequently in others. We also describe two other quite separate applications for the same compression method: representing the MTF list positions generated by the Burrows-Wheeler Block Sorting transformations and transmitting the codebook for semi-static block-based minimum-redundancy coding.

193 citations


Journal ArticleDOI
TL;DR: This research addresses the problem of finding pictures by optimally combining text and image similarity in a MMIR system and presents a general model for multimodal information retrieval that addresses the following issues: users' information need, and determining the most appropriate weighted combination of indexing techniques in order to best satisfy information need.
Abstract: Finding useful information from large multimodal document collections such as the WWW without encountering numerous false positives poses a challenge to multimedia information retrieval systems (MMIR). This research addresses the problem of finding pictures. The fact that images do not appear in isolation, but rather with accompanying, collateral text is exploited. Taken independently, existing techniques for picture retrieval using (i) text-based and (ii) image-based methods have several limitations. This research presents a general model for multimodal information retrieval that addresses the following issues: (i) users' information need, (ii) expressing information need through composite, multimodal queries, and (iii) determining the most appropriate weighted combination of indexing techniques in order to best satisfy information need. A machine learning approach is proposed for the latter. The focus is on improving precision and recall in a MMIR system by optimally combining text and image similarity. Experiments are presented which demonstrate the utility of individual indexing systems in improving overall average precision.

165 citations


Journal ArticleDOI
TL;DR: The majority of attempts to improve retrieval effectiveness were unsuccessful, but much was learnt from the research, most notably a notion of under what circumstance disambiguation may prove of use to retrieval.
Abstract: Although always present in text, word sense ambiguity only recently became regarded as a problem to information retrieval which was potentially solvable. The growth of interest in word senses resulted from new directions taken in disambiguation research. This paper first outlines this research and surveys the resulting efforts in information retrieval. Although the majority of attempts to improve retrieval effectiveness were unsuccessful, much was learnt from the research. Most notably a notion of under what circumstance disambiguation may prove of use to retrieval.

150 citations


Journal ArticleDOI
TL;DR: This work presents a compressed inverted file that indexes compressed text and uses block addressing, and compares the index against three separate techniques for varying block sizes, showing that the index is superior to each isolated approach.
Abstract: Inverted index compression, block addressing and sequential search on compressed text are three techniques that have been separately developed for efficient, low-overhead text retrieval. Modern text compression techniques can reduce the text to less than 30% of its size and allow searching it directly and faster than the uncompressed text. Inverted index compression obtains significant reduction of its original size at the same processing speed. Block addressing makes the inverted lists point to text blocks instead of exact positions and pay the reduction in space with some sequential text scanning. In this work we combine the three ideas in a single scheme. We present a compressed inverted file that indexes compressed text and uses block addressing. We consider different techniques to compress the index and study their performance with respect to the block size. We compare the index against three separate techniques for varying block sizes, showing that our index is superior to each isolated approach. For instance, with just 4% of extra space overhead the index has to scan less than 12% of the text for exact searches and about 20% allowing one error in the matches.

119 citations


Journal ArticleDOI
TL;DR: This paper provides an update on Doermann's comprehensive survey of research results in the broad area of document-based information retrieval, and focuses on methods that manipulate document images directly, and perform various information processing tasks such as retrieval, categorization, and summarization, without attempting to completely recognize the textual content of the document.
Abstract: Given the phenomenal growth in the variety and quantity of data available to users through electronic media, there is a great demand for efficient and effective ways to organize and search through all this information. Besides speech, our principal means of communication is through visual media, and in particular, through documents. In this paper, we provide an update on Doermann's comprehensive survey (1998) of research results in the broad area of document-based information retrieval. The scope of this survey is also somewhat broader, and there is a greater emphasis on relating document image analysis methods to conventional IR methods. Documents are available in a wide variety of formats. Technical papers are often available as ASCII files of clean, correct, text. Other documents may only be available as hardcopies. These documents have to be scanned and stored as images so that they may be processed by a computer. The textual content of these documents may also be extracted and recognized using OCR methods. Our survey covers the broad spectrum of methods that are required to handle different formats like text and images. The core of the paper focuses on methods that manipulate document images directly, and perform various information processing tasks such as retrieval, categorization, and summarization, without attempting to completely recognize the textual content of the document. We start, however, with a brief overview of traditional IR techniques that operate on clean text. We also discuss research dealing with text that is generated by running OCR on document images. Finally, we also briefly touch on the related problem of content-based image retrieval.

112 citations


Journal ArticleDOI
TL;DR: The TREC-5 confusion track used a set of 49 known-item tasks to study the impact of data corruption on retrieval system performance, and retrieval methods that attempted a probabilistic reconstruction of the original clean text fared better than methods that simply accepted corrupted versions of the query text.
Abstract: A known-item search is a particular information retrieval task in which the system is asked to find a single target document in a large document set. The TREC-5 confusion track used a set of 49 known-item tasks to study the impact of data corruption on retrieval system performance. Two corrupted versions of a 55,600 document corpus whose true content was known were created by applying OCR techniques to page images. The first version of the corpus used the page images as scanned, resulting in an estimated character error rate of approximately 5%. The second version used page images that had been down-sampled, resulting in an estimated character error rate of approximately 20%. The true text and each of the corrupted versions were then searched using the same set of 49 questions. In general, retrieval methods that attempted a probabilistic reconstruction of the original clean text fared better than methods that simply accepted corrupted versions of the query text.

104 citations


Journal ArticleDOI
TL;DR: A procedure based on clustering in color space followed by a connected-components analysis that seems promising for locating text in Web images and techniques using polynomial surface fitting and “fuzzy” n-tuple classifiers are described.
Abstract: The explosive growth of the World Wide Web has resulted in a distributed database consisting of hundreds of millions of documents. While existing search engines index a page based on the text that is readily extracted from its HTML encoding, an increasing amount of the information on the Web is embedded in images. This situation presents a new and exciting challenge for the fields of document analysis and information retrieval, as WWW image text is typically rendered in color and at very low spatial resolutions. In this paper, we survey the results of several years of our work in the area. For the problem of locating text in Web images, we describe a procedure based on clustering in color space followed by a connected-components analysis that seems promising. For character recognition, we discuss techniques using polynomial surface fitting and “fuzzy” n-tuple classifiers. Also presented are the results of several experiments that demonstrate where our methods perform well and where more work needs to be done. We conclude with a discussion of topics for further research.

80 citations


Journal ArticleDOI
TL;DR: An initial evaluation of the relative effectiveness of different uncertainty discount functions using a novel direct manipulation interface to a multimedia retrieval system embodying the Ostensive Model.
Abstract: The Ostensive Model proposes a manner of structuring the uncertainty associated with individual relevance judgements as sources of evidence in relevance feedback. It proposes temporal profiles of uncertainty, motivating the application of a particular class of discount function with respect to the age of the evidence. This paper presents an initial evaluation of the relative effectiveness of different uncertainty discount functions. A novel direct manipulation interface to a multimedia retrieval system embodying the Ostensive Model is outlined briefly. The paper describes the construction and characteristics of a new image test collection utilising multiple binary relevance assessments. The use of such multiple assessments and multiple interpretations of them are discussed. The evaluation environment is detailed in terms of the interface, test collection, and tasks set to users. Multiple interpretations of the results, and the statistical significance of comparisons are presented. The results obtained in the evaluation are consistent with the proposals of the Ostensive Model—reinforcing a particular evidence profile. The results give clear pointers to further, more specific, evaluations.

Journal ArticleDOI
TL;DR: A sense disambiguation technique based on a term-similarity measure for selecting the right translation sense of a query term and a query expansion technique which is based on the term similarity measure to improve the effectiveness of the translation queries.
Abstract: With the increasing availability of machine-readable bilingual dictionaries, dictionary-based automatic query translation has become a viable approach to Cross-Language Information Retrieval (CLIR). In this approach, resolving term ambiguity is a crucial step. We propose a sense disambiguation technique based on a term-similarity measure for selecting the right translation sense of a query term. In addition, we apply a query expansion technique which is also based on the term similarity measure to improve the effectiveness of the translation queries. The results of our Indonesian to English and English to Indonesian CLIR experiments demonstrate the effectiveness of the sense disambiguation technique. As for the query expansion technique, it is shown to be effective as long as the term ambiguity in the queries has been resolved. In the effort to solve the term ambiguity problem, we discovered that differences in the pattern of word-formation between the two languages render query translations from one language to the other difficult.

Journal ArticleDOI
TL;DR: In this paper, the authors describe extensive experiments for semantic text routing based on classified library titles and newswire titles and compare different sets of experiments, and demonstrate that techniques from information retrieval integrated into recurrent plausibility networks performed well even under noise and for different corpora.
Abstract: The research project AgNeT develops Agents for Neural Text routing in the internet. Unrestricted potentially faulty text messages arrive at a certain delivery point (e.g. email address or world wide web address). These text messages are scanned and then distributed to one of several expert agents according to a certain task criterium. Possible specific scenarios within this framework include the learning of the routing of publication titles or news titles. In this paper we describe extensive experiments for semantic text routing based on classified library titles and newswire titles. This task is challenging since incoming messages may contain constructions which have not been anticipated. Therefore, the contributions of this research are in learning and generalizing neural architectures for the robust interpretation of potentially noisy unrestricted messages. Neural networks were developed and examined for this topic since they support robustness and learning in noisy unrestricted real-world texts. We describe and compare different sets of experiments. The first set of experiments tests a recurrent neural network for the task of library title classification. Then we describe a larger more difficult newswire classification task from information retrieval. The comparison of the examined models demonstrates that techniques from information retrieval integrated into recurrent plausibility networks performed well even under noise and for different corpora.

Journal ArticleDOI
TL;DR: The usefulness of the features derived from interval coding in a hidden Markov model based page layout classification system that is trainable and extendible are demonstrated.
Abstract: This paper describes features and methods for document image comparison and classification at the spatial layout level. The methods are useful for visual similarity based document retrieval as well as fast algorithms for initial document type classification without OCR. A novel feature set called interval encoding is introduced to capture elements of spatial layout. This feature set encodes region layout information in fixed-length vectors by capturing structural characteristics of the image. These fixed-length vectors are then compared to each other through a Manhattan distance computation for fast page layout comparison. The paper describes experiments and results to rank-order a set of document pages in terms of their layout similarity to a test document. We also demonstrate the usefulness of the features derived from interval coding in a hidden Markov model based page layout classification system that is trainable and extendible. The methods described in the paper can be used in various document retrieval tasks including visual similarity based retrieval, categorization and information extraction.

Journal ArticleDOI
TL;DR: A preliminary investigation is presented into a new class of retrieval models that attempt to solve the term mismatch problem by exploiting complete or partial knowledge of term similarity in the term space.
Abstract: In classic Information Retrieval systems a relevant document will not be retrieved in response to a query if the document and query representations do not share at least one term. This problem, known as “term mismatch”, has been recognised for a long time by the Information Retrieval community and a number of possible solutions have been proposed. Here I present a preliminary investigation into a new class of retrieval models that attempt to solve the term mismatch problem by exploiting complete or partial knowledge of term similarity in the term space. The use of term similarity enables to enhance classic retrieval models by taking into account non-matching terms. The theoretical advantages and drawbacks of these models are presented and compared with other models tackling the same problem. A preliminary experimental investigation into the performance gain achieved by exploiting term similarity with the proposed models is presented and discussed.

Journal ArticleDOI
TL;DR: A prototype image retrieval system with browse and search capabilities was developed to investigate patterns of searching a collection of digital visual images, as well as factors, such as image size, resolution, and download speed, which affect browsing.
Abstract: A prototype image retrieval system with browse and search capabilities was developed to investigate patterns of searching a collection of digital visual images, as well as factors, such as image size, resolution, and download speed, which affect browsing. The subject populations were art history specialists and non-specialists. Through focus group interviews, a controlled test, post-test interviews and an online survey, data was gathered to compare preferences and actual patterns of use in browsing and searching. While specialists preferred direct search to browsing, and generalists used browsing as their preferred mode, both user groups found each mode to play a role depending on information need, and found value in a system combining both browse and direct search. There were no significant differences in performance among the search modes of browse, search, and combined browse/search models when the quasi-controlled study tested the different modes.

Journal ArticleDOI
TL;DR: This paper describes a new evaluation methodology, using a task-oriented test collection, which combines the advantages of traditional non-interactive testing with a more user-centred emphasis.
Abstract: Past research has identified many different types of relevance in information retrieval (IR) So far, however, most evaluation of IR systems has been through batch experiments conducted with test collections containing only expert, topical relevance judgements Recently, there has been some movement away from this traditional approach towards interactive, more user-centred methods of evaluation However, these are expensive for evaluators in terms both of time and of resources This paper describes a new evaluation methodology, using a task-oriented test collection, which combines the advantages of traditional non-interactive testing with a more user-centred emphasis The main features of a task-oriented test collection are the adoption of the task, rather than the query, as the primary unit of evaluation and the naturalistic character of the relevance judgements

Journal ArticleDOI
TL;DR: A new data structure is investigated, which allows fast decoding of texts encoded by canonical Huffman codes, with storage requirements much lower than for conventional Huffman trees, and decoding is faster, because a part of the bit-comparisons necessary for the decoding may be saved.
Abstract: A new data structure is investigated, which allows fast decoding of texts encoded by canonical Huffman codes. The storage requirements are much lower than for conventional Huffman trees, O(log^2 n) for trees of depth O(log n), and decoding is faster, because a part of the bit-comparisons necessary for the decoding may be saved. Empirical results on large real-life distributions show a reduction of up to 50% and more in the number of bit operations. The basic idea is then generalized, yielding further savings.

Journal ArticleDOI
Ronald R. Yager1
TL;DR: A framework for evaluating documents is described which allows a linguistic specification of the interrelationship between the desired attributes and also supports a hierarchical structure which allows for an increased expressiveness of queries.
Abstract: The focus of this work is on the development of a document retrieval language which attempts to enable users to better represent their requirements with respect to retrieved documents. We describe a framework for evaluating documents which allows, in the spirit of computing with words, a linguistic specification of the interrelationship between the desired attributes. This framework, which makes considerable use of the Ordered Weighted Averaging (OWA) operator, also supports a hierarchical structure which allows for an increased expressiveness of queries.

Journal ArticleDOI
TL;DR: The effects of query structures and query expansion (QE) on retrieval performance were tested with a best match retrieval system (InQuery1) and found QE was not very effective.
Abstract: The effects of query structures and query expansion (QE) on retrieval performance were tested with a best match retrieval system (InQuery^1). Query structure means the use of operators to express the relations between search keys. Six different structures were tested, representing strong structures (e.g., queries with facets or concepts identified) and weak structures (no concepts identified, a query is ‘a bag of search keys’). QE was based on concepts, which were first selected from a searching thesaurus, and then expanded by semantic relationships given in the thesaurus. The expansion levels were (a) no expansion, (b) a synonym expansion, (c) a narrower concept expansion, (d) an associative concept expansion, and (e) a cumulative expansion of all other expansions. With weak structures and Boolean structured queries, QE was not very effective. The best performance was achieved with a combination of a facet structure, where search keys within a facet were treated as instances of one search key (the SYN operator), and the largest expansion.

Journal ArticleDOI
TL;DR: A system for multilingual information retrieval that allows users to formulate queries in their preferred language and retrieve relevant information from a collection containing documents in multiple languages, based on a process of document level alignments that allows for a multilingual comparable corpus.
Abstract: We present a system for multilingual information retrieval that allows users to formulate queries in their preferred language and retrieve relevant information from a collection containing documents in multiple languages. The system is based on a process of document level alignments, where documents of different languages are paired according to their similarity. The resulting mapping allows us to produce a multilingual comparable corpus. Such a corpus has multiple interesting applications. It allows us to build a data structure for query translation in cross-language information retrieval (CLIR). Moreover, we also perform pseudo relevance feedback on the alignments to improve our retrieval results. And finally, multiple retrieval runs can be merged into one unified result list. The resulting system is inexpensive, adaptable to domain-specific collections and new languages and has performed very well at the TREC-7 conference CLIR system comparison.

Journal ArticleDOI
TL;DR: The theoretical results show clearly that information retrieval can cope even with many errors and disclose that an expensive manual or automatic post-processing of OCR-converted documents usually does not make sense, but that scanning and OCR must be performed in an appropriate way and with care.
Abstract: The retrieval of documents that originate from digitized and OCR-converted paper documents is an important task for modern retrieval systems. The problems that OCR errors cause for the retrieval process have been subject to research for several years now. We approach the problem from a theoretical point of view and model OCR conversion as a random experiment. Our theoretical results, which are supported by experiments, show clearly that information retrieval can cope even with many errors. It is, however, important that the documents are not too short and that recognition errors are distributed appropriately among words and documents. These results disclose that an expensive manual or automatic post-processing of OCR-converted documents usually does not make sense, but that scanning and OCR must be performed in an appropriate way and with care.

Journal ArticleDOI
TL;DR: The method presented first summarizes the image information content by partitioning the image in regions with same texture and generates local indexing features using self-organizing maps yields the best effectiveness of all tested methods.
Abstract: Content-based image retrieval in astronomy needs methods that can deal with an image content made of noisy and diffuse structures. This motivates investigations on how information should be summarized and indexed for this specific kind of images. The method we present first summarizes the image information content by partitioning the image in regions with same texture. We call this process texture summarization. Second, indexing features are generated by examining the distribution of parameters describing image regions. Indexing features can be associated with global or local image characteristics. Both kinds of indexing features are evaluated on the retrieval system of the Zurich archive of solar radio spectrograms. The evaluation shows that generating local indexing features using self-organizing maps yields the best effectiveness of all tested methods.

Journal ArticleDOI
TL;DR: The surprising result is that features that describe the co-occurrences of words in sentence-size or paragraph-size windows are significantly better descriptors than purely word-based indexing features.
Abstract: We have applied the well-known Robertson-Sparck Jones weighting to sets of indexing features that are different from word-based features. Our features describe the co-occurrences of words in a window range of predefined size. The experiments have been designed to analyse the value of features that are beyond word-based features but all used retrieval methods can be motivated strictly in the probabilistic framework. Among the several implications of our experiments for weighted retrieval is the surprising result that features that describe the co-occurrences of words in sentence-size or paragraph-size windows are significantly better descriptors than purely word-based indexing features.

Journal ArticleDOI
TL;DR: The Logical Uncertainty Principle is re-examined from the point of classical logic and two interpretations are given, an objective one in terms of an axiomatic theory of information, and a subjective one based on Ramsey's theory of probability.
Abstract: The Logical Uncertainty Principle is re-examined from the point of classical logic. Two interpretations are given, an objective one in terms of an axiomatic theory of information, and a subjective one based on Ramsey‘s theory of probability.

Journal ArticleDOI
K. L. Kwok1
TL;DR: The PIRCS document-focused retrieval is shown to have similarity with a simple language model approach to IR and the use of various term level and phrasal level evidence to improve retrieval accuracy is studied.
Abstract: Both English and Chinese ad-hoc information retrieval were investigated in this Tipster 3 project. Part of our objectives is to study the use of various term level and phrasal level evidence to improve retrieval accuracy. For short queries, we studied five term level techniques that together can lead to good improvements over standard ad-hoc 2-stage retrieval for TREC5-8 experiments. For long queries, we studied the use of linguistic phrases to re-rank retrieval lists. Its effect is small but consistently positive. For Chinese IR, we investigated three simple representations for documents and queries: short-words, bigrams and characters. Both approximate short-word segmentation or bigrams, augmented with characters, give highly effective results. Accurate word segmentation appears not crucial for overall result of a query set. Character indexing by itself is not competitive. Additional improvements may be obtained using collection enrichment and combination of retrieval lists. Our PIRCS document-focused retrieval is also shown to have similarity with a simple language model approach to IR.

Journal ArticleDOI
TL;DR: The kd-tree searching algorithm is used within a recent LSI implementation to reduce the time and computational complexity of query matching.
Abstract: Efficient information searching and retrieval methods are needed to navigate the ever increasing volumes of digital information. Traditional lexical information retrieval methods can be inefficient and often return inaccurate results. To overcome problems such as polysemy and synonymy, concept-based retrieval methods have been developed. One such method is Latent Semantic Indexing (LSI), a vector-space model, which uses the singular value decomposition (SVD) of a term-by-document matrix to represent terms and documents in k-dimensional space. As with other vector-space models, LSI is an attempt to exploit the underlying semantic structure of word usage in documents. During the query matching phase of LSI, a user's query is first projected into the term-document space, and then compared to all terms and documents represented in the vector space. Using some similarity measure, the nearest (most relevant) terms and documents are identified and returned to the user. The current LSI query matching method requires that the similarity measure be computed between the query and every term and document in the vector space. In this paper, the kd-tree searching algorithm is used within a recent LSI implementation to reduce the time and computational complexity of query matching. The kd-tree data structure stores the term and document vectors in such a way that only those terms and documents that are most likely to qualify as nearest neighbors to the query will be examined and retrieved.

Journal ArticleDOI
TL;DR: Concepts borrowed from high-resolution spectral analysis, but adapted uniquely to this problem have been found to be useful in this context and can be applied to a variety of pattern recognition problems.
Abstract: Images and signals may be represented by forms invariant to time shifts, spatial shifts, frequency shifts, and scale changes Advances in time-frequency analysis and scale transform techniques have made this possible However, factors such as noise contamination and “style” differences complicate this An example is found in text, where letters and words may vary in size and position Examples of complicating variations include the font used, corruption during facsimile (fax) transmission, and printer characteristics The solution advanced in this paper is to cast the desired invariants into separate subspaces for each extraneous factor or group of factors The first goal is to have minimal overlap between these subspaces and the second goal is to be able to identify each subspace accurately Concepts borrowed from high-resolution spectral analysis, but adapted uniquely to this problem have been found to be useful in this context Once the pertinent subspace is identified, the recognition of a particular invariant form within this subspace is relatively simple using well-known singular value decomposition (SVD) techniques The basic elements of the approach can be applied to a variety of pattern recognition problems The specific application covered in this paper is word spotting in bitmapped fax documents

Journal ArticleDOI
TL;DR: These four techniques have been evaluated using the TREC-6 spoken document retrieval test collection to determine the improvements in retrieval effectiveness with respect to a baseline retrieval method and show that the retrieval effectiveness can be improved considerably despite the large number of speech recognition errors.
Abstract: This paper presents four novel techniques for open-vocabulary spoken document retrieval: a method to detect slots that possibly contain a query features a method to estimate occurrence probabilitiess a technique that we call collection-wide probability re-estimation and a weighting scheme which takes advantage of the fact that long query features are detected more reliably. These four techniques have been evaluated using the TREC-6 spoken document retrieval test collection to determine the improvements in retrieval effectiveness with respect to a baseline retrieval method. Results show that the retrieval effectiveness can be improved considerably despite the large number of speech recognition errors.

Journal ArticleDOI
TL;DR: Some of the problems in interacting with best-match retrieval systems are looked at, some investigations of the complexity and breadth of interaction are investigated and attempts to categorise user's information seeking behaviour are attempted.
Abstract: In this paper we look at some of the problems in interacting with best-match retrieval systems. In particular, we examine the areas of interaction, some investigations of the complexity and breadth of interaction and attempts to categorise user's information seeking behaviour. We suggest that one of the difficulties of traditional IR systems in supporting information seeking is the way the information content of documents is represented. We discuss an alternative representation, based on how information is used within documents.