Showing papers in &quot;Information Retrieval in 2000&quot;

Retrieving with Good Sense

TL;DR: This research addresses the problem of finding pictures by optimally combining text and image similarity in a MMIR system and presents a general model for multimodal information retrieval that addresses the following issues: users' information need, and determining the most appropriate weighted combination of indexing techniques in order to best satisfy information need.

...read moreread less

Abstract: Finding useful information from large multimodal document collections such as the WWW without encountering numerous false positives poses a challenge to multimedia information retrieval systems (MMIR). This research addresses the problem of finding pictures. The fact that images do not appear in isolation, but rather with accompanying, collateral text is exploited. Taken independently, existing techniques for picture retrieval using (i) text-based and (ii) image-based methods have several limitations. This research presents a general model for multimodal information retrieval that addresses the following issues: (i) users' information need, (ii) expressing information need through composite, multimodal queries, and (iii) determining the most appropriate weighted combination of indexing techniques in order to best satisfy information need. A machine learning approach is proposed for the latter. The focus is on improving precision and recall in a MMIR system by optimally combining text and image similarity. Experiments are presented which demonstrate the utility of individual indexing systems in improving overall average precision.

...read moreread less

165 citations

Journal Article•DOI•

[...]

Mark Sanderson¹•Institutions (1)

University of Sheffield¹

Adding Compression to Block Addressing Inverted Indexes

TL;DR: The majority of attempts to improve retrieval effectiveness were unsuccessful, but much was learnt from the research, most notably a notion of under what circumstance disambiguation may prove of use to retrieval.

...read moreread less

Abstract: Although always present in text, word sense ambiguity only recently became regarded as a problem to information retrieval which was potentially solvable. The growth of interest in word senses resulted from new directions taken in disambiguation research. This paper first outlines this research and surveys the resulting efforts in information retrieval. Although the majority of attempts to improve retrieval effectiveness were unsuccessful, much was learnt from the research. Most notably a notion of under what circumstance disambiguation may prove of use to retrieval.

...read moreread less

150 citations

Journal Article•DOI•

[...]

Gonzalo Navarro¹, Edleno Silva de Moura², Marden Neubert², Nivio Ziviani², Ricardo Baeza-Yates¹ - Show less +1 more•Institutions (2)

University of Chile¹, Universidade Federal de Minas Gerais²

01 Jul 2000-Information Retrieval

TL;DR: This work presents a compressed inverted file that indexes compressed text and uses block addressing, and compares the index against three separate techniques for varying block sizes, showing that the index is superior to each isolated approach.

...read moreread less

Abstract: Inverted index compression, block addressing and sequential search on compressed text are three techniques that have been separately developed for efficient, low-overhead text retrieval. Modern text compression techniques can reduce the text to less than 30% of its size and allow searching it directly and faster than the uncompressed text. Inverted index compression obtains significant reduction of its original size at the same processing speed. Block addressing makes the inverted lists point to text blocks instead of exact positions and pay the reduction in space with some sequential text scanning. In this work we combine the three ideas in a single scheme. We present a compressed inverted file that indexes compressed text and uses block addressing. We consider different techniques to compress the index and study their performance with respect to the block size. We compare the index against three separate techniques for varying block sizes, showing that our index is superior to each isolated approach. For instance, with just 4% of extra space overhead the index has to scan less than 12% of the text for exact searches and about 20% allowing one error in the matches.

...read moreread less

119 citations

Journal Article•DOI•

Information Retrieval from Documents: A Survey

[...]

Mandar Mitra¹, Bidyut B. Chaudhuri¹•Institutions (1)

Indian Statistical Institute¹

The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text

TL;DR: This paper provides an update on Doermann's comprehensive survey of research results in the broad area of document-based information retrieval, and focuses on methods that manipulate document images directly, and perform various information processing tasks such as retrieval, categorization, and summarization, without attempting to completely recognize the textual content of the document.

...read moreread less

Abstract: Given the phenomenal growth in the variety and quantity of data available to users through electronic media, there is a great demand for efficient and effective ways to organize and search through all this information. Besides speech, our principal means of communication is through visual media, and in particular, through documents. In this paper, we provide an update on Doermann's comprehensive survey (1998) of research results in the broad area of document-based information retrieval. The scope of this survey is also somewhat broader, and there is a greater emphasis on relating document image analysis methods to conventional IR methods. Documents are available in a wide variety of formats. Technical papers are often available as ASCII files of clean, correct, text. Other documents may only be available as hardcopies. These documents have to be scanned and stored as images so that they may be processed by a computer. The textual content of these documents may also be extracted and recognized using OCR methods. Our survey covers the broad spectrum of methods that are required to handle different formats like text and images. The core of the paper focuses on methods that manipulate document images directly, and perform various information processing tasks such as retrieval, categorization, and summarization, without attempting to completely recognize the textual content of the document. We start, however, with a brief overview of traditional IR techniques that operate on clean text. We also discuss research dealing with text that is generated by running OCR on document images. Finally, we also briefly touch on the related problem of content-based image retrieval.

...read moreread less

112 citations

Journal Article•DOI•

[...]

Paul B. Kantor¹, Ellen M. Voorhees²•Institutions (2)

Rutgers University¹, National Institute of Standards and Technology²

Locating and Recognizing Text in WWW Images

TL;DR: The TREC-5 confusion track used a set of 49 known-item tasks to study the impact of data corruption on retrieval system performance, and retrieval methods that attempted a probabilistic reconstruction of the original clean text fared better than methods that simply accepted corrupted versions of the query text.

...read moreread less

Abstract: A known-item search is a particular information retrieval task in which the system is asked to find a single target document in a large document set. The TREC-5 confusion track used a set of 49 known-item tasks to study the impact of data corruption on retrieval system performance. Two corrupted versions of a 55,600 document corpus whose true content was known were created by applying OCR techniques to page images. The first version of the corpus used the page images as scanned, resulting in an estimated character error rate of approximately 5%. The second version used page images that had been down-sampled, resulting in an estimated character error rate of approximately 20%. The true text and each of the corrupted versions were then searched using the same set of 49 questions. In general, retrieval methods that attempted a probabilistic reconstruction of the original clean text fared better than methods that simply accepted corrupted versions of the query text.

...read moreread less

104 citations

Journal Article•DOI•

[...]

Daniel P. Lopresti¹, Jiangying Zhou•Institutions (1)

Alcatel-Lucent¹

Interactive Evaluation of the Ostensive ModelUsing a New Test Collection of Images with Multiple Relevance Assessments

TL;DR: A procedure based on clustering in color space followed by a connected-components analysis that seems promising for locating text in Web images and techniques using polynomial surface fitting and “fuzzy” n-tuple classifiers are described.

...read moreread less

Abstract: The explosive growth of the World Wide Web has resulted in a distributed database consisting of hundreds of millions of documents. While existing search engines index a page based on the text that is readily extracted from its HTML encoding, an increasing amount of the information on the Web is embedded in images. This situation presents a new and exciting challenge for the fields of document analysis and information retrieval, as WWW image text is typically rendered in color and at very low spatial resolutions. In this paper, we survey the results of several years of our work in the area. For the problem of locating text in Web images, we describe a procedure based on clustering in color space followed by a connected-components analysis that seems promising. For character recognition, we discuss techniques using polynomial surface fitting and “fuzzy” n-tuple classifiers. Also presented are the results of several experiments that demonstrate where our methods perform well and where more work needs to be done. We conclude with a discussion of topics for further research.

...read moreread less

80 citations

Journal Article•DOI•

[...]

Iain Campbell¹•Institutions (1)

Using Statistical Term Similarity for Sense Disambiguationin Cross-Language Information Retrieval

TL;DR: An initial evaluation of the relative effectiveness of different uncertainty discount functions using a novel direct manipulation interface to a multimedia retrieval system embodying the Ostensive Model.

...read moreread less

Abstract: The Ostensive Model proposes a manner of structuring the uncertainty associated with individual relevance judgements as sources of evidence in relevance feedback. It proposes temporal profiles of uncertainty, motivating the application of a particular class of discount function with respect to the age of the evidence. This paper presents an initial evaluation of the relative effectiveness of different uncertainty discount functions. A novel direct manipulation interface to a multimedia retrieval system embodying the Ostensive Model is outlined briefly. The paper describes the construction and characteristics of a new image test collection utilising multiple binary relevance assessments. The use of such multiple assessments and multiple interpretations of them are discussed. The evaluation environment is detailed in terms of the interface, test collection, and tasks set to users. Multiple interpretations of the results, and the statistical significance of comparisons are presented. The results obtained in the evaluation are consistent with the proposals of the Ostensive Model—reinforcing a particular evidence profile. The results give clear pointers to further, more specific, evaluations.

...read moreread less

Journal Article•DOI•

[...]

Mirna Adriani¹•Institutions (1)

Neural Network Agents for Learning Semantic Text Classification

TL;DR: A sense disambiguation technique based on a term-similarity measure for selecting the right translation sense of a query term and a query expansion technique which is based on the term similarity measure to improve the effectiveness of the translation queries.

...read moreread less

Abstract: With the increasing availability of machine-readable bilingual dictionaries, dictionary-based automatic query translation has become a viable approach to Cross-Language Information Retrieval (CLIR). In this approach, resolving term ambiguity is a crucial step. We propose a sense disambiguation technique based on a term-similarity measure for selecting the right translation sense of a query term. In addition, we apply a query expansion technique which is also based on the term similarity measure to improve the effectiveness of the translation queries. The results of our Indonesian to English and English to Indonesian CLIR experiments demonstrate the effectiveness of the sense disambiguation technique. As for the query expansion technique, it is shown to be effective as long as the term ambiguity in the queries has been resolved. In the effort to solve the term ambiguity problem, we discovered that differences in the pattern of word-formation between the two languages render query translations from one language to the other difficult.

...read moreread less

Journal Article•DOI•

[...]

Stefan Wermter¹•Institutions (1)

University of Sunderland¹

21 Jul 2000-Information Retrieval

TL;DR: In this paper, the authors describe extensive experiments for semantic text routing based on classified library titles and newswire titles and compare different sets of experiments, and demonstrate that techniques from information retrieval integrated into recurrent plausibility networks performed well even under noise and for different corpora.

...read moreread less

Abstract: The research project AgNeT develops Agents for Neural Text routing in the internet. Unrestricted potentially faulty text messages arrive at a certain delivery point (e.g. email address or world wide web address). These text messages are scanned and then distributed to one of several expert agents according to a certain task criterium. Possible specific scenarios within this framework include the learning of the routing of publication titles or news titles. In this paper we describe extensive experiments for semantic text routing based on classified library titles and newswire titles. This task is challenging since incoming messages may contain constructions which have not been anticipated. Therefore, the contributions of this research are in learning and generalizing neural architectures for the robust interpretation of potentially noisy unrestricted messages. Neural networks were developed and examined for this topic since they support robustness and learning in noisy unrestricted real-world texts. We describe and compare different sets of experiments. The first set of experiments tests a recurrent neural network for the task of library title classification. Then we describe a larger more difficult newswire classification task from information retrieval. The comparison of the examined models demonstrates that techniques from information retrieval integrated into recurrent plausibility networks performed well even under noise and for different corpora.

...read moreread less

Journal Article•DOI•

Comparison and Classification of Documents Based on Layout Similarity

[...]

Jianying Hu¹, Ramanujan S. Kashi¹, Gordon Wilfong¹•Institutions (1)

Alcatel-Lucent¹

Exploiting the Similarity of Non-Matching Terms at RetrievalTime

TL;DR: The usefulness of the features derived from interval coding in a hidden Markov model based page layout classification system that is trainable and extendible are demonstrated.

...read moreread less

Abstract: This paper describes features and methods for document image comparison and classification at the spatial layout level. The methods are useful for visual similarity based document retrieval as well as fast algorithms for initial document type classification without OCR. A novel feature set called interval encoding is introduced to capture elements of spatial layout. This feature set encodes region layout information in fixed-length vectors by capturing structural characteristics of the image. These fixed-length vectors are then compared to each other through a Manhattan distance computation for fast page layout comparison. The paper describes experiments and results to rank-order a set of document pages in terms of their layout similarity to a test document. We also demonstrate the usefulness of the features derived from interval coding in a hidden Markov model based page layout classification system that is trainable and extendible. The methods described in the paper can be used in various document retrieval tasks including visual similarity based retrieval, categorization and information extraction.

...read moreread less

Journal Article•DOI•

[...]

Fabio Crestani¹•Institutions (1)

Browse and Search Patterns in a Digital Image Database

TL;DR: A preliminary investigation is presented into a new class of retrieval models that attempt to solve the term mismatch problem by exploiting complete or partial knowledge of term similarity in the term space.

...read moreread less

Abstract: In classic Information Retrieval systems a relevant document will not be retrieved in response to a query if the document and query representations do not share at least one term. This problem, known as “term mismatch”, has been recognised for a long time by the Information Retrieval community and a number of possible solutions have been proposed. Here I present a preliminary investigation into a new class of retrieval models that attempt to solve the term mismatch problem by exploiting complete or partial knowledge of term similarity in the term space. The use of term similarity enables to enhance classic retrieval models by taking into account non-matching terms. The theoretical advantages and drawbacks of these models are presented and compared with other models tackling the same problem. A preliminary experimental investigation into the performance gain achieved by exploiting term similarity with the proposed models is presented and discussed.

...read moreread less

Journal Article•DOI•

[...]

C. Olivia Frost¹, Bradley Taylor¹, Anna Noakes¹, Stephen Markel¹, Deborah Torres¹, Karen Markey Drabenstott¹ - Show less +2 more•Institutions (1)

University of Michigan¹

01 Jan 2000-Information Retrieval

TL;DR: A prototype image retrieval system with browse and search capabilities was developed to investigate patterns of searching a collection of digital visual images, as well as factors, such as image size, resolution, and download speed, which affect browsing.

...read moreread less

Abstract: A prototype image retrieval system with browse and search capabilities was developed to investigate patterns of searching a collection of digital visual images, as well as factors, such as image size, resolution, and download speed, which affect browsing. The subject populations were art history specialists and non-specialists. Through focus group interviews, a controlled test, post-test interviews and an online survey, data was gathered to compare preferences and actual patterns of use in browsing and searching. While specialists preferred direct search to browsing, and generalists used browsing as their preferred mode, both user groups found each mode to play a role depending on information need, and found value in a system combining both browse and direct search. There were no significant differences in performance among the search modes of browse, search, and combined browse/search models when the quasi-controlled study tested the different modes.

...read moreread less

Journal Article•DOI•

A Task-Oriented Non-Interactive Evaluation Methodologyfor Information Retrieval Systems

[...]

Jane Reid¹•Institutions (1)

Queen Mary University of London¹

Skeleton Trees for the Efficient Decoding of Huffman Encoded Texts

TL;DR: This paper describes a new evaluation methodology, using a task-oriented test collection, which combines the advantages of traditional non-interactive testing with a more user-centred emphasis.

...read moreread less

Abstract: Past research has identified many different types of relevance in information retrieval (IR) So far, however, most evaluation of IR systems has been through batch experiments conducted with test collections containing only expert, topical relevance judgements Recently, there has been some movement away from this traditional approach towards interactive, more user-centred methods of evaluation However, these are expensive for evaluators in terms both of time and of resources This paper describes a new evaluation methodology, using a task-oriented test collection, which combines the advantages of traditional non-interactive testing with a more user-centred emphasis The main features of a task-oriented test collection are the adoption of the task, rather than the query, as the primary unit of evaluation and the naturalistic character of the relevance judgements

...read moreread less

Journal Article•DOI•

[...]

Shmuel T. Klein¹•Institutions (1)

Bar-Ilan University¹

01 Jul 2000-Information Retrieval

TL;DR: A new data structure is investigated, which allows fast decoding of texts encoded by canonical Huffman codes, with storage requirements much lower than for conventional Huffman trees, and decoding is faster, because a part of the bit-comparisons necessary for the decoding may be saved.

...read moreread less

Abstract: A new data structure is investigated, which allows fast decoding of texts encoded by canonical Huffman codes. The storage requirements are much lower than for conventional Huffman trees, O(log^2 n) for trees of depth O(log n), and decoding is faster, because a part of the bit-comparisons necessary for the decoding may be saved. Empirical results on large real-life distributions show a reduction of up to 50% and more in the number of bit operations. The basic idea is then generalized, yielding further savings.

...read moreread less

Journal Article•DOI•

A Hierarchical Document Retrieval Language

[...]

Ronald R. Yager¹•Institutions (1)

Iona College¹

01 Dec 2000-Information Retrieval

TL;DR: A framework for evaluating documents is described which allows a linguistic specification of the interrelationship between the desired attributes and also supports a hierarchical structure which allows for an increased expressiveness of queries.

...read moreread less

Abstract: The focus of this work is on the development of a document retrieval language which attempts to enable users to better represent their requirements with respect to retrieved documents. We describe a framework for evaluating documents which allows, in the spirit of computing with words, a linguistic specification of the interrelationship between the desired attributes. This framework, which makes considerable use of the Ordered Weighted Averaging (OWA) operator, also supports a hierarchical structure which allows for an increased expressiveness of queries.

...read moreread less

Journal Article•DOI•

The Co-Effects of Query Structure and Expansion on RetrievalPerformance in Probabilistic Text Retrieval

[...]

Jaana Kekäläinen¹, Kalervo Järvelin¹•Institutions (1)

University of Tampere¹

01 Jan 2000-Information Retrieval

TL;DR: The effects of query structures and query expansion (QE) on retrieval performance were tested with a best match retrieval system (InQuery1) and found QE was not very effective.

...read moreread less

Abstract: The effects of query structures and query expansion (QE) on retrieval performance were tested with a best match retrieval system (InQuery^1). Query structure means the use of operators to express the relations between search keys. Six different structures were tested, representing strong structures (e.g., queries with facets or concepts identified) and weak structures (no concepts identified, a query is ‘a bag of search keys’). QE was based on concepts, which were first selected from a searching thesaurus, and then expanded by semantic relationships given in the thesaurus. The expansion levels were (a) no expansion, (b) a synonym expansion, (c) a narrower concept expansion, (d) an associative concept expansion, and (e) a cumulative expansion of all other expansions. With weak structures and Boolean structured queries, QE was not very effective. The best performance was achieved with a combination of a facet structure, where search keys within a facet were treated as instances of one search key (the SYN operator), and the largest expansion.

...read moreread less

Journal Article•DOI•

Using Corpus-Based Approaches in a System for Multilingual Information Retrieval

[...]

Martin Braschler, Peter Schäuble

Information Retrieval can Cope with Many Errors

TL;DR: A system for multilingual information retrieval that allows users to formulate queries in their preferred language and retrieve relevant information from a collection containing documents in multiple languages, based on a process of document level alignments that allows for a multilingual comparable corpus.

...read moreread less

Abstract: We present a system for multilingual information retrieval that allows users to formulate queries in their preferred language and retrieve relevant information from a collection containing documents in multiple languages. The system is based on a process of document level alignments, where documents of different languages are paired according to their similarity. The resulting mapping allows us to produce a multilingual comparable corpus. Such a corpus has multiple interesting applications. It allows us to build a data structure for query translation in cross-language information retrieval (CLIR). Moreover, we also perform pseudo relevance feedback on the alignments to improve our retrieval results. And finally, multiple retrieval runs can be merged into one unified result list. The resulting system is inexpensive, adaptable to domain-specific collections and new languages and has performed very well at the TREC-7 conference CLIR system comparison.

...read moreread less

Journal Article•DOI•

[...]

Elke Mittendorf, Peter Schäuble

Content-Based Image Retrieval in Astronomy

TL;DR: The theoretical results show clearly that information retrieval can cope even with many errors and disclose that an expensive manual or automatic post-processing of OCR-converted documents usually does not make sense, but that scanning and OCR must be performed in an appropriate way and with care.

...read moreread less

Abstract: The retrieval of documents that originate from digitized and OCR-converted paper documents is an important task for modern retrieval systems. The problems that OCR errors cause for the retrieval process have been subject to research for several years now. We approach the problem from a theoretical point of view and model OCR conversion as a random experiment. Our theoretical results, which are supported by experiments, show clearly that information retrieval can cope even with many errors. It is, however, important that the documents are not too short and that recognition errors are distributed appropriately among words and documents. These results disclose that an expensive manual or automatic post-processing of OCR-converted documents usually does not make sense, but that scanning and OCR must be performed in an appropriate way and with care.

...read moreread less

Journal Article•DOI•

[...]

André Csillaghy¹, Hans Hinterberger², Arnold O. Benz²•Institutions (2)

University of California, Berkeley¹, ETH Zurich²

Using the Co-occurrence of Words for Retrieval Weighting

TL;DR: The method presented first summarizes the image information content by partitioning the image in regions with same texture and generates local indexing features using self-organizing maps yields the best effectiveness of all tested methods.

...read moreread less

Abstract: Content-based image retrieval in astronomy needs methods that can deal with an image content made of noisy and diffuse structures. This motivates investigations on how information should be summarized and indexed for this specific kind of images. The method we present first summarizes the image information content by partitioning the image in regions with same texture. We call this process texture summarization. Second, indexing features are generated by examining the distribution of parameters describing image regions. Indexing features can be associated with global or local image characteristics. Both kinds of indexing features are evaluated on the retrieval system of the Zurich archive of solar radio spectrograms. The evaluation shows that generating local indexing features using self-organizing maps yields the best effectiveness of all tested methods.

...read moreread less

Journal Article•DOI•

[...]

Elke Mittendorf, Bojidar Mateev, Peter Schäuble

Another Look at the Logical Uncertainty Principle

TL;DR: The surprising result is that features that describe the co-occurrences of words in sentence-size or paragraph-size windows are significantly better descriptors than purely word-based indexing features.

...read moreread less

Abstract: We have applied the well-known Robertson-Sparck Jones weighting to sets of indexing features that are different from word-based features. Our features describe the co-occurrences of words in a window range of predefined size. The experiments have been designed to analyse the value of features that are beyond word-based features but all used retrieval methods can be motivated strictly in the probabilistic framework. Among the several implications of our experiments for weighted retrieval is the surprising result that features that describe the co-occurrences of words in sentence-size or paragraph-size windows are significantly better descriptors than purely word-based indexing features.

...read moreread less

Journal Article•DOI•

[...]

C. J. van Rijsbergen¹•Institutions (1)

Improving English and Chinese Ad-Hoc Retrieval: A Tipster Text Phase 3 Project Report

TL;DR: The Logical Uncertainty Principle is re-examined from the point of classical logic and two interpretations are given, an objective one in terms of an axiomatic theory of information, and a subjective one based on Ramsey's theory of probability.

...read moreread less

Abstract: The Logical Uncertainty Principle is re-examined from the point of classical logic. Two interpretations are given, an objective one in terms of an axiomatic theory of information, and a subjective one based on Ramsey‘s theory of probability.

...read moreread less

Journal Article•DOI•

[...]

K. L. Kwok¹•Institutions (1)

Queens College¹

01 Dec 2000-Information Retrieval

TL;DR: The PIRCS document-focused retrieval is shown to have similarity with a simple language model approach to IR and the use of various term level and phrasal level evidence to improve retrieval accuracy is studied.

...read moreread less

Abstract: Both English and Chinese ad-hoc information retrieval were investigated in this Tipster 3 project. Part of our objectives is to study the use of various term level and phrasal level evidence to improve retrieval accuracy. For short queries, we studied five term level techniques that together can lead to good improvements over standard ad-hoc 2-stage retrieval for TREC5-8 experiments. For long queries, we studied the use of linguistic phrases to re-rank retrieval lists. Its effect is small but consistently positive. For Chinese IR, we investigated three simple representations for documents and queries: short-words, bigrams and characters. Both approximate short-word segmentation or bigrams, augmented with characters, give highly effective results. Accurate word segmentation appears not crucial for overall result of a query set. Character indexing by itself is not competitive. Additional improvements may be obtained using collection enrichment and combination of retrieval lists. Our PIRCS document-focused retrieval is also shown to have similarity with a simple language model approach to IR.

...read moreread less

Journal Article•DOI•

Improved Query Matching Using kd-Trees: A Latent Semantic Indexing Enhancement

[...]

M. K. Hughey¹, Michael W. Berry¹•Institutions (1)

University of Tennessee¹

21 May 2000-Information Retrieval

TL;DR: The kd-tree searching algorithm is used within a recent LSI implementation to reduce the time and computational complexity of query matching.

...read moreread less

Abstract: Efficient information searching and retrieval methods are needed to navigate the ever increasing volumes of digital information. Traditional lexical information retrieval methods can be inefficient and often return inaccurate results. To overcome problems such as polysemy and synonymy, concept-based retrieval methods have been developed. One such method is Latent Semantic Indexing (LSI), a vector-space model, which uses the singular value decomposition (SVD) of a term-by-document matrix to represent terms and documents in k-dimensional space. As with other vector-space models, LSI is an attempt to exploit the underlying semantic structure of word usage in documents. During the query matching phase of LSI, a user's query is first projected into the term-document space, and then compared to all terms and documents represented in the vector space. Using some similarity measure, the nearest (most relevant) terms and documents are identified and returned to the user. The current LSI query matching method requires that the similarity measure be computed between the query and every term and document in the vector space. In this paper, the kd-tree searching algorithm is used within a recent LSI implementation to reduce the time and computational complexity of query matching. The kd-tree data structure stores the term and document vectors in such a way that only those terms and documents that are most likely to qualify as nearest neighbors to the query will be examined and retrieved.

...read moreread less

Journal Article•DOI•

Word Spotting in Bitmapped Fax Documents

[...]

William J. Williams¹, Eugene J. Zalubas¹, Alfred O. Hero¹•Institutions (1)

University of Michigan¹

New Approaches to Spoken Document Retrieval

TL;DR: Concepts borrowed from high-resolution spectral analysis, but adapted uniquely to this problem have been found to be useful in this context and can be applied to a variety of pattern recognition problems.

...read moreread less

Abstract: Images and signals may be represented by forms invariant to time shifts, spatial shifts, frequency shifts, and scale changes Advances in time-frequency analysis and scale transform techniques have made this possible However, factors such as noise contamination and “style” differences complicate this An example is found in text, where letters and words may vary in size and position Examples of complicating variations include the font used, corruption during facsimile (fax) transmission, and printer characteristics The solution advanced in this paper is to cast the desired invariants into separate subspaces for each extraneous factor or group of factors The first goal is to have minimal overlap between these subspaces and the second goal is to be able to identify each subspace accurately Concepts borrowed from high-resolution spectral analysis, but adapted uniquely to this problem have been found to be useful in this context Once the pertinent subspace is identified, the recognition of a particular invariant form within this subspace is relatively simple using well-known singular value decomposition (SVD) techniques The basic elements of the approach can be applied to a variety of pattern recognition problems The specific application covered in this paper is word spotting in bitmapped fax documents

...read moreread less

Journal Article•DOI•

[...]

Martin Wechsler¹, Eugen Munteanu, Peter Schäuble•Institutions (1)

McKinsey & Company¹

Incorporating Aspects of Information Use into Relevance Feedback

TL;DR: These four techniques have been evaluated using the TREC-6 spoken document retrieval test collection to determine the improvements in retrieval effectiveness with respect to a baseline retrieval method and show that the retrieval effectiveness can be improved considerably despite the large number of speech recognition errors.

...read moreread less

Abstract: This paper presents four novel techniques for open-vocabulary spoken document retrieval: a method to detect slots that possibly contain a query features a method to estimate occurrence probabilitiess a technique that we call collection-wide probability re-estimation and a weighting scheme which takes advantage of the fact that long query features are detected more reliably. These four techniques have been evaluated using the TREC-6 spoken document retrieval test collection to determine the improvements in retrieval effectiveness with respect to a baseline retrieval method. Results show that the retrieval effectiveness can be improved considerably despite the large number of speech recognition errors.

...read moreread less

Journal Article•DOI•

[...]

Ian Ruthven¹•Institutions (1)