scispace - formally typeset
Search or ask a question
Proceedings Article

The TREC spoken document retrieval track: a success story

TL;DR: The SDR Track can be declared a success in that it has provided objective, demonstrable proof that this technology can be successfully applied to realistic audio collections using a combination of existing technologies and that it can be objectively evaluated.
Abstract: This paper describes work within the NIST Text REtrieval Conference (TREC) over the last three years in designing and implementing evaluations of Spoken Document Retrieval (SDR) technology within a broadcast news domain. SDR involves the search and retrieval of excerpts from spoken audio recordings using a combination of automatic speech recognition and information retrieval technologies. The TREC SDR Track has provided an infrastructure for the development and evaluation of SDR technology and a common forum for the exchange of knowledge between the speech recognition and information retrieval research communities. The SDR Track can be declared a success in that it has provided objective, demonstrable proof that this technology can be successfully applied to realistic audio collections using a combination of existing technologies and that it can be objectively evaluated. The design and implementation of each of the SDR evaluations are presented and the results are summarized. Plans for the 2000 TREC SDR Track are presented and thoughts about how the track might evolve are discussed.
Citations
More filters
Journal ArticleDOI
TL;DR: This survey reviews 100+ recent articles on content-based multimedia information retrieval and discusses their role in current research directions which include browsing and search paradigms, user studies, affective computing, learning, semantic queries, new features and media types, high performance indexing, and evaluation techniques.
Abstract: Extending beyond the boundaries of science, art, and culture, content-based multimedia information retrieval provides new paradigms and methods for searching through the myriad variety of media all over the world. This survey reviews 100p recent articles on content-based multimedia information retrieval and discusses their role in current research directions which include browsing and search paradigms, user studies, affective computing, learning, semantic queries, new features and media types, high performance indexing, and evaluation techniques. Based on the current state of the art, we discuss the major challenges for the future.

1,652 citations

Book
03 Jun 2010
TL;DR: This tutorial and review shows that despite its age, this long-standing evaluation method is still a highly valued tool for retrieval research.
Abstract: Use of test collections and evaluation measures to assess the effectiveness of information retrieval systems has its origins in work dating back to the early 1950s. Across the nearly 60 years since that work started, use of test collections is a de facto standard of evaluation. This monograph surveys the research conducted and explains the methods and measures devised for evaluation of retrieval systems, including a detailed look at the use of statistical significance testing in retrieval experimentation. This monograph reviews more recent examinations of the validity of the test collection approach and evaluation measures as well as outlining trends in current research exploiting query logs and live labs. At its core, the modern-day test collection is little different from the structures that the pioneering researchers in the 1950s and 1960s conceived of. This tutorial and review shows that despite its age, this long-standing evaluation method is still a highly valued tool for retrieval research.

383 citations


Cites background from "The TREC spoken document retrieval ..."

  • ...• categorizing and/or retrieving streamed text, as addressed in the routing and filtering tracks [200]; • medical scholarly articles in the TREC-based genomics collection where matching to variants of gene names became a part of the search task [119]; • search across languages with English queries retrieving Spanish and Chinese documents, as covered in the cross language search tracks [230]; and • retrieval of noisy channel data, output by OCR and speech recognizer systems, addressed in the Confusion [144] and Spoken Document Retrieval tracks [92]....

    [...]

Proceedings Article
01 Jan 2004
TL;DR: This paper proposes an indexing procedure for spoken utterance retrieval that works on lattices rather than just single-best text, and demonstrates that this procedure can improve F scores by over five points compared to singlebest retrieval on tasks with poor WER and low redundancy.
Abstract: Recent work on spoken document retrieval has suggested that it is adequate to take the singlebest output of ASR, and perform text retrieval on this output. This is reasonable enough for the task of retrieving broadcast news stories, where word error rates are relatively low, and the stories are long enough to contain much redundancy. But it is patently not reasonable if one’s task is to retrieve a short snippet of speech in a domain where WER’s can be as high as 50%; such would be the situation with teleconference speech, where one’s task is to find if and when a participant uttered a certain phrase. In this paper we propose an indexing procedure for spoken utterance retrieval that works on lattices rather than just single-best text. We demonstrate that this procedure can improve F scores by over five points compared to singlebest retrieval on tasks with poor WER and low redundancy. The representation is flexible so that we can represent both word lattices, as well as phone lattices, the latter being important for improving performance when searching for phrases containing OOV words.

275 citations


Cites background from "The TREC spoken document retrieval ..."

  • ...Also between 1997 and 2000, the Test REtrieval Conference (TREC) had a spoken document retrieval (SDR) track with many participants (Garofolo et al., 2000)....

    [...]

  • ...Also between 1997 and 2000, the Test REtrieval Conference (TREC) had a spoken document retrieval (SDR) track with many participants (Garofolo et al., 2000). NIST TREC-9 SDR Web Site (2000) states that:...

    [...]

01 Jan 2006
TL;DR: The paper describes the evaluation task posed to Spoken Term Detection systems, the evaluation methodologies, the Arabic, English and Mandarin evaluation corpora, and the results of the evaluation.
Abstract: paper presents the pilot evaluation of Spoken Term Detection technologies, held during the latter part of 2006. Spoken Term Detection systems rapidly detect the presence of a term , which is a sequence of words consecutively spoken, in a large audio corpus of heterogeneous speech material. The paper describes the evaluation task posed to Spoken Term Detection systems, the evaluation methodologies, the Arabic, English and Mandarin evaluation corpora, and the results of the evaluation. Ten participants submitted systems for the evaluation.

252 citations


Cites methods from "The TREC spoken document retrieval ..."

  • ...A benefit of the prescribed architecture is to enable uniform operation resource measurements across systems, e.g., indexing speed, index size, search speed, etc. Previous speech retrieval evaluations like TREC’s Spoken Document Retrieval [7] (SDR), and Topic Detection and Tracking [8] (TDT) have investigated technologies similar to STD....

    [...]

  • ...Previous speech retrieval evaluations like TREC’s Spoken Document Retrieval [7] (SDR), and Topic Detection and Tracking [8] (TDT) have investigated technologies similar to STD....

    [...]

Proceedings ArticleDOI
23 Jan 2004
TL;DR: This work believes that this is the first systematic approach to recognizing words in historical manuscripts with extensive experiments, which exceeds performance of other systems which operate on non-degraded input images (nonhistorical documents).
Abstract: Most offline handwriting recognition approaches proceed by segmenting words into smaller pieces (usually characters) which are recognized separately. The recognition result of a word is then the composition of the individually recognized parts. Inspired by results in cognitive psychology, researchers have begun to focus on holistic word recognition approaches. Here we present a holistic word recognition approach for single-author historical documents, which is motivated by the fact that for severely degraded documents a segmentation of words into characters will produce very poor results. The quality of the original documents does not allow us to recognize them with high accuracy - our goal here is to produce transcriptions that will allow successful retrieval of images, which has been shown to be feasible even in such noisy environments. We believe that this is the first systematic approach to recognizing words in historical manuscripts with extensive experiments. Our experiments show recognition accuracy of 65%, which exceeds performance of other systems which operate on non-degraded input images (nonhistorical documents).

237 citations


Cites background from "The TREC spoken document retrieval ..."

  • ...However, this approach requires one to determine the character boundaries [1], which can only be achieved by having already recognized the characters....

    [...]

  • ...Our experiments show recognition accuracy of 65%, which exceeds performance of other systems which operate on non-degraded input images (non historical documents)....

    [...]

References
More filters
Journal ArticleDOI
01 Jan 2000
TL;DR: The Text REtrieval Conference is a workshop series designed to encourage research on text retrieval for realistic applications by providing large test collections, uniform scoring procedures and a forum for organizations interested in comparing results.
Abstract: The Text REtrieval Conference is a workshop series designed to encourage research on text retrieval for realistic applications by providing large test collections, uniform scoring procedures and a forum for organizations interested in comparing results. TREC contains two main retrieval tasks plus optional subtasks that allow participants to focus on particular common subproblems in retrieval. The emphasis on individual experiments evaluated in a common setting has proven to be very successful. In the six years since the beginning of TREC, the state of the art in retrieval effectiveness has approximately doubled, and technology transfer among research labs and between research systems and commercial products has accelerated. In addition, TREC has sponsored the first large-scale evaluations of Chinese language retrieval, retrieval of speech and retrieval across different languages.

358 citations

Proceedings Article
01 Jan 1998

207 citations

Proceedings ArticleDOI
01 Aug 1999
TL;DR: Methods of document expansion for a speech retrieval document by a recognizer using a database of vectors of automatic transcriptions of documents is accessed and the vectors are truncated by removing all terms that are not recognizable by the recognizer to create truncated vectors.
Abstract: Methods of document expansion for a speech retrieval document by a recognizer. A database of vectors of automatic transcriptions of documents is accessed and the vectors are truncated by removing all terms that are not recognizable by the recognizer to create truncated vectors. Terms in the vectors are then weighted to associate the truncated vectors with the untruncated vectors. Terms not recognized by the recognizer are then added back to the weighted, truncated vectors. The retrieval effectiveness may then be measured.

188 citations

Proceedings Article
01 Jan 2000
TL;DR: This year the Center for Intelligent Information Retrieval at the University of Massachusetts participated in three of the tracks: the cross-language, question answering, and query tracks, and showed how query expansion compensates for some of the problems that can occurs in query formulation.
Abstract: : This year the Center for Intelligent Information Retrieval (CIIR) at the University of Massachusetts participated in three of the tracks: the cross-language, question answering, and query tracks. We used approaches that were similar to those used in past years. Although UMass used a wide range of tools, from Unix shell scripts, to PC spreadsheets, three major tools and techniques were applied across almost all tracks: the Inquery search engine, query processing, and a query expansion technique known as LCA. All three tracks used Inquery as the search engine, sometimes for training, and always for generating the final ranked lists for the test. In the cross language track, we experimented some techniques for crossing the character encoding boundaries. Our efforts were moderately successful, but we do not believe that our approach worked well in comparison to other techniques. In the question answering track, we focused on bringing answer-containing documents to the top of the ranked list. This is an important sub-task for most methods of tackling Q&A, and we are pleased with our results. We are now looking at alternate ways of thinking about that task that leverage the differences between retrieval for Q&A and for IR. Finally, we continued to participate in the query track, providing large numbers of query variants, and running our system on the huge number of resulting queries. Our analysis showed how query expansion compensates for some of the problems that can occurs in query formulation.

141 citations

Journal ArticleDOI
TL;DR: The TREC-5 confusion track used a set of 49 known-item tasks to study the impact of data corruption on retrieval system performance, and retrieval methods that attempted a probabilistic reconstruction of the original clean text fared better than methods that simply accepted corrupted versions of the query text.
Abstract: A known-item search is a particular information retrieval task in which the system is asked to find a single target document in a large document set. The TREC-5 confusion track used a set of 49 known-item tasks to study the impact of data corruption on retrieval system performance. Two corrupted versions of a 55,600 document corpus whose true content was known were created by applying OCR techniques to page images. The first version of the corpus used the page images as scanned, resulting in an estimated character error rate of approximately 5%. The second version used page images that had been down-sampled, resulting in an estimated character error rate of approximately 20%. The true text and each of the corrupted versions were then searched using the same set of 49 questions. In general, retrieval methods that attempted a probabilistic reconstruction of the original clean text fared better than methods that simply accepted corrupted versions of the query text.

104 citations