scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
01 Jul 1999
TL;DR: This dissertation explores several text representation issues in the context of event tracking, where a classifier for an event is formulated from one or more sample stories, and introduces a theoretical framework for automatic threshold estimation for the tracking algorithm, viewing the threshold as a statistic of the incoming data stream.
Abstract: In this work, we discuss and evaluate solutions to text classification problems associated with the events that are reported in on-line sources of news We present solutions to three related classification problems: new event detection, event clustering, and event tracking The primary focus of this thesis is new event detection, where the goal is to identify news stories that have not previously been reported, in a stream of broadcast news comprising radio, television, and newswire We present an algorithm for new event detection, and analyze the effects of incorporating domain properties into the classification algorithm We explore a solution that models the temporal relationship between news stories, and investigate the use of proper noun phrase extraction to capture the `who'', `what'', `when'', and `where'' contained in news Our results for new event detection suggest that previous approaches to document clustering provide a good basis for an approach to new event detection, and that further improvements to classification accuracy are obtained when the domain properties of broadcast news are modeled New event detection is related to the problem of event clustering, where the goal is to group stories that discuss the same event We investigate on-line clustering as an approach to new event detection, and re-evaluate existing cluster comparison strategies previously used for document retrieval Our results suggest that these strategies produce different groupings of events, and that the on-line single-link strategy extended with a model for domain properties is faster and more effective than other approaches In this dissertation, we explore several text representation issues in the context of event tracking, where a classifier for an event is formulated from one or more sample stories The classifier is used to monitor the subsequent news stream for documents related to the event We discuss different approaches to classifier formulation, and compare feature selection and weight-learning steps as extensions to a baseline process used for new event detection In addition, we evaluate an unsupervised adaptive approach to event tracking that captures the property of event evolution in broadcast news We introduce a theoretical framework for automatic threshold estimation for our tracking algorithm, viewing the threshold as a statistic of the incoming data stream Bias defined in terms of threshold estimators can be identified with statistical and learning techniques using our representation for text Two methods are presented that learn estimator bias and result in improved classification accuracy for tracking The implementations of our approaches to on-line new event detection, clustering, and tracking have been evaluated in comparison to other systems, and we present cross-system comparisons for all three classification problems In general, the results using our approaches compared favorably to other approaches for each problem

43 citations

Journal ArticleDOI
TL;DR: An intelligent information retrieval system involving the use of a knowledge representation language for representing data and a search process based on a combinatorial implementation of van Rijsbergen’s logical uncertainty principle that allows the representation of retrieval situations is presented.
Abstract: An intelligent information retrieval system is presented in this paper. In our approach, which complies with the logical view of information retrieval, queries, document contents and other knowledge are represented by expressions in a knowledge representation language based on the conceptual graphs introduced by Sowa. In order to take the intrinsic vagueness of information retrieval into account, i.e. to search documents imprecisely and incompletely represented in order to answer a vague query, different kinds of probabilistic logic are often used. The search process described in this paper uses graph transformations instead of probabilistic notions. This paper is focused on the content-based retrieval process, and the cognitive facet of information retrieval is not directly addressed. However, our approach, involving the use of a knowledge representation language for representing data and a search process based on a combinatorial implementation of van Rijsbergen’s logical uncertainty principle, also allows the representation of retrieval situations. Hence, we believe that it could be implemented at the core of an operational information retrieval system. Two applications, one dealing with academic libraries and the other concerning audiovisual documents, are briefly presented.

43 citations

Proceedings ArticleDOI
12 Mar 2014
TL;DR: The proposed system presents an effective preprocessing and dimensionality reduction techniques which help the document clustering by using k-means algorithm and the experimental results show that the proposed method enhances the performance of English text document clustered.
Abstract: Text mining defines generally the process of extracting interesting features (non-trivial) and knowledge from unstructured text documents Text mining is an interdisciplinary field which depends on information retrieval, data mining, machine learning, parameter statistics and computational linguistics Standard text mining and retrieval information techniques of text document usually rely on similar categories An alternative method of retrieving information is clustering documents to preprocess text The preprocessing steps have a huge effect on the success to extract knowledge This study implements TF-IDF and singular value decomposition (SVD) dimensionality reduction techniques The proposed system presents an effective preprocessing and dimensionality reduction techniques which help the document clustering by using k-means algorithm Finally, the experimental results show that the proposed method enhances the performance of English text document clustering Simulation results on BBC news and BBC sport datasets show the superiority of the proposed algorithm

43 citations

Proceedings ArticleDOI
29 Dec 2011
TL;DR: A novel mobile printed document retrieval system that utilizes both text and low bit-rate features that can reliably match retrieved documents to the query document and reduce the transmitted query size significantly is presented.
Abstract: We present a novel mobile printed document retrieval system that utilizes both text and low bit-rate features. On the client phone, text are detected using an algorithm based on edge-enhanced Maximally Stable Extremal Regions. The title text image patch is rectified using a gradient based algorithm and recognized using Optical Character Recognition. Low bit-rate image features are extracted from the query image. Both text and compressed features are sent to a server. On the server, the title text is used for on-line search and the features are used for image-based comparison. The proposed system is capable of web-scale document retrieval using title text without the need of constructing a document image database. Using features for image-based comparison, we can reliably match retrieved documents to the query document. Last, by using text and low bit-rate features, we can reduce the transmitted query size significantly.

43 citations

Proceedings ArticleDOI
05 Jun 2000
TL;DR: Some interesting initial results and findings obtained in this research on spoken document retrieval of broadcast news speech collected in Taiwan are reported.
Abstract: Spoken document retrieval has been extensively studied over the years because of its high potential in various applications in the near future. Considering the monosyllabic structure of the Chinese language, a whole class of indexing features for retrieval of spoken documents in Mandarin Chinese using syllable-level statistical characteristics has been studied, and very encouraging experimental results on retrieval of broadcast news speech collected in Taiwan were obtained. This paper reports some interesting initial results and findings obtained in this research.

43 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111