scispace - formally typeset
Search or ask a question
Topic

Document retrieval

About: Document retrieval is a research topic. Over the lifetime, 6821 publications have been published within this topic receiving 214383 citations.


Papers
More filters
Proceedings Article
12 Apr 2000
TL;DR: The process to identify and implement the time-adaptive language model and the results of the experiment in terms of its effect on word error rate, out of vocabulary rate and retrieval accuracy (Mean Average Precision) are detailed.
Abstract: This paper describes experiments implemented at NIST in adapting language models over time to improve recognition of broadcast news recorded over many months. These experiments were designed specifically to improve the utility of automatically generated transcripts for retrieval applications. To evaluate the potential of the approach, a time-adaptive automatic speech recognition run was implemented to support the 1999 TREC Spoken Document Retrieval (SDR) Track - more than 500 hours of broadcast news sampled across 5 months. The accuracy of retrieval for several systems using the time-adaptive system transcripts was evaluated against transcripts produced by virtually the same recognition system with a fixed language model. This paper details the process we employed to identify and implement the time-adaptive language model and discusses the results of the experiment in terms of its effect on word error rate, out of vocabulary rate and retrieval accuracy (Mean Average Precision).

47 citations

Proceedings Article
03 May 2021
TL;DR: GENRE as discussed by the authors proposes an autoregressive approach to generate unique names for each entity, left to right, token-by-token in an auto-regressive fashion and conditioned on the context.
Abstract: Entities are at the center of how we represent and aggregate knowledge. For instance, Encyclopedias such as Wikipedia are structured by entities (e.g., one per Wikipedia article). The ability to retrieve such entities given a query is fundamental for knowledge-intensive tasks such as entity linking and open-domain question answering. One way to understand current approaches is as classifiers among atomic labels, one for each entity. Their weight vectors are dense entity representations produced by encoding entity meta information such as their descriptions. This approach leads to several shortcomings: (i) context and entity affinity is mainly captured through a vector dot product, potentially missing fine-grained interactions between the two; (ii) a large memory footprint is needed to store dense representations when considering large entity sets; (iii) an appropriately hard set of negative data has to be subsampled at training time. In this work, we propose GENRE, the first system that retrieves entities by generating their unique names, left to right, token-by-token in an autoregressive fashion and conditioned on the context. This enables us to mitigate the aforementioned technical issues since: (i) the autoregressive formulation allows us to directly capture relations between context and entity name, effectively cross encoding both; (ii) the memory footprint is greatly reduced because the parameters of our encoder-decoder architecture scale with vocabulary size, not entity count; (iii) the exact softmax loss can be efficiently computed without the need to subsample negative data. We show the efficacy of the approach, experimenting with more than 20 datasets on entity disambiguation, end-to-end entity linking and document retrieval tasks, achieving new state-of-the-art or very competitive results while using a tiny fraction of the memory footprint of competing systems. Finally, we demonstrate that new entities can be added by simply specifying their unambiguous name. Code and pre-trained models at https://github.com/facebookresearch/GENRE.

47 citations

Book
01 Jan 2008
TL;DR: In this paper, the authors present an approach to the automatic extraction of business rules from restricted text in the banking industry using natural language processing (NLP) and a bootstrapping NLIDB system.
Abstract: Invited Papers.- Sentence and Text Comprehension: Evidence from Human Language Processing.- Towards Semantic Search.- From Databases to Natural Language: The Unusual Direction.- Natural Language Processing and Understanding.- Division of Spanish Words into Morphemes with a Genetic Algorithm.- Abbreviation Disambiguation: Experiments with Various Variants of the One Sense per Discourse Hypothesis.- The Acquisition of Common Sense Knowledge by Being Told: An Application of NLP to Itself.- Natural Language Processing and Understanding.- Interlingua for French and German Topological Prepositions.- Ontological Profiles as Semantic Domain Representations.- A Hybrid Approach to Ontology Relationship Learning.- Automating the Generation of Semantic Annotation Tools Using a Clustering Technique.- Information Retrieval.- Exploiting Multiple Features with MEMMs for Focused Web Crawling.- Ranked-Listed or Categorized Results in IR: 2 Is Better Than 1.- Exploiting Morphological Query Structure Using Genetic Optimisation.- Generation of Query-Biased Concepts Using Content and Structure for Query Reformulation.- Comparing Several Textual Information Retrieval Systems for the Geographical Information Retrieval Task.- Querying and Question Answering.- Intensional Question Answering Using ILP: What Does an Answer Mean?.- Augmenting Data Retrieval with Information Retrieval Techniques by Using Word Similarity.- Combining Data Integration and IE Techniques to Support Partially Structured Data.- Towards Building Robust Natural Language Interfaces to Databases.- Towards a Bootstrapping NLIDB System.- Document Processing and Text Mining.- Real-Time News Event Extraction for Global Crisis Monitoring.- Topics Identification Based on Event Sequence Using Co-occurrence Words.- Topic Development Based Refinement of Audio-Segmented Television News.- Text Entailment for Logical Segmentation and Summarization.- Comparing Non-parametric Ensemble Methods for Document Clustering.- A Language Modelling Approach to Linking Criminal Styles with Offender Characteristics.- Software (Requirements) Engineering and Specification.- Towards Designing Operationalizable Models of Man-Machine Interaction Based on Concepts from Human Dialog Systems.- Using Linguistic Knowledge to Classify Non-functional Requirements in SRS documents.- A Preliminary Approach to the Automatic Extraction of Business Rules from Unrestricted Text in the Banking Industry.- Paraphrasing OCL Expressions with SBVR.- A General Architecture for Connecting NLP Frameworks and Desktop Clients Using Web Services.- Conceptual Modelling and Ontologies Related Posters.- Conceptual Model Generation from Requirements Model: A Natural Language Processing Approach.- Bivalent Verbs and Their Pragmatic Particularities.- Using Ontologies and Relatedness Metrics for Semantic Document Analysis on the Web.- Information Retrieval Related Posters.- The TSRM Approach in the Document Retrieval Application.- Enhanced Services for Targeted Information Retrieval by Event Extraction and Data Mining.- Querying and Question Answering Related Posters.- Improving Question Answering Tasks by Textual Entailment Recognition.- Supporting Named Entity Recognition and Syntactic Analysis with Full-Text Queries.- Document Processing and Text Mining Related Posters.- Multilingual Feature-Driven Opinion Extraction and Summarization from Customer Reviews.- Lexical and Semantic Methods in Inner Text Topic Segmentation: A Comparison between C99 and Transeg.- An Application of NLP and Audiovisual Content Analysis for Integration of Multimodal Databases of Current Events.- Detecting Protein-Protein Interaction Sentences Using a Mixture Model.- Using Semantic Features to Improve Task Identification in Email Messages.- Text Pre-processing for Document Clustering.- Software (Requirements) Engineering and Specification Related Posters.- Trade Oriented Enterprise Content Management: A Semantic and Collaborative Prototype Dedicated to the "Quality, Hygiene, Safety and Environment" Domain.- Doctoral Symposium Papers.- Mapping Natural Language into SQL in a NLIDB.- Improving Data Integration through Disambiguation Techniques.- An Ontology-Based Focused Crawler.- Impact of Term-Indexing for Arabic Document Retrieval.

47 citations

Journal ArticleDOI
TL;DR: The view is taken that IR is an inference task, and that natural language processing (NLP) techniques can produce text representations that enable more accurate inferences about document content.
Abstract: There is no task that computers regularly perform that is more affected by the nature of human language than the retrieval of texts in response to a human need. Despite this, the techniques actually in use for this task, as well as most of the techniques proposed by information retrieval (IR) researchers, make little use of knowledge about language. In this article we take the view that IR is an inference task, and that natural language processing (NLP) techniques can produce text representations that enable more accurate inferences about document content. By considering previous work on language-based and knowledge-based techniques from this perspective, some clear lessons are apparent, and we are applying these lessons in the ADRENAL (Augmented Document REtrieval using NAtural Language processing) project. Our initial experiments with hand-coded representations suggest that using NLP-produced representations can result in significant performance increases in IR systems, and also demonstrate the attention that must be given to representational issues in language-oriented IR. © 1989 Wiley Periodicals, Inc.

47 citations

Journal ArticleDOI
TL;DR: A particular form of dissimilarity space is proposed that is adapted to the asymmetric classification problem and, in turn, to the query-by-example and relevance feedback paradigm, widely used in information retrieval.
Abstract: The paper proposes a novel representation space for multimodal information, enabling fast and efficient retrieval of video data. We suggest describing the documents not directly by selected multimodal features (audio, visual, or text) but rather by considering cross-document similarities relative to their multimodal characteristics. This idea leads us to propose a particular form of dissimilarity space that is adapted to the asymmetric classification problem and, in turn, to the query-by-example and relevance feedback paradigm, widely used in information retrieval. Based on the proposed dissimilarity space, we then define various strategies to fuse modalities through a kernel-based learning approach. The problem of automatic kernel setting to adapt the learning process to the queries is also discussed. The properties of our strategies are studied and validated on artificial data. In a second phase, a large annotated video corpus (i.e., TRECVID '05) indexed by visual, audio, and text features is considered to evaluate the overall performance of the dissimilarity space and fusion strategies. The obtained results confirm the validity of the proposed approach for the representation and retrieval of multimodal information in a real-time framework.

47 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
81% related
Metadata
43.9K papers, 642.7K citations
79% related
Recommender system
27.2K papers, 598K citations
79% related
Ontology (information science)
57K papers, 869.1K citations
78% related
Natural language
31.1K papers, 806.8K citations
77% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20239
202239
2021107
2020130
2019144
2018111