scispace - formally typeset
Search or ask a question

Showing papers on "Document retrieval published in 2022"


Book ChapterDOI
01 Feb 2022
TL;DR: This article proposed an extension to the BERT re-ranker to better exploit the context of queries and used an additional document-level representation learning objective besides the ranking objective when fine-tuning BERT.
Abstract: Query-by-document (QBD) retrieval is an Information Retrieval task in which a seed document acts as the query and the goal is to retrieve related documents – it is particular common in professional search tasks. In this work we improve the retrieval effectiveness of the BERT re-ranker, proposing an extension to its fine-tuning step to better exploit the context of queries. To this end, we use an additional document-level representation learning objective besides the ranking objective when fine-tuning the BERT re-ranker. Our experiments on two QBD retrieval benchmarks show that the proposed multi-task optimization significantly improves the ranking effectiveness without changing the BERT re-ranker or using additional training samples. In future work, the generalizability of our approach to other retrieval tasks should be further investigated.

14 citations


Book ChapterDOI
05 Jan 2022
TL;DR: In this paper , a paragraph aggregation retrieval model (PARM) is proposed to combine the advantages of rank-based aggregation and topical aggregation based on the dense embeddings for dense document-to-document retrieval.
Abstract: Dense passage retrieval (DPR) models show great effectiveness gains in first stage retrieval for the web domain. However in the web domain we are in a setting with large amounts of training data and a query-to-passage or a query-to-document retrieval task. We investigate in this paper dense document-to-document retrieval with limited labelled target data for training, in particular legal case retrieval. In order to use DPR models for document-to-document retrieval, we propose a Paragraph Aggregation Retrieval Model (PARM) which liberates DPR models from their limited input length. PARM retrieves documents on the paragraph-level: for each query paragraph, relevant documents are retrieved based on their paragraphs. Then the relevant results per query paragraph are aggregated into one ranked list for the whole query document. For the aggregation we propose vector-based aggregation with reciprocal rank fusion (VRRF) weighting, which combines the advantages of rank-based aggregation and topical aggregation based on the dense embeddings. Experimental results show that VRRF outperforms rank-based aggregation strategies for dense document-to-document retrieval with PARM. We compare PARM to document-level retrieval and demonstrate higher retrieval effectiveness of PARM for lexical and dense first-stage retrieval on two different legal case retrieval collections. We investigate how to train the dense retrieval model for PARM on limited target data with labels on the paragraph or the document-level. In addition, we analyze the differences of the retrieved results of lexical and dense retrieval with PARM.

10 citations


Journal ArticleDOI
TL;DR: Li et al. as mentioned in this paper proposed a general approach using deep neural networks with attention mechanisms for legal text retrieval, and two hierarchical architectures with sparse attention to represent long sentences and articles.
Abstract: Legal text retrieval serves as a key component in a wide range of legal text processing tasks such as legal question answering, legal case entailment, and statute law retrieval. The performance of legal text retrieval depends, to a large extent, on the representation of text, both query and legal documents. Based on good representations, a legal text retrieval model can effectively match the query to its relevant documents. Because legal documents often contain long articles and only some parts are relevant to queries, it is quite a challenge for existing models to represent such documents. In this paper, we study the use of attentive neural network-based text representation for statute law document retrieval. We propose a general approach using deep neural networks with attention mechanisms. Based on it, we develop two hierarchical architectures with sparse attention to represent long sentences and articles, and we name them Attentive CNN and Paraformer. The methods are evaluated on datasets of different sizes and characteristics in English, Japanese, and Vietnamese. Experimental results show that: (i) Attentive neural methods substantially outperform non-neural methods in terms of retrieval performance across datasets and languages; (ii) Pretrained transformer-based models achieve better accuracy on small datasets at the cost of high computational complexity while lighter weight Attentive CNN achieves better accuracy on large datasets; and (iii) Our proposed Paraformer outperforms state-of-the-art methods on COLIEE dataset, achieving the highest recall and F2 scores in the top-N retrieval task.

4 citations


Journal ArticleDOI
TL;DR: This article proposed a learning-to-rank model named LM-EMD that uses the multilingual BERT language model and Earth Mover's Distance (EMD) to measure the document's relevancy to the input query and provide interpretable insights into why a document is relevant.
Abstract: Modern cross-lingual document retrieval models are capable of finding documents relevant to the query. However, they do not have the capabilities for explaining why the document is relevant. This paper proposes a novel learning-to-rank model named LM-EMD that uses the multilingual BERT language model and Earth Mover’s Distance (EMD) to measure the document’s relevancy to the input query and provide interpretable insights into why a document is relevant. The model uses the query and document token’s contextual embeddings generated with multilingual BERT to measure their distances in the embedding space, which are then used by EMD to calculate the document’s relevance score and identify which document tokens contribute the most to its relevancy. We evaluate the model on five language pairs of varying degrees of similarity and analyze its performance. We find that the model (1) performs similar as the best performing comparing model on high-resource languages, (2) is less effective on low-resource languages, and (3) provides insight into why a document is relevant to the query.

3 citations


Journal ArticleDOI
TL;DR: In this paper, a new encrypted document retrieval system is designed and a proxy server is integrated into the system to alleviate data owner's workload and improve the whole system's security level.
Abstract: With the development of cloud computing, more and more data owners are motivated to outsource their documents to the cloud and share them with the authorized data users securely and flexibly. To protect data privacy, the documents are generally encrypted before being outsourced to the cloud and hence their searchability decreases. Though many privacy-preserving document search schemes have been proposed, they cannot reach a proper balance among functionality, flexibility, security and efficiency. In this paper, a new encrypted document retrieval system is designed and a proxy server is integrated into the system to alleviate data owner's workload and improve the whole system's security level. In this process, we consider a more practical and stronger threat model in which the cloud server can collude with a small number of data users. To support multiple document search patterns, we construct two AVL trees for the filenames and authors, and a Hierarchical Retrieval Features tree (HRF tree) for the document vectors. A depth-first search algorithm is designed for the HRF tree and the Enhanced Asymmetric Scalar-Product-Preserving Encryption (Enhanced ASPE) algorithm is utilized to encrypt the HRF tree. All the three index trees are linked with each other to efficiently support the search requests with multiple parameters. Theoretical analysis and simulation results illustrate the security and efficiency of the proposed framework.

2 citations


Proceedings ArticleDOI
01 Jan 2022
TL;DR: This paper proposed a Document Augmentation for dense retrieval (DAR) framework, which augments the representations of documents with their interpolation and perturbation, which significantly outperforms relevant baselines on the dense retrieval of both labeled and unlabeled documents.
Abstract: Dense retrieval models, which aim at retrieving the most relevant document for an input query on a dense representation space, have gained considerable attention for their remarkable success. Yet, dense models require a vast amount of labeled training data for notable performance, whereas it is often challenging to acquire query-document pairs annotated by humans. To tackle this problem, we propose a simple but effective Document Augmentation for dense Retrieval (DAR) framework, which augments the representations of documents with their interpolation and perturbation. We validate the performance of DAR on retrieval tasks with two benchmark datasets, showing that the proposed DAR significantly outperforms relevant baselines on the dense retrieval of both the labeled and unlabeled documents.

1 citations


Proceedings ArticleDOI
02 Dec 2022
TL;DR: In this article , a study on how methodologies like pooling and document similarity helps to generate more relevant documents into the relevance judgments set in order to increase the accuracy of the evaluation process.
Abstract: The Information Retrieval System Evaluation have carried out through Cranfield-paradigm in which the test collections provide the foundation of the evaluation process. The test collections consist of document corpus, topics, and a set of relevance judgements. The relevant judgements are the documents which retrieved from the test collections based on the topics. The precision of the evaluation process is based on the number of relevant documents in the relevant judgement list called qrels. This paper presents a study on how methodologies like pooling and document similarity helps to generate more relevant documents into the relevance judgments set in order to increase the accuracy of the evaluation process. The initial results have shown that combination of pooling with document similarity performs better compared to base clustering or classification.

Posted ContentDOI
08 Oct 2022
TL;DR: In this paper , the authors propose an approach that retrieves the evidence documents efficiently and accurately, making sure that the relevant document for a given user query is not missed, by assigning each document (or passage in our case), a unique identifier and using them to create dense vectors which can be efficiently indexed.
Abstract: Modern day applications, especially information retrieval webapps that involve "search" as their use cases are gradually moving towards "answering" modules. Conversational chatbots which have been proved to be more engaging to users, use Question Answering as their core. Since, precise answering is computationally expensive, several approaches have been developed to prefetch the most relevant documents/passages from the database that contain the answer. We propose a different approach that retrieves the evidence documents efficiently and accurately, making sure that the relevant document for a given user query is not missed. We do so by assigning each document (or passage in our case), a unique identifier and using them to create dense vectors which can be efficiently indexed. More precisely, we use the identifier to predict randomly sampled context window words of the relevant question corresponding to the passage along with the words of passage itself. This naturally embeds the passage identifier into the vector space in such a way that the embedding is closer to the question without compromising he information content. This approach enables efficient creation of real-time query vectors in ~4 milliseconds.

Posted ContentDOI
12 Dec 2022
TL;DR: Li et al. as discussed by the authors proposed a general approach using deep neural networks with attention mechanisms, and developed two hierarchical architectures with sparse attention to represent long sentences and articles, and named them Attentive CNN and Paraformer.
Abstract: Legal text retrieval serves as a key component in a wide range of legal text processing tasks such as legal question answering, legal case entailment, and statute law retrieval. The performance of legal text retrieval depends, to a large extent, on the representation of text, both query and legal documents. Based on good representations, a legal text retrieval model can effectively match the query to its relevant documents. Because legal documents often contain long articles and only some parts are relevant to queries, it is quite a challenge for existing models to represent such documents. In this paper, we study the use of attentive neural network-based text representation for statute law document retrieval. We propose a general approach using deep neural networks with attention mechanisms. Based on it, we develop two hierarchical architectures with sparse attention to represent long sentences and articles, and we name them Attentive CNN and Paraformer. The methods are evaluated on datasets of different sizes and characteristics in English, Japanese, and Vietnamese. Experimental results show that: i) Attentive neural methods substantially outperform non-neural methods in terms of retrieval performance across datasets and languages; ii) Pretrained transformer-based models achieve better accuracy on small datasets at the cost of high computational complexity while lighter weight Attentive CNN achieves better accuracy on large datasets; and iii) Our proposed Paraformer outperforms state-of-the-art methods on COLIEE dataset, achieving the highest recall and F2 scores in the top-N retrieval task.

Proceedings ArticleDOI
21 Oct 2022
TL;DR: In this article , the authors investigated how to extract the required answer information from a large document library based on user questions, and constructed a semantic retrieval system based on document comprehension, which can quickly locate the answers to the user's questions and help the user to obtain the required information quickly and effectively.
Abstract: In the urrent information environment, unstructured text is an important part of all kinds of information, and it has become an urgent problem in the current information service field to extract the required information from text data for user information needs. Based on semantic retrieval and an extractive document reading comprehension model, this paper investigates how to quickly and effectively extract the required answer information from a large document library based on user questions, and constructs a semantic retrieval system based on document comprehension. It is important to help users get the required information quickly and effectively and improve the efficiency of information services in the current massive information environment. The experiments show that the system can quickly and precisely locate the answers to the user's questions and help the user to obtain the required information quickly and effectively.


Journal ArticleDOI
TL;DR: An overview of the Arabic information retrieval process, including various text processing techniques, ranking approaches, evaluation measures, and some important information retrieval models, is presented in this paper , where some recent related studies and approaches in different Arabic Information retrieval fields are presented.
Abstract: Information retrieval is an important field that aims to provide a relevant document to a user information need, expressed through a query. Arabic is a challenging language that gained much attention recently in the information retrieval domain. To overcome the problems related to its complexity, many studies and techniques have been presented, most of them were conducted to solve the stemming problem. This paper presents an overview of the Arabic information retrieval process, including various text processing techniques, ranking approaches, evaluation measures, and some important information retrieval models. The paper finally presents some recent related studies and approaches in different Arabic information retrieval fields.

Posted ContentDOI
27 Nov 2022
TL;DR: A comprehensive survey on dense text retrieval can be found in this paper , where the authors provide a comprehensive, practical reference focused on the major progress for dense text this paper . But, their focus is on the relevance matching.
Abstract: Text retrieval is a long-standing research topic on information seeking, where a system is required to return relevant information resources to user's queries in natural language. From classic retrieval methods to learning-based ranking functions, the underlying retrieval models have been continually evolved with the ever-lasting technical innovation. To design effective retrieval models, a key point lies in how to learn the text representation and model the relevance matching. The recent success of pretrained language models (PLMs) sheds light on developing more capable text retrieval approaches by leveraging the excellent modeling capacity of PLMs. With powerful PLMs, we can effectively learn the representations of queries and texts in the latent representation space, and further construct the semantic matching function between the dense vectors for relevance modeling. Such a retrieval approach is referred to as dense retrieval, since it employs dense vectors (a.k.a., embeddings) to represent the texts. Considering the rapid progress on dense retrieval, in this survey, we systematically review the recent advances on PLM-based dense retrieval. Different from previous surveys on dense retrieval, we take a new perspective to organize the related work by four major aspects, including architecture, training, indexing and integration, and summarize the mainstream techniques for each aspect. We thoroughly survey the literature, and include 300+ related reference papers on dense retrieval. To support our survey, we create a website for providing useful resources, and release a code repertory and toolkit for implementing dense retrieval models. This survey aims to provide a comprehensive, practical reference focused on the major progress for dense text retrieval.

Posted ContentDOI
02 Dec 2022
TL;DR: In this article , the authors have developed an OCR (Optical Character Recognition) Search Engine to make an Information Retrieval & Extraction (IRE) system that replicates the current state-of-the-art methods using the IRE and Natural Language Processing (NLP) techniques.
Abstract: Extracting the relevant information out of a large number of documents is a challenging and tedious task. The quality of results generated by the traditionally available full-text search engine and text-based image retrieval systems is not optimal. Information retrieval (IR) tasks become more challenging with the nontraditional language scripts, as in the case of Indic scripts. The authors have developed OCR (Optical Character Recognition) Search Engine to make an Information Retrieval & Extraction (IRE) system that replicates the current state-of-the-art methods using the IRE and Natural Language Processing (NLP) techniques. Here we have presented the study of the methods used for performing search and retrieval tasks. The details of this system, along with the statistics of the dataset (source: National Digital Library of India or NDLI), is also presented. Additionally, the ideas to further explore and add value to research in IRE are also discussed.

Posted ContentDOI
05 Jan 2022
TL;DR: In this paper , a paragraph aggregation retrieval model (PARM) is proposed to combine the advantages of rank-based aggregation and topical aggregation based on the dense embeddings for dense document-to-document retrieval.
Abstract: Dense passage retrieval (DPR) models show great effectiveness gains in first stage retrieval for the web domain. However in the web domain we are in a setting with large amounts of training data and a query-to-passage or a query-to-document retrieval task. We investigate in this paper dense document-to-document retrieval with limited labelled target data for training, in particular legal case retrieval. In order to use DPR models for document-to-document retrieval, we propose a Paragraph Aggregation Retrieval Model (PARM) which liberates DPR models from their limited input length. PARM retrieves documents on the paragraph-level: for each query paragraph, relevant documents are retrieved based on their paragraphs. Then the relevant results per query paragraph are aggregated into one ranked list for the whole query document. For the aggregation we propose vector-based aggregation with reciprocal rank fusion (VRRF) weighting, which combines the advantages of rank-based aggregation and topical aggregation based on the dense embeddings. Experimental results show that VRRF outperforms rank-based aggregation strategies for dense document-to-document retrieval with PARM. We compare PARM to document-level retrieval and demonstrate higher retrieval effectiveness of PARM for lexical and dense first-stage retrieval on two different legal case retrieval collections. We investigate how to train the dense retrieval model for PARM on limited target data with labels on the paragraph or the document-level. In addition, we analyze the differences of the retrieved results of lexical and dense retrieval with PARM.

Posted ContentDOI
17 Nov 2022
TL;DR: This paper proposed a distant-supervision method that does not require any annotation to train autoregressive retrievers that attain competitive R-Precision and Recall in a zero-shot setting.
Abstract: Document retrieval is a core component of many knowledge-intensive natural language processing task formulations such as fact verification and question answering. Sources of textual knowledge, such as Wikipedia articles, condition the generation of answers from the models. Recent advances in retrieval use sequence-to-sequence models to incrementally predict the title of the appropriate Wikipedia page given a query. However, this method requires supervision in the form of human annotation to label which Wikipedia pages contain appropriate context. This paper introduces a distant-supervision method that does not require any annotation to train autoregressive retrievers that attain competitive R-Precision and Recall in a zero-shot setting. Furthermore we show that with task-specific supervised fine-tuning, autoregressive retrieval performance for two Wikipedia-based fact verification tasks can approach or even exceed full supervision using less than $1/4$ of the annotated data indicating possible directions for data-efficient autoregressive retrieval.

Book ChapterDOI
TL;DR: Information retrieval is a set of methods to provide relevant documents based on the input query as discussed by the authors and various steps included in information retrieval are starting from pre-processing, indexing, and then ranking.
Abstract: AbstractInformation retrieval is a set of methods to provide relevant documents based on the input query. Various steps included in Information retrieval are starting from pre-processing, indexing, and then ranking. The explanation using examples are given in the chapter. The applications of Information retrieval are document summarization, question answer system and many more.

Posted ContentDOI
11 Aug 2022
TL;DR: This article proposed to augment semantic document representations learned by bi-encoders with behavioral document representations by using the Pitman-Yor process to divide the total budget for behavioral representations.
Abstract: We consider text retrieval within dense representational space in real-world settings such as e-commerce search where (a) document popularity and (b) diversity of queries associated with a document have a skewed distribution. Most of the contemporary dense retrieval literature presents two shortcomings in these settings. (1) They learn an almost equal number of representations per document, agnostic to the fact that a few head documents are disproportionately more critical to achieving a good retrieval performance. (ii) They learn purely semantic document representations inferred from intrinsic document characteristics which may not contain adequate information to determine the queries for which the document is relevant--especially when the document is short. We propose to overcome these limitations by augmenting semantic document representations learned by bi-encoders with behavioral document representations learned by our proposed approach MVG. To do so, MVG (1) determines how to divide the total budget for behavioral representations by drawing a connection to the Pitman-Yor process, and (2) simply clusters the queries related to a given document (based on user behavior) within the representational space learned by a base bi-encoder, and treats the cluster centers as its behavioral representations. Our central contribution is the finding such a simple intuitive light-weight approach leads to substantial gains in key first-stage retrieval metrics by incurring only a marginal memory overhead. We establish this via extensive experiments over three large public datasets comparing several single-vector and multi-vector bi-encoders, a proprietary e-commerce search dataset compared to production-quality bi-encoder, and an A/B test.

Proceedings ArticleDOI
17 Dec 2022
TL;DR: In this paper , the authors proposed a document retrieval system based on N similarity, Fast-text, and BERT to retrieve relevant documents for particular Supreme Court cases in India from a set of prior case documents.
Abstract: Document retrieval is the process of discovering a set of documents related to a query or another document, as per comparable similarity to some significant extent. Such as legal law, cases, or judgments. The current work under the area of the Document retrieval domain focuses on creating a system that will examine various features or information contained in the supporting documents with the given query document and suggest the possible top-ranked documents which have a higher similarity score. This work is particularly useful in helping out the lawyers by suggesting similar cases or judgments and allowing them to explore significantly less number of previously recorded cases/judgments. The main focus of this work is to retrieve relevant documents for particular Supreme Court cases in India from a set of prior case documents. To categorize documents and do the document retrieval task, we have used a published data set of 2000 prior cases. For each distinct document, the current system lists a set of documents with their ranking scores using N similarity, Fast-text, and BERT. Then, the system returns the relevant documents based on their ranking score. In order to compare our system to other similar ones already in use, we computed the accuracy and recall of our system. We have discovered that various methods perform better in certain areas where a few systems have higher precision and others with higher recall values. Our developed system provides comparably better accuracy, recall, and mean average precision than other current systems.

Journal ArticleDOI
TL;DR: In this paper , two word embedding algorithms, Average Cosine Similarity Similarity Retrieval (ACS) and Word Mover's Distance Retriveeval (WMDR), were evaluated against weighted keyword (KW) search using a test set from the Digital Ricœur corpus tagged by scholarly experts.
Abstract: During the course of research, scholars often search large textual databases for segments of text relevant to their conceptual analyses. This study proposes, develops and evaluates two applications of word embedding algorithms for automated Concept Detection in theoretical corpora: Average Cosine Similarity Retrieval (ACS) and Word Mover’s Distance Retrieval (WMDR). Both strategies are evaluated against weighted keyword (KW) search using a test set from the Digital Ricœur corpus tagged by scholarly experts. In our experiments, WMDR outperformed weighted keyword search on the Concept Detection task, which suggests it is a promising strategy for Concept Detection and information retrieval systems focused on theoretical corpora. Besides these initial positive results, WMDRl has as its major characteristic the ability to use definitions as proxies for concepts; this provides search results that account for the semantic contexts of theoretical concepts.Au cours de la recherche, les chercheurs tâchent souvent de trouver des segments de texte pertinents dans d’énormes bases de données textuelles pour leurs analyses conceptuelles. Cet article propose, développe et évalue deux applications d’algorithmes de word embedding (plongement lexical) pour la Concept Detection (détection de concept) dans des corpus théoriques : dans l’Average Cosine Similarity Retrieval (ACS – Extraction de similitudes cosinus moyenne) et dans la Word Mover’s Distance Retrieval (WMDR – Extraction de distance de déplacement lexique). Les deux stratégies seront évaluées par rapport à une recherche par mot-clés pondérée avec un dispositif de test venant du corpus Digital Ricœur, qui est étiqueté par des experts érudits. Dans nos expériences, la WMDR était plus performante que la recherche par mot-clés pondérée durant la tâche de détection de concept, ce qui suggère que cela soit une stratégie prometteuse pour la détection de concept et pour des systèmes d’extraction d’information axés sur des corpus théoriques. En outre, la WMDR a comme caractéristique majeure la capacité de se servir de définitions en tant que des proxys pour des concepts. Cela fournit des résultats de recherche qui explique les contextes sémantiques de concepts théoriques.

Journal ArticleDOI
TL;DR: An improved random-walk method to rank a document by considering position of a term within a document and information gain of that term within the whole document set is introduced.
Abstract: Document representation is one of the most fundamental issues in information retrieval application. The graph-based ranking algorithms represent document as a graph. Once a document is represented as graph, the similarity of that document to a query can be calculated in various ways and the calculation provides ranking to documents. This paper introduces an improved random-walk method to rank a document by considering position of a term within a document and information gain of that term within the whole document set. The experiments on various collection sets show that our approach improves the recall and precision than other proposed methods.

Book ChapterDOI
01 Jan 2022
TL;DR: In this article , a Topic-Grained Text Representation-based Model for Document Retrieval (TGTR) is proposed to reduce the storage requirements by using novel topic-grained representations.
Abstract: Document retrieval enables users to find their required documents accurately and quickly. To satisfy the requirement of retrieval efficiency, prevalent deep neural methods adopt a representation-based matching paradigm, which saves online matching time by pre-storing document representations offline. However, the above paradigm consumes vast local storage space, especially when storing the document as word-grained representations. To tackle this, we present TGTR, a Topic-Grained Text Representation-based Model for document retrieval. Following the representation-based matching paradigm, TGTR stores the document representations offline to ensure retrieval efficiency, whereas it significantly reduces the storage requirements by using novel topic-grained representations rather than traditional word-grained. Experimental results demonstrate that compared to word-grained baselines, TGTR is consistently competitive with them on TREC CAR and MS MARCO in terms of retrieval accuracy, but it requires less than 1/10 of the storage space required by them. Moreover, TGTR overwhelmingly surpasses global-grained baselines in terms of retrieval accuracy.

Journal ArticleDOI
TL;DR: This paper explores a novel multi-layer contextual passage architecture that leverage text summarization extraction to generate passage-level evidence for the pre-selected document passage thus brought new possibilities for the long document relevance task.
Abstract: Nowadays, pre-trained language models such as Bidirectional Encoder Representations from Transformer (BERT) are becoming a basic building block in Information Retrieval tasks. Nevertheless, there are several limitations when applying BERT to the query-document matching task: (1) relevance assessments are applicable at the document-level, and the tokens of documents often exceed the maximum input length of BERT; (2) applying BERT to long documents leads to a great consumption of memory usage and run time, owing to the computational cost of the interactions between tokens. This paper explores a novel multi-layer contextual passage architecture that leverage text summarization extraction to generate passage-level evidence for the pre-selected document passage thus brought new possibilities for the long document relevance task. Experiments were conducted on two standard ad-hoc retrieval collections from the Text Retrieval Conference (TREC) 2004 Robust Track (Robust04) and ClueWeb09 with two different characteristics individually. Experimental results show that our approach can significantly outperform the strong baselines and even compared with the same BERT-based models, the precision of our methods as well as state-of-the-art neural ranking models.

Journal ArticleDOI
TL;DR: Experimental results show that, compared with traditional retrieval methods, the proposed method improves the retrieval accuracy significantly, with the highest retrieval accuracy reaching 99%, and the retrieval time is significantly reduced, indicating that the proposed methods effectively improves the retrieve accuracy and timeliness.
Abstract: In order to overcome the problems of retrieval accuracy and time-consuming of traditional document information retrieval methods, this paper designs an intelligent retrieval method of library document information based on hidden topic mining. Firstly, LDA model is used to mine the hidden topics of library document information, and then, based on the mining results, similarity degree of document information is calculated in inference network model. Finally, the Bayesian model is constructed in the sample space to retrieve the library literature information under the maximum retrieval space coverage. Experimental results show that, compared with traditional retrieval methods, the proposed method improves the retrieval accuracy significantly, with the highest retrieval accuracy reaching 99%, and the retrieval time is significantly reduced, indicating that the proposed method effectively improves the retrieval accuracy and timeliness.

Proceedings ArticleDOI
01 Aug 2022
TL;DR: In this article , the authors used NLP (Natural Language Processing) technology to provide a new system design method for multi-machine retrieval and accurate query of office documents, which is realized by uploading documents, document text analysis and processing, search engine construction, similar document recommendation, result preview and verification, retrieval result output and so on.
Abstract: Aiming at the problems of insufficient file name retrieval information, large internal retrieval overhead, slow speed and low sharing rate of documents in the existing stand-alone document retrieval software, this paper uses NLP (natural language processing) technology to provide a new system design method for multi-machine retrieval and accurate query of office documents. This method is realized by uploading documents, document text analysis and processing, search engine construction, similar document recommendation, result preview and verification, retrieval result output and so on. This paper will elaborate from five main aspects: the overall system flow, database design, text analysis and processing module design, text retrieval module design and client design.


DissertationDOI
10 Jun 2022
TL;DR: In this paper , the authors proposed a more efficient solution for the problem of document retrieval with forbidden extension query, where the forbidden pattern is an extension of the included pattern, and they achieved linear space and optimal query time.
Abstract: Strings play an important role in many areas of computer science. Searching pattern in a string or string collection is one of the most classic problems. Different variations of this problem such as document retrieval, ranked document retrieval, dictionary matching has been well studied. Enormous growth of internet, large genomic projects, sensor networks, digital libraries necessitates not just efficient algorithms and data structures for the general string indexing, but indexes for texts with fuzzy information and support for queries with different constraints. This dissertation addresses some of these problems and proposes indexing solutions. One such variation is document retrieval query for included and excluded/forbidden patterns, where the objective is to retrieve all the relevant documents that contains the included patterns and does not contain the excluded patterns. We continue the previous work done on this problem and propose more efficient solution. We conjecture that any significant improvement over these results is highly unlikely. We also consider the scenario when the query consists of more than two patterns. The forbidden pattern problem suffers from the drawback that linear space (in words) solutions are unlikely to yield a solution better than O(root(n/occ)) per document reporting time, where n is the total length of the documents and occ is the number of output documents. Continuing this path, we introduce a new variation, namely document retrieval with forbidden extension query, where the forbidden pattern is an extension of the included pattern.We also address the more general top-k version of the problem, which retrieves the top k documents, where the ranking is based on PageRank relevance metric. This problem finds motivation from search applications. It also holds theoretical interest as we show that the hardness of forbidden pattern problem is alleviated in this problem. We achieve linear space and optimal query time for this variation. We also propose succinct indexes for both these problems. Position restricted pattern matching considers the scenario where only part of the text is searched. We propose succinct index for this problem with efficient query time. An important application for this problem stems from searching in genomic sequences, where only part of the gene sequence is searched for interesting patterns. The problem of computing discriminating(resp. generic) words is to report all minimal(resp. maximal) extensions of a query pattern which are contained in at most(resp. at least) a given number of documents. These problems are motivated from applications in computational biology, text

Posted ContentDOI
11 Jul 2022
TL;DR: In this paper , a Topic-Grained Text Representation-based Model (TGTR) is proposed to reduce the storage requirements by using novel topic-grained representations rather than traditional wordgrained.
Abstract: Document retrieval enables users to find their required documents accurately and quickly. To satisfy the requirement of retrieval efficiency, prevalent deep neural methods adopt a representation-based matching paradigm, which saves online matching time by pre-storing document representations offline. However, the above paradigm consumes vast local storage space, especially when storing the document as word-grained representations. To tackle this, we present TGTR, a Topic-Grained Text Representation-based Model for document retrieval. Following the representation-based matching paradigm, TGTR stores the document representations offline to ensure retrieval efficiency, whereas it significantly reduces the storage requirements by using novel topicgrained representations rather than traditional word-grained. Experimental results demonstrate that compared to word-grained baselines, TGTR is consistently competitive with them on TREC CAR and MS MARCO in terms of retrieval accuracy, but it requires less than 1/10 of the storage space required by them. Moreover, TGTR overwhelmingly surpasses global-grained baselines in terms of retrieval accuracy.

Proceedings ArticleDOI
Yupeng Li1
25 Feb 2022
TL;DR: The design content of the campus document information online retrieval system includes system objectives, system composition and environment, system principles and functions, database design and types, and user interface design as mentioned in this paper .
Abstract: As the interactive interface of the library automation system ultimately facing users, OPAC (Online Public Search Catalog) is the most important window for the library to communicate with readers, and it plays a role in communicating readers and collection resources, readers and resource services. The article chooses Windows NT as the network environment to realize online retrieval of campus literature information. The design content of the campus document information online retrieval system includes system objectives, system composition and environment, system principles and functions, database design and types, and user interface design. The study found that the application of this system has greatly improved the efficiency of document retrieval in academic libraries.

Journal ArticleDOI
01 Jun 2022-Entropy
TL;DR: Results prove the effectiveness of the scientific document retrieval and ranking method that synthesizes the information of mathematical expressions with related texts, and the ontology attributes of scientific documents are extracted to further sort the retrieval results.
Abstract: Traditional mathematical search models retrieve scientific documents only by mathematical expressions and their contexts and do not consider the ontological attributes of scientific documents, which result in gaps between the queries and the retrieval results. To solve this problem, a retrieval and ranking model is constructed that synthesizes the information of mathematical expressions with related texts, and the ontology attributes of scientific documents are extracted to further sort the retrieval results. First, the hesitant fuzzy set of mathematical expressions is constructed by using the characteristics of the hesitant fuzzy set to address the multi-attribute problem of mathematical expression matching; then, the similarity of the mathematical expression context sentence is calculated by using the BiLSTM two-way coding feature, and the retrieval result is obtained by synthesizing the similarity between the mathematical expression and the sentence; finally, considering the ontological attributes of scientific documents, the retrieval results are ranked to obtain the final search results. The MAP_10 value of the mathematical expression retrieval results on the Ntcir-Mathir-Wikipedia-Corpus dataset is 0.815, and the average value of the NDCG@10 of the scientific document ranking results is 0.9; these results prove the effectiveness of the scientific document retrieval and ranking method.