scispace - formally typeset
Search or ask a question

Showing papers on "Document retrieval published in 2017"


Posted Content
TL;DR: In this paper, a multi-layer recurrent neural network model was proposed to detect answer spans in Wikipedia paragraphs, which combines a search component based on bigram hashing and TF-IDF matching.
Abstract: This paper proposes to tackle open- domain question answering using Wikipedia as the unique knowledge source: the answer to any factoid question is a text span in a Wikipedia article. This task of machine reading at scale combines the challenges of document retrieval (finding the relevant articles) with that of machine comprehension of text (identifying the answer spans from those articles). Our approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs. Our experiments on multiple existing QA datasets indicate that (1) both modules are highly competitive with respect to existing counterparts and (2) multitask learning using distant supervision on their combination is an effective complete system on this challenging task.

1,100 citations


Proceedings ArticleDOI
31 Mar 2017
TL;DR: This approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs, indicating that both modules are highly competitive with respect to existing counterparts.
Abstract: This paper proposes to tackle open-domain question answering using Wikipedia as the unique knowledge source: the answer to any factoid question is a text span in a Wikipedia article. This task of machine reading at scale combines the challenges of document retrieval (finding the relevant articles) with that of machine comprehension of text (identifying the answer spans from those articles). Our approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs. Our experiments on multiple existing QA datasets indicate that (1) both modules are highly competitive with respect to existing counterparts and (2) multitask learning using distant supervision on their combination is an effective complete system on this challenging task.

685 citations


Proceedings ArticleDOI
06 Nov 2017
TL;DR: Zhang et al. as mentioned in this paper proposed a novel name disambiguation method which leverages only relational data in the form of anonymized graphs and used a novel representation learning model to embed each document in a low dimensional vector space.
Abstract: In real-world, our DNA is unique but many people share names. This phenomenon often causes erroneous aggregation of documents of multiple persons who are namesake of one another. Such mistakes deteriorate the performance of document retrieval, web search, and more seriously, cause improper attribution of credit or blame in digital forensic. To resolve this issue, the name disambiguation task is designed which aims to partition the documents associated with a name reference such that each partition contains documents pertaining to a unique real-life person. Existing solutions to this task substantially rely on feature engineering, such as biographical feature extraction, or construction of auxiliary features from Wikipedia. However, for many scenarios, such features may be costly to obtain or unavailable due to the risk of privacy violation. In this work, we propose a novel name disambiguation method. Our proposed method is non-intrusive of privacy because instead of using attributes pertaining to a real-life person, our method leverages only relational data in the form of anonymized graphs. In the methodological aspect, the proposed method uses a novel representation learning model to embed each document in a low dimensional vector space where name disambiguation can be solved by a hierarchical agglomerative clustering algorithm. Our experimental results demonstrate that the proposed method is significantly better than the existing name disambiguation methods working in a similar setting.

97 citations


Journal ArticleDOI
Joel L. Fagan1
02 Aug 2017
TL;DR: An automatic phrase indexing method based on the term discrimination model is described, and the results of retrieval experiments on five document collections are presented.
Abstract: An automatic phrase indexing method based on the term discrimination model is described, and the results of retrieval experiments on five document collections are presented. Problems related to this non-syntactic phrase construction method are discussed, and some possible solutions are proposed that make use of information about the syntactic structure of document and query texts.

74 citations


Proceedings ArticleDOI
02 Feb 2017
TL;DR: This paper addresses the task of document retrieval based on the degree of document relatedness to the meanings of a query by presenting a semantic-enabled language model that adopts a probabilistic reasoning model for calculating the conditional probability of a queries concept given values assigned to document concepts.
Abstract: This paper addresses the task of document retrieval based on the degree of document relatedness to the meanings of a query by presenting a semantic-enabled language model. Our model relies on the use of semantic linking systems for forming a graph representation of documents and queries, where nodes represent concepts extracted from documents and edges represent semantic relatedness between concepts. Based on this graph, our model adopts a probabilistic reasoning model for calculating the conditional probability of a query concept given values assigned to document concepts. We present an integration framework for interpolating other retrieval systems with the presented model in this paper. Our empirical experiments on a number of TREC collections show that the semantic retrieval has a synergetic impact on the results obtained through state of the art keyword-based approaches, and the consideration of semantic information obtained from entity linking on queries and documents can complement and enhance the performance of other retrieval models.

69 citations


Journal ArticleDOI
TL;DR: A new biomedical passage retrieval method based on Stanford CoreNLP sentence/passage length, probabilistic information retrieval (IR) model and UMLS concepts which significantly outperforms the current state-of-the-art methods.

52 citations


Journal ArticleDOI
TL;DR: This paper presents a new method for QE based on fuzzy logic considering the top-retrieved document as relevance feedback documents for mining additional QE terms and increases the precision rates and the recall rates of information retrieval systems for dealing with document retrieval.
Abstract: Efficient query expansion (QE) terms selection methods are really very important for improving the accuracy and efficiency of the system by removing the irrelevant and redundant terms from the top-retrieved feedback documents corpus with respect to a user query. Each individual QE term selection method has its weaknesses and strengths. To overcome the weaknesses and to utilize the strengths of the individual method, we used multiple terms selection methods together. In this paper, we present a new method for QE based on fuzzy logic considering the top-retrieved document as relevance feedback documents for mining additional QE terms. Different QE terms selection methods calculate the degrees of importance of all unique terms of top-retrieved documents collection for mining additional expansion terms. These methods give different relevance scores for each term. The proposed method combines different weights of each term by using fuzzy rules to infer the weights of the additional query terms. Then, the weights of the additional query terms and the weights of the original query terms are used to form the new query vector, and we use this new query vector to retrieve documents. All the experiments are performed on TREC and FIRE benchmark datasets. The proposed QE method increases the precision rates and the recall rates of information retrieval systems for dealing with document retrieval. It gets a significant higher average recall rate, average precision rate and F measure on both datasets.

46 citations


Journal ArticleDOI
01 Jan 2017
TL;DR: Comparisons of 3 different topic representations in a document retrieval task show that textual labels are easier for users to interpret than are term lists and image labels, demonstrating that labeling methods are an effective alternative topic representation.
Abstract: Topic models have been shown to be a useful way of representing the content of large document collections, for example, via visualization interfaces topic browsers. These systems enable users to explore collections by way of latent topics. A standard way to represent a topic is using a term list; that is the top-n words with highest conditional probability within the topic. Other topic representations such as textual and image labels also have been proposed. However, there has been no comparison of these alternative representations. In this article, we compare 3 different topic representations in a document retrieval task. Participants were asked to retrieve relevant documents based on predefined queries within a fixed time limit, presenting topics in one of the following modalities: a lists of terms, b textual phrase labels, and c image labels. Results show that textual labels are easier for users to interpret than are term lists and image labels. Moreover, the precision of retrieved documents for textual and image labels is comparable to the precision achieved by representing topics using term lists, demonstrating that labeling methods are an effective alternative topic representation.

38 citations


Journal ArticleDOI
TL;DR: Experiments results prove that GKFCM-based proposed system outperforms better performance than existing methods on document retrieval issue.
Abstract: Clustering-based document retrieval system offers to find similar documents for a given user's query. This study explores the scope of kernel fuzzy c-means (KFCM) with the genetic algorithm on document retrieval issue. Initially, genetic algorithm-based kernel fuzzy c-means algorithm (GKFCM) is proposed to make the clustering of documents in the library. For each cluster, an index is created, which contains a common significant keywords of the documents for that cluster. Once the user enters the keyword as the input to the system, it will process the keywords with the WORDNET ontology to achieve the neighbourhood keywords and related synset keywords. Lastly, the documents inside the cluster are released at first as the resultant-related documents for the query keyword, which clusters have the maximum matching score values. Experiments results prove that GKFCM-based proposed system outperforms better performance than existing methods.

37 citations


Proceedings ArticleDOI
07 Mar 2017
TL;DR: The utility of visual re-ranking, an interactive visualization technique for multi-aspect information retrieval, is demonstrated, and can help designing search user interfaces that support multi- aspect search.
Abstract: We present visual re-ranking, an interactive visualization technique for multi-aspect information retrieval. In multi-aspect search, the information need of the user consists of more than one aspect or query simultaneously. While visualization and interactive search user interface techniques for improving user interpretation of search results have been proposed, the current research lacks understanding on how useful these are for the user: whether they lead to quantifiable benefits in perceiving the result space and allow faster, and more precise retrieval. Our technique visualizes relevance and document density on a two-dimensional map with respect to the query phrases. Pointing to a location on the map specifies a weight distribution of the relevance to each of the query phrases, according to which search results are re-ranked. User experiments compared our technique to a uni-dimensional search interface with typed query and ranked result list, in perception and retrieval tasks. Visual re-ranking yielded improved accuracy in perception, higher precision in retrieval and overall faster task execution. Our findings demonstrate the utility of visual re-ranking, and can help designing search user interfaces that support multi-aspect search.

33 citations


Proceedings ArticleDOI
02 Feb 2017
TL;DR: This tutorial is the first to disseminate the progress in this emerging field of KGs to researchers and practitioners and is available online at http://github.com/laura-dietz/tutorial-utilizing-kg.
Abstract: The past decade has witnessed the emergence of several publicly available and proprietary knowledge graphs (KGs). The increasing depth and breadth of content in KGs makes them not only rich sources of structured knowledge by themselves but also valuable resources for search systems. A surge of recent developments in entity linking and retrieval methods gave rise to a new line of research that aims at utilizing KGs for text-centric retrieval applications, making this an ideal time to pause and report current findings to the community, summarizing successful approaches, and soliciting new ideas. This tutorial is the first to disseminate the progress in this emerging field to researchers and practitioners. All tutorial resources are available online at http://github.com/laura-dietz/tutorial-utilizing-kg

Journal ArticleDOI
TL;DR: A semantically enhanced document retrieval system that describes each retrieved document with an ontological multi-grained network of the extracted conceptualization, and a SKOS-based ontology, ad-hoc created for a document corpus that enables the exploration of the concepts at different granularity levels.

Proceedings Article
01 Jan 2017
TL;DR: It is suggested that searching specific sections may improve precision under certain conditions and often with loss of recall, although chart notes incorporate structure that may facilitate accurate retrieval.
Abstract: Objective: Secondary use of electronic health record (EHR) data is enabled by accurate and complete retrieval of the relevant patient cohort, which requires searching both structured and unstructured data. Clinical text poses difficulties to searching, although chart notes incorporate structure that may facilitate accurate retrieval. Methods: We developed rules identifying clinical document sections, which can be indexed in search engines that allow faceted searches, such as Lucene or Essie, an NLM search engine. We developed 22 clinical cohorts and two queries for each cohort, one utilizing section headings and the other searching the whole document. We manually evaluated a subset of retrieved documents to compare query performance. Results: Querying by section had lower recall than whole-document queries (0.83 vs 0.95), higher precision (0.73 vs 0.54), and higher F1 (0.78 vs 0.69). Conclusion: This evaluation suggests that searching specific sections may improve precision under certain conditions and often with loss of recall.

Journal ArticleDOI
TL;DR: This paper has proposed MathIRs comprising three important modules and a substitution tree based mechanism for indexing mathematical expressions and presented experimental results for similarity search and argued that proposal ofMathIRs will ease the task of scientific document retrieval.
Abstract: Effective retrieval of mathematical contents from vast corpus of scientific documents demands enhancement in the conventional indexing and searching mechanisms. Indexing mechanism and the choice of semantic similarity measures guide the results of Math Information Retrieval system (MathIRs) to perfection. Tokenization and formula unification are among the distinguishing i features of indexing mechanism, used in MathIRs, which facilitate sub-formula and similarity search. Besides, the scientific documents and the user queries in MathIRs will contain math as well as text contents and to match these contents we require three important modules: Text-Text Similarity (TS), Math-Math Similarity (MS) and Text-Math Similarity (TMS). In this paper we have proposed MathIRs comprising these important modules and a substitution tree based mechanism for indexing mathematical expressions. We have also presented experimental results for similarity search and argued that proposal of MathIRs will ease the task of scientific document retrieval.

Patent
15 Mar 2017
TL;DR: In this paper, a judgment document retrieval method based on semantic matching and a server was proposed, which can be found by directly describing the law problems or cases through a natural language, thus the problem is solved, the use threshold of the document retrieval server is greatly decreased, and retrieval efficiency is improved.
Abstract: The invention provides a judgment document retrieval method based on semantic matching and a server. The judgment document retrieval method based on the semantic matching and the server have the advantages that when a case is retrieved, the direct inputting of words which are precisely matched with keywords in a judgment document is not needed, the matched judging document can be found by directly describing the law problems or cases through a natural language, thus the problem is solved, the use threshold of the document retrieval server is greatly decreased, and the retrieval efficiency is improved.

Proceedings ArticleDOI
01 Dec 2017
TL;DR: The current study intends to develop an intelligent system for user queries in natural language for precise answer that includes tokenization, parsing, parts of speech tagging, question classification, query construction, sentence understanding, document retrieval, keyword ranking, classifier, answer extraction and validation.
Abstract: In Contemporary world, life styles and interactions have changed in all applications domain due to increasing advances of internet technology Due to recent advances in information explosion, tries to build an intelligent question answering system where user may communicate with a machine in natural language to get response to user question using different strategies like Natural Language Processing (NLP), Artificial Intelligence, Information Retrieval and Human Computer Interaction Natural Language Processing is a technique where computer behave like human, which helps people to talk to the computer in their own language rather than computer commands The skills needed to build intelligent answering system includes tokenization, parsing, parts of speech tagging, question classification, query construction, sentence understanding, document retrieval, keyword ranking, classifier, answer extraction and validation The current study intends to develop an intelligent system for user queries in natural language for precise answer

Posted Content
TL;DR: The proposed name disambiguation method is non-intrusive of privacy because instead of using attributes pertaining to a real-life person, the method leverages only relational data in the form of anonymized graphs.
Abstract: In real-world, our DNA is unique but many people share names. This phenomenon often causes erroneous aggregation of documents of multiple persons who are namesake of one another. Such mistakes deteriorate the performance of document retrieval, web search, and more seriously, cause improper attribution of credit or blame in digital forensic. To resolve this issue, the name disambiguation task is designed which aims to partition the documents associated with a name reference such that each partition contains documents pertaining to a unique real-life person. Existing solutions to this task substantially rely on feature engineering, such as biographical feature extraction, or construction of auxiliary features from Wikipedia. However, for many scenarios, such features may be costly to obtain or unavailable due to the risk of privacy violation. In this work, we propose a novel name disambiguation method. Our proposed method is non-intrusive of privacy because instead of using attributes pertaining to a real-life person, our method leverages only relational data in the form of anonymized graphs. In the methodological aspect, the proposed method uses a novel representation learning model to embed each document in a low dimensional vector space where name disambiguation can be solved by a hierarchical agglomerative clustering algorithm. Our experimental results demonstrate that the proposed method is significantly better than the existing name disambiguation methods working in a similar setting.

Journal ArticleDOI
TL;DR: Five criteria: used models, tagging purpose, tagging right, object type, and used dataset, are introduced for evaluating tag-based information retrieval methods as a new categorical framework engaging the graphical models as well as the two-way classical methods.
Abstract: This paper aims to provide a comprehensive survey of tag-based information retrieval that covers three areas: tag-based document retrieval, tag-based image retrieval, and tag-based music information retrieval. First of all, seven representative graphical models associated with tag contents are reviewed and evaluated in terms of effectiveness in achieving their goals. The models are explored in depth based on appropriate plate notations for the tag-based document retrieval. Second, well-established review criteria for two-way classical methods, tag refinement and tag recommendation, are utilized for tag-based image retrieval. In particular, tag refinement methods are analyzed by means of the experimental results measured on different datasets. Last, popular tagging methods in the area of music information retrieval are reviewed for the tag-based music information retrieval. We introduce five criteria: used models, tagging purpose, tagging right, object type, and used dataset, for evaluating tag-based information retrieval methods as a new categorical framework engaging the graphical models as well as the two-way classical methods.

Proceedings ArticleDOI
01 Aug 2017
TL;DR: This paper describes the participation of USTB PRIR team in the 2017 BioASQ 5B on question answering, including document retrieval, snippet retrieval and concept retrieval task and introduces different multimodal query processing strategies to enrich query terms and assign different weights to them.
Abstract: This paper describes the participation of USTB_PRIR team in the 2017 BioASQ 5B on question answering, including document retrieval, snippet retrieval, and concept retrieval task. We introduce different multimodal query processing strategies to enrich query terms and assign different weights to them. Specifically, sequential dependence model (SDM), pseudo-relevance feedback (PRF), fielded sequential dependence model (FSDM) and Divergence from Randomness model (DFRM) are respectively performed on different fields of PubMed articles, sentences extracted from relevant articles, the five terminologies or ontologies (MeSH, GO, Jochem, Uniprot and DO) to achieve better search performances. Preliminary results show that our systems outperform others in the document and snippet retrieval task in the first two batches.

Journal ArticleDOI
TL;DR: It is proved that, indeed, the PRP can be sub-optimal in adversarial retrieval settings by presenting a novel game theoretic analysis of the adversarial setting and it is shown that in some cases, introducing randomization into the document ranking function yields an overall user utility that transcends that of applying thePRP.
Abstract: The main goal of search engines is ad hoc retrieval: ranking documents in a corpus by their relevance to the information need expressed by a query. The Probability Ranking Principle (PRP) --- ranking the documents by their relevance probabilities --- is the theoretical foundation of most existing ad hoc document retrieval methods. A key observation that motivates our work is that the PRP does not account for potential post-ranking effects; specifically, changes to documents that result from a given ranking. Yet, in adversarial retrieval settings such as the Web, authors may consistently try to promote their documents in rankings by changing them. We prove that, indeed, the PRP can be sub-optimal in adversarial retrieval settings. We do so by presenting a novel game theoretic analysis of the adversarial setting. The analysis is performed for different types of documents (single-topic and multi-topic) and is based on different assumptions about the writing qualities of documents' authors. We show that in some cases, introducing randomization into the document ranking function yields an overall user utility that transcends that of applying the PRP.

Journal ArticleDOI
TL;DR: Two novel ideas are developed, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing, top-k document retrieval, and document counting, and show that a classical data structure supporting the latter query becomes highly compressible on repetitive data.
Abstract: Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple $${\textsf{tf}}{\textsf{-}}{\textsf{idf}}$$tf-idf model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case.

Journal ArticleDOI
TL;DR: Using more than 50 retrieval experiments from the literature as examples, RT is applied to explain the frequency distributions of documents on relevance scales with three or more points to reinforce the paper's more general argument that RT clarifies the concept of relevance in the dialogues of retrieval evaluation.
Abstract: This article extends relevance theory (RT) from linguistic pragmatics into information retrieval. Using more than 50 retrieval experiments from the literature as examples, it applies RT to explain the frequency distributions of documents on relevance scales with three or more points. The scale points, which judges in experiments must consider in addition to queries and documents, are communications from researchers. In RT, the relevance of a communication varies directly with its cognitive effects and inversely with the effort of processing it. Researchers define and/or label the scale points to measure the cognitive effects of documents on judges. However, they apparently assume that all scale points as presented are equally easy for judges to process. Yet the notion that points cost variable effort explains fairly well the frequency distributions of judgments across them. By hypothesis, points that cost more effort are chosen by judges less frequently. Effort varies with the vagueness or strictness of scale-point labels and definitions. It is shown that vague scales tend to produce U- or V-shaped distributions, while strict scales tend to produce right-skewed distributions. These results reinforce the paper's more general argument that RT clarifies the concept of relevance in the dialogues of retrieval evaluation.

Proceedings ArticleDOI
06 Nov 2017
TL;DR: This work presents a convolutional neural model aimed at improving clinical notes representation, making them suitable for document retrieval, and is designed to predict, for each clinical note term, its importance in relevant documents.
Abstract: The rapid increase of medical literature poses a significant challenge for physicians, who have repeatedly reported to struggle to keep up to date with developments in research. This gap is one of the main challenges in integrating recent advances in clinical research with day-to-day practice. Thus, the need for clinical decision support (CDS) search systems that can retrieve highly relevant medical literature given a clinical note describing a patient has emerged. However, clinical notes are inherently noisy, thus not being fit to be used as queries as-is. In this work, we present a convolutional neural model aimed at improving clinical notes representation, making them suitable for document retrieval. The system is designed to predict, for each clinical note term, its importance in relevant documents. The approach was evaluated on the 2016 TREC CDS dataset, where it achieved a 37% improvement in infNDCG over state-of-the-art query reduction methods and a 27% improvement over the best known method for the task.

Journal ArticleDOI
TL;DR: The results of the first large-scale experimental evaluation of sampling techniques for information extraction over the deep web show the merits and limitations of the alternative query execution and document retrieval and processing strategies, and provide a roadmap for addressing this critically important building block for efficient, scalable information extraction.
Abstract: First large-scale and fine-grained evaluation of query-based sampling techniques.Learned keyword queries perform substantially better than queries derived from tuples.Focusing onand processing exhaustivelyeffective queries leads to high efficiency.Focusing onand processing in roundsless-effective queries favors quality.Filtering underperforming queries favors sampling efficiency but hurts quality. Information extraction systems discover structured information in natural language text. Having information in structured form enables much richer querying and data mining than possible over the natural language text. However, information extraction is a computationally expensive task, and hence improving the efficiency of the extraction process over large text collections is of critical interest. In this paper, we focus on an especially valuable family of text collections, namely, the so-called deep-web text collections, whose contents are not crawlable and are only available via querying. Important steps for efficient information extraction over deep-web text collections (e.g., selecting the collections on which to focus the extraction effort, based on their contents; or learning which documents within these collectionsand in which orderto process, based on their words and phrases) require having a representative document sample from each collection. These document samples have to be collected by querying the deep-web text collections, an expensive process that renders impractical the existing sampling approaches developed for other data scenarios. In this paper, we systematically study the space of query-based document sampling techniques for information extraction over the deep web. Specifically, we consider (i) alternative query execution schedules, which vary on how they account for the query effectiveness, and (ii) alternative document retrieval and processing schedules, which vary on how they distribute the extraction effort over documents. We report the results of the first large-scale experimental evaluation of sampling techniques for information extraction over the deep web. Our results show the merits and limitations of the alternative query execution and document retrieval and processing strategies, and provide a roadmap for addressing this critically important building block for efficient, scalable information extraction.

Journal ArticleDOI
TL;DR: Across four web test collections, it is found that the highest query evaluation speed is achieved by simply leaving the postings lists uncompressed, although the performance advantage over a state-of-the-art compression scheme is relatively small and the index is considerably larger.
Abstract: This paper explores the performance of top k document retrieval with score-at-a-time query evaluation on impact-ordered indexes in main memory. To better understand execution efficiency in the context of modern processor architectures, we examine the role of index compression on query evaluation latency. Experiments include compressing postings with variable byte encoding, Simple-8b, variants of the QMX compression scheme, as well as a condition that is less often considered--no compression. Across four web test collections, we find that the highest query evaluation speed is achieved by simply leaving the postings lists uncompressed, although the performance advantage over a state-of-the-art compression scheme is relatively small and the index is considerably larger. We explain this finding in terms of the design of modern processor architectures: Index segments with high impact scores are usually short and inherently benefit from cache locality. Index segments with lower impact scores may be quite long, but modern architectures have sufficient memory bandwidth (coupled with prefetching) to "keep up" with the processor. Our results highlight the importance of "architecture affinity" when designing high-performance search engines.

Journal ArticleDOI
TL;DR: To compute candidate groups consisting of k relevant documents efficiently, this work proposes dynamic diverse retrieval algorithms specialized for the patent-searching method, in which an effective dynamic interactive retrieval can be achieved.
Abstract: High-recall retrieval problem, aiming at finding the full set of relevant documents in a huge result set by effective mining techniques, is particularly useful for patent information retrieval, legal document retrieval, medical document retrieval, market information retrieval, and literature review. The existing high-recall retrieval methods, however, have been far from satisfactory to retrieve all relevant documents due to not only high-recall and precision threshold measurements but also a sheer minimize the number of reviewed documents. To address this gap, we generalize the problem to a novel high-recall retrieval model, which can be represented as finding all needles in a giant haystack. To compute candidate groups consisting of k relevant documents efficiently, we propose dynamic diverse retrieval algorithms specialized for the patent-searching method, in which an effective dynamic interactive retrieval can be achieved. In the various types of datasets, the dynamic ranking method shows considerable improvements with respect to time and cost over the conventional static ranking approaches.

Journal ArticleDOI
TL;DR: An overview of some text mining techniques that offer assistance in research by identifying biomedical entities and relations between them in text and an application that integrates PubMed document retrieval, concept and relation identification, and visualization are discussed, thus enabling a user to explore concepts and relations from within a set of retrieved citations.
Abstract: Informatics methodologies exploit computer-assisted techniques to help biomedical researchers manage large amounts of information. In this paper, we focus on the biomedical research literature (MEDLINE). We first provide an overview of some text mining techniques that offer assistance in research by identifying biomedical entities (e.g., genes, substances, and diseases) and relations between them in text.We then discuss Semantic MEDLINE, an application that integrates PubMed document retrieval, concept and relation identification, and visualization, thus enabling a user to explore concepts and relations from within a set of retrieved citations. Semantic MEDLINE provides a roadmap through content and helps users discern patterns in large numbers of retrieved citations. We illustrate its use with an informatics method we call "discovery browsing," which provides a principled way of navigating through selected aspects of some biomedical research area. The method supports an iterative process that accommodates learning and hypothesis formation in which a user is provided with high level connections before delving into details.As a use case, we examine current developments in basic research on mechanisms of Alzheimer's disease. Out of the nearly 90 000 citations returned by the PubMed query "Alzheimer's disease," discovery browsing led us to 73 citations on sortilin and that disorder. We provide a synopsis of the basic research reported in 15 of these. There is wide-spread consensus among researchers working with a range of animal models and human cells that increased sortilin expression and decreased receptor expression are associated with amyloid beta and/or amyloid precursor protein.

Journal ArticleDOI
TL;DR: This paper presents a method for determining user profile in a document retrieval system, and proposes ontology-based profile that allows to process semantic relations between users’ queries.
Abstract: Information overload has become a very important aspect of information retrieval domain. Even if a user knows where to look for interesting information, he can have a problem with precisely formulating his information needs. A solution of the problem is personalization and recommendation system – they observe user activities, analyze them to discover important preferences. Based on these information the system can improve the effectiveness of the results. In this paper we present a method for determining user profile in a document retrieval system. We propose ontology-based profile. Such a structure allows to process semantic relations between users’ queries. We focus on methods for adapting profile because only up-to-date profile can help the user to obtained results that correspond with his information needs. We present a set of postulates for adaptation methods. Performed experimental evaluations of developed methods are promising.

Proceedings ArticleDOI
01 Jan 2017
TL;DR: The evaluation results shown by this work suggest that researchers are in the right way towards shape descriptors which can capture the main characteristics of 3D models, however, more tests still need to be made, since this is the first time the authors compare non-rigid signatures for point-cloud shape retrieval.
Abstract: In this paper, we present the results of the SHREC’17 Track: Point-Cloud Shape Retrieval of Non-Rigid Toys. The aim of this track is to create a fair benchmark to evaluate the performance of methods on the non-rigid point-cloud shape retrieval problem. The database used in this task contains 100 3D point-cloud models which are classified into 10 different categories. All point clouds were generated by scanning each one of the models in their final poses using a 3D scanner, i.e., all models have been articulated before scanned. The retrieval performance is evaluated using seven commonly-used statistics (PR-plot, NN, FT, ST, E-measure, DCG, mAP). In total, there are 8 groups and 31 submissions taking part of this contest. The evaluation results shown by this work suggest that researchers are in the right way towards shape descriptors which can capture the main characteristics of 3D models, however, more tests still need to be made, since this is the first time we compare non-rigid signatures for point-cloud shape retrieval.

Proceedings ArticleDOI
07 Aug 2017
TL;DR: This work proposes a method to further improve document models by utilizing external collections as part of the document expansion process, based on relevance modeling, which improves ad-hoc document retrieval effectiveness on a variety of corpus types.
Abstract: Document expansion has been shown to improve the effectiveness of information retrieval systems by augmenting documents' term probability estimates with those of similar documents, producing higher quality document representations. We propose a method to further improve document models by utilizing external collections as part of the document expansion process. Our approach is based on relevance modeling, a popular form of pseudo-relevance feedback; however, where relevance modeling is concerned with query expansion, we are concerned with document expansion. Our experiments demonstrate that the proposed model improves ad-hoc document retrieval effectiveness on a variety of corpus types, with a particular benefit on more heterogeneous collections of documents.