Showing papers on "Document retrieval published in 2017"

PDF

Open Access

Posted Content•

Reading Wikipedia to Answer Open-Domain Questions

[...]

Danqi Chen¹, Adam Fisch², Jason Weston², Antoine Bordes²•Institutions (2)

31 Mar 2017-arXiv: Computation and Language

TL;DR: In this paper, a multi-layer recurrent neural network model was proposed to detect answer spans in Wikipedia paragraphs, which combines a search component based on bigram hashing and TF-IDF matching.

...read moreread less

Abstract: This paper proposes to tackle open- domain question answering using Wikipedia as the unique knowledge source: the answer to any factoid question is a text span in a Wikipedia article. This task of machine reading at scale combines the challenges of document retrieval (finding the relevant articles) with that of machine comprehension of text (identifying the answer spans from those articles). Our approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs. Our experiments on multiple existing QA datasets indicate that (1) both modules are highly competitive with respect to existing counterparts and (2) multitask learning using distant supervision on their combination is an effective complete system on this challenging task.

...read moreread less

1,100 citations

Proceedings Article•DOI•

Reading Wikipedia to Answer Open-Domain Questions

[...]

Danqi Chen¹, Adam Fisch², Jason Weston², Antoine Bordes²•Institutions (2)

Stanford University¹, Facebook²

31 Mar 2017

TL;DR: This approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs, indicating that both modules are highly competitive with respect to existing counterparts.

...read moreread less

Abstract: This paper proposes to tackle open-domain question answering using Wikipedia as the unique knowledge source: the answer to any factoid question is a text span in a Wikipedia article. This task of machine reading at scale combines the challenges of document retrieval (finding the relevant articles) with that of machine comprehension of text (identifying the answer spans from those articles). Our approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs. Our experiments on multiple existing QA datasets indicate that (1) both modules are highly competitive with respect to existing counterparts and (2) multitask learning using distant supervision on their combination is an effective complete system on this challenging task.

...read moreread less

685 citations

Proceedings Article•DOI•

Name Disambiguation in Anonymized Graphs using Network Embedding

[...]

Baichuan Zhang¹, Mohammad Al Hasan²•Institutions (2)

Purdue University¹, Indiana University – Purdue University Indianapolis²

06 Nov 2017

TL;DR: Zhang et al. as mentioned in this paper proposed a novel name disambiguation method which leverages only relational data in the form of anonymized graphs and used a novel representation learning model to embed each document in a low dimensional vector space.

...read moreread less

Abstract: In real-world, our DNA is unique but many people share names. This phenomenon often causes erroneous aggregation of documents of multiple persons who are namesake of one another. Such mistakes deteriorate the performance of document retrieval, web search, and more seriously, cause improper attribution of credit or blame in digital forensic. To resolve this issue, the name disambiguation task is designed which aims to partition the documents associated with a name reference such that each partition contains documents pertaining to a unique real-life person. Existing solutions to this task substantially rely on feature engineering, such as biographical feature extraction, or construction of auxiliary features from Wikipedia. However, for many scenarios, such features may be costly to obtain or unavailable due to the risk of privacy violation. In this work, we propose a novel name disambiguation method. Our proposed method is non-intrusive of privacy because instead of using attributes pertaining to a real-life person, our method leverages only relational data in the form of anonymized graphs. In the methodological aspect, the proposed method uses a novel representation learning model to embed each document in a low dimensional vector space where name disambiguation can be solved by a hierarchical agglomerative clustering algorithm. Our experimental results demonstrate that the proposed method is significantly better than the existing name disambiguation methods working in a similar setting.

...read moreread less

97 citations

Journal Article•DOI•

Automatic P h r a s e Indexing for Document Retrieval: An Examination of Syntactic and Non-Syntactic Methods

[...]

Joel L. Fagan¹•Institutions (1)

Cornell University¹

02 Aug 2017

TL;DR: An automatic phrase indexing method based on the term discrimination model is described, and the results of retrieval experiments on five document collections are presented.

...read moreread less

Abstract: An automatic phrase indexing method based on the term discrimination model is described, and the results of retrieval experiments on five document collections are presented. Problems related to this non-syntactic phrase construction method are discussed, and some possible solutions are proposed that make use of information about the syntactic structure of document and query texts.

...read moreread less

74 citations

Proceedings Article•DOI•

Document Retrieval Model Through Semantic Linking

[...]

Faezeh Ensan¹, Ebrahim Bagheri²•Institutions (2)

Ferdowsi University of Mashhad¹, Ryerson University²

02 Feb 2017

TL;DR: This paper addresses the task of document retrieval based on the degree of document relatedness to the meanings of a query by presenting a semantic-enabled language model that adopts a probabilistic reasoning model for calculating the conditional probability of a queries concept given values assigned to document concepts.

...read moreread less

Abstract: This paper addresses the task of document retrieval based on the degree of document relatedness to the meanings of a query by presenting a semantic-enabled language model. Our model relies on the use of semantic linking systems for forming a graph representation of documents and queries, where nodes represent concepts extracted from documents and edges represent semantic relatedness between concepts. Based on this graph, our model adopts a probabilistic reasoning model for calculating the conditional probability of a query concept given values assigned to document concepts. We present an integration framework for interpolating other retrieval systems with the presented model in this paper. Our empirical experiments on a number of TREC collections show that the semantic retrieval has a synergetic impact on the results obtained through state of the art keyword-based approaches, and the consideration of semantic information obtained from entity linking on queries and documents can complement and enhance the performance of other retrieval models.

...read moreread less

69 citations

Journal Article•DOI•

A passage retrieval method based on probabilistic information retrieval model and UMLS concepts in biomedical question answering

[...]

Mourad Sarrouti¹, Said Ouatik El Alaoui¹•Institutions (1)

SIDI¹

01 Apr 2017-Journal of Biomedical Informatics

TL;DR: A new biomedical passage retrieval method based on Stanford CoreNLP sentence/passage length, probabilistic information retrieval (IR) model and UMLS concepts which significantly outperforms the current state-of-the-art methods.

...read moreread less

52 citations

Journal Article•DOI•

A new fuzzy logic-based query expansion model for efficient information retrieval using relevance feedback approach

[...]

Jagendra Singh¹, Aditi Sharan¹•Institutions (1)

Jawaharlal Nehru University¹

01 Sep 2017-Neural Computing and Applications

TL;DR: This paper presents a new method for QE based on fuzzy logic considering the top-retrieved document as relevance feedback documents for mining additional QE terms and increases the precision rates and the recall rates of information retrieval systems for dealing with document retrieval.

...read moreread less

Abstract: Efficient query expansion (QE) terms selection methods are really very important for improving the accuracy and efficiency of the system by removing the irrelevant and redundant terms from the top-retrieved feedback documents corpus with respect to a user query. Each individual QE term selection method has its weaknesses and strengths. To overcome the weaknesses and to utilize the strengths of the individual method, we used multiple terms selection methods together. In this paper, we present a new method for QE based on fuzzy logic considering the top-retrieved document as relevance feedback documents for mining additional QE terms. Different QE terms selection methods calculate the degrees of importance of all unique terms of top-retrieved documents collection for mining additional expansion terms. These methods give different relevance scores for each term. The proposed method combines different weights of each term by using fuzzy rules to infer the weights of the additional query terms. Then, the weights of the additional query terms and the weights of the original query terms are used to form the new query vector, and we use this new query vector to retrieve documents. All the experiments are performed on TREC and FIRE benchmark datasets. The proposed QE method increases the precision rates and the recall rates of information retrieval systems for dealing with document retrieval. It gets a significant higher average recall rate, average precision rate and F measure on both datasets.

...read moreread less

46 citations

Journal Article•DOI•

Evaluating topic representations for exploring document collections

[...]

Nikolaos Aletras¹, Timothy Baldwin², Jey Han Lau³, Mark Stevenson⁴•Institutions (4)

University College London¹, University of Melbourne², King's College London³, University of Sheffield⁴

01 Jan 2017

TL;DR: Comparisons of 3 different topic representations in a document retrieval task show that textual labels are easier for users to interpret than are term lists and image labels, demonstrating that labeling methods are an effective alternative topic representation.

...read moreread less

Abstract: Topic models have been shown to be a useful way of representing the content of large document collections, for example, via visualization interfaces topic browsers. These systems enable users to explore collections by way of latent topics. A standard way to represent a topic is using a term list; that is the top-n words with highest conditional probability within the topic. Other topic representations such as textual and image labels also have been proposed. However, there has been no comparison of these alternative representations. In this article, we compare 3 different topic representations in a document retrieval task. Participants were asked to retrieve relevant documents based on predefined queries within a fixed time limit, presenting topics in one of the following modalities: a lists of terms, b textual phrase labels, and c image labels. Results show that textual labels are easier for users to interpret than are term lists and image labels. Moreover, the precision of retrieved documents for textual and image labels is comparable to the precision achieved by representing topics using term lists, demonstrating that labeling methods are an effective alternative topic representation.

...read moreread less

38 citations

Journal Article•DOI•

Reformulated query-based document retrieval using optimised kernel fuzzy clustering algorithm

[...]

M. M. Gowthul Alam¹, S. Baulkani²•Institutions (2)

National College of Engineering¹, Government College²

12 Jul 2017-International Journal of Business Intelligence and Data Mining

TL;DR: Experiments results prove that GKFCM-based proposed system outperforms better performance than existing methods on document retrieval issue.

...read moreread less

Abstract: Clustering-based document retrieval system offers to find similar documents for a given user's query. This study explores the scope of kernel fuzzy c-means (KFCM) with the genetic algorithm on document retrieval issue. Initially, genetic algorithm-based kernel fuzzy c-means algorithm (GKFCM) is proposed to make the clustering of documents in the library. For each cluster, an index is created, which contains a common significant keywords of the documents for that cluster. Once the user enters the keyword as the input to the system, it will process the keywords with the WORDNET ontology to achieve the neighbourhood keywords and related synset keywords. Lastly, the documents inside the cluster are released at first as the resultant-related documents for the query keyword, which clusters have the maximum matching score values. Experiments results prove that GKFCM-based proposed system outperforms better performance than existing methods.

...read moreread less

37 citations

Proceedings Article•DOI•

Visual Re-Ranking for Multi-Aspect Information Retrieval

[...]

Khalil Klouche¹, Tuukka Ruotsalo², Luana Micallef², Salvatore Andolina², Giulio Jacucci¹ - Show less +1 more•Institutions (2)

University of Helsinki¹, Aalto University²

07 Mar 2017

TL;DR: The utility of visual re-ranking, an interactive visualization technique for multi-aspect information retrieval, is demonstrated, and can help designing search user interfaces that support multi- aspect search.

...read moreread less

Abstract: We present visual re-ranking, an interactive visualization technique for multi-aspect information retrieval. In multi-aspect search, the information need of the user consists of more than one aspect or query simultaneously. While visualization and interactive search user interface techniques for improving user interpretation of search results have been proposed, the current research lacks understanding on how useful these are for the user: whether they lead to quantifiable benefits in perceiving the result space and allow faster, and more precise retrieval. Our technique visualizes relevance and document density on a two-dimensional map with respect to the query phrases. Pointing to a location on the map specifies a weight distribution of the relevance to each of the query phrases, according to which search results are re-ranked. User experiments compared our technique to a uni-dimensional search interface with typed query and ranked result list, in perception and retrieval tasks. Visual re-ranking yielded improved accuracy in perception, higher precision in retrieval and overall faster task execution. Our findings demonstrate the utility of visual re-ranking, and can help designing search user interfaces that support multi-aspect search.

...read moreread less

33 citations

Proceedings Article•DOI•

Utilizing Knowledge Graphs in Text-centric Information Retrieval

[...]

Laura Dietz¹, Alexander Kotov², Edgar Meij³•Institutions (3)

University of New Hampshire¹, Wayne State University², Bloomberg L.P.³

02 Feb 2017

TL;DR: This tutorial is the first to disseminate the progress in this emerging field of KGs to researchers and practitioners and is available online at http://github.com/laura-dietz/tutorial-utilizing-kg.

...read moreread less

Abstract: The past decade has witnessed the emergence of several publicly available and proprietary knowledge graphs (KGs). The increasing depth and breadth of content in KGs makes them not only rich sources of structured knowledge by themselves but also valuable resources for search systems. A surge of recent developments in entity linking and retrieval methods gave rise to a new line of research that aims at utilizing KGs for text-centric retrieval applications, making this an ideal time to pause and report current findings to the community, summarizing successful approaches, and soliciting new ideas. This tutorial is the first to disseminate the progress in this emerging field to researchers and practitioners. All tutorial resources are available online at http://github.com/laura-dietz/tutorial-utilizing-kg

...read moreread less

Journal Article•DOI•

A semantic-grained perspective of latent knowledge modeling

[...]

Paola Della Rocca¹, Sabrina Senatore¹, Vincenzo Loia¹•Institutions (1)

University of Salerno¹

01 Jul 2017-Information Fusion

TL;DR: A semantically enhanced document retrieval system that describes each retrieved document with an ontological multi-grained network of the extracted conceptualization, and a SKOS-based ontology, ad-hoc created for a document corpus that enables the exploration of the concepts at different granularity levels.

...read moreread less

Proceedings Article•

Evaluation of Clinical Text Segmentation to Facilitate Cohort Retrieval

[...]

Tracy Edinger¹, Dina Demner-Fushman², Aaron Cohen¹, Steven Bedrick¹, William R. Hersh¹ - Show less +1 more•Institutions (2)

Oregon Health & Science University¹, National Institutes of Health²

01 Jan 2017

TL;DR: It is suggested that searching specific sections may improve precision under certain conditions and often with loss of recall, although chart notes incorporate structure that may facilitate accurate retrieval.

...read moreread less

Abstract: Objective: Secondary use of electronic health record (EHR) data is enabled by accurate and complete retrieval of the relevant patient cohort, which requires searching both structured and unstructured data. Clinical text poses difficulties to searching, although chart notes incorporate structure that may facilitate accurate retrieval. Methods: We developed rules identifying clinical document sections, which can be indexed in search engines that allow faceted searches, such as Lucene or Essie, an NLM search engine. We developed 22 clinical cohorts and two queries for each cohort, one utilizing section headings and the other searching the whole document. We manually evaluated a subset of retrieved documents to compare query performance. Results: Querying by section had lower recall than whole-document queries (0.83 vs 0.95), higher precision (0.73 vs 0.54), and higher F1 (0.78 vs 0.69). Conclusion: This evaluation suggests that searching specific sections may improve precision under certain conditions and often with loss of recall.

...read moreread less

Journal Article•DOI•

MathIRs: Retrieval System for Scientific Documents

[...]

Amarnath Pathak, Partha Pakray, Sandip Sarkar, Dipankar Das¹, Alexander Gelbukh² - Show less +1 more•Institutions (2)

Jadavpur University¹, Instituto Politécnico Nacional²

30 Jun 2017-Computación Y Sistemas

TL;DR: This paper has proposed MathIRs comprising three important modules and a substitution tree based mechanism for indexing mathematical expressions and presented experimental results for similarity search and argued that proposal ofMathIRs will ease the task of scientific document retrieval.

...read moreread less

Abstract: Effective retrieval of mathematical contents from vast corpus of scientific documents demands enhancement in the conventional indexing and searching mechanisms. Indexing mechanism and the choice of semantic similarity measures guide the results of Math Information Retrieval system (MathIRs) to perfection. Tokenization and formula unification are among the distinguishing i features of indexing mechanism, used in MathIRs, which facilitate sub-formula and similarity search. Besides, the scientific documents and the user queries in MathIRs will contain math as well as text contents and to match these contents we require three important modules: Text-Text Similarity (TS), Math-Math Similarity (MS) and Text-Math Similarity (TMS). In this paper we have proposed MathIRs comprising these important modules and a substitution tree based mechanism for indexing mathematical expressions. We have also presented experimental results for similarity search and argued that proposal of MathIRs will ease the task of scientific document retrieval.

...read moreread less

Patent•

Judgment document retrieval method based on semantic matching and server

[...]

Zhao Fanzhou, Pan Rong, Yang Yang, Mei Lin, Zeng Hongsheng, Xue Long - Show less +2 more

15 Mar 2017

TL;DR: In this paper, a judgment document retrieval method based on semantic matching and a server was proposed, which can be found by directly describing the law problems or cases through a natural language, thus the problem is solved, the use threshold of the document retrieval server is greatly decreased, and retrieval efficiency is improved.

...read moreread less

Abstract: The invention provides a judgment document retrieval method based on semantic matching and a server. The judgment document retrieval method based on the semantic matching and the server have the advantages that when a case is retrieved, the direct inputting of words which are precisely matched with keywords in a judgment document is not needed, the matched judging document can be found by directly describing the law problems or cases through a natural language, thus the problem is solved, the use threshold of the document retrieval server is greatly decreased, and the retrieval efficiency is improved.

...read moreread less

Proceedings Article•DOI•

Survey on trends and methods of an intelligent answering system

[...]

Deepa Yogish¹, T. N. Manjunath¹, Ravindra S. Hegadi²•Institutions (2)

B.M.S. Institute of Technology¹, University of Solapur²

01 Dec 2017

TL;DR: The current study intends to develop an intelligent system for user queries in natural language for precise answer that includes tokenization, parsing, parts of speech tagging, question classification, query construction, sentence understanding, document retrieval, keyword ranking, classifier, answer extraction and validation.

...read moreread less

Abstract: In Contemporary world, life styles and interactions have changed in all applications domain due to increasing advances of internet technology Due to recent advances in information explosion, tries to build an intelligent question answering system where user may communicate with a machine in natural language to get response to user question using different strategies like Natural Language Processing (NLP), Artificial Intelligence, Information Retrieval and Human Computer Interaction Natural Language Processing is a technique where computer behave like human, which helps people to talk to the computer in their own language rather than computer commands The skills needed to build intelligent answering system includes tokenization, parsing, parts of speech tagging, question classification, query construction, sentence understanding, document retrieval, keyword ranking, classifier, answer extraction and validation The current study intends to develop an intelligent system for user queries in natural language for precise answer

...read moreread less

Posted Content•

Name Disambiguation in Anonymized Graphs using Network Embedding

[...]

Baichuan Zhang¹, Mohammad Al Hasan²•Institutions (2)

Purdue University¹, Indiana University – Purdue University Indianapolis²

08 Feb 2017-arXiv: Social and Information Networks

TL;DR: The proposed name disambiguation method is non-intrusive of privacy because instead of using attributes pertaining to a real-life person, the method leverages only relational data in the form of anonymized graphs.

...read moreread less

Journal Article•DOI•

A survey of tag-based information retrieval

[...]

Sanghoon Lee¹, Sanghoon Lee², Mohamed Masoud², Janani Balaji², Saeid Belkasim², Rajshekhar Sunderraman², Seung-Jin Moon³ - Show less +3 more•Institutions (3)

Emory University¹, Georgia State University², University of Suwon³

01 Jun 2017-International Journal of Multimedia Information Retrieval

TL;DR: Five criteria: used models, tagging purpose, tagging right, object type, and used dataset, are introduced for evaluating tag-based information retrieval methods as a new categorical framework engaging the graphical models as well as the two-way classical methods.

...read moreread less

Abstract: This paper aims to provide a comprehensive survey of tag-based information retrieval that covers three areas: tag-based document retrieval, tag-based image retrieval, and tag-based music information retrieval. First of all, seven representative graphical models associated with tag contents are reviewed and evaluated in terms of effectiveness in achieving their goals. The models are explored in depth based on appropriate plate notations for the tag-based document retrieval. Second, well-established review criteria for two-way classical methods, tag refinement and tag recommendation, are utilized for tag-based image retrieval. In particular, tag refinement methods are analyzed by means of the experimental results measured on different datasets. Last, popular tagging methods in the area of music information retrieval are reviewed for the tag-based music information retrieval. We introduce five criteria: used models, tagging purpose, tagging right, object type, and used dataset, for evaluating tag-based information retrieval methods as a new categorical framework engaging the graphical models as well as the two-way classical methods.

...read moreread less

Proceedings Article•DOI•

A Multi-strategy Query Processing Approach for Biomedical Question Answering: USTB_PRIR at BioASQ 2017 Task 5B.

[...]

Zan-Xia Jin, Bo-Wen Zhang, Fan Fang, Le-Le Zhang, Xu-Cheng Yin - Show less +1 more

01 Aug 2017

TL;DR: This paper describes the participation of USTB PRIR team in the 2017 BioASQ 5B on question answering, including document retrieval, snippet retrieval and concept retrieval task and introduces different multimodal query processing strategies to enrich query terms and assign different weights to them.

...read moreread less

Abstract: This paper describes the participation of USTB_PRIR team in the 2017 BioASQ 5B on question answering, including document retrieval, snippet retrieval, and concept retrieval task. We introduce different multimodal query processing strategies to enrich query terms and assign different weights to them. Specifically, sequential dependence model (SDM), pseudo-relevance feedback (PRF), fielded sequential dependence model (FSDM) and Divergence from Randomness model (DFRM) are respectively performed on different fields of PubMed articles, sentences extracted from relevant articles, the five terminologies or ontologies (MeSH, GO, Jochem, Uniprot and DO) to achieve better search performances. Preliminary results show that our systems outperform others in the document and snippet retrieval task in the first two batches.

...read moreread less

Journal Article•DOI•

A Game Theoretic Analysis of the Adversarial Retrieval Setting

[...]

Ran Ben Basat, Moshe Tennenholtz, Oren Kurland

30 Dec 2017-Journal of Artificial Intelligence Research

TL;DR: It is proved that, indeed, the PRP can be sub-optimal in adversarial retrieval settings by presenting a novel game theoretic analysis of the adversarial setting and it is shown that in some cases, introducing randomization into the document ranking function yields an overall user utility that transcends that of applying thePRP.

...read moreread less

Abstract: The main goal of search engines is ad hoc retrieval: ranking documents in a corpus by their relevance to the information need expressed by a query. The Probability Ranking Principle (PRP) --- ranking the documents by their relevance probabilities --- is the theoretical foundation of most existing ad hoc document retrieval methods. A key observation that motivates our work is that the PRP does not account for potential post-ranking effects; specifically, changes to documents that result from a given ranking. Yet, in adversarial retrieval settings such as the Web, authors may consistently try to promote their documents in rankings by changing them. We prove that, indeed, the PRP can be sub-optimal in adversarial retrieval settings. We do so by presenting a novel game theoretic analysis of the adversarial setting. The analysis is performed for different types of documents (single-topic and multi-topic) and is based on different assumptions about the writing qualities of documents' authors. We show that in some cases, introducing randomization into the document ranking function yields an overall user utility that transcends that of applying the PRP.

...read moreread less

Journal Article•DOI•

Document retrieval on repetitive string collections

[...]

Travis Gagie¹, Aleksi Hartikainen², Kalle Karhu, Juha Kärkkäinen³, Gonzalo Navarro⁴, Simon J. Puglisi³, Jouni Sirén⁵ - Show less +3 more•Institutions (5)

Diego Portales University¹, Google², Information Technology University³, University of Chile⁴, Wellcome Trust Sanger Institute⁵

01 Jun 2017-Information Retrieval

TL;DR: Two novel ideas are developed, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing, top-k document retrieval, and document counting, and show that a classical data structure supporting the latter query becomes highly compressible on repetitive data.

...read moreread less

Abstract: Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple $${\textsf{tf}}{\textsf{-}}{\textsf{idf}}$$tf-idf model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case.

...read moreread less

Journal Article•DOI•

Relevance theory and distributions of judgments in document retrieval

[...]

Howard D. White¹•Institutions (1)

Drexel University¹

01 Sep 2017-Information Processing and Management

TL;DR: Using more than 50 retrieval experiments from the literature as examples, RT is applied to explain the frequency distributions of documents on relevance scales with three or more points to reinforce the paper's more general argument that RT clarifies the concept of relevance in the dialogues of retrieval evaluation.

...read moreread less

Abstract: This article extends relevance theory (RT) from linguistic pragmatics into information retrieval. Using more than 50 retrieval experiments from the literature as examples, it applies RT to explain the frequency distributions of documents on relevance scales with three or more points. The scale points, which judges in experiments must consider in addition to queries and documents, are communications from researchers. In RT, the relevance of a communication varies directly with its cognitive effects and inversely with the effort of processing it. Researchers define and/or label the scale points to measure the cognitive effects of documents on judges. However, they apparently assume that all scale points as presented are equally easy for judges to process. Yet the notion that points cost variable effort explains fairly well the frequency distributions of judgments across them. By hypothesis, points that cost more effort are chosen by judges less frequently. Effort varies with the vagueness or strictness of scale-point labels and definitions. It is shown that vague scales tend to produce U- or V-shaped distributions, while strict scales tend to produce right-skewed distributions. These results reinforce the paper's more general argument that RT clarifies the concept of relevance in the dialogues of retrieval evaluation.

...read moreread less

Proceedings Article•DOI•

Denoising Clinical Notes for Medical Literature Retrieval with Convolutional Neural Model

[...]

Luca Soldaini¹, Andrew Yates², Nazli Goharian¹•Institutions (2)

Georgetown University¹, Max Planck Society²

06 Nov 2017

TL;DR: This work presents a convolutional neural model aimed at improving clinical notes representation, making them suitable for document retrieval, and is designed to predict, for each clinical note term, its importance in relevant documents.

...read moreread less

Abstract: The rapid increase of medical literature poses a significant challenge for physicians, who have repeatedly reported to struggle to keep up to date with developments in research. This gap is one of the main challenges in integrating recent advances in clinical research with day-to-day practice. Thus, the need for clinical decision support (CDS) search systems that can retrieve highly relevant medical literature given a clinical note describing a patient has emerged. However, clinical notes are inherently noisy, thus not being fit to be used as queries as-is. In this work, we present a convolutional neural model aimed at improving clinical notes representation, making them suitable for document retrieval. The system is designed to predict, for each clinical note term, its importance in relevant documents. The approach was evaluated on the 2016 TREC CDS dataset, where it achieved a 37% improvement in infNDCG over state-of-the-art query reduction methods and a 27% improvement over the best known method for the task.

...read moreread less

Journal Article•DOI•

Sampling strategies for information extraction over the deep web

[...]

Pablo Barrio¹, Luis Gravano¹•Institutions (1)

Columbia University¹

01 Mar 2017-Information Processing and Management

TL;DR: The results of the first large-scale experimental evaluation of sampling techniques for information extraction over the deep web show the merits and limitations of the alternative query execution and document retrieval and processing strategies, and provide a roadmap for addressing this critically important building block for efficient, scalable information extraction.

...read moreread less

Abstract: First large-scale and fine-grained evaluation of query-based sampling techniques.Learned keyword queries perform substantially better than queries derived from tuples.Focusing onand processing exhaustivelyeffective queries leads to high efficiency.Focusing onand processing in roundsless-effective queries favors quality.Filtering underperforming queries favors sampling efficiency but hurts quality. Information extraction systems discover structured information in natural language text. Having information in structured form enables much richer querying and data mining than possible over the natural language text. However, information extraction is a computationally expensive task, and hence improving the efficiency of the extraction process over large text collections is of critical interest. In this paper, we focus on an especially valuable family of text collections, namely, the so-called deep-web text collections, whose contents are not crawlable and are only available via querying. Important steps for efficient information extraction over deep-web text collections (e.g., selecting the collections on which to focus the extraction effort, based on their contents; or learning which documents within these collectionsand in which orderto process, based on their words and phrases) require having a representative document sample from each collection. These document samples have to be collected by querying the deep-web text collections, an expensive process that renders impractical the existing sampling approaches developed for other data scenarios. In this paper, we systematically study the space of query-based document sampling techniques for information extraction over the deep web. Specifically, we consider (i) alternative query execution schedules, which vary on how they account for the query effectiveness, and (ii) alternative document retrieval and processing schedules, which vary on how they distribute the extraction effort over documents. We report the results of the first large-scale experimental evaluation of sampling techniques for information extraction over the deep web. Our results show the merits and limitations of the alternative query execution and document retrieval and processing strategies, and provide a roadmap for addressing this critically important building block for efficient, scalable information extraction.

...read moreread less

Journal Article•DOI•

The role of index compression in score-at-a-time query evaluation

[...]

Jimmy Lin¹, Andrew Trotman²•Institutions (2)

University of Waterloo¹, University of Otago²

01 Jun 2017-Information Retrieval

TL;DR: Across four web test collections, it is found that the highest query evaluation speed is achieved by simply leaving the postings lists uncompressed, although the performance advantage over a state-of-the-art compression scheme is relatively small and the index is considerably larger.

...read moreread less

Abstract: This paper explores the performance of top k document retrieval with score-at-a-time query evaluation on impact-ordered indexes in main memory. To better understand execution efficiency in the context of modern processor architectures, we examine the role of index compression on query evaluation latency. Experiments include compressing postings with variable byte encoding, Simple-8b, variants of the QMX compression scheme, as well as a condition that is less often considered--no compression. Across four web test collections, we find that the highest query evaluation speed is achieved by simply leaving the postings lists uncompressed, although the performance advantage over a state-of-the-art compression scheme is relatively small and the index is considerably larger. We explain this finding in terms of the design of modern processor architectures: Index segments with high impact scores are usually short and inherently benefit from cache locality. Index segments with lower impact scores may be quite long, but modern architectures have sufficient memory bandwidth (coupled with prefetching) to "keep up" with the processor. Our results highlight the importance of "architecture affinity" when designing high-performance search engines.

...read moreread less

Journal Article•DOI•

Relevance maximization for high-recall retrieval problem: finding all needles in a haystack

[...]

Justin J. Song¹, Wookey Lee¹•Institutions (1)

Inha University¹

10 Jan 2017-The Journal of Supercomputing

TL;DR: To compute candidate groups consisting of k relevant documents efficiently, this work proposes dynamic diverse retrieval algorithms specialized for the patent-searching method, in which an effective dynamic interactive retrieval can be achieved.

...read moreread less

Abstract: High-recall retrieval problem, aiming at finding the full set of relevant documents in a huge result set by effective mining techniques, is particularly useful for patent information retrieval, legal document retrieval, medical document retrieval, market information retrieval, and literature review. The existing high-recall retrieval methods, however, have been far from satisfactory to retrieve all relevant documents due to not only high-recall and precision threshold measurements but also a sheer minimize the number of reviewed documents. To address this gap, we generalize the problem to a novel high-recall retrieval model, which can be represented as finding all needles in a giant haystack. To compute candidate groups consisting of k relevant documents efficiently, we propose dynamic diverse retrieval algorithms specialized for the patent-searching method, in which an effective dynamic interactive retrieval can be achieved. In the various types of datasets, the dynamic ranking method shows considerable improvements with respect to time and cost over the conventional static ranking approaches.

...read moreread less

Journal Article•DOI•

Informatics Support for Basic Research in Biomedicine

[...]

Thomas C. Rindflesch¹, Catherine Blake², Marcelo Fiszman¹, Halil Kilicoglu¹, Graciela Rosemblat¹, Jodi Schneider², Caroline J. Zeiss³ - Show less +3 more•Institutions (3)

National Institutes of Health¹, University of Illinois at Urbana–Champaign², Yale University³

01 Jul 2017-Ilar Journal

TL;DR: An overview of some text mining techniques that offer assistance in research by identifying biomedical entities and relations between them in text and an application that integrates PubMed document retrieval, concept and relation identification, and visualization are discussed, thus enabling a user to explore concepts and relations from within a set of retrieved citations.

...read moreread less

Abstract: Informatics methodologies exploit computer-assisted techniques to help biomedical researchers manage large amounts of information. In this paper, we focus on the biomedical research literature (MEDLINE). We first provide an overview of some text mining techniques that offer assistance in research by identifying biomedical entities (e.g., genes, substances, and diseases) and relations between them in text.We then discuss Semantic MEDLINE, an application that integrates PubMed document retrieval, concept and relation identification, and visualization, thus enabling a user to explore concepts and relations from within a set of retrieved citations. Semantic MEDLINE provides a roadmap through content and helps users discern patterns in large numbers of retrieved citations. We illustrate its use with an informatics method we call "discovery browsing," which provides a principled way of navigating through selected aspects of some biomedical research area. The method supports an iterative process that accommodates learning and hypothesis formation in which a user is provided with high level connections before delving into details.As a use case, we examine current developments in basic research on mechanisms of Alzheimer's disease. Out of the nearly 90 000 citations returned by the PubMed query "Alzheimer's disease," discovery browsing led us to 73 citations on sortilin and that disorder. We provide a synopsis of the basic research reported in 15 of these. There is wide-spread consensus among researchers working with a range of animal models and human cells that increased sortilin expression and decreased receptor expression are associated with amyloid beta and/or amyloid precursor protein.

...read moreread less

Journal Article•DOI•

A method for determining ontology-based user profile in document retrieval system

[...]

Bernadetta Maleszka

01 Jan 2017-Journal of Intelligent and Fuzzy Systems

TL;DR: This paper presents a method for determining user profile in a document retrieval system, and proposes ontology-based profile that allows to process semantic relations between users’ queries.

...read moreread less

Abstract: Information overload has become a very important aspect of information retrieval domain. Even if a user knows where to look for interesting information, he can have a problem with precisely formulating his information needs. A solution of the problem is personalization and recommendation system – they observe user activities, analyze them to discover important preferences. Based on these information the system can improve the effectiveness of the results. In this paper we present a method for determining user profile in a document retrieval system. We propose ontology-based profile. Such a structure allows to process semantic relations between users’ queries. We focus on methods for adapting profile because only up-to-date profile can help the user to obtained results that correspond with his information needs. We present a set of postulates for adaptation methods. Performed experimental evaluations of developed methods are promising.

...read moreread less

Proceedings Article•DOI•

Point-Cloud Shape Retrieval of Non-Rigid Toys

[...]

Frederico A. Limberger¹, Richard C. Wilson¹, Masaki Aono², Nicolas Audebert³, Alexandre Boulch³, Benjamin Bustos⁴, Andrea Giachetti⁵, Afzal Godil⁶, B. Le Saux³, Bo Li⁷, Yijuan Lu⁸, Hai-Dang Nguyen⁹, Vinh-Tiep Nguyen⁹, V.-K. Pham⁹, Ivan Sipiran¹⁰, Atsushi Tatsuma², Minh-Triet Tran⁹, Santiago Velasco-Forero¹¹ - Show less +14 more•Institutions (11)

University of York¹, Toyohashi University of Technology², Université Paris-Saclay³, University of Chile⁴, University of Verona⁵, National Institute of Standards and Technology⁶, University of Southern Mississippi⁷, Texas State University⁸, Vietnam National University, Ho Chi Minh City⁹, Pontifical Catholic University of Peru¹⁰, PSL Research University¹¹

01 Jan 2017

TL;DR: The evaluation results shown by this work suggest that researchers are in the right way towards shape descriptors which can capture the main characteristics of 3D models, however, more tests still need to be made, since this is the first time the authors compare non-rigid signatures for point-cloud shape retrieval.

...read moreread less

Abstract: In this paper, we present the results of the SHREC’17 Track: Point-Cloud Shape Retrieval of Non-Rigid Toys. The aim of this track is to create a fair benchmark to evaluate the performance of methods on the non-rigid point-cloud shape retrieval problem. The database used in this task contains 100 3D point-cloud models which are classified into 10 different categories. All point clouds were generated by scanning each one of the models in their final poses using a 3D scanner, i.e., all models have been articulated before scanned. The retrieval performance is evaluated using seven commonly-used statistics (PR-plot, NN, FT, ST, E-measure, DCG, mAP). In total, there are 8 groups and 31 submissions taking part of this contest. The evaluation results shown by this work suggest that researchers are in the right way towards shape descriptors which can capture the main characteristics of 3D models, however, more tests still need to be made, since this is the first time we compare non-rigid signatures for point-cloud shape retrieval.

...read moreread less

Proceedings Article•DOI•

Document Expansion Using External Collections

[...]

Garrick Sherman¹, Miles Efron¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

07 Aug 2017

TL;DR: This work proposes a method to further improve document models by utilizing external collections as part of the document expansion process, based on relevance modeling, which improves ad-hoc document retrieval effectiveness on a variety of corpus types.

...read moreread less

Abstract: Document expansion has been shown to improve the effectiveness of information retrieval systems by augmenting documents' term probability estimates with those of similar documents, producing higher quality document representations. We propose a method to further improve document models by utilizing external collections as part of the document expansion process. Our approach is based on relevance modeling, a popular form of pseudo-relevance feedback; however, where relevance modeling is concerned with query expansion, we are concerned with document expansion. Our experiments demonstrate that the proposed model improves ad-hoc document retrieval effectiveness on a variety of corpus types, with a particular benefit on more heterogeneous collections of documents.

...read moreread less