scispace - formally typeset
Search or ask a question

Showing papers on "Human–computer information retrieval published in 2016"


01 Jan 2016
TL;DR: This information retrieval implementing and evaluating search engines helps people to enjoy a good book with a cup of coffee in the afternoon, instead they cope with some malicious virus inside their desktop computer.
Abstract: Thank you for downloading information retrieval implementing and evaluating search engines. Maybe you have knowledge that, people have look hundreds times for their chosen novels like this information retrieval implementing and evaluating search engines, but end up in harmful downloads. Rather than enjoying a good book with a cup of coffee in the afternoon, instead they cope with some malicious virus inside their desktop computer.

96 citations


Journal ArticleDOI
TL;DR: To review some of the most important contributions in this domain to understand the principles of SIR, a taxonomy to categorize these contributions, and an analysis of some of these contributions and tools with respect to several criteria are proposed.

92 citations


Proceedings ArticleDOI
08 May 2016
TL;DR: This track aims to provide a benchmark to evaluate large-scale shape retrieval based on the ShapeNet dataset, using ShapeNet Core55, which provides more than 50 thousands models over 55 common categories in total for training and evaluating several algorithms.
Abstract: With the advent of commodity 3D capturing devices and better 3D modeling tools, 3D shape content is becoming increasingly prevalent. Therefore, the need for shape retrieval algorithms to handle large-scale shape repositories is more and more important. This track aims to provide a benchmark to evaluate large-scale shape retrieval based on the ShapeNet dataset. We use ShapeNet Core55, which provides more than 50 thousands models over 55 common categories in total for training and evaluating several algorithms. Five participating teams have submitted a variety of retrieval methods which were evaluated on several standard information retrieval performance metrics. We find the submitted methods work reasonably well on the track benchmark, but we also see significant space for improvement by future algorithms. We release all the data, results, and evaluation code for the benefit of the community.

64 citations


Journal ArticleDOI
TL;DR: This paper proposes and displays an ontology-based object-attribute-value (O-A-V) information extraction system as a web model that acts as a user dictionary to refine the search keywords in the query for subsequent attempts to improve the standard information retrieval systems.
Abstract: In the internet era, search engines play a vital role in information retrieval from web pages. Search engines arrange the retrieved results using various ranking algorithms. Additionally, retrieval is based on statistical searching techniques or content-based information extraction methods. It is still difficult for the user to understand the abstract details of every web page unless the user opens it separately to view the web content. This key point provided the motivation to propose and display an ontology-based object-attribute-value (O-A-V) information extraction system as a web model that acts as a user dictionary to refine the search keywords in the query for subsequent attempts. This first model is evaluated using various natural language processing (NLP) queries given as English sentences. Additionally, image search engines, such as Google Images, use content-based image information extraction and retrieval of web pages against the user query. To minimize the semantic gap between the image retrieval results and the expected user results, the domain ontology is built using image descriptions. The second proposed model initially examines natural language user queries using an NLP parser algorithm that will identify the subject-predicate-object (S-P-O) for the query. S-P-O extraction is an extended idea from the ontology-based O-A-V web model. Using this S-P-O extraction and considering the complex nature of writing SPARQL protocol and RDF query language (SPARQL) from the user point of view, the SPARQL auto query generation module is proposed, and it will auto generate the SPARQL query. Then, the query is deployed on the ontology, and images are retrieved based on the auto-generated SPARQL query. With the proposed methodology above, this paper seeks answers to following two questions. First, how to combine the use of domain ontology and semantics to improve information retrieval and user experience? Second, does this new unified framework improve the standard information retrieval systems? To answer these questions, a document retrieval system and an image retrieval system were built to test our proposed framework. The web document retrieval was tested against three key-words/bag-of-words models and a semantic ontology model. Image retrieval was tested on IAPR TC-12 benchmark dataset. The precision, recall and accuracy results were then compared against standard information retrieval systems using TREC_EVAL. The results indicated improvements over the standard systems. A controlled experiment was performed by test subjects querying the retrieval system in the absence and presence of our proposed framework. The queries were measured using two metrics, time and click-count. Comparisons were made on the retrieval performed with and without our proposed framework. The results were encouraging.

56 citations


Book ChapterDOI
TL;DR: In this paper, the authors examine the moderating effects of whether a retrieval attempt results in success or failure and conclude that retrieval practice is beneficial even when the retrieval attempt is unsuccessful.
Abstract: Attempting to recall information from memory (ie, retrieval practice) has been shown to enhance learning across a wide variety of materials, learners, and experimental conditions. We examine the moderating effects of what is arguably the most fundamental distinction to be made about retrieval: whether a retrieval attempt results in success or failure. After reviewing research on this topic, we conclude that retrieval practice is beneficial even when the retrieval attempt is unsuccessful. This finding appears to hold true in a variety of laboratory and real-world contexts and applies to learners across the lifespan. Based on these findings we outline a two-stage model in which learning from retrieval involves (1) a retrieval attempt and then (2) processing the answer. We then turn to a second issue: Does retrieval success even matter for learning? Recent findings suggest that retrieval failure followed by feedback leads to the same amount of learning as retrieval success. In light of these findings, we propose that separate mechanisms are not needed to explain the effect of retrieval success and retrieval failure on learning. We then review existing theories of retrieval and comment on their compatibility with extant data, and end with theoretical conclusions for researchers as well as practical advice for learners and teachers.

56 citations


Journal ArticleDOI
01 Oct 2016
TL;DR: It is shown that basic schemes are weak, but some of them can be made arbitrarily safe by composing them with large anonymity systems, and the security of each scheme is proved using a flexible differentially private definition for private queries that can capture notions of imperfect privacy.
Abstract: Private Information Retrieval (PIR), despite being well studied, is computationally costly and arduous to scale. We explore lower-cost relaxations of information-theoretic PIR, based on dummy queries, sparse vectors, and compositions with an anonymity system. We prove the security of each scheme using a flexible differentially private definition for private queries that can capture notions of imperfect privacy. We show that basic schemes are weak, but some of them can be made arbitrarily safe by composing them with large anonymity systems.

46 citations


Proceedings ArticleDOI
Hang Li1, Zhengdong Lu1
07 Jul 2016
TL;DR: This tutorial aims at summarizing and introducing the results of recent research on deep learning for information retrieval, in order to stimulate and foster more significant research and development work on the topic in the future.
Abstract: Recent years have observed a significant progress in information retrieval and natural language processing with deep learning technologies being successfully applied into almost all of their major tasks. The key to the success of deep learning is its capability of accurately learning distributed representations (vector representations or structured arrangement of them) of natural language expressions such as sentences, and effectively utilizing the representations in the tasks. This tutorial aims at summarizing and introducing the results of recent research on deep learning for information retrieval, in order to stimulate and foster more significant research and development work on the topic in the future. The tutorial mainly consists of three parts. In the first part, we introduce the fundamental techniques of deep learning for natural language processing and information retrieval, such as word embedding, recurrent neural networks, and convolutional neural networks. In the second part, we explain how deep learning, particularly representation learning techniques, can be utilized in fundamental NLP and IR problems, including matching, translation, classification, and structured prediction. In the third part, we describe how deep learning can be used in specific application tasks in details. The tasks are search, question answering (from either documents, database, or knowledge base), and image retrieval.

43 citations


Journal ArticleDOI
TL;DR: To enable the integration of multiple data sources while performing efficient retrieval of web data, an intelligent web search framework has been proposed in this paper.

39 citations


Journal ArticleDOI
09 Dec 2016-PLOS ONE
TL;DR: This strategy has been shown to be feasible and can provide evidence to doctors’ clinical questions and has the potential to be incorporated into an interventional study to determine the impact of an online evidence retrieval system.
Abstract: Background Physicians are often encouraged to locate answers for their clinical queries via an evidence-based literature search approach. The methods used are often not clearly specified. Inappropriate search strategies, time constraint and contradictory information complicate evidence retrieval. Aims Our study aimed to develop a search strategy to answer clinical queries among physicians in a primary care setting. Methods Six clinical questions of different medical conditions seen in primary care were formulated. A series of experimental searches to answer each question was conducted on 3 commonly advocated medical databases. We compared search results from a PICO (patients, intervention, comparison, outcome) framework for questions using different combinations of PICO elements. We also compared outcomes from doing searches using text words, Medical Subject Headings (MeSH), or a combination of both. All searches were documented using screenshots and saved search strategies. Results Answers to all 6 questions using the PICO framework were found. A higher number of systematic reviews were obtained using a 2 PICO element search compared to a 4 element search. A more optimal choice of search is a combination of both text words and MeSH terms. Despite searching using the Systematic Review filter, many non-systematic reviews or narrative reviews were found in PubMed. There was poor overlap between outcomes of searches using different databases. The duration of search and screening for the 6 questions ranged from 1 to 4 hours. Conclusion This strategy has been shown to be feasible and can provide evidence to doctors' clinical questions. It has the potential to be incorporated into an interventional study to determine the impact of an online evidence retrieval system.

37 citations


Journal ArticleDOI
TL;DR: A survey of the literature on indexing and retrieval of mathematical knowledge, with pointers to 77 papers and tentative taxonomies of both retrieval problems and recurring techniques is presented.
Abstract: We present a survey of the literature on indexing and retrieval of mathematical knowledge, with pointers to 77 papers and tentative taxonomies of both retrieval problems and recurring techniques.

33 citations


Proceedings ArticleDOI
10 May 2016
TL;DR: A novel tool called F-search is presented that emphasize the core strengths of LIRE: lightness, speed and accuracy of the Java library for visual information retrieval.
Abstract: With an annual growth rate of 16.2% of taken photos a year, researchers predict an almost unbelievable number of 4.9 trillion stored images in 2017. Nearly 80% of these photos in 2017 will be taken with mobile phones. To be able to cope with this immense amount of visual data in a fast and accurate way, a visual information retrieval systems are needed for various domains and applications. LIRE, short for Lucene Image Retrieval, is a light weight and easy to use Java library for visual information retrieval. It allows developers and researchers to integrate common content based image retrieval approaches in their applications and research projects. LIRE supports global and local image features and can cope with millions of images using approximate search and distributing indexes on the cloud. In this demo we present a novel tool called F-search that emphasize the core strengths of LIRE: lightness, speed and accuracy.

Proceedings ArticleDOI
07 Jul 2016
TL;DR: The "Search as Learning" (SAL) workshop is focused on an area within the information retrieval field that is only beginning to emerge: supporting users in their learning whilst interacting with information content.
Abstract: The "Search as Learning" (SAL) workshop is focused on an area within the information retrieval field that is only beginning to emerge: supporting users in their learning whilst interacting with information content.

Journal ArticleDOI
TL;DR: In this paper, some of the most important areas of information retrieval i.e. Cross-lingual Information Retrieval (CLIR), Multi-lingUAL Information RetRIeval (MLIR), Machine translation approaches and techniques are introduced.

Journal ArticleDOI
TL;DR: This paper presents a novel Content-Based Video Retrieval approach in order to cope with the semantic gap challenge by means of latent topics and reveals that the proposed ranking function is able to provide a competitive advantage within the content-based retrieval field.

Journal ArticleDOI
TL;DR: A novel discriminative semantic subspace analysis (DSSA) method is proposed, which can directly learn a semantic sub space from similar and dissimilar pairwise constraints without using any explicit class label information.
Abstract: Content-based image retrieval (CBIR) has attracted much attention during the past decades for its potential practical applications to image database management. A variety of relevance feedback (RF) schemes have been designed to bridge the gap between low-level visual features and high-level semantic concepts for an image retrieval task. In the process of RF, it would be impractical or too expensive to provide explicit class label information for each image. Instead, similar or dissimilar pairwise constraints between two images can be acquired more easily. However, most of the conventional RF approaches can only deal with training images with explicit class label information. In this paper, we propose a novel discriminative semantic subspace analysis (DSSA) method, which can directly learn a semantic subspace from similar and dissimilar pairwise constraints without using any explicit class label information. In particular, DSSA can effectively integrate the local geometry of labeled similar images, the discriminative information between labeled similar and dissimilar images, and the local geometry of labeled and unlabeled images together to learn a reliable subspace. Compared with the popular distance metric analysis approaches, our method can also learn a distance metric but perform more effectively when dealing with high-dimensional images. Extensive experiments on both the synthetic data sets and a real-world image database demonstrate the effectiveness of the proposed scheme in improving the performance of the CBIR.

Proceedings Article
01 Jan 2016
TL;DR: Experimental results show that document distance measures derived from unsupervised word embeddings contribute to significant ranking improvements when combined with traditional document retrieval approaches.
Abstract: This article summarizes the approach developed for TREC 2016 Clinical Decision Support Track. In order to address the daunting challenge of retrieval of biomedical articles for answering clinical questions, an information retrieval methodology was developed that combines pseudo-relevance feedback, semantic query expansion and document similarity measures based on unsupervised word embeddings. The individual relevance metrics were combined through a supervised learning -to-rank model based on gradient boosting to maximize the normalized discounted cumulative gain (nDCG). Experimental results show that document distance measures derived from unsupervised word embeddings contribute to significant ranking improvements when combined with traditional document retrieval approaches.

Journal ArticleDOI
TL;DR: A set of comprehensive empirical studies to explore the effects of multiple query evidences on large-scale social image search and a novel quantitative metric is proposed and applied to assess the influences of different visual queries based on their complexity levels.
Abstract: System performance assessment and comparison are fundamental for large-scale image search engine development. This article documents a set of comprehensive empirical studies to explore the effects of multiple query evidences on large-scale social image search. The search performance based on the social tags, different kinds of visual features and their combinations are systematically studied and analyzed. To quantify the visual query complexity, a novel quantitative metric is proposed and applied to assess the influences of different visual queries based on their complexity levels. Besides, we also study the effects of automatic text query expansion with social tags using a pseudo relevance feedback method on the retrieval performance. Our analysis of experimental results shows a few key research findings: (1) social tag-based retrieval methods can achieve much better results than content-based retrieval methods; (2) a combination of textual and visual features can significantly and consistently improve the search performance; (3) the complexity of image queries has a strong correlation with retrieval results' quality--more complex queries lead to poorer search effectiveness; and (4) query expansion based on social tags frequently causes search topic drift and consequently leads to performance degradation.

Proceedings ArticleDOI
08 Feb 2016
TL;DR: This work shows that it is possible to get judgements of effort from the assessors and shows that given documents of the same relevance grade, effort needed to find the portion of the document relevant to the query is a significant factor in determining user satisfaction as well as user preference between these documents.
Abstract: Document relevance has been the primary focus in the design, optimization and evaluation of retrieval systems. Traditional testcollections are constructed by asking judges the relevance grade for a document with respect to an input query. Recent work of Yilmaz et al. found an evidence that effort is another important factor in determining document utility, suggesting that more thought should be given into incorporating effort into information retrieval. However, that work did not ask judges to directly assess the level of effort required to consume a document or analyse how effort judgements relate to traditional relevance judgements. In this work, focusing on three aspects associated with effort, we show that it is possible to get judgements of effort from the assessors. We further show that given documents of the same relevance grade, effort needed to find the portion of the document relevant to the query is a significant factor in determining user satisfaction as well as user preference between these documents. Our results suggest that if the end goal is to build retrieval systems that optimize user satisfaction, effort should be included as an additional factor to relevance in building and evaluating retrieval systems. We further show that new retrieval features are needed if the goal is to build retrieval systems that jointly optimize relevance and effort and propose a set of such features. Finally, we focus on the evaluation of retrieval systems and show that incorporating effort into retrieval evaluation could lead to significant differences regarding the performance of retrieval systems.

Proceedings ArticleDOI
07 Jul 2016
TL;DR: A new Web-based collection for focused retrieval is presented and the documents most highly ranked for each query by a highly effective learning-to-rank method were judged for relevance using crowdsourcing.
Abstract: Focused retrieval (a.k.a., passage retrieval) is important at its own right and as an intermediate step in question answering systems. We present a new Web-based collection for focused retrieval. The document corpus is the Category A of the ClueWeb12 collection. Forty-nine queries from the educational domain were created. The $100$ documents most highly ranked for each query by a highly effective learning-to-rank method were judged for relevance using crowdsourcing. All sentences in the relevant documents were judged for relevance.

Journal ArticleDOI
TL;DR: A new information-retrieval algorithm based on formal concept analysis is proposed that deals with disjunctive and conjunctive queries and exploits the theoretical basis provided by the FCA to design an efficient and flexible approach for information retrieval.
Abstract: With the exponential increase in the quantity of information circulating on the Internet, an evolution of information-retrieval systems becomes paramount. Indeed, current approaches for information systems design remain unable to meet the needs of users, either in performance (precision and recall) or response time. In this paper, we propose a new information-retrieval algorithm based on formal concept analysis. The proposed algorithm deals with disjunctive and conjunctive queries. In fact, information retrieval is a direct application of the formal concept analysis (FCA). This makes the adaptation of this theory to this field an easy and intuitive task. In this context, we exploited the theoretical basis provided by the FCA to design an efficient and flexible approach for information retrieval.

Posted Content
TL;DR: The recall and precision technique are used to evaluate the efficacy of information retrieval systems and the response time and the relevancy of the results are the significant factors in user satisfaction.
Abstract: The information retrieval system evaluation revolves around the notion of relevant and non-relevant documents. The performance indicator such as precision and recall are used to determine how far the system satisfies the user requirements. The effectiveness of information retrieval systems is essentially measured by comparing performance, functionality and systematic approach on a common set of queries and documents. The significance tests are used to evaluate functional, performance (precession and recall), collection and interface evaluation. We must focus on the user satisfaction, which is the key parameter of performance evaluation. It identifies the collection of relevant documents under the retrieved set of collection in specific time interval. The recall and precision technique are used to evaluate the efficacy of information retrieval systems. The response time and the relevancy of the results are the significant factors in user satisfaction. The comparison of search engine Yahoo and Google based on precision and recall technique.

Journal ArticleDOI
01 Jul 2016
TL;DR: A machine‐learning‐based method to dynamically evaluate and predict search performance several time‐steps ahead at each given time point of the search process during an exploratory search task is proposed.
Abstract: Most information retrieval IR systems consider relevance, usefulness, and quality of information objects documents, queries for evaluation, prediction, and recommendation, often ignoring the underlying search process of information seeking. This may leave out opportunities for making recommendations that analyze the search process and/or recommend alternative search process instead of objects. To overcome this limitation, we investigated whether by analyzing a searcher's current processes we could forecast his likelihood of achieving a certain level of success with respect to search performance in the future. We propose a machine-learning-based method to dynamically evaluate and predict search performance several time-steps ahead at each given time point of the search process during an exploratory search task. Our prediction method uses a collection of features extracted from expression of information need and coverage of information. For testing, we used log data collected from 4 user studies that included 216 users 96 individuals and 60 pairs. Our results show 80-90% accuracy in prediction depending on the number of time-steps ahead. In effect, the work reported here provides a framework for evaluating search processes during exploratory search tasks and predicting search performance. Importantly, the proposed approach is based on user processes and is independent of any IR system.

Book
01 Jun 2016
TL;DR: In this article, the authors provide a comprehensive and up-to-date introduction to dynamic information retrieval modeling, the statistical modeling of IR systems that can adapt to change and learn with minimal computational footprint and be responsive and adaptive.
Abstract: Dynamic aspects of Information Retrieval (IR), including changes found in data, users and systems, are increasingly being utilized in search engines and information filtering systems. Existing IR techniques are limited in their ability to optimize over changes, learn with minimal computational footprint and be responsive and adaptive. The objective of this tutorial is to provide a comprehensive and up-to-date introduction to Dynamic Information Retrieval Modeling, the statistical modeling of IR systems that can adapt to change. It will cover techniques ranging from classic relevance feedback to the latest applications of partially observable Markov decision processes (POMDPs) and a handful of useful algorithms and tools for solving IR problems incorporating dynamics.

Journal ArticleDOI
TL;DR: A new technique to refine Information Retrieval searches to better represent the user’s information need is presented in order to enhance the performance of information retrieval by using different query expansion techniques and apply a linear combinations between them.
Abstract: Biomedical literature retrieval is becoming increasingly complex, and there is a fundamental need for advanced information retrieval systems. Information Retrieval (IR) programs scour unstructured materials such as text documents in large reserves of data that are usually stored on computers. IR is related to the representation, storage, and organization of information items, as well as to access. In IR one of the main problems is to determine which documents are relevant and which are not to the user’s needs. Under the current regime, users cannot precisely construct queries in an accurate way to retrieve particular pieces of data from large reserves of data. Basic information retrieval systems are producing low-quality search results. In our proposed system for this paper we present a new technique to refine Information Retrieval searches to better represent the user’s information need in order to enhance the performance of information retrieval by using different query expansion techniques and apply a linear combinations between them, where the combinations was linearly between two expansion results at one time. Query expansions expand the search query, for example, by finding synonyms and reweighting original terms. They provide significantly more focused, particularized search results than do basic search queries. The retrieval performance is measured by some variants of MAP (Mean Average Precision) and according to our experimental results, the combination of best results of query expansion is enhanced the retrieved documents and outperforms our baseline by 21.06 %, even it outperforms a previous study by 7.12 %. We propose several query expansion techniques and their combinations (linearly) to make user queries more cognizable to search engines and to produce higher-quality search results.

Proceedings ArticleDOI
15 Jun 2016
TL;DR: LIvRE supports image-based queries, which are efficiently matched with the extracted frames of the indexed videos, and consists of three main system components (pre-processing, indexing and retrieval), as well as a scalable and responsive HTML5 user interface accessible from a web browser.
Abstract: The fast growth of video data requires robust, efficient, and scalable systems to allow for indexing and retrieval. These systems must be accessible from lightweight, portable and usable interfaces to help users in management and search of video content. This demo paper presents LIvRE, an extension of an existing open source tool for image retrieval to support video indexing. LIvRE consists of three main system components (pre-processing, indexing and retrieval), as well as a scalable and responsive HTML5 user interface accessible from a web browser. LIvRE supports image-based queries, which are efficiently matched with the extracted frames of the indexed videos.

Proceedings ArticleDOI
01 Sep 2016
TL;DR: Review on strategies of information retrieval in web crawling has been presented that are classifying into four categories viz: focused, distributed, incremental and hidden web crawlers, on the basis of user customized parameters.
Abstract: In today's scenario, World Wide Web (WWW) is flooded with huge amount of information. Due to growing popularity of the internet, finding the meaningful information among billions of information resources on the WWW is a challenging task. The information retrieval (IR) provides documents to the end users which satisfy their need of information. Search engine is used to extract valuable information from the internet. Web crawler is the principal part of search engine; it is an automatic script or program which can browse the WWW in automatic manner. This process is known as web crawling. In this paper, review on strategies of information retrieval in web crawling has been presented that are classifying into four categories viz: focused, distributed, incremental and hidden web crawlers. Finally, on the basis of user customized parameters the comparative analysis of various IR strategies has been performed.

01 Jan 2016
TL;DR: Music information retrieval (MIR) is a multidisciplinary research endeavor that strives to develop innovative content-based searching schemes, novel interfaces, and evolving networked delivery mechanisms in an effort to make the world's vast store of music accessible to all as mentioned in this paper.
Abstract: Music information retrieval (MIR) is “a multidisciplinary research endeavor that strives to develop innovative content‐based searching schemes, novel interfaces, and evolving networked delivery mechanisms in an effort to make the world's vast store of music accessible to all.” MIR was born from computational musicology in the 1960s and has since grown to have links with music cognition and audio engineering, a dedicated annual conference (ISMIR) and an annual evaluation campaign (MIREX). MIR combines machine learning with expert human knowledge to use digital music data – images of music scores, “symbolic” data such as MIDI files, audio, and metadata about musical items – for information retrieval, classification and estimation, or sequence labeling. This chapter gives a brief history of MIR, introduces classical MIR tasks from optical music recognition to music recommendation systems, and outlines some of the key questions and directions for future developments in MIR.

Proceedings ArticleDOI
07 Jul 2016
TL;DR: The Lattes Expertise Retrieval (LExR) test collection for research on academic expertise retrieval has been designed to provide a large-scale benchmark for two complementary expertise retrieval tasks, namely, expert profiling and expert finding.
Abstract: Expertise retrieval has been the subject of intense research over the past decade, particularly with the public availability of benchmark test collections for expertise retrieval in enterprises. Another domain which has seen comparatively less research on expertise retrieval is academic search. In this paper, we describe the Lattes Expertise Retrieval (LExR) test collection for research on academic expertise retrieval. LExR has been designed to provide a large-scale benchmark for two complementary expertise retrieval tasks, namely, expert profiling and expert finding. Unlike currently available test collections, which fully support only one of these tasks, LExR provides graded relevance judgments performed by expert judges separately for each task. In addition, LExR is both cross-organization and cross-area, encompassing candidate experts from all areas of knowledge working in research institutions all over Brazil. As a result, it constitutes a valuable resource for fostering new research directions on expertise retrieval in an academic setting.

Proceedings ArticleDOI
11 Apr 2016
TL;DR: VizioMetrix is a platform that extracts visual information from the scientific literature and makes it available for use in new information retrieval applications and for studies that look at patterns of visual information across millions of papers.
Abstract: We present VizioMetrix, a platform that extracts visual information from the scientific literature and makes it available for use in new information retrieval applications and for studies that look at patterns of visual information across millions of papers. New ideas are conveyed visually in the scientific literature through figures --- diagrams, photos, visualizations, tables --- but these visual elements remain ensconced in the surrounding paper and difficult to use directly to facilitate information discovery tasks or longitudinal analytics. Very few applications in information retrieval, academic search, or bibliometrics make direct use of the figures, and none attempt to recognize and exploit the type of figure, which can be used to augment interactions with a large corpus of scholarly literature. The VizioMetrix platform processes a corpus of documents, classifies the figures, organizes the results into a cloud-hosted databases, and drives three distinct applications to support bibliometric analysis and information retrieval. The first application supports information retrieval tasks by allowing rapid browsing of classified figures. The second application supports longitudinal analysis of visual patterns in the literature and facilitates data mining of these figures. The third application supports crowdsourced tagging of figures to improve classification, augment search, and facilitate new kinds of analyses. Our initial corpus is the entirety of PubMed Central (PMC), and will be released to the public alongside this paper; we welcome other researchers to make use of these resources.

Journal ArticleDOI
TL;DR: Experimental results indicate that CR can significantly improve the retrieval performance with minimum effort and can provide a notably convenient user experience.