scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Figure Metadata Extraction from Digital Documents

TL;DR: This work describes the very first step in indexing, classification and data extraction from figures in PDF documents - accurate automatic extraction of figures and associated metadata, a nontrivial task.
Abstract: Academic papers contain multiple figures (information graphics) representing important findings and experimental results. Automatic data extraction from such figures and classification of information graphics is not straightforward and a well studied problem in document analysis cite{4275059}. Also, very few digital library search engines index figures and/or associated metadata (figure caption) from PDF documents. We describe the very first step in indexing, classification and data extraction from figures in PDF documents - accurate automatic extraction of figures and associated metadata, a nontrivial task. Document layout, font information, lexical and linguistic features for figure caption extraction from PDF documents is considered for both rule based and machine learning based approaches. We also describe a digital library search engine that indexes figure captions and mentions from 150K documents, extracted by our custom built extractor.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: The background and state of the art of scholarly data management and relevant technologies are examined, and data analysis methods, such as statistical analysis, social network analysis, and content analysis for dealing with big scholarly data are reviewed.
Abstract: With the rapid growth of digital publishing, harvesting, managing, and analyzing scholarly information have become increasingly challenging. The term Big Scholarly Data is coined for the rapidly growing scholarly data, which contains information including millions of authors, papers, citations, figures, tables, as well as scholarly networks and digital libraries. Nowadays, various scholarly data can be easily accessed and powerful data analysis technologies are being developed, which enable us to look into science itself with a new perspective. In this paper, we examine the background and state of the art of big scholarly data. We first introduce the background of scholarly data management and relevant technologies. Second, we review data analysis methods, such as statistical analysis, social network analysis, and content analysis for dealing with big scholarly data. Finally, we look into representative research issues in this area, including scientific impact evaluation, academic recommendation, and expert finding. For each issue, the background, main challenges, and latest research are covered. These discussions aim to provide a comprehensive review of this emerging area. This survey paper concludes with a discussion of open issues and promising future directions.

234 citations


Cites background from "Figure Metadata Extraction from Dig..."

  • ...In this section, we introduce related scholarly data collection techniques, digital libraries, search engines, and academic social networks....

    [...]

Journal ArticleDOI
TL;DR: This research paper investigates the current trends and identifies the existing challenges in development of a big scholarly data platform, with specific focus on directions for future research and maps them to the different phases of the big data lifecycle.
Abstract: Survey of big scholarly data with respect to the different phases of the big data lifecycle.Identifies the different big data tools and technologies that can be used for development of scholarly applications.Investigates research challenges and limitations specific to big scholarly data and its applications.Provides research directions and paves way towards the development of a generic and comprehensive big scholarly data platform. Recently, there has been a shifting focus of organizations and governments towards digitization of academic and technical documents, adding a new facet to the concept of digital libraries. The volume, variety and velocity of this generated data, satisfies the big data definition, as a result of which, this scholarly reserve is popularly referred to as big scholarly data. In order to facilitate data analytics for big scholarly data, architectures and services for the same need to be developed. The evolving nature of research problems has made them essentially interdisciplinary. As a result, there is a growing demand for scholarly applications like collaborator discovery, expert finding and research recommendation systems, in addition to several others. This research paper investigates the current trends and identifies the existing challenges in development of a big scholarly data platform, with specific focus on directions for future research and maps them to the different phases of the big data lifecycle.

104 citations

Proceedings Article
01 Apr 2015
TL;DR: This work introduces a new dataset of 150 computer science papers along with ground truth labels for the locations of the figures, tables and captions within them and demonstrates a caption-to-figure matching component that is effective even in cases where individual captions are adjacent to multiple figures.
Abstract: Identifying and extracting figures and tables along with their captions from scholarly articles is important both as a way of providing tools for article summarization, and as part of larger systems that seek to gain deeper, semantic understanding of these articles. While many "off-the-shelf" tools exist that can extract embedded images from these documents, e.g. PDFBox, Poppler, etc., these tools are unable to extract tables, captions, and figures composed of vector graphics. Our proposed approach analyzes the structure of individual pages of a document by detecting chunks of body text, and locates the areas wherein figures or tables could reside by reasoning about the empty regions within that text. This method can extract a wide variety of figures because it does not make strong assumptions about the format of the figures embedded in the document, as long as they can be differentiated from the main article's text. Our algorithm also demonstrates a caption-to-figure matching component that is effective even in cases where individual captions are adjacent to multiple figures. Our contribution also includes methods for leveraging particular consistency and formatting assumptions to identify titles, body text and captions within each article. We introduce a new dataset of 150 computer science papers along with ground truth labels for the locations of the figures, tables and captions within them. Our algorithm achieves 96% precision at 92% recall when tested against this dataset, surpassing previous state of the art. We release our dataset, code, and evaluation scripts on our project website for enabling future research.

68 citations

Journal ArticleDOI
TL;DR: In this article, the authors use techniques from computer vision and machine learning to classify more than 8 million figures from PubMed into five figure types and study the resulting patterns of visual information as they relate to scholarly impact.
Abstract: Scientific results are communicated visually in the literature through diagrams, visualizations, and photographs. These information-dense objects have been largely ignored in bibliometrics and scientometrics studies when compared to citations and text. In this paper, we use techniques from computer vision and machine learning to classify more than 8 million figures from PubMed into five figure types and study the resulting patterns of visual information as they relate to scholarly impact. We find that the distribution of figures and figure types in the literature has remained relatively constant over time, but can vary widely across field and topic. Remarkably, we find a significant correlation between scientific impact and the use of visual information, where higher impact papers tend to include more diagrams, and to a lesser extent more plots. To explore these results and other ways of extracting this visual information, we have built a visual browser to illustrate the concept and explore design alternatives for supporting viziometric analysis and organizing visual information. We use these results to articulate a new research agenda-viziometrics-to study the organization and presentation of visual information in the scientific literature.

52 citations

Journal ArticleDOI
TL;DR: A comprehensive survey of approaches across all components of the automated chart mining pipeline such as automated extraction of charts from documents, processing of multi-panel charts, and datasets for training and evaluation are presented.
Abstract: Charts are useful communication tools for the presentation of data in a visually appealing format that facilitates comprehension. There have been many studies dedicated to chart mining, which refers to the process of automatic detection, extraction and analysis of charts to reproduce the tabular data that was originally used to create them. By allowing access to data which might not be available in other formats, chart mining facilitates the creation of many downstream applications. This paper presents a comprehensive survey of approaches across all components of the automated chart mining pipeline, such as (i) automated extraction of charts from documents; (ii) processing of multi-panel charts; (iii) automatic image classifiers to collect chart images at scale; (iv) automated extraction of data from each chart image, for popular chart types as well as selected specialized classes; (v) applications of chart mining; and (vi) datasets for training and evaluation, and the methods that were used to build them. Finally, we summarize the main trends found in the literature and provide pointers to areas for further research in chart mining.

48 citations

References
More filters
Proceedings ArticleDOI
16 Oct 2011
TL;DR: ReVision is a system that automatically redesigns visualizations to improve graphical perception, and applies perceptually-based design principles to populate an interactive gallery of redesigned charts.
Abstract: Poorly designed charts are prevalent in reports, magazines, books and on the Web Most of these charts are only available as bitmap images; without access to the underlying data it is prohibitively difficult for viewers to create more effective visual representations In response we present ReVision, a system that automatically redesigns visualizations to improve graphical perception Given a bitmap image of a chart as input, ReVision applies computer vision and machine learning techniques to identify the chart type (eg, pie chart, bar chart, scatterplot, etc) It then extracts the graphical marks and infers the underlying data Using a corpus of images drawn from the web, ReVision achieves image classification accuracy of 96% across ten chart categories It also accurately extracts marks from 79% of bar charts and 62% of pie charts, and from these charts it successfully extracts data from 71% of bar charts and 64% of pie charts ReVision then applies perceptually-based design principles to populate an interactive gallery of redesigned charts With this interface, users can view alternative chart designs and retarget content to different visual styles

258 citations


"Figure Metadata Extraction from Dig..." refers background in this paper

  • ...Classification of figures in academic documents has been explored extensively[6], [8]....

    [...]

Proceedings ArticleDOI
25 Jun 2007
TL;DR: An approach for classifying images of charts based on the shape and spatial relationships of their primitives and two novel features to represent the structural information based on region segmentation and curve saliency are introduced.
Abstract: We present an approach for classifying images of charts based on the shape and spatial relationships of their primitives. Five categories are considered: bar-charts, curve-plots, pie-charts, scatter-plots and surface-plots. We introduce two novel features to represent the structural information based on (a) region segmentation and (b) curve saliency. The local shape is characterized using the Histograms of Oriented Gradients (HOG) and the Scale Invariant Feature Transform (SIFT) descriptors. Each image is represented by sets of feature vectors of each modality. The similarity between two images is measured by the overlap in the distribution of the features -measured using the Pyramid Match algorithm. A test image is classified based on its similarity with training images from the categories. The approach is tested with a database of images collected from the Internet.

67 citations


"Figure Metadata Extraction from Dig..." refers background in this paper

  • ...Classification of figures in academic documents has been explored extensively[6], [8]....

    [...]

Journal ArticleDOI
TL;DR: This paper describes a system that extracts data from documents fully automatically, completely eliminating the need for human intervention, and has the potential to be a vital component in high volume digital libraries.
Abstract: Authors use images to present a wide variety of important information in documents. For example, two-dimensional (2-D) plots display important data in scientific publications. Often, end-users seek to extract this data and convert it into a machine-processible form so that the data can be analyzed automatically or compared with other existing data. Existing document data extraction tools are semi-automatic and require users to provide metadata and interactively extract the data. In this paper, we describe a system that extracts data from documents fully automatically, completely eliminating the need for human intervention. The system uses a supervised learning-based algorithm to classify figures in digital documents into five classes: photographs, 2-D plots, 3-D plots, diagrams, and others. Then, an integrated algorithm is used to extract numerical data from data points and lines in the 2-D plot images along with the axes and their labels, the data symbols in the figure’s legend and their associated labels. We demonstrate that the proposed system and its component algorithms are effective via an empirical evaluation. Our data extraction system has the potential to be a vital component in high volume digital libraries.

54 citations


"Figure Metadata Extraction from Dig..." refers methods in this paper

  • ...Figures were analyzed extensively, with attempts to vectorize raster images[2] or extract data from 2D plots and solid line curves[5]....

    [...]

Journal ArticleDOI
TL;DR: A set of methods to extract useful information (synopsis) related to document-elements automatically based on the similarity and the proximity of the sentences with the caption and the sentences in the document text that refer to the document-element are presented.
Abstract: Increasingly, special-purpose search engines are being built to enable the retrieval of document-elements like tables, figures, and algorithms [Bhatia et al. 2010; Liu et al. 2007; Hearst et al. 2007]. These search engines present a thumbnail view of document-elements, some document metadata such as the title of the papers and their authors, and the caption of the document-element. While some authors in some disciplines write carefully tailored captions, generally, the author of a document assumes that the caption will be read in the context of the text in the document. When the caption is presented out of context as in a document-element-search-engine result, it may not contain enough information to help the end-user understand what the content of the document-element is. Consequently, end-users examining document-element search results would want a short “synopsis” of this information presented along with the document-element. Having access to the synopsis allows the end-user to quickly understand the content of the document-element without having to download and read the entire document as examining the synopsis takes a shorter time than finding information about a document element by downloading, opening and reading the file. Furthermore, it may allow the end-user to examine more results than they would otherwise. In this paper, we present the first set of methods to extract this useful information (synopsis) related to document-elements automatically. We use Naive Bayes and support vector machine classifiers to identify relevant sentences from the document text based on the similarity and the proximity of the sentences with the caption and the sentences in the document text that refer to the document-element. We compare the two classification methods and study the effects of different features used. We also investigate the problem of choosing the optimum synopsis-size that strikes a balance between the information content and the size of the generated synopses. A user study is also performed to measure how the synopses generated by our proposed method compare with other state-of-the-art approaches.

50 citations


"Figure Metadata Extraction from Dig..." refers background in this paper

  • ...Therefore, our precision should be equal or better than theirs....

    [...]

Proceedings ArticleDOI
26 Jul 2009
TL;DR: Two algorithms to recover the sequence of extracted sparse lines, which improve the table content collection are proposed and the experimental results show the comparison of the performance of both algorithms, and the effectiveness of text sequence recovering for the table boundary detection.
Abstract: As the rapid growth of PDF documents, recognizing the document structure and components are useful for document storage, classification and retrieval. Table, a ubiquitous document component, becomes an important information source. Accurately detecting the table boundary plays a crucial role for many applications, e.g., the increasing demand on the table data search. Rather than converting PDFs to image or HTML and then processing with other techniques (e.g., OCR), extracting and analyzing texts from PDFs directly is easy and accurate. However, text extraction tools face a common problem: text sequence error. In this paper, we propose two algorithms to recover the sequence of extracted sparse lines, which improve the table content collection. The experimental results show the comparison of the performance of both algorithms, and demonstrate the effectiveness of text sequence recovering for the table boundary detection.

38 citations


"Figure Metadata Extraction from Dig..." refers background or methods in this paper

  • ...Search engines on specific document elements such as table [3] or acknowledged entities7 have been reported earlier....

    [...]

  • ...We gratefully acknowledge partial support from the National Science Foundation and suggestions by Suppawong Tuarob, Shibamouli Lahiri, Lior Rokach, Madian Khabsa and Hung-Hsuan Chen....

    [...]