scispace - formally typeset
Open AccessJournal ArticleDOI

CERMINE: automatic extraction of structured metadata from scientific literature

Reads0
Chats0
TLDR
The overall workflow architecture of CERMINE is outlined, details about individual steps implementations are provided and the evaluation of the extraction workflow carried out with the use of a large dataset showed good performance for most metadata types.
Abstract
CERMINE is a comprehensive open-source system for extracting structured metadata from scientific articles in a born-digital form. The system is based on a modular workflow, whose loosely coupled architecture allows for individual component evaluation and adjustment, enables effortless improvements and replacements of independent parts of the algorithm and facilitates future architecture expanding. The implementations of most steps are based on supervised and unsupervised machine learning techniques, which simplifies the procedure of adapting the system to new document layouts and styles. The evaluation of the extraction workflow carried out with the use of a large dataset showed good performance for most metadata types, with the average F score of 77.5 %. CERMINE system is available under an open-source licence and can be accessed at http://cermine.ceon.pl. In this paper, we outline the overall workflow architecture and provide details about individual steps implementations. We also thoroughly compare CERMINE to similar solutions, describe evaluation methodology and finally report its results.

read more

Content maybe subject to copyright    Report

Citations
More filters
Posted Content

Computing Graph Neural Networks: A Survey from Algorithms to Accelerators

TL;DR: A review of the field of GNNs is presented from the perspective of computing, and an in-depth analysis of current software and hardware acceleration schemes is provided, from which a hardware-software, graph-aware, and communication-centric vision for GNN accelerators is distilled.
Journal ArticleDOI

Information extraction from scientific articles: a survey

TL;DR: In this article, the authors present the overall progress concerning automatic information extraction from scientific articles and classify the information insights extracted from scientific documents into two broad categories i.e. metadata and key-insights.
Journal ArticleDOI

Citation recommendation: approaches and datasets

TL;DR: Citation recommendation describes the task of recommending citations for a given text as discussed by the authors, which describes the need to cite the most appropriate publications when writing scientific texts and has emerged as an important research topic.
Journal ArticleDOI

The NIH Open Citation Collection: A public access, broad coverage resource.

TL;DR: The NIH Open Citation Collection (NIH-OCC), a public access database for biomedical research that is made freely available to the community, is described and data from unrestricted data sources such as MedLine, PubMed Central, and CrossRef are included.
Journal ArticleDOI

Personalized Content Extraction and Text Classification Using Effective Web Scraping Techniques

TL;DR: The proposed model performs well, and the final data-set is trained with the standard machine learning techniques, according to KEyWoRdS.
References
More filters
Journal ArticleDOI

LIBSVM: A library for support vector machines

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Journal ArticleDOI

Identification of common molecular subsequences.

TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).
Journal ArticleDOI

Automating the Construction of Internet Portals with Machine Learning

TL;DR: New research in reinforcement learning, information extraction and text classification that enables efficient spidering, the identification of informative text segments, and the population of topic hierarchies are described.
Proceedings ArticleDOI

CiteSeer: an automatic citation indexing system

TL;DR: CiteSeer has many advantages over traditional citation indexes, including the ability to create more up-to-date databases which are not limited to a preselected set of journals or restricted by journal publication delays, completely autonomous operation with a corresponding reduction in cost, and powerful interactive browsing of the literature using the context of citations.
Journal ArticleDOI

The document spectrum for page layout analysis

TL;DR: The document spectrum (or docstrum) as discussed by the authors is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components, which yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks.
Related Papers (5)