CERMINE: automatic extraction of structured metadata from scientific literature

doi:10.1007/S10032-015-0249-8

Open AccessJournal ArticleDOI

CERMINE: automatic extraction of structured metadata from scientific literature

Dominika Tkaczyk, +4 more

- 01 Dec 2015 -

International Journal on Document Analys...

- Vol. 18, Iss: 4, pp 317-335

Chats0

TLDR

The overall workflow architecture of CERMINE is outlined, details about individual steps implementations are provided and the evaluation of the extraction workflow carried out with the use of a large dataset showed good performance for most metadata types.

Abstract:

CERMINE is a comprehensive open-source system for extracting structured metadata from scientific articles in a born-digital form. The system is based on a modular workflow, whose loosely coupled architecture allows for individual component evaluation and adjustment, enables effortless improvements and replacements of independent parts of the algorithm and facilitates future architecture expanding. The implementations of most steps are based on supervised and unsupervised machine learning techniques, which simplifies the procedure of adapting the system to new document layouts and styles. The evaluation of the extraction workflow carried out with the use of a large dataset showed good performance for most metadata types, with the average F score of 77.5 %. CERMINE system is available under an open-source licence and can be accessed at http://cermine.ceon.pl. In this paper, we outline the overall workflow architecture and provide details about individual steps implementations. We also thoroughly compare CERMINE to similar solutions, describe evaluation methodology and finally report its results.

Citations

PDF

Open Access

More filters

Posted Content

Computing Graph Neural Networks: A Survey from Algorithms to Accelerators

Sergi Abadal, +4 more

- 30 Sep 2020 -

arXiv: Learning

TL;DR: A review of the field of GNNs is presented from the perspective of computing, and an in-depth analysis of current software and hardware acceleration schemes is provided, from which a hardware-software, graph-aware, and communication-centric vision for GNN accelerators is distilled.

...read moreread less

Journal ArticleDOI

Information extraction from scientific articles: a survey

Zara Nasar, +2 more

- 01 Dec 2018 -

Scientometrics

TL;DR: In this article, the authors present the overall progress concerning automatic information extraction from scientific articles and classify the information insights extracted from scientific documents into two broad categories i.e. metadata and key-insights.

...read moreread less

Journal ArticleDOI

Citation recommendation: approaches and datasets

Michael Färber, +1 more

- 01 Dec 2020 -

International Journal on Digital Librari...

TL;DR: Citation recommendation describes the task of recommending citations for a given text as discussed by the authors, which describes the need to cite the most appropriate publications when writing scientific texts and has emerged as an important research topic.

...read moreread less

Journal ArticleDOI

The NIH Open Citation Collection: A public access, broad coverage resource.

B. Ian Hutchins, +9 more

- 10 Oct 2019 -

PLOS Biology

TL;DR: The NIH Open Citation Collection (NIH-OCC), a public access database for biomedical research that is made freely available to the community, is described and data from unrestricted data sources such as MedLine, PubMed Central, and CrossRef are included.

...read moreread less

Journal ArticleDOI

Personalized Content Extraction and Text Classification Using Effective Web Scraping Techniques

Tharun Karthikeyan, +4 more

- 01 Jul 2019 -

International Journal of Web Portals

TL;DR: The proposed model performs well, and the final data-set is trained with the standard machine learning techniques, according to KEyWoRdS.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

LIBSVM: A library for support vector machines

Chih-Chung Chang, +1 more

- 06 May 2011 -

ACM Transactions on Intelligent Systems ...

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.

...read moreread less

Journal ArticleDOI

Identification of common molecular subsequences.

Temple F. Smith, +1 more

- 25 Mar 1981 -

Journal of Molecular Biology

TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

...read moreread less

Journal ArticleDOI

Automating the Construction of Internet Portals with Machine Learning

Andrew McCallum, +3 more

- 21 Jul 2000 -

Information Retrieval

TL;DR: New research in reinforcement learning, information extraction and text classification that enables efficient spidering, the identification of informative text segments, and the population of topic hierarchies are described.

...read moreread less

Proceedings ArticleDOI

CiteSeer: an automatic citation indexing system

C. Lee Giles, +2 more

TL;DR: CiteSeer has many advantages over traditional citation indexes, including the ability to create more up-to-date databases which are not limited to a preselected set of journals or restricted by journal publication delays, completely autonomous operation with a corresponding reduction in cost, and powerful interactive browsing of the literature using the context of citations.

...read moreread less

Journal ArticleDOI

The document spectrum for page layout analysis

Lawrence O'Gorman

- 01 Nov 1993 -

IEEE Transactions on Pattern Analysis an...

TL;DR: The document spectrum (or docstrum) as discussed by the authors is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components, which yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks.

...read moreread less

Collapse

Neural Computation

CERMINE: automatic extraction of structured metadata from scientific literature

Citations

Computing Graph Neural Networks: A Survey from Algorithms to Accelerators

Information extraction from scientific articles: a survey

Citation recommendation: approaches and datasets

The NIH Open Citation Collection: A public access, broad coverage resource.

Personalized Content Extraction and Text Classification Using Effective Web Scraping Techniques

References

LIBSVM: A library for support vector machines

Identification of common molecular subsequences.

Automating the Construction of Internet Portals with Machine Learning

CiteSeer: an automatic citation indexing system

The document spectrum for page layout analysis

Related Papers (5)

GROBID: combining automatic bibliographic data recognition and term extraction for scholarship publications

ParsCit: an Open-source CRF Reference String Parsing Package

PDFX: fully-automated PDF-to-XML conversion of scientific literature

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Long short-term memory