CERMINE: automatic extraction of structured metadata from scientific literature
Reads0
Chats0
TLDR
The overall workflow architecture of CERMINE is outlined, details about individual steps implementations are provided and the evaluation of the extraction workflow carried out with the use of a large dataset showed good performance for most metadata types.Abstract:
CERMINE is a comprehensive open-source system for extracting structured metadata from scientific articles in a born-digital form. The system is based on a modular workflow, whose loosely coupled architecture allows for individual component evaluation and adjustment, enables effortless improvements and replacements of independent parts of the algorithm and facilitates future architecture expanding. The implementations of most steps are based on supervised and unsupervised machine learning techniques, which simplifies the procedure of adapting the system to new document layouts and styles. The evaluation of the extraction workflow carried out with the use of a large dataset showed good performance for most metadata types, with the average F score of 77.5 %. CERMINE system is available under an open-source licence and can be accessed at http://cermine.ceon.pl. In this paper, we outline the overall workflow architecture and provide details about individual steps implementations. We also thoroughly compare CERMINE to similar solutions, describe evaluation methodology and finally report its results.read more
Citations
More filters
Posted Content
Computing Graph Neural Networks: A Survey from Algorithms to Accelerators
TL;DR: A review of the field of GNNs is presented from the perspective of computing, and an in-depth analysis of current software and hardware acceleration schemes is provided, from which a hardware-software, graph-aware, and communication-centric vision for GNN accelerators is distilled.
Journal ArticleDOI
Information extraction from scientific articles: a survey
TL;DR: In this article, the authors present the overall progress concerning automatic information extraction from scientific articles and classify the information insights extracted from scientific documents into two broad categories i.e. metadata and key-insights.
Journal ArticleDOI
Citation recommendation: approaches and datasets
Michael Färber,Adam Jatowt +1 more
TL;DR: Citation recommendation describes the task of recommending citations for a given text as discussed by the authors, which describes the need to cite the most appropriate publications when writing scientific texts and has emerged as an important research topic.
Journal ArticleDOI
The NIH Open Citation Collection: A public access, broad coverage resource.
B. Ian Hutchins,Kirk Baker,Matthew T. Davis,Mario A. Diwersy,Ehsanul Haque,Robert M. Harriman,Travis Hoppe,Stephen A. Leicht,Payam Meyer,George M. Santangelo +9 more
TL;DR: The NIH Open Citation Collection (NIH-OCC), a public access database for biomedical research that is made freely available to the community, is described and data from unrestricted data sources such as MedLine, PubMed Central, and CrossRef are included.
Journal ArticleDOI
Personalized Content Extraction and Text Classification Using Effective Web Scraping Techniques
TL;DR: The proposed model performs well, and the final data-set is trained with the standard machine learning techniques, according to KEyWoRdS.
References
More filters
Journal ArticleDOI
LIBSVM: A library for support vector machines
Chih-Chung Chang,Chih-Jen Lin +1 more
TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Journal ArticleDOI
Identification of common molecular subsequences.
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).
Journal ArticleDOI
Automating the Construction of Internet Portals with Machine Learning
TL;DR: New research in reinforcement learning, information extraction and text classification that enables efficient spidering, the identification of informative text segments, and the population of topic hierarchies are described.
Proceedings ArticleDOI
CiteSeer: an automatic citation indexing system
TL;DR: CiteSeer has many advantages over traditional citation indexes, including the ability to create more up-to-date databases which are not limited to a preselected set of journals or restricted by journal publication delays, completely autonomous operation with a corresponding reduction in cost, and powerful interactive browsing of the literature using the context of citations.
Journal ArticleDOI
The document spectrum for page layout analysis
TL;DR: The document spectrum (or docstrum) as discussed by the authors is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components, which yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks.