scispace - formally typeset
Search or ask a question

Showing papers on "XML published in 2020"


Book ChapterDOI
23 Aug 2020
TL;DR: Li et al. as discussed by the authors proposed an attention-based encoder-dual-decoder (EDD) architecture that converts images of tables into HTML code, which has a structure decoder which reconstructs the table structure and a cell decoder to recognize cell content.
Abstract: Important information that relates to a specific topic in a document is often organized in tabular format to assist readers with information retrieval and comparison, which may be difficult to provide in natural language. However, tabular data in unstructured digital documents, e.g. Portable Document Format (PDF) and images, are difficult to parse into structured machine-readable format, due to complexity and diversity in their structure and style. To facilitate image-based table recognition with deep learning, we develop and release the largest publicly available table recognition dataset PubTabNet (https://github.com/ibm-aur-nlp/PubTabNet.), containing 568k table images with corresponding structured HTML representation. PubTabNet is automatically generated by matching the XML and PDF representations of the scientific articles in PubMed Central™ Open Access Subset (PMCOA). We also propose a novel attention-based encoder-dual-decoder (EDD) architecture that converts images of tables into HTML code. The model has a structure decoder which reconstructs the table structure and helps the cell decoder to recognize cell content. In addition, we propose a new Tree-Edit-Distance-based Similarity (TEDS) metric for table recognition, which more appropriately captures multi-hop cell misalignment and OCR errors than the pre-established metric. The experiments demonstrate that the EDD model can accurately recognize complex tables solely relying on the image representation, outperforming the state-of-the-art by 9.7% absolute TEDS score.

39 citations


Journal ArticleDOI
TL;DR: This paper presents GTFS-Madrid-Bench, a benchmark to evaluate OBDI engines that can be used for the provision of access mechanisms to virtual knowledge graphs, and introduces several scenarios that aim at measuring the query capabilities, performance and scalability of all these engines, considering their heterogeneity.

28 citations


Journal ArticleDOI
TL;DR: Pubmed Parser is described, a software to mine Pubmed and MEDLINE efficiently that is built on top of Python and can therefore be integrated into a myriad of tools for machine learning such as scikit-learn and deep learningsuch as tensorflow and pytorch.
Abstract: The number of biomedical publications is increasing exponentially every year. If we had the ability to access, manipulate, and link this information, we could extract knowledge that is perhaps hidden within the figures, text, and citations. In particular, the repositories made available by the PubMed and MEDLINE databases enable these kinds of applications at an unprecedented level. Examples of applications that can be built from this dataset range from predicting novel drug-drug interactions, classifying biomedical text data, searching specific oncological profiles, disambiguating author names, or automatically learning a biomedical ontology. Here, we describe Pubmed Parser (pubmed_parser), a software to mine Pubmed and MEDLINE efficiently. Pubmed Parser is built on top of Python and can therefore be integrated into a myriad of tools for machine learning such as scikit-learn and deep learning such as tensorflow and pytorch.

17 citations



Proceedings ArticleDOI
20 Nov 2020
TL;DR: This paper discusses a new method for training extraction models directly from the textual value of information and shows that it performs competitively with a standard word classifier without requiring costly word level supervision.
Abstract: The predominant approaches for extracting key information from documents resort to classifiers predicting the information type of each word. However, the word level ground truth used for learning is expensive to obtain since it is not naturally produced by the extraction task. In this paper, we discuss a new method for training extraction models directly from the textual value of information. The extracted information of a document is represented as a sequence of tokens in the XML language. We learn to output this representation with a pointer-generator network that alternately copies the document words carrying information and generates the XML tags delimiting the types of information. The ability of our end-to-end method to retrieve structured information is assessed on a large set of business documents. We show that it performs competitively with a standard word classifier without requiring costly word level supervision.

15 citations


Book
09 Feb 2020
TL;DR: The Author-X system as mentioned in this paper provides an overall overview of both the model and the system, provided by AuthorX, for securing XML data especially conceived for push dissemination mode, allowing the specification of both access control and signature policies in order to require the satisfaction of confidentiality, integrity and authenticity requirements for both the receiving subjects and information owners.
Abstract: The increasing ability to interconnect computers through internetworking, and the use of the Web, as the main means to exchange information, has sped up the development of a new class of information-centered applications focused on the dissemination of XML data. Since it is often the case that data managed by an information system are highly strategic and sensitive, a comprehensive framework for ensuring the main security properties in disseminating XML data, is needed. In our poster, we provide an overall overview of both the model and the system, provided by Author-X, for securing XML data especially conceived for push dissemination mode. Our model allows the specification of both access control and signature policies in order to require the satisfaction of confidentiality, integrity and authenticity requirements for both the receiving subjects and information owners. The Author-X system provides ad hoc techniques for enforcing the specified security policies.

14 citations


Journal ArticleDOI
TL;DR: An overview of the state-of-the-art of XML data manipulation, in conventional and temporal XML databases, studies the support of such functionality in mainstream commercial DBMSs and gives some remarks on possible future research directions related to this issue are provided.

14 citations


Journal ArticleDOI
TL;DR: Since blockchain-specific security guidance is currently lacking, mapping existing frameworks, such as OWASP, to the blockchain can help in the identification of potential vulnerabilities in blockchain systems.

13 citations


Journal ArticleDOI
01 Jan 2020
TL;DR: This article proposes how content users might benefit from semantic concepts by the delivery of sets of logically connected topics, which can be described as microDocs, which might also play a role in the provisioning of content by web-services being integrated into different types of content processing and content delivery applications.
Abstract: We address and develop a new concept for the dynamic delivery of topic-based content created within the domain of technical communication. Corresponding content management environments introduced within the last decades, focused so far on semantically structured and mostly XML-based information models and, more recently, on semantic metadata using taxonomies leading together to concepts of so-called intelligent content. Latest developments attempt to extend these concepts with additional explicit semantic approaches modelled and implemented, for example, by using ontologies and related technologies. In this article, we propose how content users might benefit from these semantic concepts by the delivery of sets of logically connected topics, which can be described as microdocuments (“microDocs”). This generic approach of topic assemblies might also play a role in the provisioning of content by web-services being integrated into different types of content processing and content delivery applications.

12 citations


Journal ArticleDOI
TL;DR: Digital forensics techniques were used to investigate a known case of contract cheating where the contract author has notified the university and the student subsequently confirmed that they had contracted the work out.
Abstract: Contract cheating is a major problem in Higher Education because it is very difficult to detect using traditional plagiarism detection tools. Digital forensics techniques are already used in law to determine ownership of documents, and also in criminal cases, where it is not uncommon to hide information and images within an ordinary looking document using steganography techniques. These digital forensic techniques were used to investigate a known case of contract cheating where the contract author has notified the university and the student subsequently confirmed that they had contracted the work out. Microsoft Word documents use a format known as Office Open XML Format, and as such, it is possible to review the editing process of a document. A student submission known to have been contracted out was analysed using the revision identifiers within the document, and a tool was developed to review these identifiers. Using visualisation techniques it is possible to see a pattern of editing that is inconsistent with the pattern seen in an authentic document.

11 citations


Journal ArticleDOI
TL;DR: The current CTWLADE can map the data required and provided by the hydraulic software tool storm water management model (SWMM) and is ready to be integrated into a Web 3D Service to provide the data for 3D dynamic visualization in interactive scenes.
Abstract: Urban flooding, as one of the most serious natural disasters, has caused considerable personal injury and property damage throughout the world. To better cope with the problem of waterlogging, the experts have developed many waterlogging models that can accurately simulate the process of pipe network drainage and water accumulation. The study of urban waterlogging involves many data types. These data come from the departments of hydrology, meteorology, planning, surveying, and mapping, etc. The incoordination of space–time scale and format standard has brought huge obstacles to the study of urban waterlogging. This is not conducive to interpretation, transmission, and visualization in today’s network environment. In this paper, the entities and attributes related to waterlogging are defined. Based on the five modules of urban drainage network, sub basin, dynamic water body, time series, and meteorological data, the corresponding UML (Unified Modeling Language) model is designed and constructed. On this basis, the urban waterlogging application domain extension model city waterlogging application domain extension (CTWLADE) is established. According to the characteristics of different types of data, two different methods based on FME object and citygml4j are proposed to realize the corresponding data integration, and KML (Keyhole Markup Language) /glTF data organization form and the corresponding sharing method are proposed to solve the problem that the CTWLADE model data cannot be visualized directly on the web and cannot interact in three-dimensional format. To evaluate the CTWLADE, a prototype system was implemented, which can convert waterlogging-related multi-source data in extensible markup language (XML) files conform. The current CTWLADE can map the data required and provided by the hydraulic software tool storm water management model (SWMM) and is ready to be integrated into a Web 3D Service to provide the data for 3D dynamic visualization in interactive scenes.

Proceedings Article
Harry Bunt1
01 May 2020
TL;DR: The focus in this paper is on the structuring of the semantic information needed to characterise quantification in natural language and the representation of these structures in QuantML.
Abstract: This paper discusses the current state of developing an ISO standard annotation scheme for quantification phenomena in natural language, as part of the ISO Semantic Annotation Framework (ISO 24617). A proposed approach that combines ideas from the theory of generalised quantifiers and from neo-Davidsonian event semantics was adopted by the ISO organisation in 2019 as a starting point for developing such an annotation scheme. * This scheme consists of (1) a conceptual ‘metamodel’ that visualises the types of entities, functions and relations that go into annotations of quantification; (2) an abstract syntax which defines ‘annotation structures’ as triples and other set-theoretic constructs; (3) an XML-based representation of annotation structures (‘concrete syntax’); and (4) a compositional semantics of annotation structures. The latter three components together define the interpreted markup language QuantML. The focus in this paper is on the structuring of the semantic information needed to characterise quantification in natural language and the representation of these structures in QuantML.

Journal ArticleDOI
TL;DR: This study reports a novel platform to store and query massive XML-based biological data collections, and a formal approach to transform the XML query model into the MapReduce query model is proposed.
Abstract: Publishing biological data in XML formats is attractive for organizations who would like to provide their bioinformatics resources in an extensible and machine-readable format. In the era of big data, massive XML-based biological data management is emerged as a challengeable issue. With the continuous growth of the XML-based biological data sets, it is usually frustrating to use traditional declarative query languages to provide efficient query capabilities in terms of processing speed and scale. In this study, we report a novel platform to store and query massive XML-based biological data collections. A prototype tool for constructing HBase tables from XML-based biological data collections is first developed, and then a formal approach to transform the XML query model into the MapReduce query model is proposed. Finally, an evaluation of the query performance of the proposed approach on the existing XML-based biological databases is presented, showing that the performance advantages of the proposed solution. The source code of the massive XML-based biological data management platform is freely available at https://github.com/lyotvincent/X2H .

Journal ArticleDOI
TL;DR: This paper provides an insight into the relationship between the two standards and a methodology for the conversion from one to the other, and the process of developing software to perform such conversion.
Abstract: . The trend of increased usage of both BIM and 3D GIS and the similarity between the two has led to an increase in the overlap between them. A key application of such overlap is providing geospatial context data for BIM models through importing 3D GIS-data to BIM software to help in different design-related issues. However, this is currently difficult because of the lack of support in BIM software for the formats and data models of 3D Geo-information. This paper deals with this issue by developing and implementing a methodology to convert the common open 3D city model data model into the most common open BIM data format, namely CityGML (Groger et al., 2012) to IFC (buildingsmart, 2019b). For the aim of this study, the two standards are divided into 5 comparable subparts: Semantics, Geometry, Geographical coordinates, Topology, and Encoding. The characteristics of each of these subparts are studied and a conversion method is proposed for each of them from the former standard to the latter. This is done by performing a semantic and geometrical mapping between the two standards, converting the georeferencing from global to local, converting the encoding that the two standards use from XML to STEP, and deciding which topological relations are to be retained. A prototype implementation has been created using Python to combine the above tasks. The work presented in this paper can provide a foundation for future work in converting CityGML to IFC. It provides an insight into the relationship between the two standards and a methodology for the conversion from one to the other, and the process of developing software to perform such conversion. This is done in a way that can be extended for future specific needs.

Journal ArticleDOI
TL;DR: The authors have proposed a classification of XML based vulnerabilities based on exhaustive literature survey that will help the web developers for proposing secure parsers that will thwart such attacks.
Abstract: XML based attacks are executed in web applications through crafted XML document that forces XML parser to process un-validated documents. This leads to disclosure of sensitive information, maliciou...


Proceedings ArticleDOI
17 Sep 2020
TL;DR: This paper focuses on IEEE 1599 applicability to music education and geographically-distributed performance in this period of self-isolation and home schooling due to Coronavirus disease.
Abstract: IEEE 1599 is an international standard aiming to describe music and music-related information through a multilayer approach. The idea is to organize multiple and heterogeneous materials related to a single musical piece within a hierarchical and highly-interconnected structure expressed via the XML syntax, so as to support multimodal and synchronized experience of music content. Another relevant feature is the possibility to enjoy IEEE 1599 materials via network, thanks to ad hoc Web applications already publicly available. This paper focuses on IEEE 1599 applicability to music education and geographically-distributed performance in this period of self-isolation and home schooling due to Coronavirus disease. Moreover, the lesson learned from the experimentation during the emergency phase is inspiring some improvements for the format, which is currently under revision by the IEEE Working Group for XML Musical Application.

Proceedings ArticleDOI
23 Aug 2020
TL;DR: In this article, the authors present approaches for information extraction from Web data that can be differentiated along two key dimensions: 1) the diversity in data modality that is leveraged, e.g. text, visual, XML/HTML, and 2) the thrust to develop scalable approaches with zero to limited human supervision.
Abstract: How do we surface the large amount of information present in HTML documents on the Web, from news articles to Rotten Tomatoes pages to tables of sports scores? Such information can enable a variety of applications including knowledge base construction, question answering, recommendation, and more. In this tutorial, we present approaches for information extraction (IE) from Web data that can be differentiated along two key dimensions: 1) the diversity in data modality that is leveraged, e.g. text, visual, XML/HTML, and 2) the thrust to develop scalable approaches with zero to limited human supervision.

Journal ArticleDOI
TL;DR: The finding is that the performance of XML data in a health care environment is better if the authors use BaseX instead of eXist-DB, and Analytic SQL is dominating XQuery to perform analytical functions.

Posted Content
TL;DR: This paper builds and evaluates translation models for seven target languages from English, with several different copy mechanisms and an XML-constrained beam search, and provides a detailed human analysis of gaps between the model output and human translations for real-world applications, including suitability for post-editing.
Abstract: This paper presents a high-quality multilingual dataset for the documentation domain to advance research on localization of structured text Unlike widely-used datasets for translation of plain text, we collect XML-structured parallel text segments from the online documentation for an enterprise software platform These Web pages have been professionally translated from English into 16 languages and maintained by domain experts, and around 100,000 text segments are available for each language pair We build and evaluate translation models for seven target languages from English, with several different copy mechanisms and an XML-constrained beam search We also experiment with a non-English pair to show that our dataset has the potential to explicitly enable $17 \times 16$ translation settings Our experiments show that learning to translate with the XML tags improves translation accuracy, and the beam search accurately generates XML structures We also discuss trade-offs of using the copy mechanisms by focusing on translation of numerical words and named entities We further provide a detailed human analysis of gaps between the model output and human translations for real-world applications, including suitability for post-editing

Posted ContentDOI
20 Mar 2020-bioRxiv
TL;DR: An HDF5 file format ‘mzMLb’ that is optimised for both read/write speed and storage of the raw mass spectrometry data is proposed, demonstrating a flexible format that with or without compression is faster than all existing approaches in virtually all cases.
Abstract: With ever-increasing amounts of data produced by mass spectrometry (MS) proteomics and metabolomics, and the sheer volume of samples now analyzed, the need for a common open format possessing both file size efficiency and faster read/write speeds has become paramount to drive the next generation of data analysis pipelines. The Proteomics Standards Initiative (PSI) has established a clear and precise XML representation for data interchange, mzML, receiving substantial uptake; nevertheless, storage and file access efficiency has not been the main focus. We propose an HDF5 file format 9mzMLb9 that is optimised for both read/write speed and storage of the raw mass spectrometry data. We provide extensive validation of write speed, random read speed and storage size, demonstrating a flexible format that with or without compression is faster than all existing approaches in virtually all cases, while with compression, is comparable in size to proprietary vendor file formats. Since our approach uniquely preserves the XML encoding of the metadata, the format implicitly supports future versions of mzML and is straightforward to implement: mzMLb9s design adheres to both HDF5 and NetCDF4 standard implementations, which allows it to be easily utilised by third parties due to their widespread programming language support. A reference implementation within the established ProteoWizard toolkit is provided.

Proceedings ArticleDOI
TL;DR: In this paper, the Common Variability Language (CVL) is used as a composition-based approach and annotations to manage fine-grained variability of a Software Product Line for web applications.
Abstract: Web applications development involves managing a high diversity of files and resources like code, pages or style sheets, implemented in different languages. To deal with the automatic generation of custom-made configurations of web applications, industry usually adopts annotation-based approaches despite the majority of studies encourage the use of composition-based approaches to implement Software Product Lines. Recent work tries to combine both approaches to get the complementary benefits. However, technological companies are reticent to adopt new development paradigms such as feature-oriented programming or aspect-oriented programming. Moreover, it is extremely difficult, or even impossible, to apply these programming models to web applications, mainly because of their multilingual nature, since their development involves multiple types of source code (Java, Groovy, JavaScript), templates (HTML, Markdown, XML), style sheet files (CSS and its variants, such as SCSS), and other files (JSON, YML, shell scripts). We propose to use the Common Variability Language as a composition-based approach and integrate annotations to manage fine grained variability of a Software Product Line for web applications. In this paper, we (i) show that existing composition and annotation-based approaches, including some well-known combinations, are not appropriate to model and implement the variability of web applications; and (ii) present a combined approach that effectively integrates annotations into a composition-based approach for web applications. We implement our approach and show its applicability with an industrial real-world system.

Journal ArticleDOI
TL;DR: A formal definition of this GPML standard is presented and a case study where GPML is used to implement a model predictive controller for the control of a building heating plant is presented.
Abstract: We propose a genetic programming markup language (GPML), an XML-based standard for the interchange of genetic programming trees, and outline the benefits such a format would bring in allowing the deployment of trained genetic programming (GP) models in applications as well as the subsidiary benefit of allowing GP researchers to directly share trained trees. We present a formal definition of this standard and describe details of an implementation. In addition, we present a case study where GPML is used to implement a model predictive controller for the control of a building heating plant.

Journal ArticleDOI
TL;DR: This research presents a novel dynamic XML labelling scheme, called the Pentagonal Scheme, in which data are represented as ordered XML nodes with relationships between them, which efficiently supports random skewed updates, has fast calculations and uncomplicated implementations to thus handle updates efficiently.
Abstract: In XML databases, the indexing process is based on a labelling or numbering scheme and generally used to label an XML document to perform an XML query using the path node information. Moreover, a labelling scheme helps to capture the structural relationships during query processing without the need to access the physical document. Two of the main problems for labelling XML schemes are duplicated labels, and cost efficiency regarding labelling time and size. This research presents a novel dynamic XML labelling scheme, called the Pentagonal Scheme, in which data are represented as ordered XML nodes with relationships between them. The update of these nodes from large-scale XML documents has been widely investigated and represents a challenging research problem, as it means relabelling a whole tree. Our algorithms provide an efficient dynamic XML labelling scheme to support data updates without duplicating labels or relabelling old nodes. Our work evaluates the labelling process in terms of size and time, and evaluates the labelling scheme’s ability to handle several insertions in XML documents. The findings indicate that the Pentagonal scheme shows a better initial labelling time performance than compared schemes. Particularly when using large size XML datasets. Moreover, it efficiently supports random skewed updates, has fast calculations and uncomplicated implementations to thus handle updates efficiently. In addition, the comparable evaluation of the query response time and relationships in Pentagonal scheme can be efficiently performed without presenting any extra cost. It was for this reason that our labelling scheme achieved the goal of this research.

Journal ArticleDOI
TL;DR: An approach for generating a Digital Twin efficiently is presented, in which the obstacles will be overcome by using fast scans of the shop floor and subsequent object recognition, and the focus here is on creating a simulation model.

Journal ArticleDOI
01 Feb 2020
TL;DR: An effective quality analysis of XML web data using clustering and classification approach is used in JAVA to assess the XML data quality and the web pages are effectively ranked.
Abstract: An effective quality analysis of XML web data using clustering and classification approach is used in our proposed method. XML is turning into a standard in representation of data, it is attractive to support keyword search in XML database. A keyword search searches for words anyplace in record. It is developed as best worldview for finding data on web. The most imperative prerequisite for the keyword search is to rank the consequences of question so that the most pertinent outcomes show up. Here, we gather more XML documents. Followed by that, feature extraction occurs. Since the selected feature contains both relevant as well as irrelevant features it is essential to filter the irrelevant features. For the purpose of selecting, the relevant features probability-based feature selection method is used. Then for clustering the relevant features on the basis of keywords weighted fuzzy c means clustering algorithm is used. In order to assess the XML data quality, optimal neural network (ONN) classifier is utilized. In this ONN classifier in order to select the optimal weights, whale optimization algorithm is used. Thus, the web pages are effectively ranked. The efficiency of the proposed method is assessed using clustering and classification accuracy, RMSE, and search time. The proposed method is implemented in JAVA.

Journal ArticleDOI
TL;DR: Template4EHR is a to ol for the dynamic creation of data schemas for electronichealth-record storage and user creation and customization of graphical user interfaces.
Abstract: Template4EHR is a to ol for the dynamic creation of data schemas for electronichealth-record storage and user creation and customization of graphical user interfaces. In experimental tests with IT and health professionals, Template4EHR obtained an 81.22% satisfaction rate.

Proceedings ArticleDOI
10 Jun 2020
TL;DR: This paper introduces the concept of a new configuration interface for the IEC 61499 devices using OPC UA information modeling concepts and demonstrates how this interface can be implemented in XML and Binary XML.
Abstract: In the modern era of industrial automation, the term Industry 4.0 is defined as the fourth industrial revolution. This is a phenomenon where technologies from various layers of an enterprise are interconnected and form a meshed network of self-regulated, adaptive, re-configurable and self-optimizing devices. These devices vary from Programmable Logic Controller, embedded PCs, edge nodes, smart sensors, and actuators, working as proxies or mediators for a real object in the software domain integrating into Intelligent Enterprise Applications. Heterogeneous configuration interfaces of these devices hinder smooth integration and configuration process. A unified way of interacting with the devices for configuration is well-defined in the IEC 61499 standard. The standard defines the commands, interaction behavior, and interface description for the control devices and engineering tools. There are implementations of the configuration interface in XML and Binary XML, which are widely used for their flexible, extensible, and human-readable nature. Whereas the OPC UA can offer an open configuration interface for the IEC 61499 devices and software tools, with built-in interoperability solutions. This paper introduces the concept of a new configuration interface for the IEC 61499 devices using OPC UA information modeling concepts.

Journal ArticleDOI
TL;DR: A new format for MT and related electromagnetic transfer function extensible markup language (EMTF XML) is developed, which is a novel, self-describing, searchable, and extensible way to store the data.
Abstract: Initial processing of magnetotelluric (MT) data collected at a site results in a small data file that defines the MT transfer functions (MT TFs) or variants at a discrete set of frequencies...

Journal ArticleDOI
TL;DR: Results of an experimental study indicate that XChange can provide higher effectiveness and efficiency when used to understand changes between versions of XML documents when compared with the (syntactic) state-of-the-art approaches.