Showing papers on "XML published in 2019"

PDF

Open Access

Posted Content•

PubLayNet: largest dataset ever for document layout analysis.

[...]

Xu Zhong¹, Jianbin Tang¹, Antonio Jimeno Yepes¹•Institutions (1)

16 Aug 2019-arXiv: Computation and Language

TL;DR: The authors developed the PubLayNet dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central, where typical document layout elements are annotated.

...read moreread less

Abstract: Recognizing the layout of unstructured digital documents is an important step when parsing the documents into structured machine-readable format for downstream applications. Deep neural networks that are developed for computer vision have been proven to be an effective method to analyze layout of document images. However, document layout datasets that are currently publicly available are several magnitudes smaller than established computing vision datasets. Models have to be trained by transfer learning from a base model that is pre-trained on a traditional computer vision dataset. In this paper, we develop the PubLayNet dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated. The experiments demonstrate that deep neural networks trained on PubLayNet accurately recognize the layout of scientific articles. The pre-trained models are also a more effective base mode for transfer learning on a different document domain. We release the dataset (this https URL) to support development and evaluation of more advanced models for document layout analysis.

...read moreread less

177 citations

Proceedings Article•DOI•

PubLayNet: Largest Dataset Ever for Document Layout Analysis

[...]

Xu Zhong¹, Jianbin Tang¹, Antonio Jimeno Yepes¹•Institutions (1)

IBM¹

16 Aug 2019

TL;DR: The PubLayNet dataset for document layout analysis is developed by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central and demonstrated that deep neural networks trained on Pub LayNet accurately recognize the layout of scientific articles.

...read moreread less

Abstract: Recognizing the layout of unstructured digital documents is an important step when parsing the documents into structured machine-readable format for downstream applications. Deep neural networks that are developed for computer vision have been proven to be an effective method to analyze layout of document images. However, document layout datasets that are currently publicly available are several magnitudes smaller than established computing vision datasets. Models have to be trained by transfer learning from a base model that is pre-trained on a traditional computer vision dataset. In this paper, we develop the PubLayNet dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated. The experiments demonstrate that deep neural networks trained on PubLayNet accurately recognize the layout of scientific articles. The pre-trained models are also a more effective base mode for transfer learning on a different document domain. We release the dataset (https://github.com/ibm-aur-nlp/PubLayNet) to support development and evaluation of more advanced models for document layout analysis.

...read moreread less

160 citations

Posted Content•

Image-based table recognition: data, model, and evaluation

[...]

Xu Zhong¹, Elaheh ShafieiBavani¹, Antonio Jimeno Yepes¹•Institutions (1)

IBM¹

25 Nov 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: The largest publicly available table recognition dataset PubTabNet is developed, containing 568k table images with corresponding structured HTML representation, and a new Tree-Edit-Distance-based Similarity (TEDS) metric for table recognition is proposed, which more appropriately captures multi-hop cell misalignment and OCR errors than the pre-established metric.

...read moreread less

Abstract: Important information that relates to a specific topic in a document is often organized in tabular format to assist readers with information retrieval and comparison, which may be difficult to provide in natural language. However, tabular data in unstructured digital documents, e.g., Portable Document Format (PDF) and images, are difficult to parse into structured machine-readable format, due to complexity and diversity in their structure and style. To facilitate image-based table recognition with deep learning, we develop the largest publicly available table recognition dataset PubTabNet (this https URL), containing 568k table images with corresponding structured HTML representation. PubTabNet is automatically generated by matching the XML and PDF representations of the scientific articles in PubMed Central Open Access Subset (PMCOA). We also propose a novel attention-based encoder-dual-decoder (EDD) architecture that converts images of tables into HTML code. The model has a structure decoder which reconstructs the table structure and helps the cell decoder to recognize cell content. In addition, we propose a new Tree-Edit-Distance-based Similarity (TEDS) metric for table recognition, which more appropriately captures multi-hop cell misalignment and OCR errors than the pre-established metric. The experiments demonstrate that the EDD model can accurately recognize complex tables solely relying on the image representation, outperforming the state-of-the-art by 9.7% absolute TEDS score.

...read moreread less

88 citations

Journal Article•DOI•

CityJSON: a compact and easy-to-use encoding of the CityGML data model.

[...]

Hugo Ledoux¹, Ken Arroyo Ohori¹, Kavisha Kumar¹, Balázs Dukai¹, Anna Labetski¹, S. Vitalis¹ - Show less +2 more•Institutions (1)

Delft University of Technology¹

25 Feb 2019-arXiv: Databases

TL;DR: CityJSON as mentioned in this paper is a JSON-based exchange format for the CityGML data model (version 2.0.0) which is designed with programmers in mind, so that software and APIs supporting it can be quickly built.

...read moreread less

Abstract: The international standard CityGML is both a data model and an exchange format to store digital 3D models of cities. While the data model is used by several cities, companies, and governments, in this paper we argue that its XML-based exchange format has several drawbacks. These drawbacks mean that it is difficult for developers to implement parsers for CityGML, and that practitioners have, as a consequence, to convert their data to other formats if they want to exchange them with others. We present CityJSON, a new JSON-based exchange format for the CityGML data model (version 2.0.0). CityJSON was designed with programmers in mind, so that software and APIs supporting it can be quickly built. It was also designed to be compact (a compression factor of around six with real-world datasets), and to be friendly for web and mobile development. We argue that it is considerably easier to use than the CityGML format, both for reading and for creating datasets. We discuss in this paper the main features of CityJSON, briefly present the different software packages to parse/view/edit/create files (including one to automatically convert between the JSON and GML encodings), analyse how real-world datasets compare to those of CityGML, and we also introduce \emph{Extensions}, which allow us to extend the core data model in a documented manner.

...read moreread less

81 citations

Journal Article•DOI•

PMC text mining subset in BioC: about three million full-text articles and growing.

[...]

Donald C. Comeau¹, Chih-Hsuan Wei¹, Rezarta Islamaj Dogan¹, Zhiyong Lu¹•Institutions (1)

National Institutes of Health¹

15 Sep 2019-Bioinformatics

TL;DR: To facilitate automated processing of nearly 3 million full-text articles (in PMC Open Access and Author Manuscript subsets) and to improve interoperability, BioC, a community-driven simple data structure in either XML or JSON format is converted.

...read moreread less

Abstract: Motivation Interest in text mining full-text biomedical research articles is growing. To facilitate automated processing of nearly 3 million full-text articles (in PubMed Central® Open Access and Author Manuscript subsets) and to improve interoperability, we convert these articles to BioC, a community-driven simple data structure in either XML or JavaScript Object Notation format for conveniently sharing text and annotations. Results The resultant articles can be downloaded via both File Transfer Protocol for bulk access and a Web API for updates or a more focused collection. Since the availability of the Web API in 2017, our BioC collection has been widely used by the research community. Availability and implementation https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PMC/.

...read moreread less

48 citations

Journal Article•DOI•

CityJSON: a compact and easy-to-use encoding of the CityGML data model

[...]

Hugo Ledoux¹, Ken Arroyo Ohori¹, Kavisha Kumar¹, Balázs Dukai¹, Anna Labetski¹, S. Vitalis¹ - Show less +2 more•Institutions (1)

Delft University of Technology¹

01 Dec 2019-Open Geospatial Data, Software and Standards

...read moreread less

Abstract: The international standard CityGML is both a data model and an exchange format to store digital 3D models of cities. While the data model is used by several cities, companies, and governments, in this paper we argue that its XML-based exchange format has several drawbacks. These drawbacks mean that it is difficult for developers to implement parsers for CityGML, and that practitioners have, as a consequence, to convert their data to other formats if they want to exchange them with others. We present CityJSON, a new JSON-based exchange format for the CityGML data model (version 2.0.0). CityJSON was designed with programmers in mind, so that software and APIs supporting it can be quickly built. It was also designed to be compact (a compression factor of around six with real-world datasets), and to be friendly for web and mobile development. We argue that it is considerably easier to use than the CityGML format, both for reading and for creating datasets. We discuss in this paper the main features of CityJSON, briefly present the different software packages to parse/view/edit/create files (including one to automatically convert between the JSON and GML encodings), analyse how real-world datasets compare to those of CityGML, and we also introduce Extensions, which allow us to extend the core data model in a documented manner.

...read moreread less

39 citations

Journal Article•DOI•

mzTab-M: A Data Standard for Sharing Quantitative Results in Mass Spectrometry Metabolomics.

[...]

Nils Hoffmann¹, Joel Rein², Timo Sachsenberg³, Jürgen Hartler⁴, Kenneth Haug⁵, Gerhard Mayer⁶, Oliver Alka³, Saravanan Dayalan⁷, Jake T M Pearce⁸, Philippe Rocca-Serra⁹, Da Qi¹⁰, Martin Eisenacher⁶, Yasset Perez-Riverol⁵, Juan Antonio Vizcaíno⁵, Reza M. Salek¹¹, Steffen Neumann¹², Andrew R. Jones¹⁰ - Show less +13 more•Institutions (12)

Leibniz Institute for Neurobiology¹, Wellcome Trust Sanger Institute², University of Tübingen³, Graz University of Technology⁴, European Bioinformatics Institute⁵, Ruhr University Bochum⁶, University of Melbourne⁷, Imperial College London⁸, University of Oxford⁹, University of Liverpool¹⁰, International Agency for Research on Cancer¹¹, Leibniz Association¹²

28 Jan 2019-Analytical Chemistry

TL;DR: The mzTab-M format as mentioned in this paper is a tab-separated text format that allows for ambiguity in the identification of molecules to be communicated clearly to readers of the files (both people and software).

...read moreread less

Abstract: Mass spectrometry (MS) is one of the primary techniques used for large-scale analysis of small molecules in metabolomics studies. To date, there has been little data format standardization in this field, as different software packages export results in different formats represented in XML or plain text, making data sharing, database deposition, and reanalysis highly challenging. Working within the consortia of the Metabolomics Standards Initiative, Proteomics Standards Initiative, and the Metabolomics Society, we have created mzTab-M to act as a common output format from analytical approaches using MS on small molecules. The format has been developed over several years, with input from a wide range of stakeholders. mzTab-M is a simple tab-separated text format, but importantly, the structure is highly standardized through the design of a detailed specification document, tightly coupled to validation software, and a mandatory controlled vocabulary of terms to populate it. The format is able to represent final quantification values from analyses, as well as the evidence trail in terms of features measured directly from MS (e.g., LC-MS, GC-MS, DIMS, etc.) and different types of approaches used to identify molecules. mzTab-M allows for ambiguity in the identification of molecules to be communicated clearly to readers of the files (both people and software). There are several implementations of the format available, and we anticipate widespread adoption in the field.

...read moreread less

35 citations

Journal Article•DOI•

Automatic Generation of Tests to Exploit XML Injection Vulnerabilities in Web Applications

[...]

Sadeeq Jan¹, Annibale Panichella¹, Andrea Arcuri¹, Lionel C. Briand¹•Institutions (1)

University of Luxembourg¹

01 Apr 2019-IEEE Transactions on Software Engineering

TL;DR: This work improves upon previous results by providing more efficient techniques to generate XML injection attacks, and investigates four different algorithms and two different fitness functions.

...read moreread less

Abstract: Modern enterprise systems can be composed of many web services (e.g., SOAP and RESTful). Users of such systems might not have direct access to those services, and rather interact with them through a single-entry point which provides a GUI (e.g., a web page or a mobile app). Although the interactions with such entry point might be secure, a hacker could trick such systems to send malicious inputs to those internal web services. A typical example is XML injection targeting SOAP communications. Previous work has shown that it is possible to automatically generate such kind of attacks using search-based techniques. In this paper, we improve upon previous results by providing more efficient techniques to generate such attacks. In particular, we investigate four different algorithms and two different fitness functions. A large empirical study, involving also two industrial systems, shows that our technique is effective at automatically generating XML injection attacks.

...read moreread less

34 citations

Proceedings Article•DOI•

Coming: a tool for mining change pattern instances from git commits

[...]

Matias Martinez, Martin Monperrus¹•Institutions (1)

Royal Institute of Technology¹

25 May 2019

TL;DR: In this paper, the authors present a tool that takes as input a GitHub repository and mines instances of code change patterns present on each commit, and analyzes those changes to determine if they correspond to an instance of a change pattern.

...read moreread less

Abstract: Software repositories such as Git have become a relevant source of information for software engineer researchers. For instance, the detection of commits that fulfill a given criterion (e.g., bugfixing commits) is one of the most frequent tasks done to understand the software evolution. However, to our knowledge, there is no open-source tool that, given a Git repository, returns all the instances of a given code change pattern. In this paper we present Coming, a tool that takes as input a Git repository and mines instances of code change patterns present on each commit. For that, Coming computes fine-grained code changes between two consecutive revisions, analyzes those changes to determine if they correspond to an instance of a change pattern (specified by the user using XML), and finally, after analyzing all the commits, it presents a) the frequency of code changes and b) the instances found in each commit. We evaluate Coming on a set of 28 pairs of revisions from Defects4J, finding instances of change patterns that involve If conditions on 26 of them.

...read moreread less

27 citations

Journal Article•DOI•

One Size Does Not Fit All: Querying Web Polystores

[...]

Yasar Khan¹, Antoine Zimmermann², Alokkumar Jha¹, Vijay Gadepally³, Mathieu d'Aquin¹, Ratnesh Sahay¹ - Show less +2 more•Institutions (3)

National University of Ireland, Galway¹, Centre national de la recherche scientifique², Massachusetts Institute of Technology³

17 Jan 2019-IEEE Access

TL;DR: This paper proposes a web-based query federation mechanism—called PolyWeb—that unifies query answering over multiple native data models (CSV, RDB, and RDF) and demonstrates PolyWeb on a cancer genomics use case.

...read moreread less

Abstract: Data retrieval systems are facing a paradigm shift due to the proliferation of specialized data storage engines (SQL, NoSQL, Column Stores, MapReduce, Data Stream, and Graph) supported by varied data models (CSV, JSON, RDB, RDF, and XML). One immediate consequence of this paradigm shift results into data bottleneck over the web; which means, web applications are unable to retrieve data with the intensity at which data are being generated from different facilities. Especially in the genomics and healthcare verticals, data are growing from petascale to exascale, and biomedical stakeholders are expecting seamless retrieval of these data over the web. In this paper, we argue that the bottleneck over the web can be reduced by minimizing the costly data conversion process and delegating query performance and processing loads to the specialized data storage engines over their native data models. We propose a web-based query federation mechanism-called PolyWeb-that unifies query answering over multiple native data models (CSV, RDB, and RDF). We emphasize two main challenges of query federation over native data models: 1) devise a method to select prospective data sources-with different underlying data models-that can satisfy a given query and 2) query optimization, join, and execution over different data models. We demonstrate PolyWeb on a cancer genomics use case, where it is often the case that a description of biological and chemical entities (e.g., gene, disease, drug, and pathways) spans across multiple data models and respective storage engines. In order to assess the benefits and limitations of evaluating queries over native data models, we evaluate PolyWeb with the state-of-the-art query federation engines in terms of result completeness, source selection, and overall query execution time.

...read moreread less

25 citations

Proceedings Article•DOI•

PyGGI 2.0: language independent genetic improvement framework

[...]

Gabin An¹, Aymeric Blot², Justyna Petke², Shin Yoo¹•Institutions (2)

KAIST¹, University College London²

12 Aug 2019

TL;DR: PyGGI as mentioned in this paper is a research tool for genetic improvement (GI) that is designed to be versatile and easy to use, allowing users to easily define GI operators and algorithms that can be reused with multiple target languages.

...read moreread less

Abstract: PyGGI is a research tool for Genetic Improvement (GI), that is designed to be versatile and easy to use. We present version 2.0 of PyGGI, the main feature of which is an XML-based intermediate program representation. It allows users to easily define GI operators and algorithms that can be reused with multiple target languages. Using the new version of PyGGI, we present two case studies. First, we conduct an Automated Program Repair (APR) experiment with the QuixBugs benchmark, one that contains defective programs in both Python and Java. Second, we replicate an existing work on runtime improvement through program specialisation for the MiniSAT satisfiability solver. PyGGI 2.0 was able to generate a patch for a bug not previously fixed by any APR tool. It was also able to achieve 14% runtime improvement in the case of MiniSAT. The presented results show the applicability and the expressiveness of the new version of PyGGI. A video of the tool demo is at: https://youtu.be/PxRUdlRDS40.

...read moreread less

Journal Article•DOI•

Review of the Complexity of Managing Big Data of the Internet of Things

[...]

David Gil¹, Magnus Johnsson², Magnus Johnsson³, Higinio Mora¹, Julian Szymański⁴ - Show less +1 more•Institutions (4)

University of Alicante¹, Malmö University², National Research Nuclear University MEPhI³, Gdańsk University of Technology⁴

03 Feb 2019-Complexity

TL;DR: In this review, the state of the art of all the aforementioned aspects of Big Data in the context of the Internet of Things is exposed and the most novel technologies in machine learning, deep learning, and data mining on Big Data are discussed.

...read moreread less

Abstract: There is a growing awareness that the complexity of managing Big Data is one of the main challenges in the developing field of the Internet of Things (IoT). Complexity arises from several aspects of the Big Data life cycle, such as gathering data, storing them onto cloud servers, cleaning and integrating the data, a process involving the last advances in ontologies, such as Extensible Markup Language (XML) and Resource Description Framework (RDF), and the application of machine learning methods to carry out classifications, predictions, and visualizations. In this review, the state of the art of all the aforementioned aspects of Big Data in the context of the Internet of Things is exposed. The most novel technologies in machine learning, deep learning, and data mining on Big Data are discussed as well. Finally, we also point the reader to the state-of-the-art literature for further in-depth studies, and we present the major trends for the future.

...read moreread less

Journal Article•DOI•

A library of building occupant behaviour models represented in a standardised schema

[...]

Zsofia Deme Belafi¹, Zsofia Deme Belafi², Tianzhen Hong², András Reith³•Institutions (3)

Budapest University of Technology and Economics¹, Lawrence Berkeley National Laboratory², Urban Design Group³

05 Mar 2019-Energy Efficiency

TL;DR: A newly developed library of OB models represented in the standardised obXML schema format is presented, which provides ready-to-use examples for BPS users to employ more accurate occupant representation in their energy models.

...read moreread less

Abstract: Over the past four decades, a substantial body of literature has explored the impacts of occupant behaviour (OB) on building technologies, operation, and energy consumption. A large number of data-driven behavioural models have been developed based on field data. These models lack standardisation and consistency, leading to difficulties in applications and comparison. To address this problem, an ontology was developed using the drivers-needs-actions-systems (DNAS) framework. Recent work has been carried out to implement the theoretical DNAS framework into an eXtensible Markup Language (XML) schema, titled ‘occupant behaviour XML’ (obXML) which is a practical implementation of OB models that can be integrated into building performance simulation (BPS) programs. This paper presents a newly developed library of OB models represented in the standardised obXML schema format. This library provides ready-to-use examples for BPS users to employ more accurate occupant representation in their energy models. The library, which contains an initial effort of 52 OB models, was made publicly available for the BPS community. As part of the library development process, limitations of the obXML schema were identified and addressed, and future improvements were proposed. Authors hope that by compiling this library building, energy modellers from all over the world can enhance their BPS models by integrating more accurate and robust OB patterns.

...read moreread less

Proceedings Article•DOI•

OPC UA NodeSet Ontologies as a Pillar of Representing Semantic Digital Twins of Manufacturing Resources

[...]

Alexander Perzylo¹, Stefan Profanter¹, Markus Rickert¹, Alois Knoll¹•Institutions (1)

Technische Universität München¹

01 Sep 2019

TL;DR: A semantic representation of a resource’s properties is proposed, in which OWL ontologies are used to encode the information models that can be found in OPC UA NodeSet specifications, and an OWL-based description of the resource‘s geometry and – if applicable – its kinematic model is combined.

...read moreread less

Abstract: The effectiveness of cognitive manufacturing systems in agile production environments heavily depend on the automatic assessment of various levels of interoperability between manufacturing resources. For taking informed decisions, a semantically rich representation of all resources in a workcell or production line is required. OPC UA provides means for communication and information exchange in such distributed settings.This paper proposes a semantic representation of a resource’s properties, in which we use OWL ontologies to encode the information models that can be found in OPC UA NodeSet specifications. We further combine these models with an OWL-based description of the resource’s geometry and – if applicable – its kinematic model. This leads to a comprehensive semantic representation of hardware and software features of a manufacturing resource, which we call semantic digital twin. Among other things, it reduces costs through virtual prototyping and enables the automatic deployment of manufacturing tasks in production lines. As a result, small-batch assemblies become financially viable.In order to minimize the effort of creating OWL-based UA NodeSet descriptions, we provide a software tool for the automatic transformation of XML-based NodeSet specifications that adhere to the OPC Foundation’s NodeSet2 XML schema.

...read moreread less

Journal Article•DOI•

A standard data model for life cycle analysis of industrial products: A support for eco-design initiatives

[...]

Marco Mandolini¹, Marco Marconi², Marta Rossi¹, Claudio Favi³, Michele Germani¹ - Show less +1 more•Institutions (3)

Marche Polytechnic University¹, Tuscia University², University of Parma³

01 Aug 2019-Computers in Industry

TL;DR: A life cycle standard data model (LCSDM) that manages and shares life cycle information along the product development process and the use of a unique standard for data sharing among the several eco-design software tools is defined.

...read moreread less

Journal Article•DOI•

xml2jupyter: Mapping parameters between XML and Jupyter widgets.

[...]

Randy Heiland¹, Daniel Mishler¹, Tyler Zhang¹, Eric Bower¹, Paul Macklin¹ - Show less +1 more•Institutions (1)

Indiana University¹

01 Jul 2019

TL;DR: The XML2jupyter project as mentioned in this paper is a Python package that provides a mapping between configuration files, formatted in the Extensible Markup Language (XML), and Jupyters widgets.

...read moreread less

Abstract: Jupyter Notebooks (Kluyver et al., 2016, Perkel (2018)) provide executable documents (in a variety of programming languages) that can be run in a web browser. When a notebook contains graphical widgets, it becomes an easy-to-use graphical user interface (GUI). Many scientific simulation packages use text-based configuration files to provide parameter values and run at the command line without a graphical interface. Manually editing these files to explore how different values affect a simulation can be burdensome for technical users, and impossible to use for those with other scientific backgrounds. xml2jupyter is a Python package that addresses these scientific bottlenecks. It provides a mapping between configuration files, formatted in the Extensible Markup Language (XML), and Jupyter widgets. Widgets are automatically generated from the XML file and these can, optionally, be incorporated into a larger GUI for a simulation package, and optionally hosted on cloud resources. Users modify parameter values via the widgets, and the values are written to the XML configuration file which is input to the simulation's command-line interface. xml2jupyter has been tested using PhysiCell (Ghaffarizadeh, Heiland, Friedman, Mumenthaler, & Macklin, 2018), an open source, agent-based simulator for biology, and it is being used by students for classroom and research projects. In addition, we use xml2jupyter to help create Jupyter GUIs for PhysiCell-related applications running on nanoHUB (Madhavan et al., 2013).

...read moreread less

Journal Article•DOI•

Search-based multi-vulnerability testing of XML injections in web applications.

[...]

Sadeeq Jan¹, Sadeeq Jan², Annibale Panichella³, Annibale Panichella², Andrea Arcuri, Lionel C. Briand² - Show less +2 more•Institutions (3)

University of Engineering and Technology, Peshawar¹, University of Luxembourg², Delft University of Technology³

01 Jan 2019-Empirical Software Engineering

TL;DR: A novel co-evolutionary algorithm (COMIX) that is tailored to the problem and uncover multiple vulnerabilities at the same time is proposed and experiments show that COMIX outperforms a single-target search approach for XMLi and other multi- target search algorithms originally defined for white-box unit testing.

...read moreread less

Abstract: Modern web applications often interact with internal web services, which are not directly accessible to users. However, malicious user inputs can be used to exploit security vulnerabilities in web services through the application front-ends. Therefore, testing techniques have been proposed to reveal security flaws in the interactions with back-end web services, e.g., XML Injections (XMLi). Given a potentially malicious message between a web application and web services, search-based techniques have been used to find input data to mislead the web application into sending such a message, possibly compromising the target web service. However, state-of-the-art techniques focus on (search for) one single malicious message at a time. Since, in practice, there can be many different kinds of malicious messages, with only a few of them which can possibly be generated by a given front-end, searching for one single message at a time is ineffective and may not scale. To overcome these limitations, we propose a novel co-evolutionary algorithm (COMIX) that is tailored to our problem and uncover multiple vulnerabilities at the same time. Our experiments show that COMIX outperforms a single-target search approach for XMLi and other multi-target search algorithms originally defined for white-box unit testing.

...read moreread less

Journal Article•DOI•

Scalable and flexible management of medical image big data

[...]

Dejun Teng¹, Jun Kong², Fusheng Wang³•Institutions (3)

Ohio State University¹, Emory University², Stony Brook University³

01 Jun 2019-Distributed and Parallel Databases

TL;DR: This paper has delivered two open source systems DCMRL/XMLStore and DCMDocStore, a parallel, hybrid relational and XML data management approach, and a NoSQL document store approach that support scalable data management and comprehensive queries.

...read moreread less

Abstract: Digital imaging plays a critical role for image guided diagnosis and clinical trials, and the amount of image data is fast growing. There are two major requirements for image data management: scalability for massive scales and support of comprehensive queries. Traditional Picture Archiving and Communication Systems (PACS for short) are based on relational data management systems and suffer from limited scalability and query support. Therefore, new systems that support fast, scalable and comprehensive queries on image data are highly demanded. In this paper, we introduce two alternative approaches: DCMRL/XMLStore (RL/XML for short)-a parallel, hybrid relational and XML data management approach, and DCMDocStore (DOC for short)-a NoSQL document store approach. DCMRL/XMLStore manages DICOM images as binary large objects and metadata as relational tables and XML documents based on IBM DB2, which is parallelized through data partitioning. DCMDocStore manages DICOM metadata as JSON objects, and DICOM images as encoded attachments in MongoDB running on multiple nodes. We have delivered two open source systems DCMRL/XMLStore and DCMDocStore. Both systems support scalable data management and comprehensive queries. We also evaluated them with nearly one million DICOM images from National Biomedical Imaging Archive. The results show that, DCMDocStore demonstrates high data loading speed, high scalability and fault tolerance. DCMRL/XMLStore provides efficient queries, but comes with slower data loading. Traditional PACS systems have inherent limitations on flexible queries and scalability for massive amount of images.

...read moreread less

Journal Article•DOI•

A privacy protection approach for XML-based archives management in a cloud environment

[...]

Zongda Wu, Jian Xie, Xinze Lian, Jun Pan

09 Dec 2019-The Electronic Library

TL;DR: This paper presents a valuable study attempting to protect privacy for the management of XML archives in a cloud environment, so it has a positive significance to promote the application of cloud computing in a digital archive system.

...read moreread less

Abstract: The security of archival privacy data in the cloud has become the main obstacle to the application of cloud computing in archives management. To this end, aiming at XML archives, this paper aims to present a privacy protection approach that can ensure the security of privacy data in the untrusted cloud, without compromising the system availability.,The basic idea of the approach is as follows. First, the privacy data before being submitted to the cloud should be strictly encrypted on a trusted client to ensure the security. Then, to query the encrypted data efficiently, the approach constructs some key feature data for the encrypted data, so that each XML query defined on the privacy data can be executed correctly in the cloud.,Finally, both theoretical analysis and experimental evaluation demonstrate the overall performance of the approach in terms of security, efficiency and accuracy.,This paper presents a valuable study attempting to protect privacy for the management of XML archives in a cloud environment, so it has a positive significance to promote the application of cloud computing in a digital archive system.

...read moreread less

Journal Article•DOI•

Extremo: An Eclipse plugin for modelling and meta-modelling assistance

[...]

Ángel Mora Segura¹, Juan de Lara¹•Institutions (1)

Autonomous University of Madrid¹

01 Jul 2019-Science of Computer Programming

TL;DR: Extremo, an Eclipse plugin aimed at gathering the information stored in heterogeneous sources in a common data model, to facilitate the reuse of information chunks in the model being built, is presented.

...read moreread less

Journal Article•DOI•

Customized code generation based on user specifications for simulation and optimization

[...]

Gregor Tolksdorf¹, Erik Esche¹, Günter Wozny¹, Jens-Uwe Repke¹•Institutions (1)

Technical University of Berlin¹

02 Feb 2019-Computers & Chemical Engineering

TL;DR: This contribution presents a first implementation of model-driven development of customized code for simulation and optimization based on equation-oriented models in process science, based on free standards (MathML, XML) based on User-defined Language Specificators (UDLS).

...read moreread less

RocketRML - A NodeJS implementation of a use-case specific RML mapper

[...]

Umutcan Şimşek, Elias Kärle, Dieter Fensel

11 Mar 2019

TL;DR: This paper describes a new implementation of an RML mapper written with the JavaScript based NodeJS framework and shows, that the implementation has great potential to perform heavy mapping tasks in reasonable time, but comes with some limitations regarding JOINs, Named Graphs and inputs other than XML and JSON.

...read moreread less

Abstract: The creation of Linked Data from raw data sources is, in theory, no rocket science (pun intended). Depending on the nature of the input and the mapping technology in use, it can become a quite tedious task. For our work on mapping real-life touristic data to the this http URL vocabulary we used RML but soon encountered, that the existing Java mapper implementations reached their limits and were not sufficient for our use cases. In this paper we describe a new implementation of an RML mapper. Written with the JavaScript based NodeJS framework it performs quite well for our uses cases where we work with large XML and JSON files. The performance testing and the execution of the RML test cases have shown, that the implementation has great potential to perform heavy mapping tasks in reasonable time, but comes with some limitations regarding JOINs, Named Graphs and inputs other than XML and JSON - which is fine at the moment, due to the nature of the given use cases.

...read moreread less

Journal Article•DOI•

A study of the concept of parametric modeling of construction objects

[...]

Kateryna Kyivska, Svitlana Tsiutsiura, Mikola Tsiutsiura, Olena Kryvoruchko, Andrii V. Yerukaiev, Vladyslav V. Hots - Show less +2 more

30 Apr 2019

TL;DR: Research was carried out, the result of which was the generalization and structuring of the parameters of the elements of information models of building objects that allows to solve the problem of heterogeneity of information about the model coming from different sources.

...read moreread less

Abstract: Integration of information models between different software systems is an urgent task. There are a large number of CAD systems that cover more than 90% of the design tasks, but the mechanism for transferring information between these programs has not yet been worked out. Despite the use of common data integration formats (such as IFC, XML, DXF, DWG, PDF), CAD model elements are represented using various indicators and characteristics. This leads to a partial loss of information on objects. Integration of the model in an incomplete form and filling in the missing parameters manually takes a lot of time and is not effective. To solve this problem, research was carried out, the result of which was the generalization and structuring of the parameters of the elements of information models of building objects. This allows you to solve the problem of heterogeneity of information about the model coming from different sources.

...read moreread less

Journal Article•DOI•

Validating object-oriented software at design phase by achieving MC/DC

[...]

Swadhin Kumar Barisal¹, Suvam Suvabrata Behera¹, Sangharatna Godboley², Durga Prasad Mohapatra¹•Institutions (2)

National Institute of Technology, Rourkela¹, National University of Singapore²

01 Aug 2019-International Journal of Systems Assurance Engineering and Management

TL;DR: This paper deals with a new technique for validating object-oriented software at design phase of project development, and focuses on UML Activity Diagram.

...read moreread less

Abstract: This paper deals with a new technique for validating object-oriented software at design phase of project development. There are several modeling diagrams used at design phase of Software Development Life Cycle. But in this paper, we focus on UML Activity Diagram. In our work, first we construct the UML activity diagram for the given system using ArgoUML. Then, the XML (“EXtensible Markup Language”) code is generated for the constructed activity diagram. Next, this XML code is translated to XSD (“XML Schema Definition”) code. This XSD code is given as input to JAXB (“Java Architecture for XML Binding”), which generates the Java template. Then, this Java template is instrumented to a complete Java program with minimal manual effort. Next, we carryout concolic testing of this Java code using jCUTE. This tool generates test cases by taking the Java program as input. Then, the obtained test suite and generated Java source code are inputed into our in-house developed tool named COPECA (COverage PErcentage CAlculator) to calculate MC/DC (Modified Condition/Decision Coverage) score. We have achieved 56.31% MC/DC on doing experiment with fourteen activity diagrams, which is a fair (moderate) achievement compared to the existing work.

...read moreread less

Journal Article•DOI•

An approach to XBRL interoperability based on Ant Colony Optimization algorithm

[...]

Kamaleddin Yaghoobirafi¹, Eslam Nazemi¹•Institutions (1)

Shahid Beheshti University¹

01 Jan 2019-Knowledge Based Systems

TL;DR: A novel approach is proposed which utilizes Ant Colony Optimization (ACO) in order to detect best semantic mappings between inconsistent concepts of two XBRL documents, capable of finding mappings, which were not easily discoverable otherwise.

...read moreread less

Abstract: Extensible Business Reporting Language (XBRL) is an XML-based language developed for enhancing interoperability among the entities involved in process of business reporting. Although this language is adopted by various regulators all around the world and has contributed greatly to semantic interoperability in this field, the variations between taxonomies and also between elements of instance documents, still cause many inconsistencies between elements. Although some existing approaches suppose the conversion of XBRL to ontologies and then resolve the inconsistencies by applying some mapping techniques, it does not seem practical because of low precision and incompleteness of these conversions. In this paper, a novel approach is proposed which utilizes Ant Colony Optimization (ACO) in order to detect best semantic mappings between inconsistent concepts of two XBRL documents. This approach analyzes the possible mappings with respect to various factors like concept names, all label texts, presentation and calculation hierarchies and so on. This makes the approach capable of finding mappings, which were not easily discoverable otherwise. The proposed approach is implemented and applied to actual XBRL reports. The results are measured with aid of well-known criteria (precision, recall and F-measure) and are compared with the well-known Hungarian algorithm and illustrate the better performance in accordance with these three criteria.

...read moreread less

Proceedings Article•DOI•

A High-Quality Multilingual Dataset for Structured Documentation Translation

[...]

Kazuma Hashimoto, Raffaella Buschiazzo, James Bradbury, Teresa Marshall, Richard Socher, Caiming Xiong - Show less +2 more

01 Aug 2019

TL;DR: The authors presented a high-quality multilingual dataset for the documentation domain to advance research on localization of structured text, which collected XML-structured parallel text segments from the online documentation for an enterprise software platform.

...read moreread less

Abstract: This paper presents a high-quality multilingual dataset for the documentation domain to advance research on localization of structured text. Unlike widely-used datasets for translation of plain text, we collect XML-structured parallel text segments from the online documentation for an enterprise software platform. These Web pages have been professionally translated from English into 16 languages and maintained by domain experts, and around 100,000 text segments are available for each language pair. We build and evaluate translation models for seven target languages from English, with several different copy mechanisms and an XML-constrained beam search. We also experiment with a non-English pair to show that our dataset has the potential to explicitly enable 17 × 16 translation settings. Our experiments show that learning to translate with the XML tags improves translation accuracy, and the beam search accurately generates XML structures. We also discuss trade-offs of using the copy mechanisms by focusing on translation of numerical words and named entities. We further provide a detailed human analysis of gaps between the model output and human translations for real-world applications, including suitability for post-editing.

...read moreread less

Proceedings Article•DOI•

Taxonomy of Attacks on Web Based Applications

[...]

Ankit Singh, Aditi Sharma, Nikhil Sharma, Ila Kaushik¹, Bharat Bhushan - Show less +1 more•Institutions (1)

Krishna Institute of Engineering and Technology¹

05 Jul 2019

TL;DR: Various web-based attacks along with security measures to decrease the impact of attack in the network are discussed.

...read moreread less

Abstract: Computers are being reasonably important part of our day to day lives. Important information is being shared and received via web, which are now a day's very much vulnerable to attacks. Web based attacks are considered as most important aspect in breaching network security. Various applications of web comprise of health care, banking and e- business operations. Key elements of security include confidentiality, integrity, availability which must be preserved in all aspects. Some common attacks include Malformed XML, SQL Injection, XML Bomb, XPath Injection. During transmission of data attacker impersonates itself as an authorize user and try to conceal the confidential information. In this paper we will discuss various web-based attacks along with security measures to decrease the impact of attack in the network.

...read moreread less

Journal Article•DOI•

Educating the Information Integration Using Contextual Knowledge and Ontology Merging in Advanced Levels

[...]

Husnul Qodim, Herningsih Herningsih, Phong Thanh Nguyen, Quyen Le Hoang Thuy To Nguyen, Apriana Toding - Show less +1 more

23 Dec 2019-The International Journal of Higher Education

TL;DR: This paper defines the methods of educating the information integration by the use of ontologies, and uses various ontology languages like xml, rdf, daml+oil, owl etc.

...read moreread less

Abstract: This paper defines the methods of educating the information integration by the use of ontologies. For this there are two various architecture are central and peer-to-peer data integration. A ciis generally has a worldwide mapping, which gives the client a uniform interface to get to data put away in the information sources. Conversely, in piis, there are no worldwide purposes of control on the information sources. Such systems enable developers to develop an integrated hybrid contextual based system and new concepts to be introduced. This enables the retrival of the information is easier and faster. The two most significant methodologies for structure an information integration framework is global as view & local as view (lav). In the gav method, each substance in the worldwide pattern is related nearby outline. In this paper we use various ontology languages like xml, rdf, daml+oil, owl etc.

...read moreread less

Book Chapter•DOI•

A unified fuzzy ontology for distributed electronic health record semantic interoperability

[...]

Ebtsam Adel¹, Shaker El-Sappagh², Sherif Barakat¹, Mohammed Elmogy¹•Institutions (2)

Mansoura University¹, Banha University²

01 Jan 2019

TL;DR: A unified semantic interoperability framework for distributed EHR based on fuzzy ontology is proposed, which has many benefits and advantages over frameworks that rely on crisp ontology only and supports the idea of plug and play where any system with any structure can be integrated anonymously with existing systems without affecting the current working environment.

...read moreread less

Abstract: Electronic health records (EHR) provide efficient management of clinical information in any healthcare organization. It is a complete and longitudinal electronic registration of all occasions and data identified with the person's health status, from birth to death. Medical data are growing rapidly. These data are heterogeneous, distributed, and nonstructured. Each data element can have its schema, structure, standard, format, coding system, level of abstraction, and semantic. Medical personnel need to query the distributed EHR systems anonymously by using a single language. Combination and integration of the data are vital to recover the history of patients, to share information, and to elicit queries. Semantic interoperability provides a meaningful exchange and the use of clinical data between many healthcare systems. Physicians often send fuzzy questions to EHR systems and need answers from distributed systems. In this chapter, a unified semantic interoperability framework for distributed EHR based on fuzzy ontology is proposed. The framework architecture consists of three main layers. The lowest layer (local ontologies construction) stores the EHRs heterogeneous data with different database schemas, standards, terminologies, purposes, locations, and formats. The sources of this information may be different databases (e.g., MySQL, SqlServer, DB2, Access, and Oracle) in heterogeneous schemas, EHR standards, XML files, spreadsheet files, or archetype definition language (ADL) files. These different inputs are transformed into crisp ontology using a mediator (e.g., DB2OWL, X2OWL or ADL2OntoModule) suitable for each type. In the middle layer (global ontology construction), the local ontologies are mapped (using mapping algorithms or human experts with the help of common terminology vocabularies) to a crisp global one. The global reference ontology combines and integrates all local ontologies and therefore describes all data. Then this crisp ontology is converted to a unified fuzzy ontology. Finally, the third layer is the user interface in which a doctor or any specialist can ask any linguistic or semantic queries by dealing with only the global reference fuzzy ontology. That ontology is more dynamic and helps in understanding natural language deep medical queries. The result is a global and robust semantic interoperability technique. The proposed solution is based on a fuzzy ontology semantic to integrate different healthcare systems. That framework has many benefits and advantages over frameworks that rely on crisp ontology only, including: (1) it moves toward achieving full semantic interoperability of heterogeneous EHRs, (2) it supports the idea of plug and play where any system with any structure can be integrated anonymously with existing systems without affecting the current working environment, and (3) it is an expandable and designed in a modular way as it based on using ontologies and terminologies; the functionality of the proposed framework can be extended uniformly. We expect that our framework will handle the current EHR semantic interoperability challenges, reduce the cost of the integration process, and get a higher acceptance and accuracy rate than previous studies.

...read moreread less

Journal Article•DOI•

Transforming XML schemas into OWL ontologies using formal concept analysis

[...]

Mokhtaria Hacherouf¹, Safia Nait-Bahloul, Christophe Cruz²•Institutions (2)

University of Science and Technology of Oran Mohamed-Boudiaf¹, Centre national de la recherche scientifique²

01 Jun 2019-Software and Systems Modeling

TL;DR: A formal method to transform XSD schemas into OWL schemas using transformation patterns and a set of existing transformation patterns to allow the maximum transformation of XSD schema constructions is proposed.

...read moreread less

Abstract: Ontology Web Language (OWL) is considered as a data representation format exploited by the Extensible Markup Language (XML) format. OWL extends XML by providing properties to further express the semantics of data. To this effect, transforming XML data into OWL proves important and constitutes an added value for indexing XML documents and re-engineering ontologies. In this paper, we propose a formal method to transform XSD schemas into OWL schemas using transformation patterns. To achieve this end, we extend at the beginning, a set of existing transformation patterns to allow the maximum transformation of XSD schema constructions. In addition, a formal method is presented to transform an XSD schema using the extended patterns. This method named PIXCO comprises several processes. The first process models both the transformation patterns and all the constructions of XSD schema to be transformed. The patterns are modeled using the context of Formal Concept Analysis. The XSD constructions are modeled using a proposed mathematical model. This modeling will be used in the design of the following process. The second process identifies the most appropriate patterns to transform each construction set of XSD schema. The third process generates for each XSD construction set an OWL model according to the pattern that is identified. Finally, it creates the OWL file encompassing the generated OWL models.

...read moreread less

Collapse