Showing papers on "XML published in 2022"

PDF

Open Access

Journal Article•DOI•

The Spoken BNC2014

[...]

05 Jul 2022-International Journal of Corpus Linguistics

TL;DR: The Spoken British National Corpus 2014 (SBNC 2014) as discussed by the authors is an 11.5-million-word corpus of orthographically transcribed conversations among L1 speakers of British English from across the UK.

...read moreread less

Abstract: Abstract This paper introduces the Spoken British National Corpus 2014, an 11.5-million-word corpus of orthographically transcribed conversations among L1 speakers of British English from across the UK, recorded in the years 2012–2016. After showing that a survey of the recent history of corpora of spoken British English justifies the compilation of this new corpus, we describe the main stages of the Spoken BNC2014’s creation: design, data and metadata collection, transcription, XML encoding, and annotation. In doing so we aim to (i) encourage users of the corpus to approach the data with sensitivity to the many methodological issues we identified and attempted to overcome while compiling the Spoken BNC2014, and (ii) inform (future) compilers of spoken corpora of the innovations we implemented to attempt to make the construction of corpora representing spontaneous speech in informal contexts more tractable, both logistically and practically, than in the past.

...read moreread less

27 citations

Journal Article•DOI•

Graph integration of structured, semistructured and unstructured data for data journalism

[...]

Angelos-Christos G. Anadiotis, Oana Balalau, Catarina Conceição¹, Helena Galhardas¹, Mhd Yamen Haddad, Ioana Manolescu, Tayeb Merabti, Jingmao You - Show less +4 more•Institutions (1)

Technical University of Lisbon¹

01 Feb 2022-Information Systems

TL;DR: In this article, a complete approach for integrating dynamic sets of heterogeneous datasets along the lines described above is presented, as well as the challenges faced to make such graphs useful, allow their integration to scale, and the solutions proposed for these problems.

...read moreread less

13 citations

Journal Article•DOI•

Decision support system in health care building design based on case-based reasoning and reinforcement learning

[...]

Juan I. Guerrero¹, Justin R. Ellis², Gloria Miro-Amarante¹, Antonio Martín¹•Institutions (2)

University of Seville¹, Aalborg University²

01 Jan 2022-Expert Systems With Applications

TL;DR: In this article, the authors proposed a case-based reasoning and reinforcement learning approach to analyze the data about building design (provided in an Extended Mark-up Language or XML file, and other compatible formats), checking, and validating the regulations.

...read moreread less

Abstract: The health care building designing is a very difficult process in which there are a great quantity of parameters and variables that it is necessary to consider. The health care building should reach the population needs. Additionally, the design of this type of building and the related facilities involves a great quantity of regulations which they are adapted to the different countries. Thus, the health care facilities should be designed according to numerous and complex regulations. The checking of this regulation is very difficult and usually involves big teams of specialized engineers and architects. The proposed Case-Based Reasoning (CBR) and Reinforcement Learning can analyse the data about building design (provided in an Extended Mark-up Language or XML file, and other compatible formats), checking, and validating the regulations. This approach allows to reduce the specialized and high qualified personnel, providing a report with the checking regulations, and the traceability of warning and faults in the application of regulations.

...read moreread less

12 citations

Journal Article•DOI•

Data variety, come as you are in multi-model data warehouses

[...]

Sandro Bimonte, Enrico Gallinucci¹, Patrick Marcel², Stefano Rizzi¹•Institutions (2)

University of Bologna¹, François Rabelais University²

01 Feb 2022-Information Systems

TL;DR: This paper investigates the performances of an MMDBMS when used to store multidimensional data for OLAP analyses, and proposes and compares three logical solutions implemented on the PostgreSQL multi-model DBMS.

...read moreread less

10 citations

Journal Article•DOI•

PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format

[...]

Miao Zhu, Jacqueline M. Cole

29 Mar 2022-Journal of Chemical Information and Modeling

TL;DR: The PDFDataExtraction tool is presented, which can act as a plug-in to ChemDataExtractor and outperforms other PDF-extraction tools for the chemical literature by coupling its functionalities to the chemical-named entity-recognition capabilities of ChemDataextractor.

...read moreread less

Abstract: The layout of portable document format (PDF) files is constant to any screen, and the metadata therein are latent, compared to mark-up languages such as HTML and XML. No semantic tags are usually provided, and a PDF file is not designed to be edited or its data interpreted by software. However, data held in PDF files need to be extracted in order to comply with open-source data requirements that are now government-regulated. In the chemical domain, related chemical and property data also need to be found, and their correlations need to be exploited to enable data science in areas such as data-driven materials discovery. Such relationships may be realized using text-mining software such as the “chemistry-aware” natural-language-processing tool, ChemDataExtractor; however, this tool has limited data-extraction capabilities from PDF files. This study presents the PDFDataExtractor tool, which can act as a plug-in to ChemDataExtractor. It outperforms other PDF-extraction tools for the chemical literature by coupling its functionalities to the chemical-named entity-recognition capabilities of ChemDataExtractor. The intrinsic PDF-reading abilities of ChemDataExtractor are much improved. The system features a template-based architecture. This enables semantic information to be extracted from the PDF files of scientific articles in order to reconstruct the logical structure of articles. While other existing PDF-extracting tools focus on quantity mining, this template-based system is more focused on quality mining on different layouts. PDFDataExtractor outputs information in JSON and plain text, including the metadata of a PDF file, such as paper title, authors, affiliation, email, abstract, keywords, journal, year, document object identifier (DOI), reference, and issue number. With a self-created evaluation article set, PDFDataExtractor achieved promising precision for all key assessed metadata areas of the document text.

...read moreread less

8 citations

Journal Article•DOI•

A Pipeline for the Implementation and Visualization of Explainable Machine Learning for Medical Imaging Using Radiomics Features

[...]

Cameron Severn, Krithika Suresh, Carsten Görg, Yoon Seong Choi, Rajan Jain, Debashis Ghosh - Show less +2 more

01 Jul 2022-Sensors

TL;DR: A novel pipeline for XML imaging is described that uses radiomics data and Shapley values as tools to explain outcome predictions from complex prediction models built with medical imaging with well-defined predictors.

...read moreread less

Abstract: Machine learning (ML) models have been shown to predict the presence of clinical factors from medical imaging with remarkable accuracy. However, these complex models can be difficult to interpret and are often criticized as “black boxes”. Prediction models that provide no insight into how their predictions are obtained are difficult to trust for making important clinical decisions, such as medical diagnoses or treatment. Explainable machine learning (XML) methods, such as Shapley values, have made it possible to explain the behavior of ML algorithms and to identify which predictors contribute most to a prediction. Incorporating XML methods into medical software tools has the potential to increase trust in ML-powered predictions and aid physicians in making medical decisions. Specifically, in the field of medical imaging analysis the most used methods for explaining deep learning-based model predictions are saliency maps that highlight important areas of an image. However, they do not provide a straightforward interpretation of which qualities of an image area are important. Here, we describe a novel pipeline for XML imaging that uses radiomics data and Shapley values as tools to explain outcome predictions from complex prediction models built with medical imaging with well-defined predictors. We present a visualization of XML imaging results in a clinician-focused dashboard that can be generalized to various settings. We demonstrate the use of this workflow for developing and explaining a prediction model using MRI data from glioma patients to predict a genetic mutation.

...read moreread less

6 citations

Journal Article•DOI•

ProSight Annotator: Complete control and customization of protein entries in UniProt XML files

[...]

Joseph B. Greer, Bryan P. Early, Kenneth R. Durbin, Steven M. Patrie, Paul M. Thomas, Neil L. Kelleher, Richard D. LeDuc, Ryan T. Fellers - Show less +4 more

14 Mar 2022-Proteomics

TL;DR: ProSight Annotator solves these issues by providing a graphical interface for adding user‐defined features to UniProt‐formatted XML files for better informed proteoform searches.

...read moreread less

Abstract: The effectiveness of any proteomics database search depends on the theoretical candidate information contained in the protein database. Unfortunately, candidate entries from protein databases such as UniProt rarely contain all the post‐translational modifications (PTMs), disulfide bonds, or endogenous cleavages of interest to researchers. These omissions can limit discovery of novel and biologically important proteoforms. Conversely, searching for a specific proteoform becomes a computationally difficult task for heavily modified proteins. Both situations require updates to the database through user‐annotated entries. Unfortunately, manually creating properly formatted UniProt Extensible Markup Language (XML) files is tedious and prone to errors. ProSight Annotator solves these issues by providing a graphical interface for adding user‐defined features to UniProt‐formatted XML files for better informed proteoform searches. It can be downloaded from http://prosightannotator.northwestern.edu.

...read moreread less

6 citations

Proceedings Article•DOI•

A Digital Twin for Monitoring the Construction of a Wind Farm

[...]

Alejandra Ospina-Bohórquez, Jorge López-Rebollo, Pedro Muñoz-Sánchez, Diego González-Aguilera

02 May 2022

TL;DR: In this article , the authors propose using Unreal Engine to create an interface that includes as-designed models obtained from the building information modeling (BIM) and as-built models corresponding to different steps during the construction.

...read moreread less

Abstract: Digital twins (DTs) represent an emerging technology that allows interaction between assets and their virtual replicas and enclose geometry from modeling procedures and dynamism from AI. DTs serve different purposes, e.g., testing how devices behave under diverse conditions or monitoring processes and supporting improvement. However, until now, the use of DTs for monitoring constructions has been limited, as they are frequently used only as a high-quality 3D digital representation without connecting to other systems, dynamic analysis, or simulation. This work proposes creating a DT for monitoring the construction of a wind farm. It draws a comparison between the as-designed models (from the design phase) and the as-built models (that represent the actual construction at different times). As a result, the DT can help to control deviations that may occur during construction. The authors propose using Unreal Engine to create an interface that includes as-designed models obtained from the building information modeling (BIM) and as-built models corresponding to different steps during the construction. The result is a video game-type interactive application with a timeline tool that allows going through the construction stages recorded in the as-built models and comparing them to the as-designed model.

...read moreread less

5 citations

Journal Article•DOI•

Using Explainable Machine Learning to Explore the Impact of Synoptic Reporting on Prostate Cancer

[...]

F. M. Janssen, Katja K.H. Aben, Berdine L. Heesterman, Quirinus J. M. Voorham, Paul A. Seegers, Arturo Moncada-Torres - Show less +2 more

29 Jan 2022-Algorithms

TL;DR: It is suggested that synoptic reporting in Pathology has a moderate positive impact on predicted patient survival, as adding an explainability layer to predictive ML models can open their black box, making them more accessible and easier to understand by the user.

...read moreread less

Abstract: Machine learning (ML) models have proven to be an attractive alternative to traditional statistical methods in oncology. However, they are often regarded as black boxes, hindering their adoption for answering real-life clinical questions. In this paper, we show a practical application of explainable machine learning (XML). Specifically, we explored the effect that synoptic reporting (SR; i.e., reports where data elements are presented as discrete data items) in Pathology has on the survival of a population of 14,878 Dutch prostate cancer patients. We compared the performance of a Cox Proportional Hazards model (CPH) against that of an eXtreme Gradient Boosting model (XGB) in predicting patient ranked survival. We found that the XGB model (c-index = 0.67) performed significantly better than the CPH (c-index = 0.58). Moreover, we used Shapley Additive Explanations (SHAP) values to generate a quantitative mathematical representation of how features—including usage of SR—contributed to the models’ output. The XGB model in combination with SHAP visualizations revealed interesting interaction effects between SR and the rest of the most important features. These results hint that SR has a moderate positive impact on predicted patient survival. Moreover, adding an explainability layer to predictive ML models can open their black box, making them more accessible and easier to understand by the user. This can make XML-based techniques appealing alternatives to the classical methods used in oncological research and in health care in general.

...read moreread less

5 citations

Journal Article•DOI•

Graph integration of structured, semistructured and unstructured data for data journalism

[...]

01 Feb 2022-Information Systems

TL;DR: In this article , a complete approach for integrating dynamic sets of heterogeneous datasets along the lines described above is presented, as well as the challenges faced to make such graphs useful, allow their integration to scale, and the solutions proposed for these problems.

...read moreread less

5 citations

Journal Article•DOI•

Intelligent Integration Algorithm of National Traditional Sports Culture Resources Based on Big Data

[...]

Xia Ai

28 Jan 2022-Journal of Mathematics

TL;DR: The experimental results show that compared with the traditional integration algorithm, the proposed algorithm can better solve the problem of information island among traditional ethnic sports culture resources at all levels and effectively maintain the stability of the storage environment of traditional ethnicsports culture resources while realizing the real-time interconnection of resources.

...read moreread less

Abstract: There is an information island between traditional national sports culture resources and realize the efficient real-time interconnection of resources. Therefore, this study proposed an intelligent integration algorithm of traditional national sports culture resources using big data that halts the said information island. Firstly, the complete data set is obtained by determining the time attenuation period of the weighted sample, and the mining parameters are based on the real value to realize the in-depth mining of the resource wisdom of traditional ethnic sports culture. Then, the query set of big data is constructed based on the results of weakly associated data mining, and the query of weakly associated data is completed through data repair. Finally, XML technology is used to run the schema to build a resource integration model. The experimental results show that compared with the traditional integration algorithm, the proposed algorithm can better solve the problem of information island among traditional ethnic sports culture resources at all levels and effectively maintain the stability of the storage environment of traditional ethnic sports culture resources while realizing the real-time interconnection of resources.

...read moreread less

Book Chapter•DOI•

Auto-generation of Smart Contracts from a Domain-Specific XML-Based Language

[...]

Vimal Dwivedi¹•Institutions (1)

Tallinn University of Technology¹

01 Jan 2022

TL;DR: In this article , the authors propose a smart legal contract markup language (SLCML), an XML-based smart-contract language with pattern and transformation rules that automatically convert XML code to the Solidity language.

...read moreread less

Abstract: Smart contracts are a means of facilitating, verifying and enforcing digital agreements. Blockchain technology, which includes an inherent consensus mechanism and programming languages, enables the concept of smart contracts. However, smart contracts written in an existing language, such as Solidity, Vyper, and others, are difficult for domain stakeholders and programmers to understand in order to develop code efficiently and without error, owing to a conceptual gap between the contractual provisions and the respective code. Our study addresses the problem by creating smart legal contract markup language (SLCML), an XML-based smart-contract language with pattern and transformation rules that automatically convert XML code to the Solidity language. In particular, we develop an XML schema (SLCML schema) that is used to instantiate any type of business contract understandable to IT and non-IT practitioners and is processed by computers. To reduce the effort and risk associated with smart contract development, we advocate a pattern for converting SLCML contracts to Solidity smart contracts, a smart contractual oriented computer language. We exemplify and assess our SLCML and transformation approach by defining a dairy supply chain contract based on real-world data.

...read moreread less

Journal Article•DOI•

TransformEHRs: a flexible methodology for building transparent ETL processes for EHR reuse

[...]

M. Pedrera-Jiménez, N. García-Barrio, Paula Rubio-Mayo, Alberto Tato-Gómez, Juan Luis Cruz-Bermúdez, J.L. Bernal-Sobrino, Adolfo Muñoz-Carrero, Pablo Serrano - Show less +4 more

25 Mar 2022-Methods of Information in Medicine

TL;DR: This study has provided a transparent and flexible solution to the difficulty of making the processes for obtaining EHR-derived data for secondary use understandable, auditable, and reproducible.

...read moreread less

Abstract: Abstract Background During the COVID-19 pandemic, several methodologies were designed for obtaining electronic health record (EHR)-derived datasets for research. These processes are often based on black boxes, on which clinical researchers are unaware of how the data were recorded, extracted, and transformed. In order to solve this, it is essential that extract, transform, and load (ETL) processes are based on transparent, homogeneous, and formal methodologies, making them understandable, reproducible, and auditable. Objectives This study aims to design and implement a methodology, according with FAIR Principles, for building ETL processes (focused on data extraction, selection, and transformation) for EHR reuse in a transparent and flexible manner, applicable to any clinical condition and health care organization. Methods The proposed methodology comprises four stages: (1) analysis of secondary use models and identification of data operations, based on internationally used clinical repositories, case report forms, and aggregated datasets; (2) modeling and formalization of data operations, through the paradigm of the Detailed Clinical Models; (3) agnostic development of data operations, selecting SQL and R as programming languages; and (4) automation of the ETL instantiation, building a formal configuration file with XML. Results First, four international projects were analyzed to identify 17 operations, necessary to obtain datasets according to the specifications of these projects from the EHR. With this, each of the data operations was formalized, using the ISO 13606 reference model, specifying the valid data types as arguments, inputs and outputs, and their cardinality. Then, an agnostic catalog of data was developed through data-oriented programming languages previously selected. Finally, an automated ETL instantiation process was built from an ETL configuration file formally defined. Conclusions This study has provided a transparent and flexible solution to the difficulty of making the processes for obtaining EHR-derived data for secondary use understandable, auditable, and reproducible. Moreover, the abstraction carried out in this study means that any previous EHR reuse methodology can incorporate these results into them.

...read moreread less

Journal Article•DOI•

Constructing ontologies by mining deep semantics from XML Schemas and XML instance documents

[...]

Fu Zhang¹, Qiang Li¹•Institutions (1)

Northeastern University (China)¹

01 Jan 2022-International Journal of Intelligent Systems

TL;DR: An approach for constructing ontologies by mining deep semantics from eXtensible Markup Language (XML) Schemas and XML instance documents and an ontology population approach at the instance level based on the XML instance document is proposed.

...read moreread less

Abstract: With the development of the Semantic Web and Artificial Intelligence techniques, ontology has become a very powerful way of representing not only knowledge but also their semantics. Therefore, how to construct ontologies from existing data sources has become an important research topic. In this paper, an approach for constructing ontologies by mining deep semantics from eXtensible Markup Language (XML) Schemas (including XML Schema 1.0 and XML Schema 1.1) and XML instance documents is proposed. Given an XML Schema and its corresponding XML instance document, 34 rules are first defined to mine deep semantics from the XML Schema. The mined semantics is formally stored in an intermediate conceptual model and then is used to generate an ontology at the conceptual level. Further, an ontology population approach at the instance level based on the XML instance document is proposed. Now, a complete ontology is formed. Also, some corresponding core algorithms are provided. Finally, a prototype system is implemented, which can automatically generate ontologies from XML Schemas and populate ontologies from XML instance documents. The paper also classifies and summarizes the existing work and makes a detailed comparison. Case studies on real XML data sets verify the effectiveness of the approach.

...read moreread less

Journal Article•DOI•

Design of action correction assistant system in physical education teaching and training based on .NET platform

[...]

Chong-gao Chen, Dawid Poap

26 Mar 2022-Mobile Networks and Applications

TL;DR: Experimental results show that the correcting assistant system based on the Microsoft.

...read moreread less

Journal Article•DOI•

Zoonoses and foodborne outbreaks guidance for reporting 2021 data

[...]

Giusi Amore, Frank Boelaert, Davide Gibin, Alexandra D. Papanikolaou, Valentina Rizzi, Anca-Violeta Stoicescu - Show less +2 more

01 Jan 2022-EFSA supporting publications

TL;DR: In this article , the authors present the guidance to reporting European Union (EU) Member States and non-Member States in data transmission using extensible markup language (XML) data transfer covering the reporting of prevalence data on zoonoses and microbiological agents and contaminants in food, foodborne outbreak data, animal population data and disease status data.

...read moreread less

Abstract: This technical report of the European Food Safety Authority (EFSA) presents the guidance to reporting European Union (EU) Member States and non-Member States in data transmission using extensible markup language (XML) data transfer covering the reporting of prevalence data on zoonoses and microbiological agents and contaminants in food, foodborne outbreak data, animal population data and disease status data. For data collection purposes, EFSA has created the Data Collection Framework (DCF) application. The present report provides data dictionaries to guide the reporting of information deriving from 2021 under the framework of Directive 2003/99/EC, Regulation (EU) 2017/625 and Commission Implementing Regulation (EU) 2019/627. The objective is to explain in detail the individual data elements that are included in the EFSA data models to be used for XML data transmission through the DCF. In particular, the data elements to be reported are explained, including information about the data type, a reference to the list of allowed terms and any additional business rule or requirement that may apply.

...read moreread less

Journal Article•DOI•

Inishell 2.0: semantically driven automatic GUI generation for scientific models

[...]

18 Jan 2022-Geoscientific Model Development

TL;DR: Inishell as discussed by the authors is a C++/Qt tool that can populate a graphical user interface (GUI) based on an XML description of the required numerical model configuration elements (i.e., the data model of the configuration data).

...read moreread less

Abstract: Abstract. As numerical model developers, we have experienced first hand how most users struggle with the configuration of the models, leading to numerous support requests. Such issues are usually mitigated by offering a graphical user interface (GUI) that flattens the learning curve. Developing a GUI, however, requires a significant investment for the model developers, as well as a specific skill set. Moreover, this does not fit with the daily duties of model developers. As a consequence, when a GUI has been created – usually within a specific project and often relying on an intern – the maintenance either constitutes a major burden or is not performed. This also tends to limit the evolution of the numerical models themselves, since the model developers try to avoid having to change the GUI. In this paper we describe an approach based on an XML description of the required numerical model configuration elements (i.e., the data model of the configuration data) and a C++/Qt tool (Inishell) that populates a GUI based on this description on the fly. This makes the maintenance of the GUI very simple and enables users to easily get an up-to-date GUI for configuring the numerical model. The first version of this tool was written almost 10 years ago and showed that the concept works very well for our own surface process models. A full rewrite offering a more modern interface and extended capabilities is presented in this paper.

...read moreread less

Journal Article•DOI•

Construction and Knowledge Mining of Traditional Chinese Medicine Ancient Books Bibliographic Abstracts Database Based on Genetic Algorithm and BP Neural Network

[...]

Yongmei Wang, Shujun Ren, Li Song, Jiang-Tao Zhang

06 Apr 2022-Mathematical Problems in Engineering

TL;DR: The overall model design of TCMAB bibliographic abstracts database system is proposed based on the construction process of knowledge map, and genetic algorithm and BP neural network are used for knowledge mining and discovery.

...read moreread less

Abstract: With the rapid development of modern science technology and Internet technology, the establishment of a unified and standardized bibliographic summary database to realize the exchange and resource sharing of ancient Chinese medicine bibliography, is the inevitable trend of ancient Chinese medicine bibliography digital service. Firstly, this paper formulates the bibliographic metadata specification of traditional Chinese medicine ancient books (TCMAB), extracts each cataloging file into an XML document in line with the bibliographic metadata specification of TCMAB. Secondly, this paper realizes the unified description of ancient book resources in the database system of TCMAB, uses the native XML database eXist to store and manage the XML documents of all traditional Chinese medicine ancient book resources, and integrates the multimedia data with XML data. Finally, genetic algorithm and BP neural network are used for knowledge mining and discovery, the overall model design of TCMAB bibliographic abstracts database system is proposed based on the construction process of knowledge map. The system platform adopts B/S mode, eXist database management system, PowerSSP streaming media and video server for audio video processing.

...read moreread less

Journal Article•DOI•

Revisiting the Detection of Lateral Movement through Sysmon

[...]

Christos Smiliotopoulos, Konstantia Barmpatsalou , Georgios Kambourakis

01 Aug 2022-Applied Sciences

TL;DR: By capitalizing on the rich corpus of the 870K Sysmon logs collected, an extensible Python .evtx file analyzer, dubbed PeX, is created and evaluated, which can be used towards automatizing the parsing and scrutiny of such voluminous files.

...read moreread less

Abstract: This work attempts to answer in a clear way the following key questions regarding the optimal initialization of the Sysmon tool for the identification of Lateral Movement in the MS Windows ecosystem. First, from an expert’s standpoint and with reference to the relevant literature, what are the criteria for determining the possibly optimal initialization features of the Sysmon event monitoring tool, which are also applicable as custom rules within the config.xml configuration file? Second, based on the identified features, how can a functional configuration file, able to identify as many LM variants as possible, be generated? To answer these questions, we relied on the MITRE ATT and CK knowledge base of adversary tactics and techniques and focused on the execution of the nine commonest LM methods. The conducted experiments, performed on a properly configured testbed, suggested a great number of interrelated networking features that were implemented as custom rules in the Sysmon’s config.xml file. Moreover, by capitalizing on the rich corpus of the 870K Sysmon logs collected, we created and evaluated, in terms of TP and FP rates, an extensible Python .evtx file analyzer, dubbed PeX, which can be used towards automatizing the parsing and scrutiny of such voluminous files. Both the .evtx logs dataset and the developed PeX tool are provided publicly for further propelling future research in this interesting and rapidly evolving field.

...read moreread less

Journal Article•DOI•

Automated generation of a model view definition from an information delivery manual using idmXSD and buildingSMART data dictionary

[...]

Seung Wan Son, Ghang Lee, Jeaeun Jung, Jungdae Kim, Kahyun Jeon - Show less +1 more

01 Oct 2022-Advanced Engineering Informatics

TL;DR: In this article , the authors proposed an integrated IDM and MVD development method using bSDD as a lexicon based on three international standards (ISO 12006-3, ISO 16739-1, and ISO 29481-3).

...read moreread less

Journal Article•DOI•

Guidance for reporting laboratory data on African swine fever (ASF)

[...]

Sofie Dhollander, Davide Gibin, Lina Mur, Alexandra D. Papanikolaou, Gabriele Zancanaro - Show less +1 more

01 Nov 2022-EFSA supporting publications

TL;DR: In this article , the authors present a technical report aimed at guiding the reporting of data on analytical test results, and related metadata, to EFSA in the context of the activities for the surveillance and monitoring of African Swine Fever.

...read moreread less

Abstract: This technical report is aimed at guiding the reporting of data on analytical test results, and related metadata, to EFSA in the context of the activities for the surveillance and monitoring of African Swine Fever. The objective is to explain in detail the individual data elements that are included in the EFSA Standard Sample Description version 2 (SSD2) data model. The guidance is intended to support the reporting countries in data transmission using eXtensible Markup Language (XML) data file transfer through the Data Collection Framework (DCF) according to the protocol described in the EFSA Guidance on Data Exchange version 2 (GDE2). The data elements are explained, including information about data type, list of allowed terms and associated business rules. Instructions about how to report common sampling schemes are also provided to ensure harmonised reporting among the countries.

...read moreread less

Journal Article•DOI•

XML2HBase: Storing and querying large collections of XML documents using a NoSQL database system

[...]

01 Mar 2022-Journal of Parallel and Distributed Computing

TL;DR: Xia et al. as mentioned in this paper proposed a framework named XML2HBase to address the problem of storing and querying large collections of small XML documents using HBase, a widely deployed NoSQL database.

...read moreread less

Proceedings Article•DOI•

Unified vocabulary in Official Gazettes: An exploratory study on procurement data

[...]

Ruben Interian, Isela Mendoza, Flávia Bernardini, José Viterbo

04 Oct 2022

TL;DR: In this article , the authors proposed a unified vocabulary for Official Gazettes' procurement data and a systematic literature mapping was conducted to discover what types of notices are published in official Gazettes and which structured formats and patterns are used.

...read moreread less

Abstract: The heterogeneity of notices, the poor quality or the total absence of metadata, and the lack of format standardization make it difficult to unlock the total potential value of open data published in Official Gazettes, inhibiting civic engagement. To address these issues, we propose a unified vocabulary for Official Gazettes’ procurement data. A systematic literature mapping was conducted to discover what types of notices are published in Official Gazettes and which structured formats and patterns are used. We also analyzed the publication patterns of 16 Official Gazettes of different countries, extracting the concepts used in the unified vocabulary. Official Gazettes publish laws, resolutions, announcements, financial statements, government biddings, and contracts. In this study, we focused specifically on procurement data, i.e., biddings and contracts. For these official notices, an XML publication pattern is proposed. Using common publication standards improves the interoperability and integration of Official Gazettes and Open Government Data portals, allows the systematic analysis of public procurement data, and leads to increased control of governments by society.

...read moreread less

Proceedings Article•DOI•

Automated Identification of Libraries from Vulnerability Data: Can We Do Better?

[...]

Stefanus Agus Haryono, Hong Jin Kang, Abhishek Sharma, Asankhaya Sharma, Andrew E. Santosa, Angela Yi, David Lo - Show less +3 more

01 May 2022

TL;DR: Evaluating multiple XML techniques found that other than DiSMEC and XML-CNN, recent XML mod-els outperform the FastXML model by 3%-10% in terms of F1-scores on Top-k (k=1,2,3) predictions and significant improvements in both the training and prediction time of these XML models.

...read moreread less

Abstract: Software engineers depend heavily on software libraries and have to update their dependencies once vulnerabilities are found in them. Software Composition Analysis (SCA) helps developers identify vulnerable libraries used by an application. A key challenge is the identification of libraries related to a given reported vulnerability in the National Vulnerability Database (NVD), which may not ex-plicitly indicate the affected libraries. Recently, researchers have tried to address the problem of identifying the libraries from an NVD report by treating it as an extreme multi-label learning (XML) problem, characterized by its large number of possible labels and severe data sparsity. As input, the NVD report is provided, and as output, a set of relevant libraries is returned. In this work, we evaluated multiple XML techniques. While pre-vious work only evaluated a traditional XML technique, FastXML, we trained four other traditional XML models (DiSMEC, Parabel, Bonsai, ExtremeText) as well as two deep learning-based models (XML-CNN and LightXML). We compared both their effectiveness and the time cost of training and using the models for predictions. We find that other than DiSMEC and XML-CNN, recent XML mod-els outperform the FastXML model by 3%-10% in terms of F1-scores on Top-k (k=1,2,3) predictions. Furthermore, we observe significant improvements in both the training and prediction time of these XML models, with Bonsai and Parabel model achieving 627x and 589x faster training time and 12x faster prediction time from the FastXML baseline. We discuss the implications of our experimental results and highlight limitations for future work to address.

...read moreread less

Journal Article•DOI•

An efficient format-independent watermarking framework for large-scale data sets

[...]

Sapana Rani, Raju Halder

01 Jul 2022-Expert systems with applications

TL;DR: In this paper , the authors proposed an efficient distortion-free watermarking of large-scale data sets in various formats by exploiting the power of parallel and distributed computing environment, in particular, MapReduce, Pig and Hive paradigms for the data in CSV, XML and JSON formats.

...read moreread less

Abstract: In this paper, we propose an efficient distortion-free watermarking of large-scale data sets in various formats by exploiting the power of parallel and distributed computing environment. In particular, we adapt MapReduce, Pig and Hive paradigms for the data in CSV, XML and JSON formats by identifying key computational steps involved in the sequential watermarking algorithms. Following this, we design a middleware which allows watermark generation and verification (under any computing paradigm of user’s choice) of large-scale data sets (in any suitable format of user’s interest) and their conversion without affecting the watermark. The experimental evaluation on large-scale benchmark data sets shows a significant reduction of watermark generation and verification times. Interestingly, in case of XML and JSON formats, Pig and Hive outperform the MapReduce paradigm, whereas MapReduce shows better performance in case of CSV format. To the best of our knowledge, this is the first proposal towards large-scale data sets watermarking, considering popular distributed computing paradigms and data formats.

...read moreread less

Proceedings Article•DOI•

JEDI: These aren't the JSON documents you're looking for...

[...]

Thomas Hütter, Nikolaus Augsten, Christoph M. Kirsch, Michael P. Carey

20 Jan 2022

TL;DR: QuickJEDI is developed, an algorithm that computes JEDI by leveraging a new technique to prune expensive sibling matchings and it outperforms a baseline algorithm by an order of magnitude in runtime.

...read moreread less

Abstract: The JavaScript Object Notation (JSON) is a popular data format used in document stores to natively support semi-structured data. In this paper, we address the problem of JSON similarity lookup queries: given a query document and a distance threshold τ, retrieve all documents that are within τ from the query document. Different from other hierarchical formats such as XML, JSON supports both ordered and unordered sibling collections within a single document which poses a new challenge to the tree model and distance computation. We propose JSON tree, a lossless tree representation of JSON documents, and define the JSON Edit Distance (JEDI), the first edit-based distance measure for JSON. We develop QuickJEDI, an algorithm that computes JEDI by leveraging a new technique to prune expensive sibling matchings. It outperforms a baseline algorithm by an order of magnitude in runtime. To boost the performance of JSON similarity queries, we introduce an index called JSIM and an effective upper bound based on tree sorting. Our upper bound algorithm runs in O(nτ) time and O(n+τ log n) space, which substantially improves the previous best bound of O(n2) time and O(n log n) space (where n is the tree size). Our experimental evaluation shows that our solution scales to databases with millions of documents and JSON trees with tens of thousands of nodes.

...read moreread less

Journal Article•DOI•

Keyword coupling query of spatiotemporal data based on XML

[...]

Luyi Bai¹•Institutions (1)

University of Leicester¹

02 Feb 2022-Journal of Intelligent and Fuzzy Systems

TL;DR: Wang et al. as discussed by the authors proposed the concept of query time series (QTS) according to the data revision degree to analyze the logical relationship between keywords of spatiotemporal data, and a calculation method of keyword similarity is proposed and the best parameter in the method is found through experiment.

...read moreread less

Abstract: With the increasing popularity of XML for data representations, there is a lot of interest in keyword query on XML. Many algorithms have been proposed for XML keyword queries. But the existing approaches fall short in their abilities to analyze the logical relationship between keywords of spatiotemporal data. To overcome this limitation, in this paper, we firstly propose the concept of query time series (QTS) according to the data revision degree. For the logical relationship of keywords in QTS, we study the intra-coupling logic relationship and the inter-coupling logic relationship separately. Then a calculation method of keyword similarity is proposed and the best parameter in the method is found through experiment. Finally, we compare this method with others. Experimental results show that our method is superior to previous approaches.

...read moreread less

Journal Article•DOI•

NLI-GSC: A Natural Language Interface for Generating SourceCode

[...]

Aaqib Ahmed R.H. Ansari, Deepali Vora

01 Jan 2022-International Journal of Advanced Computer Science and Applications

TL;DR: Another novel method has been proposed to create dataset from scratch using predefined structure that is filled with predefined keywords creating unique combination of training dataset.

...read moreread less

Abstract: There are many different programming languages and each programming language has its own structure or way of writing the code, it becomes difficult to learn and frequently switch between different programming languages. Due to this reason, a person working with multiple programming languages needs to look at documentations frequently which costs time and effort. In the past few years, there have been significant increase in the amount of papers published on this topic, each providing a unique solution to this problem. Many of these papers are based on applying NLP concepts in unique configuration to get the desired results. Some have used AI along with NLP to train the system to generate source-code in specific language, and some have trained the AI directly without pre-processing the dataset with NLP. All of these papers face two problems: a lack of proper dataset for this particular application and each paper can convent natural language into only one specified programming language source-code. This proposed system shows that a language independent solution is a feasible alternate for writing source-code without having full knowledge about a programming language. The proposed system uses Natural Language Processing to convert Natural Language into programming language-independent pseudo code using custom Named Entity Recognition and save it in XML (eXtensible Markup Language) format which is an intermediate step. Then, using traditional programming, this system converts the generated pseudo code into programming language-dependent source-code. In this paper, another novel method has been proposed to create dataset from scratch using predefined structure that is filled with predefined keywords creating unique combination of training dataset. Keywords—Natural Language Processing (NLP); Natural Language Interface (NLI); Entity Recognition (ER); Artificial Intelligence (AI); source code generation; pseudocode generation

...read moreread less

Journal Article•DOI•

A Framework for Content-Based Search in Large Music Collections

[...]

Tiange Zhu, Raphaël Fournier-S'niehotta, Philippe Rigaux, Nicolas Travers

23 Feb 2022-Big data and cognitive computing

TL;DR: A general framework for building a scalable search engine, based on a music description language that represents music content independently from a specific encoding, an extendible list of feature-extraction functions, and indexing, searching, and ranking procedures designed to be integrated into the standard architecture of a text-oriented search engine is proposed.

...read moreread less

Abstract: We address the problem of scalable content-based search in large collections of music documents. Music content is highly complex and versatile and presents multiple facets that can be considered independently or in combination. Moreover, music documents can be digitally encoded in many ways. We propose a general framework for building a scalable search engine, based on (i) a music description language that represents music content independently from a specific encoding, (ii) an extendible list of feature-extraction functions, and (iii) indexing, searching, and ranking procedures designed to be integrated into the standard architecture of a text-oriented search engine. As a proof of concept, we also detail an actual implementation of the framework for searching in large collections of XML-encoded music scores, based on the popular ElasticSearch system. It is released as open-source in GitHub, and available as a ready-to-use Docker image for communities that manage large collections of digitized music documents.

...read moreread less

Proceedings Article•DOI•

FPP: A Modeling Language for F Prime

[...]

05 Mar 2022

TL;DR: FPP as mentioned in this paper is an open-source modeling language for F Prime, a flight software framework developed at JPL and deployed at NASA's Mars helicopter Ingenuity, and it has a succinct and readable syntax, a well-defined semantics, and robust error checking and reporting.

...read moreread less

Abstract: We present F Prime Prime (FPP), a new open-source modeling language for F Prime. F Prime is an open-source flight software framework developed at JPL and deployed, among other places, on the Mars helicopter Ingenuity. FPP provides a convenient way to model the architectural elements of an F Prime application, e.g., components, ports, and their connections. It has a succinct and readable syntax, a well-defined semantics, and robust error checking and reporting. The FPP tool suite, written in Scala, analyzes FPP models, reports errors, and translates correct FPP models to a combination of XML and C++. Existing F Prime tools translate the XML to a partial implementation in C++, to be completed by the developers. The model elements have clean interfaces and are highly reusable. An accompanying visualization tool constructs diagrams of components and connections that FSW developers can use to understand and communicate their designs, for example at reviews. We discuss the design and implementation of FPP and the integration of FPP into F Prime. We also discuss our experience using FPP to construct F Prime models. Finally, we discuss our plans for future work, including improved code generation, improved visualization, and more advanced analysis capabilities.

...read moreread less

Collapse