scispace - formally typeset
Search or ask a question

Showing papers on "XML published in 2022"


Journal ArticleDOI
TL;DR: The Spoken British National Corpus 2014 (SBNC 2014) as discussed by the authors is an 11.5-million-word corpus of orthographically transcribed conversations among L1 speakers of British English from across the UK.
Abstract: Abstract This paper introduces the Spoken British National Corpus 2014, an 11.5-million-word corpus of orthographically transcribed conversations among L1 speakers of British English from across the UK, recorded in the years 2012–2016. After showing that a survey of the recent history of corpora of spoken British English justifies the compilation of this new corpus, we describe the main stages of the Spoken BNC2014’s creation: design, data and metadata collection, transcription, XML encoding, and annotation. In doing so we aim to (i) encourage users of the corpus to approach the data with sensitivity to the many methodological issues we identified and attempted to overcome while compiling the Spoken BNC2014, and (ii) inform (future) compilers of spoken corpora of the innovations we implemented to attempt to make the construction of corpora representing spontaneous speech in informal contexts more tractable, both logistically and practically, than in the past.

27 citations


Journal ArticleDOI
TL;DR: In this article, a complete approach for integrating dynamic sets of heterogeneous datasets along the lines described above is presented, as well as the challenges faced to make such graphs useful, allow their integration to scale, and the solutions proposed for these problems.

13 citations


Journal ArticleDOI
TL;DR: In this article, the authors proposed a case-based reasoning and reinforcement learning approach to analyze the data about building design (provided in an Extended Mark-up Language or XML file, and other compatible formats), checking, and validating the regulations.
Abstract: The health care building designing is a very difficult process in which there are a great quantity of parameters and variables that it is necessary to consider. The health care building should reach the population needs. Additionally, the design of this type of building and the related facilities involves a great quantity of regulations which they are adapted to the different countries. Thus, the health care facilities should be designed according to numerous and complex regulations. The checking of this regulation is very difficult and usually involves big teams of specialized engineers and architects. The proposed Case-Based Reasoning (CBR) and Reinforcement Learning can analyse the data about building design (provided in an Extended Mark-up Language or XML file, and other compatible formats), checking, and validating the regulations. This approach allows to reduce the specialized and high qualified personnel, providing a report with the checking regulations, and the traceability of warning and faults in the application of regulations.

12 citations


Journal ArticleDOI
TL;DR: This paper investigates the performances of an MMDBMS when used to store multidimensional data for OLAP analyses, and proposes and compares three logical solutions implemented on the PostgreSQL multi-model DBMS.

10 citations


Journal ArticleDOI
TL;DR: The PDFDataExtraction tool is presented, which can act as a plug-in to ChemDataExtractor and outperforms other PDF-extraction tools for the chemical literature by coupling its functionalities to the chemical-named entity-recognition capabilities of ChemDataextractor.
Abstract: The layout of portable document format (PDF) files is constant to any screen, and the metadata therein are latent, compared to mark-up languages such as HTML and XML. No semantic tags are usually provided, and a PDF file is not designed to be edited or its data interpreted by software. However, data held in PDF files need to be extracted in order to comply with open-source data requirements that are now government-regulated. In the chemical domain, related chemical and property data also need to be found, and their correlations need to be exploited to enable data science in areas such as data-driven materials discovery. Such relationships may be realized using text-mining software such as the “chemistry-aware” natural-language-processing tool, ChemDataExtractor; however, this tool has limited data-extraction capabilities from PDF files. This study presents the PDFDataExtractor tool, which can act as a plug-in to ChemDataExtractor. It outperforms other PDF-extraction tools for the chemical literature by coupling its functionalities to the chemical-named entity-recognition capabilities of ChemDataExtractor. The intrinsic PDF-reading abilities of ChemDataExtractor are much improved. The system features a template-based architecture. This enables semantic information to be extracted from the PDF files of scientific articles in order to reconstruct the logical structure of articles. While other existing PDF-extracting tools focus on quantity mining, this template-based system is more focused on quality mining on different layouts. PDFDataExtractor outputs information in JSON and plain text, including the metadata of a PDF file, such as paper title, authors, affiliation, email, abstract, keywords, journal, year, document object identifier (DOI), reference, and issue number. With a self-created evaluation article set, PDFDataExtractor achieved promising precision for all key assessed metadata areas of the document text.

8 citations


Journal ArticleDOI
01 Jul 2022-Sensors
TL;DR: A novel pipeline for XML imaging is described that uses radiomics data and Shapley values as tools to explain outcome predictions from complex prediction models built with medical imaging with well-defined predictors.
Abstract: Machine learning (ML) models have been shown to predict the presence of clinical factors from medical imaging with remarkable accuracy. However, these complex models can be difficult to interpret and are often criticized as “black boxes”. Prediction models that provide no insight into how their predictions are obtained are difficult to trust for making important clinical decisions, such as medical diagnoses or treatment. Explainable machine learning (XML) methods, such as Shapley values, have made it possible to explain the behavior of ML algorithms and to identify which predictors contribute most to a prediction. Incorporating XML methods into medical software tools has the potential to increase trust in ML-powered predictions and aid physicians in making medical decisions. Specifically, in the field of medical imaging analysis the most used methods for explaining deep learning-based model predictions are saliency maps that highlight important areas of an image. However, they do not provide a straightforward interpretation of which qualities of an image area are important. Here, we describe a novel pipeline for XML imaging that uses radiomics data and Shapley values as tools to explain outcome predictions from complex prediction models built with medical imaging with well-defined predictors. We present a visualization of XML imaging results in a clinician-focused dashboard that can be generalized to various settings. We demonstrate the use of this workflow for developing and explaining a prediction model using MRI data from glioma patients to predict a genetic mutation.

6 citations


Journal ArticleDOI
TL;DR: ProSight Annotator solves these issues by providing a graphical interface for adding user‐defined features to UniProt‐formatted XML files for better informed proteoform searches.
Abstract: The effectiveness of any proteomics database search depends on the theoretical candidate information contained in the protein database. Unfortunately, candidate entries from protein databases such as UniProt rarely contain all the post‐translational modifications (PTMs), disulfide bonds, or endogenous cleavages of interest to researchers. These omissions can limit discovery of novel and biologically important proteoforms. Conversely, searching for a specific proteoform becomes a computationally difficult task for heavily modified proteins. Both situations require updates to the database through user‐annotated entries. Unfortunately, manually creating properly formatted UniProt Extensible Markup Language (XML) files is tedious and prone to errors. ProSight Annotator solves these issues by providing a graphical interface for adding user‐defined features to UniProt‐formatted XML files for better informed proteoform searches. It can be downloaded from http://prosightannotator.northwestern.edu.

6 citations


Proceedings ArticleDOI
02 May 2022
TL;DR: In this article , the authors propose using Unreal Engine to create an interface that includes as-designed models obtained from the building information modeling (BIM) and as-built models corresponding to different steps during the construction.
Abstract: Digital twins (DTs) represent an emerging technology that allows interaction between assets and their virtual replicas and enclose geometry from modeling procedures and dynamism from AI. DTs serve different purposes, e.g., testing how devices behave under diverse conditions or monitoring processes and supporting improvement. However, until now, the use of DTs for monitoring constructions has been limited, as they are frequently used only as a high-quality 3D digital representation without connecting to other systems, dynamic analysis, or simulation. This work proposes creating a DT for monitoring the construction of a wind farm. It draws a comparison between the as-designed models (from the design phase) and the as-built models (that represent the actual construction at different times). As a result, the DT can help to control deviations that may occur during construction. The authors propose using Unreal Engine to create an interface that includes as-designed models obtained from the building information modeling (BIM) and as-built models corresponding to different steps during the construction. The result is a video game-type interactive application with a timeline tool that allows going through the construction stages recorded in the as-built models and comparing them to the as-designed model.

5 citations


Journal ArticleDOI
TL;DR: It is suggested that synoptic reporting in Pathology has a moderate positive impact on predicted patient survival, as adding an explainability layer to predictive ML models can open their black box, making them more accessible and easier to understand by the user.
Abstract: Machine learning (ML) models have proven to be an attractive alternative to traditional statistical methods in oncology. However, they are often regarded as black boxes, hindering their adoption for answering real-life clinical questions. In this paper, we show a practical application of explainable machine learning (XML). Specifically, we explored the effect that synoptic reporting (SR; i.e., reports where data elements are presented as discrete data items) in Pathology has on the survival of a population of 14,878 Dutch prostate cancer patients. We compared the performance of a Cox Proportional Hazards model (CPH) against that of an eXtreme Gradient Boosting model (XGB) in predicting patient ranked survival. We found that the XGB model (c-index = 0.67) performed significantly better than the CPH (c-index = 0.58). Moreover, we used Shapley Additive Explanations (SHAP) values to generate a quantitative mathematical representation of how features—including usage of SR—contributed to the models’ output. The XGB model in combination with SHAP visualizations revealed interesting interaction effects between SR and the rest of the most important features. These results hint that SR has a moderate positive impact on predicted patient survival. Moreover, adding an explainability layer to predictive ML models can open their black box, making them more accessible and easier to understand by the user. This can make XML-based techniques appealing alternatives to the classical methods used in oncological research and in health care in general.

5 citations


Journal ArticleDOI
TL;DR: In this article , a complete approach for integrating dynamic sets of heterogeneous datasets along the lines described above is presented, as well as the challenges faced to make such graphs useful, allow their integration to scale, and the solutions proposed for these problems.

5 citations


Journal ArticleDOI
TL;DR: The experimental results show that compared with the traditional integration algorithm, the proposed algorithm can better solve the problem of information island among traditional ethnic sports culture resources at all levels and effectively maintain the stability of the storage environment of traditional ethnicsports culture resources while realizing the real-time interconnection of resources.
Abstract: There is an information island between traditional national sports culture resources and realize the efficient real-time interconnection of resources. Therefore, this study proposed an intelligent integration algorithm of traditional national sports culture resources using big data that halts the said information island. Firstly, the complete data set is obtained by determining the time attenuation period of the weighted sample, and the mining parameters are based on the real value to realize the in-depth mining of the resource wisdom of traditional ethnic sports culture. Then, the query set of big data is constructed based on the results of weakly associated data mining, and the query of weakly associated data is completed through data repair. Finally, XML technology is used to run the schema to build a resource integration model. The experimental results show that compared with the traditional integration algorithm, the proposed algorithm can better solve the problem of information island among traditional ethnic sports culture resources at all levels and effectively maintain the stability of the storage environment of traditional ethnic sports culture resources while realizing the real-time interconnection of resources.

Book ChapterDOI
01 Jan 2022
TL;DR: In this article , the authors propose a smart legal contract markup language (SLCML), an XML-based smart-contract language with pattern and transformation rules that automatically convert XML code to the Solidity language.
Abstract: Smart contracts are a means of facilitating, verifying and enforcing digital agreements. Blockchain technology, which includes an inherent consensus mechanism and programming languages, enables the concept of smart contracts. However, smart contracts written in an existing language, such as Solidity, Vyper, and others, are difficult for domain stakeholders and programmers to understand in order to develop code efficiently and without error, owing to a conceptual gap between the contractual provisions and the respective code. Our study addresses the problem by creating smart legal contract markup language (SLCML), an XML-based smart-contract language with pattern and transformation rules that automatically convert XML code to the Solidity language. In particular, we develop an XML schema (SLCML schema) that is used to instantiate any type of business contract understandable to IT and non-IT practitioners and is processed by computers. To reduce the effort and risk associated with smart contract development, we advocate a pattern for converting SLCML contracts to Solidity smart contracts, a smart contractual oriented computer language. We exemplify and assess our SLCML and transformation approach by defining a dairy supply chain contract based on real-world data.

Journal ArticleDOI
TL;DR: This study has provided a transparent and flexible solution to the difficulty of making the processes for obtaining EHR-derived data for secondary use understandable, auditable, and reproducible.
Abstract: Abstract Background During the COVID-19 pandemic, several methodologies were designed for obtaining electronic health record (EHR)-derived datasets for research. These processes are often based on black boxes, on which clinical researchers are unaware of how the data were recorded, extracted, and transformed. In order to solve this, it is essential that extract, transform, and load (ETL) processes are based on transparent, homogeneous, and formal methodologies, making them understandable, reproducible, and auditable. Objectives This study aims to design and implement a methodology, according with FAIR Principles, for building ETL processes (focused on data extraction, selection, and transformation) for EHR reuse in a transparent and flexible manner, applicable to any clinical condition and health care organization. Methods The proposed methodology comprises four stages: (1) analysis of secondary use models and identification of data operations, based on internationally used clinical repositories, case report forms, and aggregated datasets; (2) modeling and formalization of data operations, through the paradigm of the Detailed Clinical Models; (3) agnostic development of data operations, selecting SQL and R as programming languages; and (4) automation of the ETL instantiation, building a formal configuration file with XML. Results First, four international projects were analyzed to identify 17 operations, necessary to obtain datasets according to the specifications of these projects from the EHR. With this, each of the data operations was formalized, using the ISO 13606 reference model, specifying the valid data types as arguments, inputs and outputs, and their cardinality. Then, an agnostic catalog of data was developed through data-oriented programming languages previously selected. Finally, an automated ETL instantiation process was built from an ETL configuration file formally defined. Conclusions This study has provided a transparent and flexible solution to the difficulty of making the processes for obtaining EHR-derived data for secondary use understandable, auditable, and reproducible. Moreover, the abstraction carried out in this study means that any previous EHR reuse methodology can incorporate these results into them.

Journal ArticleDOI
TL;DR: An approach for constructing ontologies by mining deep semantics from eXtensible Markup Language (XML) Schemas and XML instance documents and an ontology population approach at the instance level based on the XML instance document is proposed.
Abstract: With the development of the Semantic Web and Artificial Intelligence techniques, ontology has become a very powerful way of representing not only knowledge but also their semantics. Therefore, how to construct ontologies from existing data sources has become an important research topic. In this paper, an approach for constructing ontologies by mining deep semantics from eXtensible Markup Language (XML) Schemas (including XML Schema 1.0 and XML Schema 1.1) and XML instance documents is proposed. Given an XML Schema and its corresponding XML instance document, 34 rules are first defined to mine deep semantics from the XML Schema. The mined semantics is formally stored in an intermediate conceptual model and then is used to generate an ontology at the conceptual level. Further, an ontology population approach at the instance level based on the XML instance document is proposed. Now, a complete ontology is formed. Also, some corresponding core algorithms are provided. Finally, a prototype system is implemented, which can automatically generate ontologies from XML Schemas and populate ontologies from XML instance documents. The paper also classifies and summarizes the existing work and makes a detailed comparison. Case studies on real XML data sets verify the effectiveness of the approach.

Journal ArticleDOI
TL;DR: Experimental results show that the correcting assistant system based on the Microsoft.

Journal ArticleDOI
TL;DR: In this article , the authors present the guidance to reporting European Union (EU) Member States and non-Member States in data transmission using extensible markup language (XML) data transfer covering the reporting of prevalence data on zoonoses and microbiological agents and contaminants in food, foodborne outbreak data, animal population data and disease status data.
Abstract: This technical report of the European Food Safety Authority (EFSA) presents the guidance to reporting European Union (EU) Member States and non-Member States in data transmission using extensible markup language (XML) data transfer covering the reporting of prevalence data on zoonoses and microbiological agents and contaminants in food, foodborne outbreak data, animal population data and disease status data. For data collection purposes, EFSA has created the Data Collection Framework (DCF) application. The present report provides data dictionaries to guide the reporting of information deriving from 2021 under the framework of Directive 2003/99/EC, Regulation (EU) 2017/625 and Commission Implementing Regulation (EU) 2019/627. The objective is to explain in detail the individual data elements that are included in the EFSA data models to be used for XML data transmission through the DCF. In particular, the data elements to be reported are explained, including information about the data type, a reference to the list of allowed terms and any additional business rule or requirement that may apply.

Journal ArticleDOI
TL;DR: Inishell as discussed by the authors is a C++/Qt tool that can populate a graphical user interface (GUI) based on an XML description of the required numerical model configuration elements (i.e., the data model of the configuration data).
Abstract: Abstract. As numerical model developers, we have experienced first hand how most users struggle with the configuration of the models, leading to numerous support requests. Such issues are usually mitigated by offering a graphical user interface (GUI) that flattens the learning curve. Developing a GUI, however, requires a significant investment for the model developers, as well as a specific skill set. Moreover, this does not fit with the daily duties of model developers. As a consequence, when a GUI has been created – usually within a specific project and often relying on an intern – the maintenance either constitutes a major burden or is not performed. This also tends to limit the evolution of the numerical models themselves, since the model developers try to avoid having to change the GUI. In this paper we describe an approach based on an XML description of the required numerical model configuration elements (i.e., the data model of the configuration data) and a C++/Qt tool (Inishell) that populates a GUI based on this description on the fly. This makes the maintenance of the GUI very simple and enables users to easily get an up-to-date GUI for configuring the numerical model. The first version of this tool was written almost 10 years ago and showed that the concept works very well for our own surface process models. A full rewrite offering a more modern interface and extended capabilities is presented in this paper.

Journal ArticleDOI
TL;DR: The overall model design of TCMAB bibliographic abstracts database system is proposed based on the construction process of knowledge map, and genetic algorithm and BP neural network are used for knowledge mining and discovery.
Abstract: With the rapid development of modern science technology and Internet technology, the establishment of a unified and standardized bibliographic summary database to realize the exchange and resource sharing of ancient Chinese medicine bibliography, is the inevitable trend of ancient Chinese medicine bibliography digital service. Firstly, this paper formulates the bibliographic metadata specification of traditional Chinese medicine ancient books (TCMAB), extracts each cataloging file into an XML document in line with the bibliographic metadata specification of TCMAB. Secondly, this paper realizes the unified description of ancient book resources in the database system of TCMAB, uses the native XML database eXist to store and manage the XML documents of all traditional Chinese medicine ancient book resources, and integrates the multimedia data with XML data. Finally, genetic algorithm and BP neural network are used for knowledge mining and discovery, the overall model design of TCMAB bibliographic abstracts database system is proposed based on the construction process of knowledge map. The system platform adopts B/S mode, eXist database management system, PowerSSP streaming media and video server for audio video processing.

Journal ArticleDOI
TL;DR: By capitalizing on the rich corpus of the 870K Sysmon logs collected, an extensible Python .evtx file analyzer, dubbed PeX, is created and evaluated, which can be used towards automatizing the parsing and scrutiny of such voluminous files.
Abstract: This work attempts to answer in a clear way the following key questions regarding the optimal initialization of the Sysmon tool for the identification of Lateral Movement in the MS Windows ecosystem. First, from an expert’s standpoint and with reference to the relevant literature, what are the criteria for determining the possibly optimal initialization features of the Sysmon event monitoring tool, which are also applicable as custom rules within the config.xml configuration file? Second, based on the identified features, how can a functional configuration file, able to identify as many LM variants as possible, be generated? To answer these questions, we relied on the MITRE ATT and CK knowledge base of adversary tactics and techniques and focused on the execution of the nine commonest LM methods. The conducted experiments, performed on a properly configured testbed, suggested a great number of interrelated networking features that were implemented as custom rules in the Sysmon’s config.xml file. Moreover, by capitalizing on the rich corpus of the 870K Sysmon logs collected, we created and evaluated, in terms of TP and FP rates, an extensible Python .evtx file analyzer, dubbed PeX, which can be used towards automatizing the parsing and scrutiny of such voluminous files. Both the .evtx logs dataset and the developed PeX tool are provided publicly for further propelling future research in this interesting and rapidly evolving field.

Journal ArticleDOI
TL;DR: In this article , the authors proposed an integrated IDM and MVD development method using bSDD as a lexicon based on three international standards (ISO 12006-3, ISO 16739-1, and ISO 29481-3).

Journal ArticleDOI
TL;DR: In this article , the authors present a technical report aimed at guiding the reporting of data on analytical test results, and related metadata, to EFSA in the context of the activities for the surveillance and monitoring of African Swine Fever.
Abstract: This technical report is aimed at guiding the reporting of data on analytical test results, and related metadata, to EFSA in the context of the activities for the surveillance and monitoring of African Swine Fever. The objective is to explain in detail the individual data elements that are included in the EFSA Standard Sample Description version 2 (SSD2) data model. The guidance is intended to support the reporting countries in data transmission using eXtensible Markup Language (XML) data file transfer through the Data Collection Framework (DCF) according to the protocol described in the EFSA Guidance on Data Exchange version 2 (GDE2). The data elements are explained, including information about data type, list of allowed terms and associated business rules. Instructions about how to report common sampling schemes are also provided to ensure harmonised reporting among the countries.

Journal ArticleDOI
TL;DR: Xia et al. as mentioned in this paper proposed a framework named XML2HBase to address the problem of storing and querying large collections of small XML documents using HBase, a widely deployed NoSQL database.

Proceedings ArticleDOI
04 Oct 2022
TL;DR: In this article , the authors proposed a unified vocabulary for Official Gazettes' procurement data and a systematic literature mapping was conducted to discover what types of notices are published in official Gazettes and which structured formats and patterns are used.
Abstract: The heterogeneity of notices, the poor quality or the total absence of metadata, and the lack of format standardization make it difficult to unlock the total potential value of open data published in Official Gazettes, inhibiting civic engagement. To address these issues, we propose a unified vocabulary for Official Gazettes’ procurement data. A systematic literature mapping was conducted to discover what types of notices are published in Official Gazettes and which structured formats and patterns are used. We also analyzed the publication patterns of 16 Official Gazettes of different countries, extracting the concepts used in the unified vocabulary. Official Gazettes publish laws, resolutions, announcements, financial statements, government biddings, and contracts. In this study, we focused specifically on procurement data, i.e., biddings and contracts. For these official notices, an XML publication pattern is proposed. Using common publication standards improves the interoperability and integration of Official Gazettes and Open Government Data portals, allows the systematic analysis of public procurement data, and leads to increased control of governments by society.

Proceedings ArticleDOI
01 May 2022
TL;DR: Evaluating multiple XML techniques found that other than DiSMEC and XML-CNN, recent XML mod-els outperform the FastXML model by 3%-10% in terms of F1-scores on Top-k (k=1,2,3) predictions and significant improvements in both the training and prediction time of these XML models.
Abstract: Software engineers depend heavily on software libraries and have to update their dependencies once vulnerabilities are found in them. Software Composition Analysis (SCA) helps developers identify vulnerable libraries used by an application. A key challenge is the identification of libraries related to a given reported vulnerability in the National Vulnerability Database (NVD), which may not ex-plicitly indicate the affected libraries. Recently, researchers have tried to address the problem of identifying the libraries from an NVD report by treating it as an extreme multi-label learning (XML) problem, characterized by its large number of possible labels and severe data sparsity. As input, the NVD report is provided, and as output, a set of relevant libraries is returned. In this work, we evaluated multiple XML techniques. While pre-vious work only evaluated a traditional XML technique, FastXML, we trained four other traditional XML models (DiSMEC, Parabel, Bonsai, ExtremeText) as well as two deep learning-based models (XML-CNN and LightXML). We compared both their effectiveness and the time cost of training and using the models for predictions. We find that other than DiSMEC and XML-CNN, recent XML mod-els outperform the FastXML model by 3%-10% in terms of F1-scores on Top-k (k=1,2,3) predictions. Furthermore, we observe significant improvements in both the training and prediction time of these XML models, with Bonsai and Parabel model achieving 627x and 589x faster training time and 12x faster prediction time from the FastXML baseline. We discuss the implications of our experimental results and highlight limitations for future work to address.

Journal ArticleDOI
TL;DR: In this paper , the authors proposed an efficient distortion-free watermarking of large-scale data sets in various formats by exploiting the power of parallel and distributed computing environment, in particular, MapReduce, Pig and Hive paradigms for the data in CSV, XML and JSON formats.
Abstract: In this paper, we propose an efficient distortion-free watermarking of large-scale data sets in various formats by exploiting the power of parallel and distributed computing environment. In particular, we adapt MapReduce, Pig and Hive paradigms for the data in CSV, XML and JSON formats by identifying key computational steps involved in the sequential watermarking algorithms. Following this, we design a middleware which allows watermark generation and verification (under any computing paradigm of user’s choice) of large-scale data sets (in any suitable format of user’s interest) and their conversion without affecting the watermark. The experimental evaluation on large-scale benchmark data sets shows a significant reduction of watermark generation and verification times. Interestingly, in case of XML and JSON formats, Pig and Hive outperform the MapReduce paradigm, whereas MapReduce shows better performance in case of CSV format. To the best of our knowledge, this is the first proposal towards large-scale data sets watermarking, considering popular distributed computing paradigms and data formats.

Proceedings ArticleDOI
20 Jan 2022
TL;DR: QuickJEDI is developed, an algorithm that computes JEDI by leveraging a new technique to prune expensive sibling matchings and it outperforms a baseline algorithm by an order of magnitude in runtime.
Abstract: The JavaScript Object Notation (JSON) is a popular data format used in document stores to natively support semi-structured data. In this paper, we address the problem of JSON similarity lookup queries: given a query document and a distance threshold τ, retrieve all documents that are within τ from the query document. Different from other hierarchical formats such as XML, JSON supports both ordered and unordered sibling collections within a single document which poses a new challenge to the tree model and distance computation. We propose JSON tree, a lossless tree representation of JSON documents, and define the JSON Edit Distance (JEDI), the first edit-based distance measure for JSON. We develop QuickJEDI, an algorithm that computes JEDI by leveraging a new technique to prune expensive sibling matchings. It outperforms a baseline algorithm by an order of magnitude in runtime. To boost the performance of JSON similarity queries, we introduce an index called JSIM and an effective upper bound based on tree sorting. Our upper bound algorithm runs in O(nτ) time and O(n+τ log n) space, which substantially improves the previous best bound of O(n2) time and O(n log n) space (where n is the tree size). Our experimental evaluation shows that our solution scales to databases with millions of documents and JSON trees with tens of thousands of nodes.

Journal ArticleDOI
Luyi Bai1
TL;DR: Wang et al. as discussed by the authors proposed the concept of query time series (QTS) according to the data revision degree to analyze the logical relationship between keywords of spatiotemporal data, and a calculation method of keyword similarity is proposed and the best parameter in the method is found through experiment.
Abstract: With the increasing popularity of XML for data representations, there is a lot of interest in keyword query on XML. Many algorithms have been proposed for XML keyword queries. But the existing approaches fall short in their abilities to analyze the logical relationship between keywords of spatiotemporal data. To overcome this limitation, in this paper, we firstly propose the concept of query time series (QTS) according to the data revision degree. For the logical relationship of keywords in QTS, we study the intra-coupling logic relationship and the inter-coupling logic relationship separately. Then a calculation method of keyword similarity is proposed and the best parameter in the method is found through experiment. Finally, we compare this method with others. Experimental results show that our method is superior to previous approaches.

Journal ArticleDOI
TL;DR: Another novel method has been proposed to create dataset from scratch using predefined structure that is filled with predefined keywords creating unique combination of training dataset.
Abstract: There are many different programming languages and each programming language has its own structure or way of writing the code, it becomes difficult to learn and frequently switch between different programming languages. Due to this reason, a person working with multiple programming languages needs to look at documentations frequently which costs time and effort. In the past few years, there have been significant increase in the amount of papers published on this topic, each providing a unique solution to this problem. Many of these papers are based on applying NLP concepts in unique configuration to get the desired results. Some have used AI along with NLP to train the system to generate source-code in specific language, and some have trained the AI directly without pre-processing the dataset with NLP. All of these papers face two problems: a lack of proper dataset for this particular application and each paper can convent natural language into only one specified programming language source-code. This proposed system shows that a language independent solution is a feasible alternate for writing source-code without having full knowledge about a programming language. The proposed system uses Natural Language Processing to convert Natural Language into programming language-independent pseudo code using custom Named Entity Recognition and save it in XML (eXtensible Markup Language) format which is an intermediate step. Then, using traditional programming, this system converts the generated pseudo code into programming language-dependent source-code. In this paper, another novel method has been proposed to create dataset from scratch using predefined structure that is filled with predefined keywords creating unique combination of training dataset. Keywords—Natural Language Processing (NLP); Natural Language Interface (NLI); Entity Recognition (ER); Artificial Intelligence (AI); source code generation; pseudocode generation

Journal ArticleDOI
TL;DR: A general framework for building a scalable search engine, based on a music description language that represents music content independently from a specific encoding, an extendible list of feature-extraction functions, and indexing, searching, and ranking procedures designed to be integrated into the standard architecture of a text-oriented search engine is proposed.
Abstract: We address the problem of scalable content-based search in large collections of music documents. Music content is highly complex and versatile and presents multiple facets that can be considered independently or in combination. Moreover, music documents can be digitally encoded in many ways. We propose a general framework for building a scalable search engine, based on (i) a music description language that represents music content independently from a specific encoding, (ii) an extendible list of feature-extraction functions, and (iii) indexing, searching, and ranking procedures designed to be integrated into the standard architecture of a text-oriented search engine. As a proof of concept, we also detail an actual implementation of the framework for searching in large collections of XML-encoded music scores, based on the popular ElasticSearch system. It is released as open-source in GitHub, and available as a ready-to-use Docker image for communities that manage large collections of digitized music documents.

Proceedings ArticleDOI
05 Mar 2022
TL;DR: FPP as mentioned in this paper is an open-source modeling language for F Prime, a flight software framework developed at JPL and deployed at NASA's Mars helicopter Ingenuity, and it has a succinct and readable syntax, a well-defined semantics, and robust error checking and reporting.
Abstract: We present F Prime Prime (FPP), a new open-source modeling language for F Prime. F Prime is an open-source flight software framework developed at JPL and deployed, among other places, on the Mars helicopter Ingenuity. FPP provides a convenient way to model the architectural elements of an F Prime application, e.g., components, ports, and their connections. It has a succinct and readable syntax, a well-defined semantics, and robust error checking and reporting. The FPP tool suite, written in Scala, analyzes FPP models, reports errors, and translates correct FPP models to a combination of XML and C++. Existing F Prime tools translate the XML to a partial implementation in C++, to be completed by the developers. The model elements have clean interfaces and are highly reusable. An accompanying visualization tool constructs diagrams of components and connections that FSW developers can use to understand and communicate their designs, for example at reviews. We discuss the design and implementation of FPP and the integration of FPP into F Prime. We also discuss our experience using FPP to construct F Prime models. Finally, we discuss our plans for future work, including improved code generation, improved visualization, and more advanced analysis capabilities.