Showing papers on "Serialization published in 2020"

PDF

Open Access

Proceedings Article•DOI•

MRP 2020: The Second Shared Task on Cross-Framework and Cross-Lingual Meaning Representation Parsing.

[...]

Stephan Oepen, Omri Abend, Lasha Abzianidze, Johan Bos, Jan Hajič, Daniel Hershcovich, Bin Li, Tim O'Gorman, Nianwen Xue, Daniel Zeman - Show less +6 more

01 Nov 2020

TL;DR: The 2020 Shared Task at the Conference for Computational Language Learning (CoNLL) was devoted to Meaning Representation Parsing (MRP) across frameworks and languages.

...read moreread less

Abstract: The 2020 Shared Task at the Conference for Computational Language Learning (CoNLL) was devoted to Meaning Representation Parsing (MRP) across frameworks and languages. Extending a similar setup from the previous year, five distinct approaches to the representation of sentence meaning in the form of directed graphs were represented in the English training and evaluation data for the task, packaged in a uniform graph abstraction and serialization; for four of these representation frameworks, additional training and evaluation data was provided for one additional language per framework. The task received submissions from eight teams, of which two do not participate in the official ranking because they arrived after the closing deadline or made use of additional training data. All technical information regarding the task, including system submissions, official results, and links to supporting resources and software are available from the task web site at: http://mrp.nlpl.eu

...read moreread less

42 citations

Posted Content•

MOTION - A Framework for Mixed-Protocol Multi-Party Computation.

[...]

Lennart Braun, Daniel Demmler, Thomas Schneider, Oleksandr Tkachenko

01 Jan 2020-IACR Cryptology ePrint Archive

TL;DR: MOTION is built in a user-friendly, modular, and extensible way, intended to be used as a tool in MPC research and to increase adoption of MPC protocols in practice and is shown to be highly efficient for privacy-preserving neural network inference.

...read moreread less

Abstract: We present MOTION, an efficient and generic framework for mixedprotocol secure multi-party computation (MPC). Our framework is built from the ground up and incorporates several important engineering decisions such as full communication serialization which enables MPC over arbitrary messaging interfaces and removes the need of owning network sockets. It is available under the liberal MIT license and independent of external MPC libraries, which often have stricter licenses. MOTION is extensive and thoroughly tested: it currently consists of more than 36 000 lines of code, 20% of which are unit and component tests. It is built in a user-friendly, modular, and extensible way, intended to be used as tool in MPC research and to increase adoption of MPC protocols in practice. MOTION incorporates several novel performance optimizations that improve the communication complexity and latency, e.g., 2× better online round complexity of precomputed correlated Oblivious Transfer (OT). We instantiate our framework with protocols for N parties and security against up to N−1 passive corruptions: the MPC protocols of Goldreich-Micali-Wigderson (GMW) in its arithmetic and Boolean version and oblivious transfer (OT)-based BMR (BenEfraim et al., CCS’16), as well as novel and highly efficient conversions between them, including a non-interactive conversion from BMR to arithmetic GMW. Moreover, we design a novel garbling technique that saves 20% of communication in the BMR protocol. MOTION is highly efficient, which we demonstrate in our experiments by measuring its run-times in various network settings with different numbers of parties. For secure evaluation of AES-128 with N=3 parties in the high-latency network setting from the OT-based BMR paper, we achieve a 16× better throughput of 16 AES/s using BMR. This shows that the BMR protocol is much more competitive than previously assumed. For N=3 parties and full-threshold protocols in the LAN setting, MOTION is 10×–18× faster than the previous best passively secure implementation from the MP-SPDZ framework, and 190×–586× faster than the actively secure SCALEMAMBA framework. Finally, we show that our framework is highly efficient for privacy-preserving neural network inference.

...read moreread less

39 citations

Journal Article•DOI•

Provenance-Aware Knowledge Representation: A Survey of Data Models and Contextualized Knowledge Graphs

[...]

Leslie F. Sikos¹, Dean Philp•Institutions (1)

Edith Cowan University¹

01 Sep 2020-Data Science and Engineering

TL;DR: A critical review of data models, annotation frameworks, knowledge organization systems, serialization syntaxes, and algebras that enable provenance-aware RDF statements and their limitations can serve as the basis for novel approaches in RDF-powered applications with increasing provenance needs.

...read moreread less

Abstract: Expressing machine-interpretable statements in the form of subject-predicate-object triples is a well-established practice for capturing semantics of structured data. However, the standard used for representing these triples, RDF, inherently lacks the mechanism to attach provenance data, which would be crucial to make automatically generated and/or processed data authoritative. This paper is a critical review of data models, annotation frameworks, knowledge organization systems, serialization syntaxes, and algebras that enable provenance-aware RDF statements. The various approaches are assessed in terms of standard compliance, formal semantics, tuple type, vocabulary term usage, blank nodes, provenance granularity, and scalability. This can be used to advance existing solutions and help implementers to select the most suitable approach (or a combination of approaches) for their applications. Moreover, the analysis of the mechanisms and their limitations highlighted in this paper can serve as the basis for novel approaches in RDF-powered applications with increasing provenance needs.

...read moreread less

39 citations

Proceedings Article•DOI•

DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models

[...]

Bogdan Nicolae¹, Jiali Li², Justin M. Wozniak¹, George Bosilca², Matthieu Dorier¹, Franck Cappello¹ - Show less +2 more•Institutions (2)

Argonne National Laboratory¹, University of Tennessee²

11 May 2020

TL;DR: A checkpointing technique specifically designed to address limitations of simple checkpointing techniques, introducing efficient asynchronous techniques to hide the overhead of serialization and I/O, and distribute the load over all participating processes is proposed.

...read moreread less

Abstract: In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. One common pattern emerging in such applications is frequent checkpointing of the state of the learning model during training, needed in a variety of scenarios: analysis of intermediate states to explain features and correlations with training data, exploration strategies involving alternative models that share a common ancestor, knowledge transfer, resilience, etc. However, with increasing size of the learning models and popularity of distributed data-parallel training approaches, simple checkpointing techniques used so far face several limitations: low serialization performance, blocking I/O, stragglers due to the fact that only a single process is involved in checkpointing. This paper proposes a checkpointing technique specifically designed to address the aforementioned limitations, introducing efficient asynchronous techniques to hide the overhead of serialization and I/O, and distribute the load over all participating processes. Experiments with two deep learning applications (CANDLE and ResNet) on a pre-Exascale HPC platform (Theta) shows significant improvement over state-of-art, both in terms of checkpointing duration and runtime overhead.

...read moreread less

21 citations

Proceedings Article•DOI•

Schema Inference for Property Graphs

[...]

Hanâ Lbath, Angela Bonifati, Russ Harmer

03 Sep 2020

TL;DR: This work is the first to tackle the problem of schema inference for property graphs with a novel end-to-end schema inference method for property graph schemas that tackles complex and nested property values, multi-labeled nodes and node hierarchies.

...read moreread less

Abstract: Property graph instances are typically populated without defining a schema beforehand. Although this ensures great flexibility, the lack of a schema implies to miss opportunities for query optimization, data integration and analytics, to name a few. Since several graph instances exist prior to the schema definition, extracting the schema from those instances in a principled way might become a significant yet daunting task. In this paper, we present a novel end-to-end schema inference method for property graph schemas that tackles complex and nested property values, multi-labeled nodes and node hierarchies. Our method consists of three main steps, the first of which builds upon Cypher queries to extract the node and edge serialization of a property graph. The second step builds over a MapReduce type inference system, working on the serialized output thereby obtained during the first step. The third step analyzes subtypes and supertypes to infer node hierarchies. We describe our schema inference pipeline and its implementation, a labels-and a properties-oriented variant. Finally, we experimentally evaluate and compare the scalability and accuracy of our approaches on several real-life datasets. To the best of our knowledge, our work is the first to tackle the problem of schema inference for property graphs.

...read moreread less

21 citations

Proceedings Article•DOI•

A specialized architecture for object serialization with applications to big data analytics

[...]

Jaeyoung Jang¹, Sung Jun Jung², Sunmin Jeong², Jun Heo², Hoon Shin², Tae Jun Ham², Jae W. Lee² - Show less +3 more•Institutions (2)

Sungkyunkwan University¹, Seoul National University²

30 May 2020

TL;DR: Cereal, a specialized hardware accelerator for memory object serialization, is proposed by co-designing the serialization format with hardware architecture and effectively utilizes abundant parallelism in the S/D process to deliver high throughput.

...read moreread less

Abstract: Object serialization and deserialization (S/D) is an essential feature for efficient communication between distributed computing nodes with potentially non-uniform execution environments. S/D operations are widely used in big data analytics frameworks for remote procedure calls and massive data transfers like shuffles. However, frequent S/D operations incur significant performance and energy overheads as they must traverse and process a large object graph. Prior approaches improve S/D throughput by effectively hiding disk or network I/O latency with computation, increasing compression ratio, and/or application-specific customization. However, inherent dependencies in the existing (de)serialization formats and algorithms eventually become the major performance bottleneck. Thus, we propose Cereal, a specialized hardware accelerator for memory object serialization. By co-designing the serialization format with hardware architecture, Cereal effectively utilizes abundant parallelism in the S/D process to deliver high throughput. Cereal also employs an efficient object packing scheme to compress metadata such as object reference offsets and a space-efficient bitmap representation for the object layout. Our evaluation of Cereal using both a cycle-level simulator and synthesizable Chisel RTL demonstrates that Cereal delivers 43.4× higher average S/D throughput than 88 other S/D libraries on Java Serialization Benchmark Suite. For six Spark applications Cereal achieves 7.97× and 4.81× speedups on average for S/D operations over Java built-in serializer and Kryo, respectively, while saving S/D energy by 227.75× and 136.28×.

...read moreread less

20 citations

Journal Article•DOI•

Message Passing Optimization in Robot Operating System

[...]

Ziyue Jiang, Yifan Gong, Jidong Zhai¹, Yu-Ping Wang¹, Liu Wei, Hao Wu, Jiangming Jin - Show less +3 more•Institutions (1)

Tsinghua University¹

01 Feb 2020-International Journal of Parallel Programming

TL;DR: This work proposes a novel technique, called adaptive two-layer serialization algorithm, which can achieve good performance in communication for different kinds of messages in autonomous robot systems and shows significant performance improvement over traditional methods in ROS2.

...read moreread less

Abstract: With the development of deep learning, autonomous robot systems grow rapidly and require better performance. Robot Operating System 2 (ROS2) has been widely adopted as the main communication framework in autonomous robot systems. However, the performance of ROS2 has become the bottleneck of these real-time systems. From our observations, we find that it can take a large amount of time to serialize complex message in communication, especially for some high-level programming languages, including Python, Java and so on. To address this challenge, we propose a novel technique, called adaptive two-layer serialization algorithm, which can achieve good performance in communication for different kinds of messages. Experimental results show that our algorithm can achieve significant performance improvement over traditional methods in ROS2, up to 93% improvement in our framework. We have successfully applied our proposed techniques in a real autonomous robot system.

...read moreread less

18 citations

Posted Content•

Mainlining Databases: Supporting Fast Transactional Workloads on Universal Columnar Data File Formats

[...]

Tianyu Li¹, Matthew Butrovich², Amadou Ngom², Wan Shen Lim², Wes McKinney², Andrew Pavlo² - Show less +2 more•Institutions (2)

Massachusetts Institute of Technology¹, Carnegie Mellon University²

29 Apr 2020-arXiv: Databases

TL;DR: This work develops a storage architecture for in-memory database management systems (DBMSs) that is aware of the eventual usage of its data and emits columnar storage blocks in a universal open-source format and introduces relaxations to common analytical data formats to efficiently update records.

...read moreread less

Abstract: The proliferation of modern data processing tools has given rise to open-source columnar data formats. The advantage of these formats is that they help organizations avoid repeatedly converting data to a new format for each application. These formats, however, are read-only, and organizations must use a heavy-weight transformation process to load data from on-line transactional processing (OLTP) systems. We aim to reduce or even eliminate this process by developing a storage architecture for in-memory database management systems (DBMSs) that is aware of the eventual usage of its data and emits columnar storage blocks in a universal open-source format. We introduce relaxations to common analytical data formats to efficiently update records and rely on a lightweight transformation process to convert blocks to a read-optimized layout when they are cold. We also describe how to access data from third-party analytical tools with minimal serialization overhead. To evaluate our work, we implemented our storage engine based on the Apache Arrow format and integrated it into the DB-X DBMS. Our experiments show that our approach achieves comparable performance with dedicated OLTP DBMSs while enabling orders-of-magnitude faster data exports to external data science and machine learning tools than existing methods.

...read moreread less

17 citations

Journal Article•DOI•

Scalable and secure product serialization for multi-party perishable good supply chains using blockchain

[...]

Subhasis Thakur¹, John G. Breslin¹•Institutions (1)

National University of Ireland, Galway¹

01 Sep 2020

TL;DR: This paper uses blockchains to develop a product serialization method that solves the above security issues in a multi-party perishable good supply chain and proposes a secure serialization protocol to verify the authenticity of serial numbers despite not frequently engaging with the blockchain.

...read moreread less

Abstract: Product serialization aims to allocate unique serial numbers to products in a supply chain The security challenges to product serialization are: • Valid serial numbers can be stolen and used to label fake products Thus uniqueness of a serial number should be verifiable at any stage of its lifecycle in a supply chain • A planned change of custody of a product in distribution can be corrupted by a few intimidatory nodes Compliance with the planned change of custody should be verifiable • The manufacturer and the consumer should be able to verify that perishable food products with expired shelf life are discarded In this paper, we use blockchains to develop a product serialization method that solves the above security issues in a multi-party perishable good supply chain Blockchains can revolutionize security and transparency in supply chains by providing a secure data-sharing platform in a multi-party environment Although blockchains can provide a secure data storage of change of custody events of products in a supply chain, a high volume of such events poses scalability problems for blockchains In this paper, we solve the product serialization problem using blockchain offline channels Our solution significantly reduces the number of transactions needed to be recorded in the blockchain We propose a secure serialization protocol to verify the authenticity of serial numbers despite not frequently engaging with the blockchain

...read moreread less

17 citations

Journal Article•DOI•

Human Motion Serialization Recognition With Through-the-Wall Radar

[...]

Xiaqing Yang¹, Pengyun Chen¹, Mingyang Wang², Shisheng Guo¹, Chao Jia¹, Guolong Cui¹ - Show less +2 more•Institutions (2)

University of Electronic Science and Technology of China¹, Southwest Research Institute²

07 Oct 2020-IEEE Access

TL;DR: The results show that the proposed model can validly recognize the human motion serialization and achieve 93% recognition accuracy within the initial 20% duration of the activities, which is of great significance for real-time human motion recognition.

...read moreread less

Abstract: Motivated by the intrinsic dynamics of physical motion as well as establishment of target motion model, this article addresses the problem of human motion recognition with ultra wide band (UWB) through-the-wall radar (TWR) in a novel view of range profile serialization. Specifically, we first convert the original radar echoes into range profiles. Then, an auto-encoder network (AEN) with three dense layers is adopted to reduce the dimension and extract the features of each range profile. After that, a gated recurrent unit (GRU) network with two hidden layers is employed to deal with the features of each time-range slice and output the recognition results at each slice in real time. Finally, experimental data with respect to four different behind-wall human motions is collected by self-developed UWB TWR to validate the effectiveness of the proposed model. The results show that the proposed model can validly recognize the human motion serialization and achieve 93% recognition accuracy within the initial 20% duration of the activities (the average durations are 4s, 5.5s, 3s and 4.5s), which is of great significance for real-time human motion recognition.

...read moreread less

16 citations

Journal Article•DOI•

Evolution of the ROOT Tree I/O

[...]

Jakob Blomer¹, Philippe Canal², Axel Naumann¹, Danilo Piparo¹•Institutions (2)

CERN¹, Fermilab²

17 Mar 2020-arXiv: Databases

TL;DR: This contribution presents the status and plans of the future ROOT 7 event I/O, and shows how the new, optimized physical data layout speeds up serialization and deserialization and facilitates parallel, vectorized and bulk operations.

...read moreread less

Abstract: The ROOT TTree data format encodes hundreds of petabytes of High Energy and Nuclear Physics events. Its columnar layout drives rapid analyses, as only those parts ("branches") that are really used in a given analysis need to be read from storage. Its unique feature is the seamless C++ integration, which allows users to directly store their event classes without explicitly defining data schemas. In this contribution, we present the status and plans of the future ROOT 7 event I/O. Along with the ROOT 7 interface modernization, we aim for robust, where possible compile-time safe C++ interfaces to read and write event data. On the performance side, we show first benchmarks using ROOT's new experimental I/O subsystem that combines the best of TTrees with recent advances in columnar data formats. A core ingredient is a strong separation of the high-level logical data layout (C++ classes) from the low-level physical data layout (storage backed nested vectors of simple types). We show how the new, optimized physical data layout speeds up serialization and deserialization and facilitates parallel, vectorized and bulk operations. This lets ROOT I/O run optimally on the upcoming ultra-fast NVRAM storage devices, as well as file-less storage systems such as object stores.

...read moreread less

Book Chapter•DOI•

The PROV-JSONLD Serialization

[...]

Luc Moreau¹, Trung Dong Huynh¹•Institutions (1)

King's College London¹

22 Jun 2020

TL;DR: This document specifies a JSON representation for the PROV Data Model, called PROV-JSON, that provides a compact and faithful representation of PROV that supports fast data look-up and is particularly suitable for interchanging PROV documents between web services and clients.

...read moreread less

Abstract: Provenance is information about entities, activities, and people involved in producing a piece of data or a thing, which can be used to form assessments about the data or the thing’s quality, reliability, or trustworthiness. PROV-DM is the conceptual data model that forms the basis for the W3C provenance (PROV) family of specifications. In this paper, we propose a new serialization for PROV in JSON called PROV-JSONLD. It provides a lightweight representation of PROV expressions in JSON, which is suitable to be processed by Web applications, while maintaining a natural encoding that is familiar with PROV practitioners. In addition, PROV-JSONLD exploits JSON-LD to define a semantic mapping that conforms to the PROV-O specification and, hence, the encoded PROV expressions can be readily processed as Linked Data. Finally, we show that the serialization is also efficiently processable in our evaluation. Overall, PROV-JSONLD is designed to be suitable for interchanging provenance information in Web and Linked Data applications, to offer a natural encoding of provenance for its targeted audience, and to allow for fast processing.

...read moreread less

Journal Article•DOI•

Relational reference attribute grammars: Improving continuous model validation

[...]

Johannes Mey¹, René Schöne¹, Görel Hedin², Emma Söderberg², Thomas Kühn¹, Niklas Fors², Jesper Öqvist², Uwe Aßmann¹ - Show less +4 more•Institutions (2)

Dresden University of Technology¹, Lund University²

01 Apr 2020

TL;DR: This work presents an approach to directly model uni- and bidirectional noncontainment relations in RAGs and provide efficient means for navigating and editing them and discusses the efficient and inter-operable serialization and deserialization of such model instances.

...read moreread less

Abstract: Just like current software systems, conceptual models are characterised by increasing complexity and rate of change. Yet, these models only become useful if they can be continuously evaluated, validated and serialized. To achieve sufficiently low response times for large models, incremental analysis is required. Reference Attribute Grammars (RAGs) offer mechanisms to perform incremental analysis efficiently using dynamic dependency tracking. However, not all features used in conceptual modelling are directly available in RAGs. In particular, support for noncontainment model relations is only available through encodings. We present an approach called Relational RAGs to directly model uni- and bidirectional noncontainment relations in RAGs and provide efficient means for navigating and editing them. Furthermore, we discuss the efficient and inter-operable serialization and deserialization of such model instances. This approach is evaluated using a scalable benchmark for incremental model editing and the JastAdd RAG system. Our work demonstrates the suitability of RAGs for validating complex and continuously changing models of current software systems.

...read moreread less

Journal Article•DOI•

BinaryCIF and CIFTools-Lightweight, efficient and extensible macromolecular data management.

[...]

David Sehnal¹, David Sehnal², David Sehnal³, Sebastian Bittrich⁴, Sameer Velankar², Jaroslav Koča³, Jaroslav Koča¹, Radka Svobodová³, Radka Svobodová¹, Stephen K. Burley, Alexander S. Rose⁴ - Show less +7 more•Institutions (4)

Central European Institute of Technology¹, European Bioinformatics Institute², Masaryk University³, University of California, San Diego⁴

19 Oct 2020-PLOS Computational Biology

TL;DR: BinaryCIF, a serialization of Crystallographic Information File (CIF) format files that maintains full compatibility to related data schemas, while reducing file sizes by more than a factor of two versus gzip compressed CIF files, is introduced.

...read moreread less

Abstract: 3D macromolecular structural data is growing ever more complex and plentiful in the wake of substantive advances in experimental and computational structure determination methods including macromolecular crystallography, cryo-electron microscopy, and integrative methods. Efficient means of working with 3D macromolecular structural data for archiving, analyses, and visualization are central to facilitating interoperability and reusability in compliance with the FAIR Principles. We address two challenges posed by growth in data size and complexity. First, data size is reduced by bespoke compression techniques. Second, complexity is managed through improved software tooling and fully leveraging available data dictionary schemas. To this end, we introduce BinaryCIF, a serialization of Crystallographic Information File (CIF) format files that maintains full compatibility to related data schemas, such as PDBx/mmCIF, while reducing file sizes by more than a factor of two versus gzip compressed CIF files. Moreover, for the largest structures, BinaryCIF provides even better compression-factor ten and four versus CIF files and gzipped CIF files, respectively. Herein, we describe CIFTools, a set of libraries in Java and TypeScript for generic and typed handling of CIF and BinaryCIF files. Together, BinaryCIF and CIFTools enable lightweight, efficient, and extensible handling of 3D macromolecular structural data.

...read moreread less

Journal Article•DOI•

An efficient binary storage format for IFC building models using HDF5 hierarchical data format

[...]

Thomas Krijnen¹, Jakob Beetz²•Institutions (2)

Eindhoven University of Technology¹, RWTH Aachen University²

01 May 2020-Automation in Construction

TL;DR: A binary serialization format originating from the family of EXPRESS standards, based on an existing open, binary, hierarchical data format called HDF5 that allows random access to specific instances and therefore efficient retrieval of relevant subsets is assessed.

...read moreread less

Journal Article•DOI•

Transformation of SysML Requirement Diagram into OWL Ontologies

[...]

Helna Wardhana, Ahmad Ashari, Anny Kartika

01 Jan 2020-International Journal of Advanced Computer Science and Applications

TL;DR: This research proposes a model that can automatically transform a SysML Requirement Diagram into an OWL file so that system designs can be easily understood by both humans and machines.

...read moreread less

Abstract: The Requirement Diagrams are used by the System Modeling Language (SysML) to depict and model non-functional requirements, such as response time, size, or system functionality, which cannot be accommodated in the Unified Modeling Language (UML). Nevertheless, SysML still lacks the capability to represent the semantic contexts within the design. Web Ontology Language (OWL) can be used to capture the semantic context of system design; hence, the transformation of SysML diagrams into OWL is needed. The current method of SysML Diagrams transformation into OWL is still done manually so that it is very vulnerable to errors, and the translation process requires more time and effort for system engineers. This research proposes a model that can automatically transform a SysML Requirement Diagram into an OWL file so that system designs can be easily understood by both humans and machines. It also allows users to extract knowledge contained in the previous diagrams. The transformation process makes use of a transformation rule and an algorithm that can be used to change a SysML Requirement Diagram into an OWL ontology file. XML Metadata Interchange (XMI) serialization is used as the bridge to perform the transformation. The produced ontology can be viewed in Protege. The class and subclass hierarchy, as well as the object properties and data properties, are clearly shown. In the experiment, it is also shown that the model can conduct the transformation correctly.

...read moreread less

Journal Article•DOI•

Core Scientific Dataset Model: A lightweight and portable model and file format for multi-dimensional scientific data

[...]

Deepansh J. Srivastava¹, Thomas Vosegaard², Dominique Massiot³, Philip J. Grandinetti¹•Institutions (3)

Ohio State University¹, Aarhus University², University of Orléans³

02 Jan 2020-PLOS ONE

TL;DR: The Core Scientific Dataset model with JavaScript Object Notation (JSON) serialization is presented as a lightweight, portable, and versatile standard for intra- and interdisciplinary scientific data exchange.

...read moreread less

Abstract: The Core Scientific Dataset (CSD) model with JavaScript Object Notation (JSON) serialization is presented as a lightweight, portable, and versatile standard for intra- and interdisciplinary scientific data exchange. This model supports datasets with a p-component dependent variable, {U0, …, Uq, …, Up-1}, discretely sampled at M unique points in a d-dimensional independent variable (X0, …, Xk, …, Xd-1) space. Moreover, this sampling is over an orthogonal grid, regular or rectilinear, where the principal coordinate axes of the grid are the independent variables. It can also hold correlated datasets assuming the different physical quantities (dependent variables) are sampled on the same orthogonal grid of independent variables. The model encapsulates the dependent variables' sampled data values and the minimum metadata needed to accurately represent this data in an appropriate coordinate system of independent variables. The CSD model can serve as a re-usable building block in the development of more sophisticated portable scientific dataset file standards.

...read moreread less

Performance Comparison of Messaging Protocols and Serialization Formats for Digital Twins in IoV

[...]

Daniel Persson Proos¹, Niklas Carlsson¹•Institutions (1)

Linköping University¹

22 Jun 2020

TL;DR: Comparing the performance tradeoffs of popular application-layer messaging protocols and binary serialization formats in the context of vehicle-to-cloud communication for maintaining digital twins shows that CoAP has the lowest latency and overhead, but is not able to guarantee reliable transfer, even when using its confirmable message feature.

...read moreread less

Abstract: This paper compares the performance tradeoffs of popular application-layer messaging protocols and binary serialization formats in the context of vehicle-to-cloud communication for maintaining digital twins. Of particular interest are solutions that enables emerging delay-sensitive Intelligent Transport System (ITS) features, while reducing data usage in mobile networks. The evaluated protocols are Constrained Application Protocol (CoAP), Advanced Message Queuing Protocol (AMQP), and Message Queuing Telemetry Transport (MQTT), and the serialization formats studied are Protobuf and Flatbuffers. The results show that CoAP – the only User Datagram Protocol (UDP) based protocol evaluated – has the lowest latency and overhead, but is not able to guarantee reliable transfer, even when using its confirmable message feature. For our context, the best performer that guarantees reliable transfer is MQTT. For the serialization formats, Protobuf is shown to have faster serialization speed and three times smaller serialized message size than Flatbuffers. In contrast, Flatbuffers uses less memory and has shorter deserialization time, making it an interesting alternative for applications where the vehicle is the receiver of time sensitive information. Finally, insights and implications on ITS communication are discussed.

...read moreread less

Posted Content•DOI•

ISA API: An open platform for interoperable life science experimental metadata

[...]

David Johnson¹, David Johnson², Keeva Cochrane³, Robert P. Davey⁴, Anthony Etuk⁴, Alejandra Gonzalez-Beltran², Alejandra Gonzalez-Beltran⁵, Kenneth Haug³, Kenneth Haug⁶, Massimiliano Izzo², Martin Larralde³, Thomas N. Lawson⁷, Alice Minotto⁴, Pablo Moreno³, Venkata Chandrasekhar Nainala³, Claire O'Donovan³, Luca Pireddu⁸, Pierrick Roger, Felix Shaw⁴, Christoph Steinbeck, Ralf J. M. Weber⁷, Susanna-Assunta Sansone², Philippe Rocca-Serra² - Show less +19 more•Institutions (8)

Uppsala University¹, University of Oxford², European Bioinformatics Institute³, Norwich Research Park⁴, Science and Technology Facilities Council⁵, Wellcome Trust Sanger Institute⁶, University of Birmingham⁷, Center for Advanced Studies Research and Development in Sardinia⁸

16 Nov 2020-bioRxiv

TL;DR: The ISA API provides users with rich programmatic metadata handling functionality to support automation, a common interface and an interoperable medium between the two ISA formats, as well as with other life science data formats required for depositing data in public databases.

...read moreread less

Abstract: Background The Investigation/Study/Assay (ISA) Metadata Framework is an established and widely used set of open-source community specifications and software tools for enabling discovery, exchange and publication of metadata from experiments in the life sciences. The original ISA software suite provided a set of user-facing Java tools for creating and manipulating the information structured in ISA-Tab – a now widely used tabular format. To make the ISA framework more accessible to machines and enable programmatic manipulation of experiment metadata, a JSON serialization ISA-JSON was developed. Results In this work, we present the ISA API, a Python library for the creation, editing, parsing, and validating of ISA-Tab and ISA-JSON formats by using a common data model engineered as Python class objects. We describe the ISA API feature set, early adopters and its growing user community. Conclusions The ISA API provides users with rich programmatic metadata handling functionality to support automation, a common interface and an interoperable medium between the two ISA formats, as well as with other life science data formats required for depositing data in public databases.

...read moreread less

Journal Article•DOI•

Mainlining Databases: Supporting Fast Transactional Workloads on Universal Columnar Data File Formats

[...]

Tianyu Li¹, Matthew Butrovich¹, Amadou Ngom¹, Wan Shen Lim¹, Wes McKinney¹, Andrew Pavlo¹ - Show less +2 more•Institutions (1)

Carnegie Mellon University¹

01 Dec 2020

TL;DR: In this article, the authors propose a storage architecture for in-memory database management systems (DBMSs) that is aware of the eventual usage of its data and emits columnar storage blocks in a universal open-source format.

...read moreread less

Journal Article•DOI•

Evolution of the ROOT Tree I/O

[...]

Jakob Blomer¹, Philippe Canal², Axel Naumann¹, Danilo Piparo¹•Institutions (2)

CERN¹, Fermilab²

01 Nov 2020-Epj Web of Conferences

TL;DR: The ROOT TTree as discussed by the authors data format encodes hundreds of petabytes of High Energy and Nuclear Physics events, and it has a columnar layout that drives rapid analyses, as only those parts (branches) that are really used in a given analysis need to be read from storage.

...read moreread less

Abstract: The ROOT TTree data format encodes hundreds of petabytes of High Energy and Nuclear Physics events. Its columnar layout drives rapid analyses, as only those parts (“branches”) that are really used in a given analysis need to be read from storage. Its unique feature is the seamless C++ integration, which allows users to directly store their event classes without explicitly defining data schemas. In this contribution, we present the status and plans of the future ROOT 7 event I/O. Along with the ROOT 7 interface modernization, we aim for robust, where possible compile-time safe C++ interfaces to read and write event data. On the performance side, we show first benchmarks using ROOT’s new experimental I/O subsystem that combines the best of TTrees with recent advances in columnar data formats. A core ingredient is a strong separation of the high-level logical data layout (C++ classes) from the low-level physical data layout (storage backed nested vectors of simple types). We show how the new, optimized physical data layout speeds up serialization and deserialization and facilitates parallel, vectorized and bulk operations. This lets ROOT I/O run optimally on the upcoming ultra-fast NVRAM storage devices, as well as file-less storage systems such as object stores.

...read moreread less

Posted Content•

Verifying Correct Usage of Context-Free API Protocols (Extended Version).

[...]

Kostas Ferles, Jon Stephens, Isil Dillig

19 Oct 2020-arXiv: Programming Languages

TL;DR: A novel refinement technique to progressively improve the CFG, which obtains counterexamples from CFG inclusion queries and uses them to introduce new non-terminals and productions to the grammar while still over-approximating the program’s relevant behavior.

...read moreread less

Abstract: Several real-world libraries (e.g., reentrant locks, GUI frameworks, serialization libraries) require their clients to use the provided API in a manner that conforms to a context-free specification. Motivated by this observation, this paper describes a new technique for verifying the correct usage of context-free API protocols. The key idea underlying our technique is to over-approximate the program's feasible API call sequences using a context-free grammar (CFG) and then check language inclusion between this grammar and the specification. However, since this inclusion check may fail due to imprecision in the program's CFG abstraction, we propose a novel refinement technique to progressively improve the CFG. In particular, our method obtains counterexamples from CFG inclusion queries and uses them to introduce new non-terminals and productions to the grammar while still over-approximating the program's relevant behavior. We have implemented the proposed algorithm in a tool called CFPChecker and evaluate it on 10 popular Java applications that use at least one API with a context-free specification. Our evaluation shows that CFPChecker is able to verify correct usage of the API in clients that use it correctly and produces counterexamples for those that do not. We also compare our method against three relevant baselines and demonstrate that CFPChecker enables verification of safety properties that are beyond the reach of existing tools.

...read moreread less

Journal Article•DOI•

Duplicating same argument of function to realize efficient hardware for high-level synthesis

[...]

Moena Yamasaki¹, Akira Yamawaki¹•Institutions (1)

Kyushu Institute of Technology¹

01 May 2020-Artificial Life and Robotics

TL;DR: This paper shows how to write a software program to avoid the problem of serialization in high-level synthesis and adapts this method to two image processing and evaluates the effect of this proposal to them.

...read moreread less

Abstract: High-level synthesis (HLS) automatically converting software into hardware is a promising technology to reduce the design burden significantly. However, to use HLS technology efficiently, software program must be described considering the hardware organization that HLS tool will generate. We are developing the HLS image processing library. However, some caution is required when using HLS for programs that read images. When the same image is read through an argument of the function, the input port corresponding to this argument on the hardware generated by HLS tool may cause the port conflict. As a result, the image reading is made serialized and this serialization disturbs the performance of the data path well pipelined by the HLS tool. This paper shows how to write a software program to avoid this problem. In addition, we adapt this method to two image processing and evaluate the effect of our proposal to them.

...read moreread less

Proceedings Article•DOI•

Compiler and runtime support for continuation marks

[...]

Matthew Flatt¹, R. Kent Dybvig²•Institutions (2)

University of Utah¹, Cisco Systems, Inc.²

11 Jun 2020

TL;DR: The implementation of continuation marks for Chez Scheme (in support of Racket) makes dynamic binding and lookup constant-time and fast, preserves the performance of Chez scheme's first-class continuations, and imposes negligible overhead on program fragments that do not use first- class continuations or marks.

...read moreread less

Abstract: Continuation marks enable dynamic binding and context inspection in a language with proper handling of tail calls and first-class, multi-prompt, delimited continuations. The simplest and most direct use of continuation marks is to implement dynamically scoped variables, such as the current output stream or the current exception handler. Other uses include stack inspection for debugging or security checks, serialization of an in-progress computation, and run-time elision of redundant checks. By exposing continuation marks to users of a programming language, more kinds of language extensions can be implemented as libraries without further changes to the compiler. At the same time, the compiler and runtime system must provide an efficient implementation of continuation marks to ensure that library-implemented language extensions are as effective as changing the compiler. Our implementation of continuation marks for Chez Scheme (in support of Racket) makes dynamic binding and lookup constant-time and fast, preserves the performance of Chez Scheme's first-class continuations, and imposes negligible overhead on program fragments that do not use first-class continuations or marks.

...read moreread less

Journal Article•DOI•

HDT crypt : Compression and encryption of RDF datasets

[...]

Javier David Fernandez Garcia¹, Sabrina Kirrane¹, Axel Polleres¹, Simon Steyskal¹, Simon Steyskal² - Show less +1 more•Institutions (2)

Vienna University of Economics and Business¹, Siemens²

01 Jan 2020-Sprachwissenschaft

TL;DR: This paper demonstrates how HDT - a compressed serialization format for RDF - can be extended to cater for supporting encryption and proposes a number of different graph partitioning strategies and discusses the benefits and tradeoffs of each approach.

...read moreread less

Abstract: The publication and interchange of RDF datasets online has experienced significant growth in recent years, promoted by different but complementary efforts, such as Linked Open Data, the Web of Things and RDF stream processing systems. However, the current Linked Data infrastructure does not cater for the storage and exchange of sensitive or private data. On the one hand, data publishers need means to limit access to confidential data (e.g. health, financial, personal, or other sensitive data). On the other hand, the infrastructure needs to compress RDF graphs in a manner that minimises the amount of data that is both stored and transferred over the wire. In this paper, we demonstrate how HDT - a compressed serialization format for RDF - can be extended to cater for supporting encryption. We propose a number of different graph partitioning strategies and discuss the benefits and tradeoffs of each approach.

...read moreread less

Report•DOI•

JSON Canonicalization Scheme (JCS)

[...]

Bret Jordan, Samuel Erdtman, Anders Rundgren

20 Jan 2020

TL;DR: The JCS specification defines how to create a canonical representation of JSON data by building on the strict serialization methods for JSON primitives defined by ECMAScript, constraining JSON data to the I-JSON subset, and by using deterministic property sorting.

...read moreread less

Abstract: Cryptographic operations like hashing and signing need the data to be expressed in an invariant format so that the operations are reliably repeatable. One way to address this is to create a canonical representation of the data. Canonicalization also permits data to be exchanged in its original form on the "wire" while cryptographic operations performed on the canonicalized counterpart of the data in the producer and consumer end points, generate consistent results. This document describes the JSON Canonicalization Scheme (JCS). The JCS specification defines how to create a canonical representation of JSON data by building on the strict serialization methods for JSON primitives defined by ECMAScript, constraining JSON data to the I-JSON subset, and by using deterministic property sorting.

...read moreread less

Book Chapter•DOI•

Dependency Preserved Raft for Transactions

[...]

Zihao Zhang¹, Huiqi Hu¹, Yang Yu¹, Weining Qian¹, Ke Shu - Show less +1 more•Institutions (1)

East China Normal University¹

24 Sep 2020

Abstract: Modern databases are commonly deployed on multiple commercial machines with quorum-based replication to provide high availability and guarantee strong consistency. A widely adopted consensus protocol is Raft because it is easy to understand and implement. However, Raft’s strict serialization limits the concurrency of the system, making it unable to reflect the capability of high concurrent transaction processing brought by new hardware and concurrency control technologies. Upon realizing this, the work targets on improving the parallelism of replication. We propose a variant of Raft protocol named DP-Raft to support parallel replication of database logs so that it can match the speed of transaction execution. Our key contributions are: (1) we define the rules for using log dependencies to commit and apply logs out of order; (2) DP-Raft is proposed for replicating logs in parallel. DP-Raft preserves log dependencies to ensure the safety of parallel replication and with some data structures to reduce the cost of state maintenance; (3) experiments on YCSB benchmark show that our method can improve throughput and reduce latency of transaction processing in database systems than existing Raft-based solutions.

...read moreread less

Journal Article•DOI•

High Frequency Fairness

[...]

Guillaume Haeringer¹, Hayden Melton•Institutions (1)

Baruch College¹

20 Oct 2020-Social Science Research Network

TL;DR: It is found that specific forms of constrained RSD ensure fairness under certain assumptions about the content of those orders but that the general case nevertheless requires unconstrained RSD.

...read moreread less

Abstract: The emergence of high frequency trading has resulted in `bursts' of orders arriving at an exchange (nearly) simultaneously, yet most electronic financial exchanges implement the continuous limit order book which requires processing of orders serially. Contrary to an assumption that appears throughout the economics literature, the technology that performs serialization provides only constrained random serial dictatorship (RSD) in the sense that not all priority orderings of agents are possible. We provide necessary and sufficient conditions for fairness under different market conditions on orders for constrained RSD mechanisms. Our results show that exchanges relying on the current serialization technology cannot ensure fairness, including exchanges using `speed bumps.' We find that specific forms of constrained RSD ensure fairness under certain assumptions about the content of those orders but that the general case nevertheless requires unconstrained RSD. Our results have implications for the design of trading exchanges.

...read moreread less

Posted Content•

Spatial Dependency Parsing for 2D Document Understanding

[...]

Wonseok Hwang, Jinyeong Yim, Seunghyun Park, Sohee Yang, Minjoon Seo - Show less +1 more

01 May 2020

TL;DR: It is shown that SPADE (SPatial DEpendency parser), an end-to-end spatial dependency parser that is serializer-free and capable of modeling an arbitrary number of information layers, outperforms the previous BIO tagging-based approach on name card parsing task and achieves comparable performance on receipt parsing task.

...read moreread less

Abstract: Information Extraction (IE) for document images is often approached as a BIO tagging problem, where the model sequentially goes through and classifies each recognized input token into one of the information categories. However, such problem setup has two inherent limitations that (1) it can only extract a flat list of information and (2) it assumes that the input data is serialized, often by a simple rule-based script. Nevertheless, real-world documents often contain hierarchical information in the form of two-dimensional language data in which the serialization can be highly non-trivial. To tackle these issues, we propose SPADE$\spadesuit$ (SPatial DEpendency parser), an end-to-end spatial dependency parser that is serializer-free and capable of modeling an arbitrary number of information layers, making it suitable for parsing structure-rich documents such as receipts and multimodal documents such as name cards. We show that SPADE$\spadesuit$ outperforms the previous BIO tagging-based approach on name card parsing task and achieves comparable performance on receipt parsing task. Especially, when the receipt images have non-flat manifold representing physical distortion of receipt paper in real-world, SPADE$\spadesuit$ outperforms the tagging-based method by a large margin of 25.8% highlighting the strong performance of SPADE$\spadesuit$ over spatially complex document.

...read moreread less

Book Chapter•DOI•

Comparison of Linear Genome Representations for Software Synthesis

[...]

Edward Pantridge, Thomas Helmuth¹, Lee Spector², Lee Spector³, Lee Spector⁴ - Show less +1 more•Institutions (4)

Hamilton College¹, University of Massachusetts Amherst², Amherst College³, Hampshire College⁴

01 Jan 2020

TL;DR: A new, simpler, non-epigenetic alternative to Plush, called Plushy, is presented that appears to maintain all of the advantages of Plush while providing additional benefits and illustrates the virtues of unconstrained linear genome representations more generally.

...read moreread less

Abstract: In many genetic programming systems, the program variation and execution processes operate on different program representations. The representations on which variation operates are referred to as genomes. Unconstrained linear genome representations can provide a variety of advantages, including reduced complexity of program generation, variation, simplification and serialization operations. The Plush genome representation, which uses epigenetic markers on linear genomes to express nonlinear structures, has supported the production of state-of-the-art results in program synthesis with the PushGP genetic programming system. Here we present a new, simpler, non-epigenetic alternative to Plush, called Plushy, that appears to maintain all of the advantages of Plush while providing additional benefits. These results illustrate the virtues of unconstrained linear genome representations more generally, and may be transferable to genetic programming systems that target different languages for evolved programs.

...read moreread less