Showing papers on "Serialization published in 2021"

PDF

Open Access

Proceedings Article•DOI•

Breakfast of champions: towards zero-copy serialization with NIC scatter-gather

[...]

Deepti Raghavan¹, Philip Levis¹, Matei Zaharia¹, Irene Zhang²•Institutions (2)

01 Jun 2021

TL;DR: In this paper, the authors observe that widely deployed NICs possess scatter-gather capabilities that can be re-purposed to accelerate serialization's core task of coalescing and flattening inmemory data structures.

...read moreread less

Abstract: Microsecond I/O will make data serialization a major bottleneck for datacenter applications. Serialization is fundamentally about data movement: serialization libraries coalesce and flatten in-memory data structures into a single transmittable buffer. CPU-based serialization approaches will hit a performance limit due to data movement overheads and be unable to keep up with modern networks. We observe that widely deployed NICs possess scatter-gather capabilities that can be re-purposed to accelerate serialization's core task of coalescing and flattening in-memory data structures. It is possible to build a completely zero-copy, zero-allocation serialization library with commodity NICs. Doing so introduces many research challenges, including using the hardware capabilities efficiently for a wide variety of non-uniform data structures, making application memory available for zero-copy I/O, and ensuring memory safety.

...read moreread less

19 citations

Proceedings Article•DOI•

A Hardware Accelerator for Protocol Buffers

[...]

Sagar Karandikar¹, Chris Leary², Chris Kennelly², Jerry Zhao¹, Dinesh Parimi¹, Borivoje Nikolic¹, Krste Asanovic¹, Parthasarathy Ranganathan² - Show less +4 more•Institutions (2)

University of California, Berkeley¹, Google²

18 Oct 2021

TL;DR: HyperProtoBench as mentioned in this paper is an open-source benchmark representative of key serialization-framework user services at scale, which is based on the Protocol Buffers (protobuf) library.

...read moreread less

Abstract: Serialization frameworks are a fundamental component of scale-out systems, but introduce significant compute overheads. However, they are amenable to acceleration with specialized hardware. To understand the trade-offs involved in architecting such an accelerator, we present the first in-depth study of serialization framework usage at scale by profiling Protocol Buffers (“protobuf”) usage across Google’s datacenter fleet. We use this data to build HyperProtoBench, an open-source benchmark representative of key serialization-framework user services at scale. In doing so, we identify key insights that challenge prevailing assumptions about serialization framework usage. We use these insights to develop a novel hardware accelerator for protobufs, implemented in RTL and integrated into a RISC-V SoC. Applications can easily harness the accelerator, as it integrates with a modified version of the open-source protobuf library and is wire-compatible with standard protobufs. We have fully open-sourced our RTL, which, to the best of our knowledge, is the only such implementation currently available to the community. We also present a first-of-its-kind, end-to-end evaluation of our entire RTL-based system running hyperscale-derived benchmarks and microbenchmarks. We boot Linux on the system using FireSim to run these benchmarks and implement the design in a commercial 22nm FinFET process to obtain area and frequency metrics. We demonstrate an average 6.2 × to 11.2 × performance improvement vs. our baseline RISC-V SoC with BOOM OoO cores and despite the RISC-V SoC’s weaker uncore/supporting components, an average 3.8 × improvement vs. a Xeon-based server.

...read moreread less

19 citations

Journal Article•DOI•

PowerSystems.jl — A power system data management package for large scale modeling

[...]

Jose Daniel Lara¹, Jose Daniel Lara², Clayton Barrows², Daniel Thom², Dheepak Krishnamurthy², Duncan S. Callaway¹ - Show less +2 more•Institutions (2)

University of California, Berkeley¹, National Renewable Energy Laboratory²

01 Jul 2021-SoftwareX

TL;DR: PowerSystems.jl implements an abstract hierarchy to represent and customize power systems data and includes data containers for quasi-static and dynamic simulation applications that include efficient management of large quantities of time series data, optimized serialization, and comprehensive validation capabilities.

...read moreread less

18 citations

Proceedings Article•DOI•

Zerializer: towards zero-copy serialization

[...]

Adam Wolnikowski¹, Stephen Ibanez², Jonathan Stone³, Changhoon Kim², Rajit Manohar¹, Robert Soulé¹ - Show less +2 more•Institutions (3)

Yale University¹, Stanford University², Switch³

01 Jun 2021

TL;DR: In this paper, the authors argue for offloading serialization logic to the DMA path via specialized hardware, and propose an initial hardware design for such an accelerator, and give preliminary evidence of its feasibility and expected benefits.

...read moreread less

Abstract: Achieving zero-copy I/O has long been an important goal in the networking community. However, data serialization obviates the benefits of zero-copy I/O, because it requires the CPU to read, transform, and write message data, resulting in additional memory copies between the real object instances and the contiguous socket buffer. Therefore, we argue for offloading serialization logic to the DMA path via specialized hardware. We propose an initial hardware design for such an accelerator, and give preliminary evidence of its feasibility and expected benefits.

...read moreread less

18 citations

Journal Article•DOI•

A context-aware recommendation system for improving manufacturing process modeling

[...]

Jiaxing Wang¹, Sibin Gao, Zhejun Tang², Dapeng Tan¹, Dapeng Tan³, Bin Cao¹, Jing Fan¹ - Show less +3 more•Institutions (3)

Zhejiang University of Technology¹, University of Illinois at Urbana–Champaign², Chinese Ministry of Education³

18 Oct 2021-Journal of Intelligent Manufacturing

TL;DR: In this paper, a context-aware recommendation system for improving manufacturing process modeling is proposed, where independent paths and P,Q-grams are efficiently extracted from the manufacturing processes in the repository to represent their typical behavior and structure.

...read moreread less

Abstract: Process recommendation is an essential technique to help process modeler effectively and efficiently model a manufacturing process from scratch. However, the current process recommendation methods suffer from the following problems: (1) To extract all the execution paths from a manufacturing process, the behavior-based methods may occur a state space explosion problem when unfolding a process with multiple parallel patterns, resulting in low efficiency. (2) Current structure-based methods are inefficient since too many expensive computations of the graph edit distance are involved. (3) Most of the existing methods manually design their process similarity metrics with several features, which can only be applied in specific situations. (4) Few works provide visualization tools for process modeling assistance. To resolve these problems, this paper proposes a context-aware recommendation system for improving manufacturing process modeling. First, the independent paths and P,Q-grams are efficiently extracted from the manufacturing processes in the repository to represent their typical behavior and structure. Then, the process recommendation problem is transformed into the word prediction problem in natural language processing, where the serialization of an independent path/P,Q-gram and a node in it are separately regarded as a sentence and a word. The Word2vec model is introduced to automatically learn the relationships among nodes from independent paths and P,Q-grams and generate the vectors with hundreds of context-aware features for nodes in the repository. After that, the top-k similar nodes are recommended for the target node in the process fragment under construction based on the k-nearest neighbors algorithm. Finally, a visualization tool is provided for process modelers to efficiently design a new manufacturing process. Experimental evaluations show that the proposed method can perform similar or even better than the baseline methods in terms of recommending quality.

...read moreread less

15 citations

Journal Article•DOI•

Parallel Theatre: An actor framework in Java for high performance computing

[...]

Libero Nigro¹•Institutions (1)

University of Calabria¹

01 Jan 2021-Simulation Modelling Practice and Theory

TL;DR: A novel extension of Theatre, Parallel Theatre, which is developed for an exploitation of the computing potential of nowadays multi-core machines with shared memory and the particular control forms which were developed for untimed and timed parallel systems are described.

...read moreread less

14 citations

Journal Article•DOI•

ISA API: An open platform for interoperable life science experimental metadata.

[...]

David Johnson¹, David Johnson², Dominique Batista², Keeva Cochrane³, Robert P. Davey⁴, Anthony Etuk⁴, Alejandra Gonzalez-Beltran², Alejandra Gonzalez-Beltran⁵, Kenneth Haug⁶, Kenneth Haug³, Massimiliano Izzo², Martin Larralde³, Thomas N. Lawson⁷, Alice Minotto⁴, Pablo Moreno³, Venkata Chandrasekhar Nainala³, Claire O'Donovan³, Luca Pireddu⁸, Pierrick Roger, Felix Shaw⁴, Christoph Steinbeck, Ralf J. M. Weber⁷, Susanna-Assunta Sansone², Philippe Rocca-Serra² - Show less +20 more•Institutions (8)

Uppsala University¹, University of Oxford², European Bioinformatics Institute³, Norwich Research Park⁴, Science and Technology Facilities Council⁵, Wellcome Trust Sanger Institute⁶, University of Birmingham⁷, Center for Advanced Studies Research and Development in Sardinia⁸

16 Sep 2021-GigaScience

TL;DR: The ISA Metadata Framework as discussed by the authors is a set of open source community specifications and software tools for enabling discovery, exchange, and publication of metadata from experiments in the life sciences.

...read moreread less

Abstract: BACKGROUND The Investigation/Study/Assay (ISA) Metadata Framework is an established and widely used set of open source community specifications and software tools for enabling discovery, exchange, and publication of metadata from experiments in the life sciences. The original ISA software suite provided a set of user-facing Java tools for creating and manipulating the information structured in ISA-Tab-a now widely used tabular format. To make the ISA framework more accessible to machines and enable programmatic manipulation of experiment metadata, the JSON serialization ISA-JSON was developed. RESULTS In this work, we present the ISA API, a Python library for the creation, editing, parsing, and validating of ISA-Tab and ISA-JSON formats by using a common data model engineered as Python object classes. We describe the ISA API feature set, early adopters, and its growing user community. CONCLUSIONS The ISA API provides users with rich programmatic metadata-handling functionality to support automation, a common interface, and an interoperable medium between the 2 ISA formats, as well as with other life science data formats required for depositing data in public databases.

...read moreread less

10 citations

Journal Article•DOI•

Automatic derivation of conceptual database models from differently serialized business process models

[...]

Drazen Brdjanin¹, Stefan Ilic, Goran Banjac¹, Danijela Banjac¹, Slavko Maric¹ - Show less +1 more•Institutions (1)

University of Banja Luka¹

01 Feb 2021-Software and Systems Modeling

TL;DR: An deterministic rule-based approach is proposed to overcome the serialization specificities and to enable extraction of characteristic elements from differently serialized process models and an online web-based model-driven tool named AMADEOS is implemented, which is able to automatically derive conceptual database models from process models represented by different notations and also differentlyserialized.

...read moreread less

Abstract: The existing tools that aim to derive data models from business process models are typically able to process the source models represented by one single notation and also serialized in one specific way. However, the standards (e.g., BPMN) enable different serialization formats and also provide serialization flexibility, which leads to various implementations of the standard in different modeling tools and results in differently serialized models in practice, which therefore significantly constraints usability of the existing model-driven tools. In this article, we present an approach to automatic derivation of conceptual database models from business process models represented by different notations, with particular focus on differently serialized process models. A deterministic rule-based approach is proposed to overcome the serialization specificities and to enable extraction of characteristic elements from differently serialized process models. Based on the proposed approach, we implemented an online web-based model-driven tool named AMADEOS, which is able to automatically derive conceptual database models from process models represented by different notations and also differently serialized. The experimental results show that the proposed approach and implemented tool enable successful extraction of specific elements from differently serialized process models and enable derivation of the target conceptual database models with very high completeness and precision.

...read moreread less

7 citations

Journal Article•DOI•

Verifying correct usage of context-free API protocols

[...]

Kostas Ferles¹, Jon Stephens¹, Isil Dillig¹•Institutions (1)

University of Texas at Austin¹

04 Jan 2021

TL;DR: CFPChecker as mentioned in this paper is a tool for verifying the correct usage of context-free API protocols by over-approximating the program's feasible API call sequences using a CFG and checking language inclusion between this grammar and the specification.

...read moreread less

Abstract: Several real-world libraries (e.g., reentrant locks, GUI frameworks, serialization libraries) require their clients to use the provided API in a manner that conforms to a context-free specification. Motivated by this observation, this paper describes a new technique for verifying the correct usage of context-free API protocols. The key idea underlying our technique is to over-approximate the program’s feasible API call sequences using a context-free grammar (CFG) and then check language inclusion between this grammar and the specification. However, since this inclusion check may fail due to imprecision in the program’s CFG abstraction, we propose a novel refinement technique to progressively improve the CFG. In particular, our method obtains counterexamples from CFG inclusion queries and uses them to introduce new non-terminals and productions to the grammar while still over-approximating the program’s relevant behavior. We have implemented the proposed algorithm in a tool called CFPChecker and evaluate it on 10 popular Java applications that use at least one API with a context-free specification. Our evaluation shows that CFPChecker is able to verify correct usage of the API in clients that use it correctly and produces counterexamples for those that do not. We also compare our method against three relevant baselines and demonstrate that CFPChecker enables verification of safety properties that are beyond the reach of existing tools.

...read moreread less

6 citations

Posted Content•

Computer-Aided Design as Language

[...]

Yaroslav Ganin¹, Sergey Bartunov¹, Yujia Li¹, Ethan Anderson Keller, Stefano Saliceti - Show less +1 more•Institutions (1)

Google¹

06 May 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a machine learning model is proposed to automatically generate 2D sketches from computer-aided design (CAD) applications, which are used in manufacturing to model everything from coffee mugs to sports cars.

...read moreread less

Abstract: Computer-Aided Design (CAD) applications are used in manufacturing to model everything from coffee mugs to sports cars. These programs are complex and require years of training and experience to master. A component of all CAD models particularly difficult to make are the highly structured 2D sketches that lie at the heart of every 3D construction. In this work, we propose a machine learning model capable of automatically generating such sketches. Through this, we pave the way for developing intelligent tools that would help engineers create better designs with less effort. Our method is a combination of a general-purpose language modeling technique alongside an off-the-shelf data serialization protocol. We show that our approach has enough flexibility to accommodate the complexity of the domain and performs well for both unconditional synthesis and image-to-sketch translation.

...read moreread less

6 citations

Proceedings Article•DOI•

Taming Voting Algorithms on Gpus for an Efficient Connected Component Analysis Algorithm

[...]

Florian Lemaitre¹, Arthur Marius Hennequin¹, Lionel Lacassagne¹•Institutions (1)

University of Paris¹

06 Jun 2021

TL;DR: In this paper, the authors proposed a voting algorithm for connected component analysis on many-core architectures like GPUs because of the serialization of atomic memory accesses and the trend to increase the number of cores makes this issue even more critical.

...read moreread less

Abstract: Connected Component Analysis is vastly used as a building block for many Computer Vision algorithms from many fields like medical image processing, surveillance, or autonomous driving. It extends Connected Component Labeling by computing some features of the connected components like their bounding box or their surface. As such, Connected Component Analysis is a voting algorithm just like histogram computation or Hough transform. Voting algorithms are difficult on many-core architectures like GPUs because of the serialization of atomic memory accesses. The trend to increase the number of cores makes this issue even more critical. This paper explores multiple ways to reduce those conflicts for voting algorithms and especially for Connected Component Analysis. We show that our new algorithm is from 4 up to 10 times faster than State-of-the-Art on average on an Nvidia A100.

...read moreread less

Proceedings Article•DOI•

Serialization-aware call graph construction

[...]

Joanna C. S. Santos¹, Reese A. Jones¹, Chinomso Ashiogwu², Mehdi Mirakhorli¹•Institutions (2)

Rochester Institute of Technology¹, University of Maryland, College Park²

22 Jun 2021

TL;DR: Salsa as discussed by the authors is an approach to complement existing points-to analysis with respect to serialization-related features to enhance the call graph soundness while not greatly affecting its precision.

...read moreread less

Abstract: Although call graphs are crucial for inter-procedural analyses, it is challenging to statically compute them for programs with dynamic features. Prior work focused on supporting certain kinds of dynamic features, but serialization-related features are still not very well supported. Therefore, we introduce Salsa, an approach to complement existing points-to analysis with respect to serialization-related features to enhance the call graph’s soundness while not greatly affecting its precision. We evaluate Salsa’s soundness, precision, and performance using 9 programs from the Java Call graph Assessment & Test Suite (CATS) and 4 programs from the XCorpus dataset. We compared Salsa against off-the-shelf call graph construction algorithms available on Soot, Doop, WALA, and OPAL. Our experiments showed that Salsa improved call graphs’ soundness while not greatly affecting their precision. We also observed that Salsa did not incur an extra overhead on the underlying pointer analysis method.

...read moreread less

Journal Article•DOI•

PSON: A Serialization Format for IoT Sensor Networks

[...]

Alvaro Luis, Pablo Casares¹, Juan Jose Cuadrado-Gallego², Juan Jose Cuadrado-Gallego³, Miguel A. Patricio⁴ - Show less +1 more•Institutions (4)

ULTra¹, Concordia University², University of Alcalá³, Charles III University of Madrid⁴

02 Jul 2021-Sensors

TL;DR: In this paper, a new serialization format (PSON) is proposed for Internet of Things (IoT) environments, which simplifies the serialization/deserialization tasks and minimizes the messages to be sent/received.

...read moreread less

Abstract: In many Internet of Things (IoT) environments, the lifetime of a sensor is linked to its power supply. Sensor devices capture external information and transmit it. They also receive messages with control commands, which means that one of the largest computational overheads of sensor devices is spent on data serialization and deserialization tasks, as well as data transmission. The simpler the serialization/deserialization and the smaller the size of the information to be transmitted, the longer the lifetime of the sensor device and, consequently, the longer the service life. This paper presents a new serialization format (PSON) for these environments, which simplifies the serialization/deserialization tasks and minimizes the messages to be sent/received. The paper presents evaluation results with the most popular serialization formats, demonstrating the improvement obtained with the new PSON format.

...read moreread less

Journal Article•DOI•

An Empirical Study on Interactive Flipped Classroom Model Based on Digital Micro-Video Course by Big Data Analysis and Models

[...]

Na Tian, Sang-Bing Tsai¹•Institutions (1)

Wuyi University¹

25 Nov 2021-Mathematical Problems in Engineering

TL;DR: In this article, a flipped classroom model for a digital micro-video for a big data English course is presented, which uses certain techniques to apply audiovisual language to the production of specific micro-class videos.

...read moreread less

Abstract: This paper provides an in-depth analysis and study of the interactive flipped classroom model for a digital micro-video for a big data English course. To improve the learning efficiency of English courses and reduce the learning pressure of students, the thesis also uses certain techniques to apply audiovisual language to the production of specific micro-class videos, broadcast the successfully recorded micro-class courses to students, and then use the questionnaire to randomly distribute the designed audiovisual language use questionnaire. Micro-classes earnestly perform data statistics for students and finally conduct data analysis to summarize and verify the effects of micro-class audiovisual language use. The improved algorithm can effectively reduce the fluctuation of the consumption of various resources in the cluster and make the services in the cluster more stable. The new distributed interprocess communication based on protocol and serialization technology is more efficient than traditional communication based on protocol standards, reduces bandwidth consumption in the cluster, and improves the throughput of each node in the cluster. The content design and scripting of micro-video teaching resources are based on this. Then, the production process of micro-video teaching resources is explained, according to the selection of tools, the preparation, recording, editing, and generation of materials.

...read moreread less

Journal Article•DOI•

General Optimization Model of Modular Equipment Selection and Serialization for Shale Gas Field

[...]

Hong Bingyuan, Li Xiaoping, Cui Xuemeng, Jingjing Gao, Yu Li, Gong Jing, Huanying Liu - Show less +3 more

30 Jun 2021-Frontiers in Energy Research

TL;DR: In this article, the authors proposed a method to determine the use planning of modular equipment in shale gas field, considering the processing capacity, processing cost, floor area, construction cost of the modular equipment and the changes of market supply and demand, an optimization model is established.

...read moreread less

Abstract: The potential technical and economic advantages and flexible operability of modular equipment make it more and more widely used in gas field production and development. In addition to considering the manufacturing process, the selection and serialization of modular equipment should be made according to the change of gas well productivity curve, so as to meet the field demand to the greatest extent and enhance the flexibility of gathering and transportation system. This paper proposes a method to determine the use planning of modular equipment in shale gas field. Considering the processing capacity, processing cost, floor area, construction cost of modular equipment and the changes of market supply and demand, an optimization model is established. On the basis of the above model, the method of serialization of modular equipment is proposed. The effectiveness of the model is verified by a real case study. It is proved that the model can optimize the layout of modular equipment, make the modular equipment run efficiently and economically, reduce costs and increase efficiency. This study provides a reference for optimizing equipment management strategy and promoting green production practice of shale gas production.

...read moreread less

Journal Article•DOI•

Reflection of terms in attribute grammars: Design and applications

[...]

Lucas Kramer¹, Ted Kaminski¹, Eric Van Wyk•Institutions (1)

University of Minnesota¹

01 Jun 2021

TL;DR: An extension to Silver itself is described that simplifies writing language extensions for the ableC extensible C specification by allowing language engineers to specify C-language syntax trees using the concrete syntax of C (with typed holes) instead of writing abstract syntax trees.

...read moreread less

Abstract: This paper shows how reflection on undecorated syntax trees (terms) used in attribute grammars can significantly reduce the amount of boiler-plate specifications that must be written. The proposed reflection system is implemented in the form of a function mapping terms and other values into a generic representation and a function for the inverse mapping. The system is implemented in the Silver attribute grammar system. We demonstrate the usefulness of this approach to reflection in attribute grammars in several ways. The first use is in the serialization and de-serialization of the interface files Silver generates to support separate compilation; a custom interface language was replaced by a generic reflection-based implementation. Secondly, we describe an extension to Silver itself that simplifies writing language extensions for the ableC extensible C specification by allowing language engineers to specify C-language syntax trees using the concrete syntax of C (with typed holes) instead of writing abstract syntax trees. Third, strategic term rewriting in the style of Stratego is implemented using reflection as a library for, and extension to, Silver . Finally, an experimental implementation of staged interpreters for a small staged functional language is discussed.

...read moreread less

Proceedings Article•DOI•

Don't Let RPCs Constrain Your API

[...]

Daniel Bittman¹, Robert Soulé², Ethan L. Miller¹, Vishal Shrivastav³, Pankaj Mehra, Matthew Boisvert¹, Avi Silberschatz², Peter Alvaro¹ - Show less +4 more•Institutions (3)

University of California, Santa Cruz¹, Yale University², Purdue University³

10 Nov 2021

TL;DR: In this paper, the authors proposed a global address space for pervasive data identity in the OS and the network, which combines the code mobility of RPC with first-class data references in a global addressing space.

...read moreread less

Abstract: As data becomes increasingly distributed, traditional RPC and data serialization limits performance, result in rigidity, and hamper expressivity. We believe that technology trends including high-density persistent memory, high-speed networks, and programmable switches make this the right time to revisit prior research on distributed shared memory, global addressing, and content-based networking. Our vision combines the code mobility of RPC with first-class data references in a global address space by co-designing the OS and the network around pervasive data identity. We have initial results showing the promise of the proposed co-design.

...read moreread less

Journal Article•DOI•

Object Sharing in Mbya Guarani: A Case of Asymmetrical Verbal Serialization?

[...]

Marcia Damaso Vieira, Estefanía Baranger

05 Mar 2021-Langages

TL;DR: In this article, the grammatical status of the V1-V2 (Cy/vy) constructions found in Mbya Guarani is discussed, which can express simultaneous events, among other meanings, and involve a single clause.

...read moreread less

Abstract: In this paper, we intend to describe and discuss the grammatical status of the V1-V2 (Cy/vy) constructions found in Mbya Guarani which can express simultaneous events, among other meanings, and which involve a single clause. We suggest here that this verbal complex can be treated as a case of asymmetrical verbal serialization because it contains verbs from a major lexical class, occupying the V1 slot, followed by a more restricted intransitive verbal class, such as movement, postural, or stative verbs, which stands in the V2 position. The curious property of these constructions is that V2 can be transitivized through the attachment of applicative or causative morphemes and “share” its object with transitive V1. “Object sharing” is another property attributed to serialization, as suggested by Baker and Baker and Stewart, which may be seen as a strong argument in favor of the present hypothesis. We will also provide evidence to distinguish Mbya Guarani V1-V2 (Cy/vy) complex from other constructions, such as temporal and purpose subordinate clauses, involving the particle vy.

...read moreread less

Posted Content•

tsflex: flexible time series processing & feature extraction

[...]

Jonas Van Der Donckt, M. Jeroen Van Der Donckt, Emiel Deprost, Sofie Van Hoecke

24 Nov 2021-arXiv: Learning

TL;DR: In this paper, a domain-independent, flexible, and sequence first Python toolkit for processing and feature extraction is presented, which is capable of handling irregularly-sampled sequences with unaligned measurements.

...read moreread less

Abstract: Time series processing and feature extraction are crucial and time-intensive steps in conventional machine learning pipelines. Existing packages are limited in their real-world applicability, as they cannot cope with irregularly-sampled and asynchronous data. We therefore present $\texttt{tsflex}$, a domain-independent, flexible, and sequence first Python toolkit for processing & feature extraction, that is capable of handling irregularly-sampled sequences with unaligned measurements. This toolkit is sequence first as (1) sequence based arguments are leveraged for strided-window feature extraction, and (2) the sequence-index is maintained through all supported operations. $\texttt{tsflex}$ is flexible as it natively supports (1) multivariate time series, (2) multiple window-stride configurations, and (3) integrates with processing and feature functions from other packages, while (4) making no assumptions about the data sampling rate regularity and synchronization. Other functionalities from this package are multiprocessing, in-depth execution time logging, support for categorical & time based data, chunking sequences, and embedded serialization. $\texttt{tsflex}$ is developed to enable fast and memory-efficient time series processing & feature extraction. Results indicate that $\texttt{tsflex}$ is more flexible than similar packages while outperforming these toolkits in both runtime and memory usage.

...read moreread less

Journal Article•DOI•

A Task-Aware Fine-Grained Storage Selection Mechanism for In-Memory Big Data Computing Frameworks

[...]

Bo Wang, Jie Tang¹, Rui Zhang, Jialei Liu, Shaoshan Liu, Deyu Qi¹ - Show less +2 more•Institutions (1)

South China University of Technology¹

01 Feb 2021-International Journal of Parallel Programming

TL;DR: A novel task-aware fine-grained storage scheme auto-selection mechanism that automatically determines the storage scheme for caching each data block, which is the smallest unit during computing, which can offer great performance improvement.

...read moreread less

Abstract: In-memory big data computing, widely used in hot areas such as deep learning and artificial intelligence, can meet the demands of ultra-low latency service and real-time data analysis. However, existing in-memory computing frameworks usually use memory in an aggressive way. Memory space is quickly exhausted and leads to great performance degradation or even task failure. On the other hand, the increasing volumes of raw data and intermediate data introduce huge memory demands, which further deteriorate the short of memory. To release the pressure on memory, those in-memory frameworks provide various storage schemes options for caching data, which determines where and how data is cached. But their storage scheme selection mechanisms are simple and insufficient, always manually set by users. Besides, those coarse-grained data storage mechanisms cannot satisfy memory access patterns of each computing unit which works on only part of the data. In this paper, we proposed a novel task-aware fine-grained storage scheme auto-selection mechanism. It automatically determines the storage scheme for caching each data block, which is the smallest unit during computing. The caching decision is made by considering the future tasks, real-time resource utilization, and storage costs, including block creation costs, I/O costs, and serialization costs under each storage scenario. The experiments show that our proposed mechanism, compared with the default storage setting, can offer great performance improvement, especially in memory-constrained circumstances it can be as much as 78%.

...read moreread less

Patent•

Heterogeneous data real-time synchronization method and device, equipment and storage medium

[...]

Liu Haizhong, Zhang Lei, Situ Daqing, Zhang Wei, He Guangbai, Zhang Weitao, Zhang Wenxi - Show less +3 more

15 Jan 2021

TL;DR: In this paper, a heterogeneous data real-time synchronization method is proposed, which consists of an extraction thread, a writing thread and a plurality of source databases, and the method comprises the following steps: in response to a data synchronization request input by a user, extracting incremental data from the source databases in parallel through the extraction thread; executing serialization analysis operation on the incremental data through the extractor thread, generating intermediate data and writing the intermediate data into an intermediate database; obtaining intermediate data from intermediate database in parallel using the write-in thread; performing deserialization analysis

...read moreread less

Abstract: The invention discloses a heterogeneous data real-time synchronization method and device, equipment and a storage medium, and relates to an extraction thread, a writing thread and a plurality of source databases. The method comprises the following steps: in response to a data synchronization request input by a user, extracting incremental data from the source databases in parallel through the extraction thread; executing serialization analysis operation on the incremental data through the extraction thread, generating intermediate data and writing the intermediate data into an intermediate database; obtaining the intermediate data from the intermediate database in parallel through the write-in thread; performing deserialization analysis operation and idempotent operation on the intermediate data through the writing thread to obtain to-be-written data; and serially writing the to-be-written data into a target database through the writing thread, so that a universal data synchronizationmode is provided, the heterogeneous data synchronization efficiency is effectively improved, and meanwhile, rapid breakpoint continuous storage is achieved.

...read moreread less

Proceedings Article•DOI•

FlashByte: Improving Memory Efficiency with Lightweight Native Storage

[...]

Junxian Zhao¹, Aidi Pi¹, Shaoqi Wang¹, Xiaobo Zhou¹•Institutions (1)

University of Colorado Boulder¹

01 May 2021

TL;DR: FlashByte as discussed by the authors proposes a lightweight native storage that efficiently caches intermediate data in the Java heap and reduces the overhead of garbage collection by transferring the data from the heap to the native storage.

...read moreread less

Abstract: In-memory caching of intermediate data is effective in reducing re-computation and I/O cost in distributed data-analytics frameworks, but it also generates a large amount of data in Java heap which increases the overhead of garbage collection (GC). An alternative off-heap approach caches data in native storage by transmitting the data from heap to native storage so as to reduce GC overhead. However, it incurs severe serialization and de-serialization overheads. Serialization also generates non-trivial metadata of the cached data in native storage. We propose and develop FlashByte, a lightweight native storage that efficiently caches intermediate data. FlashByte improves memory efficiency by achieving low GC overhead, low data transmission overhead, and low memory consumption. Specifically, the cached data are divided into two parts: metadata stored in Java heap and raw data stored in native storage. The metadata is generated based on the profile of workloads. Its size is trivial because it only contains a concise format of raw data, which achieves low memory consumption in the heap as well as low GC overhead. Native storage only stores the raw data to reduce its memory consumption. According to the metadata, the raw data are efficiently transmitted between the heap and native storage without serialization and de-serialization. We implement FlashByte in Spark and conduct evaluation with benchmark workloads. Experimental results show that, compared with the in-heap approach of Vanilla Spark, FlashByte achieves up to 4x speedup of the job execution time, reduces GC time by up to 96%, and reduces the memory consumption in heap by up to 36%. Compared with the alternative off-heap approach, FlashByte achieves up to 2.3x speedup of the job execution time, reduces the data transmission time by up to 84%, and reduces the cache size in native storage by up to 34%.

...read moreread less

DOI•

Data Serialization Formats for the Internet of Things

[...]

Daniel Friesel¹, Olaf Spinczyk¹•Institutions (1)

University of Osnabrück¹

08 Sep 2021

TL;DR: This work evaluates (de)serialization and transmission cost of mqtt.eclipse.org payloads on 8- to 32-bit microcontrollers and finds that Protocol Buffers and the XDR format, dating back to 1987, are most efficient.

...read moreread less

Abstract: IoT devices rely on data exchange with gateways and cloud servers. However, the performance of today's serialization formats and libraries on embedded systems with energy and memory constraints is not well-documented and hard to predict. We evaluate (de)serialization and transmission cost of mqtt.eclipse.org payloads on 8- to 32-bit microcontrollers and find that Protocol Buffers (as implemented by NanoPB) and the XDR format, dating back to 1987, are most efficient.

...read moreread less

Book Chapter•DOI•

The Evolution of Context-Aware RDF Knowledge Graphs

[...]

Leslie F. Sikos¹•Institutions (1)

Edith Cowan University¹

01 Jan 2021

TL;DR: This chapter is a critical review of different approaches proposed over the years in the Semantic Web research community to address this limitation of RDF for capturing different types of information, such as data provenance, spatiotemporal data, and certainty.

...read moreread less

Abstract: The many benefits of knowledge graphs using or based on the Resource Description Framework (RDF) well justify the utilization and wide deployment of a simple yet powerful, formally grounded data model, its serialization formats, vocabulary, and well-defined interpretation to be used for efficient querying, data integration, and automated reasoning. However, the simplicity of RDF comes at a price: there is no built-in mechanism for RDF statements to store metadata and context. This chapter is a critical review of different approaches proposed over the years in the Semantic Web research community to address this limitation, which are used for capturing different types of information, such as data provenance, spatiotemporal data, and certainty, which are crucial in data science applications to make statements context-aware, authoritative, verifiable, and reproducible.

...read moreread less

Book Chapter•DOI•

Snapshot-Based Migration of ES6 JavaScript

[...]

Yong Hwan Yoo¹, Soo-Mook Moon¹•Institutions (1)

Seoul National University¹

18 May 2021

TL;DR: In this article, the authors propose a snapshot-based technique to migrate a stateful JavaScript program to enable a non-breaking user experience across different devices by profiling and serializing the runtime state.

...read moreread less

Abstract: Recently, researches have proposed application (app) migration approaches for JavaScript programs to enable a non-breaking user experience across different devices. To migrate a stateful JavaScript app’s runtime, past studies have proposed snapshot-based techniques in which the app’s runtime state is profiled and serialized into a text form that can be restored back later. A common limitation of existing literature, however, is that they are based on old JavaScript specifications. Since major updates introduced by ECMASCript2015 (a.k.a. ES6), JavaScript supports various features that cannot be migrated correctly with existing methods. Some of these features are in fact heavily used in today’s real-world apps and thus greatly reduces the scope of previous works.

...read moreread less

Serialize R Objects to JSON, JavaScript Object Notation [R package RJSONIO version 1.3-1.6]

[...]

Duncan Temple Lang, Jonathan Wallace

16 Sep 2021

Proceedings Article•DOI•

Inferred Interactive Controls Through Provenance Tracking of ROS Message Data

[...]

Thomas Witte¹, Matthias Tichy¹•Institutions (1)

University of Ulm¹

01 Jun 2021

TL;DR: In this paper, a C++ support library for ROS is proposed to track the provenance of message data across multiple nodes and apply source changes, reversing any transformation on the tracked data.

...read moreread less

Abstract: Interactive controls that enrich visualizations need domain knowledge to create a sensible visual representation, as well as access to parameters and data to manipulate. However, source data and the means to visualize them are often scattered across multiple components, making it hard to link a value change in the interface to the appropriate source data. Provenance, the documentation of the origin and history of message data, can be used to reverse the evaluation of a value and change it at its source. We present a communication pattern as well as a C++ support library for ROS to track the provenance of message data across multiple nodes and apply source changes, reversing any transformation on the tracked data. We demonstrate that it is possible to automatically infer interactive 3D user interfaces from standard, non-interactive ROS visualizations by leveraging this additional tracking information. Preliminary results from a prototypical implementation of multiple origin tracking enabled ROS nodes indicate, that this tracking introduces a significant but still practicable message size and serialization performance overhead. To apply this tracking to existing C++ codebases only small, syntactic changes are necessary: a wrapper type around tracked values hides all necessary bookkeeping.

...read moreread less

DOI•

Hardware-Software Co-Design of an RPC Processor

[...]

Arash Pourhabibi Zarandi

01 Jan 2021

Proceedings Article•DOI•

The Match-Extend serialization algorithm in Multiprecedence

[...]

Maxime Papillon

01 Aug 2021

TL;DR: Match-Extend as discussed by the authors is a deterministic serialization algorithm for the derivation of surface forms, which is based on the Precedence-based phonology or Multiprecedence phonology.

...read moreread less

Abstract: Raimy (1999; 2000a; 2000b) proposed a graphical formalism for modeling reduplication, originallymostly focused on phonological overapplication in a derivational framework. This framework is now known as Precedence-based phonology or Multiprecedence phonology. Raimy’s idea is that the segments at the input to the phonology are not totally ordered by precedence. This paper tackles a challenge that arose with Raimy’s work, the development of a deterministic serialization algorithm as part of the derivation of surface forms. The Match-Extend algorithm introduced here requires fewer assumptions and sticks tighter to the attested typology. The algorithm also contains no parameter or constraint specific to individual graphs or topologies, unlike previous proposals. Match-Extend requires nothing except knowing the last added set of links.

...read moreread less

Journal Article•DOI•

Tool for SPARQL Querying over Compact RDF Representations

[...]

Delfina Ramos-Vidal, Guillermo de Bernardo

15 Oct 2021

TL;DR: In this paper, the authors present an architecture for the efficient storing and querying of large RDF datasets, which is built over HDT, an RDF serialization framework, and its interaction with the Jena query engine.

...read moreread less

Abstract: We present an architecture for the efficient storing and querying of large RDF datasets. Our approach seeks to store RDF datasets in very little space while offering complete SPARQL functionality. To achieve this, our proposal was built over HDT, an RDF serialization framework, and its interaction with the Jena query engine. We propose a set of modifications to this framework in order to incorporate a range of space-efficient compact data structures for data storage and access, while using high-level capabilities to answer more complicated SPARQL queries. As a result, our approach provides a standard mechanism for using low-level data structures in complicated query situations requiring SPARQL searches, which are typically not supported by current solutions.

...read moreread less