scispace - formally typeset
Search or ask a question

Showing papers on "Software portability published in 2018"


Proceedings ArticleDOI
08 Oct 2018
TL;DR: TVM as discussed by the authors is a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends, such as mobile phones, embedded devices, and accelerators.
Abstract: There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms - such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) - requires significant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations. Experimental results show that TVM delivers performance across hardware back-ends that are competitive with state-of-the-art, hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPUs. We also demonstrate TVM's ability to target new accelerator back-ends, such as the FPGA-based generic deep learning accelerator. The system is open sourced and in production use inside several major companies.

991 citations


Journal ArticleDOI
TL;DR: This paper provides the first comprehensive overview of RIOT, covering the key components of interest to potential developers and users: the kernel, hardware abstraction, and software modularity, both conceptually and in practice for various example configurations.
Abstract: As the Internet of Things (IoT) emerges, compact operating systems (OSs) are required on low-end devices to ease development and portability of IoT applications. RIOT is a prominent free and open source OS in this space. In this paper, we provide the first comprehensive overview of RIOT. We cover the key components of interest to potential developers and users: the kernel, hardware abstraction, and software modularity, both conceptually and in practice for various example configurations. We explain operational aspects like system boot-up, timers, power management, and the use of networking. Finally, the relevant APIs as exposed by the OS are discussed along with the larger ecosystem around RIOT, including development and open source community aspects.

181 citations


Posted Content
TL;DR: TVM is proposed, an end-to-end optimization stack that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends and discusses the optimization challenges specific toDeep learning that TVM solves.
Abstract: Scalable frameworks, such as TensorFlow, MXNet, Caffe, and PyTorch drive the current popularity and utility of deep learning. However, these frameworks are optimized for a narrow range of server-class GPUs and deploying workloads to other platforms such as mobile phones, embedded devices, and specialized accelerators (e.g., FPGAs, ASICs) requires laborious manual effort. We propose TVM, an end-to-end optimization stack that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. We discuss the optimization challenges specific to deep learning that TVM solves: high-level operator fusion, low-level memory reuse across threads, mapping to arbitrary hardware primitives, and memory latency hiding. Experimental results demonstrate that TVM delivers performance across hardware back-ends that are competitive with state-of-the-art libraries for low-power CPU and server-class GPUs. We also demonstrate TVM's ability to target new hardware accelerator back-ends by targeting an FPGA-based generic deep learning accelerator. The compiler infrastructure is open sourced.

161 citations


Posted Content
TL;DR: TVM is a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends and automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations.
Abstract: There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms -- such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) -- requires significant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations. Experimental results show that TVM delivers performance across hardware back-ends that are competitive with state-of-the-art, hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPUs. We also demonstrate TVM's ability to target new accelerator back-ends, such as the FPGA-based generic deep learning accelerator. The system is open sourced and in production use inside several major companies.

136 citations


Journal ArticleDOI
TL;DR: Through the development of a massively parallel MC algorithm using the Open Computing Language framework, this research extends the existing graphics processing unit (GPU) accelerated MC technique to a highly scalable vendor-independent heterogeneous computing environment, achieving significantly improved performance and software portability.
Abstract: We present a highly scalable Monte Carlo (MC) three-dimensional photon transport simulation platform designed for heterogeneous computing systems. Through the development of a massively parallel MC algorithm using the Open Computing Language framework, this research extends our existing graphics processing unit (GPU)-accelerated MC technique to a highly scalable vendor-independent heterogeneous computing environment, achieving significantly improved performance and software portability. A number of parallel computing techniques are investigated to achieve portable performance over a wide range of computing hardware. Furthermore, multiple thread-level and device-level load-balancing strategies are developed to obtain efficient simulations using multiple central processing units and GPUs.

130 citations


Proceedings ArticleDOI
Jiawen Liu1, Hengyu Zhao1, Matheus Ogleari1, Dong Li1, Jishen Zhao1 
20 Oct 2018
TL;DR: This work proposes a software/hardware co-design of heterogeneous processing-in-memory (PIM) system that enables high program portability and easy program maintenance across various heterogeneous hardware, optimize system energy efficiency, and improve hardware utilization.
Abstract: Neural networks (NNs) have been adopted in a wide range of application domains, such as image classification, speech recognition, object detection, and computer vision. However, training NNs - especially deep neural networks (DNNs) - can be energy and time consuming, because of frequent data movement between processor and memory. Furthermore, training involves massive fine-grained operations with various computation and memory access characteristics. Exploiting high parallelism with such diverse operations is challenging. To address these challenges, we propose a software/hardware co-design of heterogeneous processing-in-memory (PIM) system. Our hardware design incorporates hundreds of fix-function arithmetic units and ARM-based programmable cores on the logic layer of a 3D die-stacked memory to form a heterogeneous PIM architecture attached to CPU. Our software design offers a programming model and a runtime system that program, offload, and schedule various NN training operations across compute resources provided by CPU and heterogeneous PIM. By extending the OpenCL programming model and employing a hardware heterogeneity-aware runtime system, we enable high program portability and easy program maintenance across various heterogeneous hardware, optimize system energy efficiency, and improve hardware utilization.

108 citations


Journal ArticleDOI
TL;DR: In this article, a survey of the state-of-the-art software-defined radio (SDR) platforms in the context of wireless communication protocols is presented, with a focus on programmability, flexibility, portability, and energy efficiency.

91 citations


Journal ArticleDOI
31 Jul 2018
TL;DR: If autotuning is to be widely used in the HPC community, researchers must address the software engineering challenges, manage configuration overheads, and continue to demonstrate significant performance gains and portability across architectures.
Abstract: Autotuning refers to the automatic generation of a search space of possible implementations of a computation that are evaluated through models and/or empirical measurement to identify the most desirable implementation. Autotuning has the potential to dramatically improve the performance portability of petascale and exascale applications. To date, autotuning has been used primarily in high-performance applications through tunable libraries or previously tuned application code that is integrated directly into the application. This paper draws on the authors’ extensive experience applying autotuning to high-performance applications, describing both successes and future challenges. If autotuning is to be widely used in the HPC community, researchers must address the software engineering challenges, manage configuration overheads, and continue to demonstrate significant performance gains and portability across architectures. In particular, tools that configure the application must be integrated into the application build process so that tuning can be reapplied as the application and target architectures evolve.

87 citations


Journal ArticleDOI
26 Oct 2018
TL;DR: This paper focuses on two fundamental problems that software developers are faced with: performance portability across the ever-changing landscape of parallel platforms and correctness guarantees for sophisticated floating-point code.
Abstract: In this paper, we address the question of how to automatically map computational kernels to highly efficient code for a wide range of computing platforms and establish the correctness of the synthesized code. More specifically, we focus on two fundamental problems that software developers are faced with: performance portability across the ever-changing landscape of parallel platforms and correctness guarantees for sophisticated floating-point code. The problem is approached as follows: We develop a formal framework to capture computational algorithms, computing platforms, and program transformations of interest, using a unifying mathematical formalism we call operator language (OL). Then we cast the problem of synthesizing highly optimized computational kernels for a given machine as a strongly constrained optimization problem that is solved by search and a multistage rewriting system. Since all rewrite steps are semantics preserving, our approach establishes equivalence between the kernel specification and the synthesized program. This approach is implemented in the SPIRAL system, and we demonstrate it with a selection of computational kernels from the signal and image processing domain, software-defined radio, and robotic vehicle control. Our target platforms range from mobile devices, desktops, and server multicore processors to large-scale high-performance and supercomputing systems, and we demonstrate performance comparable to expertly hand-tuned code across kernels and platforms.

84 citations


Proceedings ArticleDOI
24 Feb 2018
TL;DR: This paper demonstrates how complex multidimensional stencil code and optimizations such as tiling are expressible using compositions of simple 1D Lift primitives, and shows that this approach outperforms existing compiler approaches and hand-tuned codes.
Abstract: Stencil computations are widely used from physical simulations to machine-learning. They are embarrassingly parallel and perfectly fit modern hardware such as Graphic Processing Units. Although stencil computations have been extensively studied, optimizing them for increasingly diverse hardware remains challenging. Domain Specific Languages (DSLs) have raised the programming abstraction and offer good performance. However, this places the burden on DSL implementers who have to write almost full-fledged parallelizing compilers and optimizers. Lift has recently emerged as a promising approach to achieve performance portability and is based on a small set of reusable parallel primitives that DSL or library writers can build upon. Lift’s key novelty is in its encoding of optimizations as a system of extensible rewrite rules which are used to explore the optimization space. However, Lift has mostly focused on linear algebra operations and it remains to be seen whether this approach is applicable for other domains. This paper demonstrates how complex multidimensional stencil code and optimizations such as tiling are expressible using compositions of simple 1D Lift primitives. By leveraging existing Lift primitives and optimizations, we only require the addition of two primitives and one rewrite rule to do so. Our results show that this approach outperforms existing compiler approaches and hand-tuned codes.

83 citations


Journal ArticleDOI
TL;DR: A new graphical user interface for ToxPi (Toxicological Prioritization Index) is presented that provides interactive visualization, analysis, reporting, and portability and introduces several features, from flexible data import formats to similarity-based clustering to options for high-resolution graphical output.
Abstract: Drawing integrated conclusions from diverse source data requires synthesis across multiple types of information. The ToxPi (Toxicological Prioritization Index) is an analytical framework that was developed to enable integration of multiple sources of evidence by transforming data into integrated, visual profiles. Methodological improvements have advanced ToxPi and expanded its applicability, necessitating a new, consolidated software platform to provide functionality, while preserving flexibility for future updates. We detail the implementation of a new graphical user interface for ToxPi (Toxicological Prioritization Index) that provides interactive visualization, analysis, reporting, and portability. The interface is deployed as a stand-alone, platform-independent Java application, with a modular design to accommodate inclusion of future analytics. The new ToxPi interface introduces several features, from flexible data import formats (including legacy formats that permit backward compatibility) to similarity-based clustering to options for high-resolution graphical output. We present the new ToxPi interface for dynamic exploration, visualization, and sharing of integrated data models. The ToxPi interface is freely-available as a single compressed download that includes the main Java executable, all libraries, example data files, and a complete user manual from http://toxpi.org .

Proceedings ArticleDOI
21 Apr 2018
TL;DR: This work presents Project Zanzibar: a flexible mat that can locate, uniquely identify and communicate with tangible objects placed on its surface, as well as sense a user's touch and hover hand gestures, and describes the underlying technical contributions.
Abstract: We present Project Zanzibar: a flexible mat that can locate, uniquely identify and communicate with tangible objects placed on its surface, as well as sense a user's touch and hover hand gestures. We describe the underlying technical contributions: efficient and localised Near Field Communication (NFC) over a large surface area; object tracking combining NFC signal strength and capacitive footprint detection, and manufacturing techniques for a rollable device form-factor that enables portability, while providing a sizable interaction area when unrolled. In addition, we detail design patterns for tangibles of varying complexity and interactive capabilities, including the ability to sense orientation on the mat, harvest power, provide additional input and output, stack, or extend sensing outside the bounds of the mat. Capabilities and interaction modalities are illustrated with self-generated applications. Finally, we report on the experience of professional game developers building novel physical/digital experiences using the platform.

Journal ArticleDOI
TL;DR: A comprehensive analysis of Pilot-Jobs systems is presented in this paper, with a focus on the motivations, evolution, properties, and implementation of PilotJobs, and an outline of the Pilot abstraction, its distinguishing logical components and functionalities, its terminology and its architecture pattern.
Abstract: Pilot-Job systems play an important role in supporting distributed scientific computing. They are used to execute millions of jobs on several cyberinfrastructures worldwide, consuming billions of CPU hours a year. With the increasing importance of task-level parallelism in high-performance computing, Pilot-Job systems are also witnessing an adoption beyond traditional domains. Notwithstanding the growing impact on scientific research, there is no agreement on a definition of Pilot-Job system and no clear understanding of the underlying abstraction and paradigm. Pilot-Job implementations have proliferated with no shared best practices or open interfaces and little interoperability. Ultimately, this is hindering the realization of the full impact of Pilot-Jobs by limiting their robustness, portability, and maintainability. This article offers a comprehensive analysis of Pilot-Job systems critically assessing their motivations, evolution, properties, and implementation. The three main contributions of this article are as follows: (1) an analysis of the motivations and evolution of Pilot-Job systems; (2) an outline of the Pilot abstraction, its distinguishing logical components and functionalities, its terminology, and its architecture pattern; and (3) the description of core and auxiliary properties of Pilot-Jobs systems and the analysis of six exemplar Pilot-Job implementations. Together, these contributions illustrate the Pilot paradigm, its generality, and how it helps to address some challenges in distributed scientific computing.

Proceedings ArticleDOI
19 Apr 2018
TL;DR: A conceptual framework is proposed to enable analysts to explore data items, track interaction histories, and alter visualization configurations through mechanisms using both devices in combination to support visual data analysis.
Abstract: We explore the combination of smartwatches and a large interactive display to support visual data analysis. These two extremes of interactive surfaces are increasingly popular, but feature different characteristics-display and input modalities, personal/public use, performance, and portability. In this paper, we first identify possible roles for both devices and the interplay between them through an example scenario. We then propose a conceptual framework to enable analysts to explore data items, track interaction histories, and alter visualization configurations through mechanisms using both devices in combination. We validate an implementation of our framework through a formative evaluation and a user study. The results show that this device combination, compared to just a large display, allows users to develop complex insights more fluidly by leveraging the roles of the two devices. Finally, we report on the interaction patterns and interplay between the devices for visual exploration as observed during our study.

Proceedings ArticleDOI
18 Jun 2018
TL;DR: Relay as mentioned in this paper is a purely functional, statically-typed language with the goal of balancing efficient compilation, expressiveness, and portability for machine learning models across an array of heterogeneous hardware devices.
Abstract: Machine learning powers diverse services in industry including search, translation, recommendation systems, and security. The scale and importance of these models require that they be efficient, expressive, and portable across an array of heterogeneous hardware devices. These constraints are often at odds; in order to better accommodate them we propose a new high-level intermediate representation (IR) called Relay. Relay is being designed as a purely-functional, statically-typed language with the goal of balancing efficient compilation, expressiveness, and portability. We discuss the goals of Relay and highlight its important design constraints. Our prototype is part of the open source NNVM compiler framework, which powers Amazon's deep learning framework MxNet.

Proceedings ArticleDOI
TL;DR: This work proposes a new high-level intermediate representation (IR) called Relay, being designed as a purely-functional, statically-typed language with the goal of balancing efficient compilation, expressiveness, and portability.
Abstract: Machine learning powers diverse services in industry including search, translation, recommendation systems, and security. The scale and importance of these models require that they be efficient, expressive, and portable across an array of heterogeneous hardware devices. These constraints are often at odds; in order to better accommodate them we propose a new high-level intermediate representation (IR) called Relay. Relay is being designed as a purely-functional, statically-typed language with the goal of balancing efficient compilation, expressiveness, and portability. We discuss the goals of Relay and highlight its important design constraints. Our prototype is part of the open source NNVM compiler framework, which powers Amazon's deep learning framework MxNet.

Journal ArticleDOI
TL;DR: A comprehensive survey on noninvasive/pain-free blood glucose monitoring methods from the recent five years is presented, holding AI-based estimation and decision models hold the future of nonin invasive glucose monitoring in terms of accuracy, cost effectiveness, portability, efficiency, etc.
Abstract: Keeping track of blood glucose levels non-invasively is now possible due to diverse breakthroughs in wearable sensors technology coupled with advanced biomedical signal processing. However, each user might have different requirements and priorities when it comes to selecting a self-monitoring solution. After extensive research and careful selection, we have presented a comprehensive survey on noninvasive/pain-free blood glucose monitoring methods from the recent five years (2012–2016). Several techniques, from bioinformatics, computer science, chemical engineering, microwave technology, etc., are discussed in order to cover a wide variety of solutions available for different scales and preferences. We categorize the noninvasive techniques into nonsample- and sample-based techniques, which we further grouped into optical, nonoptical, intermittent, and continuous. The devices manufactured or being manufactured for noninvasive monitoring are also compared in this paper. These techniques are then analyzed based on certain constraints, which include time efficiency, comfort, cost, portability, power consumption, etc., a user might experience. Recalibration, time, and power efficiency are the biggest challenges that require further research in order to satisfy a large number of users. In order to solve these challenges, artificial intelligence (AI) has been employed by many researchers. AI-based estimation and decision models hold the future of noninvasive glucose monitoring in terms of accuracy, cost effectiveness, portability, efficiency, etc. The significance of this paper is twofold: first, to bridge the gap between IT and medical field; and second, to bridge the gap between end users and the solutions (hardware and software).

Journal ArticleDOI
TL;DR: This article presents SkePU 2, the next generation of the SkePU C++ skeleton programming framework for heterogeneous parallel systems, and proposes a new skeleton, Call, unique in the sense that it does not impose any predefined skeleton structure and can encapsulate arbitrary user-defined multi-backend computations.
Abstract: In this article we present SkePU 2, the next generation of the SkePU C++ skeleton programming framework for heterogeneous parallel systems. We critically examine the design and limitations of the SkePU 1 programming interface. We present a new, flexible and type-safe, interface for skeleton programming in SkePU 2, and a source-to-source transformation tool which knows about SkePU 2 constructs such as skeletons and user functions. We demonstrate how the source-to-source compiler transforms programs to enable efficient execution on parallel heterogeneous systems. We show how SkePU 2 enables new use-cases and applications by increasing the flexibility from SkePU 1, and how programming errors can be caught earlier and easier thanks to improved type safety. We propose a new skeleton, Call, unique in the sense that it does not impose any predefined skeleton structure and can encapsulate arbitrary user-defined multi-backend computations. We also discuss how the source-to-source compiler can enable a new optimization opportunity by selecting among multiple user function specializations when building a parallel program. Finally, we show that the performance of our prototype SkePU 2 implementation closely matches that of SkePU 1.

Journal ArticleDOI
TL;DR: Some of the most recent advances, as well as the remaining challenges and future prospects, for electrochemical biosensing development that could make an impact on the future global market are discussed.

Proceedings ArticleDOI
10 Feb 2018
TL;DR: It is concluded that the HPVM representation is a promising basis for achieving performance portability and for implementing parallelizing compilers for heterogeneous parallel systems.
Abstract: We propose a parallel program representation for heterogeneous systems, designed to enable performance portability across a wide range of popular parallel hardware, including GPUs, vector instruction sets, multicore CPUs and potentially FPGAs Our representation, which we call HPVM, is a hierarchical dataflow graph with shared memory and vector instructions HPVM supports three important capabilities for programming heterogeneous systems: a compiler intermediate representation (IR), a virtual instruction set (ISA), and a basis for runtime scheduling; previous systems focus on only one of these capabilities As a compiler IR, HPVM aims to enable effective code generation and optimization for heterogeneous systems As a virtual ISA, it can be used to ship executable programs, in order to achieve both functional portability and performance portability across such systems At runtime, HPVM enables flexible scheduling policies, both through the graph structure and the ability to compile individual nodes in a program to any of the target devices on a system We have implemented a prototype HPVM system, defining the HPVM IR as an extension of the LLVM compiler IR, compiler optimizations that operate directly on HPVM graphs, and code generators that translate the virtual ISA to NVIDIA GPUs, Intel's AVX vector units, and to multicore X86-64 processors Experimental results show that HPVM optimizations achieve significant performance improvements, HPVM translators achieve performance competitive with manually developed OpenCL code for both GPUs and vector hardware, and that runtime scheduling policies can make use of both program and runtime information to exploit the flexible compilation capabilities Overall, we conclude that the HPVM representation is a promising basis for achieving performance portability and for implementing parallelizing compilers for heterogeneous parallel systems

Proceedings ArticleDOI
02 Jun 2018
TL;DR: The benefits of XMem are demonstrated using two use cases: improving the performance portability of software-based cache optimization by expressing the semantics of data locality in the optimization andimproving the performance of OS-based page placement in DRAM by leveraging the semanticsof data structures and their access properties.
Abstract: This paper makes a case for a new cross-layer interface, Expressive Memory (XMem), to communicate higher-level program semantics from the application to the system software and hardware architecture. XMem provides (i) a flexible and extensible abstraction, called an Atom, enabling the application to express key program semantics in terms of how the program accesses data and the attributes of the data itself, and (ii) new cross-layer interfaces to make the expressed higher-level information available to the underlying OS and architecture. By providing key information that is otherwise unavailable, XMem exposes a new, rich view of the program data to the OS and the different architectural components that optimize memory system performance (e.g., caches, memory controllers). By bridging the semantic gap between the application and the underlying memory resources, XMem provides two key benefits. First, it enables architectural/system-level techniques to leverage key program semantics that are challenging to predict or infer. Second, it improves the efficacy and portability of software optimizations by alleviating the need to tune code for specific hardware resources (e.g., cache space). While XMem is designed to enhance and enable a wide range of memory optimizations, we demonstrate the benefits of XMem using two use cases: (i) improving the performance portability of software-based cache optimization by expressing the semantics of data locality in the optimization and (ii) improving the performance of OS-based page placement in DRAM by leveraging the semantics of data structures and their access properties.

Proceedings ArticleDOI
22 Jul 2018
TL;DR: A performance evaluation of Docker and Singularity on bare metal nodes in the Chameleon cloud and analysis of mapping elements of parallel workloads to the containers for optimal resource management with container-ready orchestration tools shows that scientific workloads for both Docker andsingularity based containers can achieve near-native performance.
Abstract: The HPC community is actively researching and evaluating tools to support execution of scientific applications in cloud-based environments. Among the various technologies, containers have recently gained importance as they have significantly better performance compared to full-scale virtualization, support for microservices and DevOps, and work seamlessly with workflow and orchestration tools. Docker is currently the leader in containerization technology because it offers low overhead, flexibility, portability of applications, and reproducibility. Singularity is another container solution that is of interest as it is designed specifically for scientific applications. It is important to conduct performance and feature analysis of the container technologies to understand their applicability for each application and target execution environment.This paper presents a (1) performance evaluation of Docker and Singularity on bare metal nodes in the Chameleon cloud (2) mechanism by which Docker containers can be mapped with InfiniBand hardware with RDMA communication and (3) analysis of mapping elements of parallel workloads to the containers for optimal resource management with container-ready orchestration tools. Our experiments are targeted toward application developers so that they can make informed decisions on choosing the container technologies and approaches that are suitable for their HPC workloads on cloud infrastructure. Our performance analysis shows that scientific workloads for both Docker and Singularity based containers can achieve near-native performance.Singularity is designed specifically for HPC workloads. However, Docker still has advantages over Singularity for use in clouds as it provides overlay networking and an intuitive way to run MPI applications with one container per rank for fine-grained resources allocation. Both Docker and Singularity make it possible to directly use the underlying network fabric from the containers for coarsegrained resource allocation.

Journal ArticleDOI
TL;DR: A novel deep learning network is constructed, Hybrid Deep Learning Network (HDLN), and used to detect code injection attacks on mobile phones, which outperforms those with other traditional classifiers and gets higher average precision than other detection methods.

Journal ArticleDOI
TL;DR: From these results, platform providers cannot only obtain an understanding on how investments in interoperability and portability impact cost, enable cost-effective service integration, and create value, but also design new strategies for optimizing investments.

Proceedings ArticleDOI
01 Nov 2018
TL;DR: The Roofline model is extended so that it empirically captures a more realistic set of performance bounds for CPUs and GPUs, factors in the true cost of different floating-point instructions when counting FLOPs, incorporates the effects of different memory access patterns, and facilitates the performance portability analysis.
Abstract: System and node architectures continue to diversify to better balance on-node computation, memory capacity, memory bandwidth, interconnect bandwidth, power, and cost for specific computational workloads. For many application developers, achieving performance portability (effectively exploiting the capabilities of multiple architectures) is a desired goal. Unfortunately, dramatically different per-node performance coupled with differences in machine balance can lead to developers being unable to determine whether they have attained performance portability or simply written portable code. The Roofline model provides a means of quantitatively assessing how well a given application makes use of a target platform’s computational capabilities. In this paper, we extend the Roofline model so that it 1) empirically captures a more realistic set of performance bounds for CPUs and GPUs, 2) factors in the true cost of different floating-point instructions when counting FLOPs, 3) incorporates the effects of different memory access patterns, and 4) with appropriate pairing of code performance and Roofline ceiling, facilitates the performance portability analysis.

Proceedings ArticleDOI
21 May 2018
TL;DR: In this paper, a convolutional neural network is trained to provide accurate hand gesture recognition, and a finite-state machine based deterministic model performs efficient gesture-to-instruction mapping and further improves robustness of the interaction scheme.
Abstract: This paper presents a real-time programming and parameter reconfiguration method for autonomous underwater robots in human-robot collaborative tasks. Using a set of intuitive and meaningful hand gestures, we develop a syntactically simple framework that is computationally more efficient than a complex, grammar-based approach. In the proposed framework, a convolutional neural network is trained to provide accurate hand gesture recognition; subsequently, a finite-state machine- based deterministic model performs efficient gesture-to-instruction mapping and further improves robustness of the interaction scheme. The key aspect of this framework is that it can be easily adopted by divers for communicating simple instructions to underwater robots without using artificial tags such as fiducial markers or requiring memorization of a potentially complex set of language rules. Extensive experiments are performed both on field-trial data and through simulation, which demonstrate the robustness, efficiency, and portability of this framework in a number of different scenarios. Finally, a user interaction study is presented that illustrates the gain in the ease of use of our proposed interaction framework compared to the existing methods for the underwater domain.

Proceedings ArticleDOI
19 Mar 2018
TL;DR: This work outlines a novel 5G PPP-compliant software framework specifically tailored to the energy domain, which combines i) trusted, scalable and lock-in free plug ‘n’ play support for a variety of constrained devices and ii) 5G devices’ abstractions.
Abstract: The energy sector represents undoubtedly one of the most significant “test cases” for 5G enabling technologies, due to the need of addressing a huge range of very diverse requirements to deal with across a variety of applications (stringent capacity for smart metering/AMI versus latency for supervisory control and fault localization). However, to effectively support energy utilities along their transition towards more decentralized renewable-oriented systems, several open issues still remain as to 5G networks management automation, security, resilience, scalability and portability. To face these issues, we outline a novel 5G PPP-compliant software framework specifically tailored to the energy domain, which combines i) trusted, scalable and lock-in free plug ‘n’ play support for a variety of constrained devices; ii) 5G devices’ abstractions to demonstrate mMTC (massive Machine Type Communications), uMTC (critical MTC) and xMBB (Extended Massive BroadBand) communications coupled with partially distributed, trusted, end-to-end security and MCM to enable secure, scalable and energy efficient communications; iii) extended Mobile Edge Computing (xMEC) micro-clouds to reduce backhaul load, increase the overall network capacity and reduce delays, while facilitating the deployment of generic MTC related NFVs (Network Function Virtualisation) and utility-centric VNFs (Virtual Network Functions).

Journal ArticleDOI
01 Jul 2018
TL;DR: AIDA emulates the syntax and semantics of popular data science packages but transparently executes the required transformations and computations inside the RDBMS, and supports the seamless use of both relational and linear algebra operations using a unified abstraction.
Abstract: With the tremendous growth in data science and machine learning, it has become increasingly clear that traditional relational database management systems (RDBMS) are lacking appropriate support for the programming paradigms required by such applications, whose developers prefer tools that perform the computation outside the database system. While the database community has attempted to integrate some of these tools in the RDBMS, this has not swayed the trend as existing solutions are often not convenient for the incremental, iterative development approach used in these fields. In this paper, we propose AIDA - an abstraction for advanced in-database analytics. AIDA emulates the syntax and semantics of popular data science packages but transparently executes the required transformations and computations inside the RDBMS. In particular, AIDA works with a regular Python interpreter as a client to connect to the database. Furthermore, it supports the seamless use of both relational and linear algebra operations using a unified abstraction. AIDA relies on the RDBMS engine to efficiently execute relational operations and on an embedded Python interpreter and NumPy to perform linear algebra operations. Data reformatting is done transparently and avoids data copy whenever possible. AIDA does not require changes to statistical packages or the RDBMS facilitating portability.

Journal ArticleDOI
TL;DR: Important TOSCA concepts and benefits in the context of commonly understood cloud use cases are introduced as a foundation to future discussions regarding advanced TOS CA concepts and additional breakthrough issues.
Abstract: TOSCA, the Topology and Orchestration Specification for Cloud Applications offers an OASIS-recognized, open standard domain-specific language (DSL) that enables portability and automated management of applications, services, and resources regardless of underlying cloud platform, software defined environment, or infrastructure. With a growing, interoperable eco-system of open source projects, solutions from leading cloud platform and service providers, and research, TOSCA empowers the definition and modeling of applications and their services (microservices or traditional services) across their entire lifecycle by describing their components, relationships, dependencies, requirements, and capabilities for orchestrating software in the context of associated operational policies. The authors introduce important TOSCA concepts and benefits in the context of commonly understood cloud use cases as a foundation to future discussions regarding advanced TOSCA concepts and additional breakthrough issues.

Journal ArticleDOI
TL;DR: A novel automated, modular, multi-layer and portable cloud monitoring framework that is capable of automatically adapting when elasticity actions are enforced to either the cloud service or to the monitoring topology and is recoverable from faults introduced in the monitoring configuration with proven scalability and low runtime footprint.
Abstract: Automatic resource provisioning is a challenging and complex task. It requires for applications, services and underlying platforms to be continuously monitored at multiple levels and time intervals. The complex nature of this task lays in the ability of the monitoring system to automatically detect runtime configurations in a cloud service due to elasticity action enforcement. Moreover, with the adoption of open cloud standards and library stacks, cloud consumers are now able to migrate their applications or even distribute them across multiple cloud domains. However, current cloud monitoring tools are either bounded to specific cloud platforms or limit their portability to provide elasticity support. In this article, we describe the challenges when monitoring elastically adaptive multi-cloud services. We then introduce a novel automated, modular, multi-layer and portable cloud monitoring framework. Experiments on multiple clouds and real-life applications show that our framework is capable of automatically adapting when elasticity actions are enforced to either the cloud service or to the monitoring topology. Furthermore, it is recoverable from faults introduced in the monitoring configuration with proven scalability and low runtime footprint. Most importantly, our framework is able to reduce network traffic by 41 percent, and consequently the monitoring cost, which is both billable and noticeable in large-scale multi-cloud services.