scispace - formally typeset
Search or ask a question

Showing papers on "Software portability published in 2019"


Proceedings ArticleDOI
01 Nov 2019
TL;DR: RAJA is described, a portability layer that enables C++ applications to leverage various programming models, and thus architectures, with a single-source codebase, and preliminary results using RAJA are described.
Abstract: Modern high-performance computing systems are diverse, with hardware designs ranging from homogeneous multi- core CPUs to GPU or FPGA accelerated systems. Achieving desir- able application performance often requires choosing a program- ming model best suited to a particular platform. For large codes used daily in production that are under continual development, architecture-specific ports are untenable. Maintainability re- quires single-source application code that is performance portable across a range of architectures and programming models. In this paper we describe RAJA, a portability layer that enables C++ applications to leverage various programming models, and thus architectures, with a single-source codebase. We describe preliminary results using RAJA in three large production codes at Lawrence Livermore National Laboratory, observing 17×, 13× and 12× speedup on GPU-only over CPU- only nodes with single-source application code in each case.

123 citations


Proceedings ArticleDOI
22 Jun 2019
TL;DR: Triton is presented, a language and compiler centered around the concept of tile, i.e., statically shaped multi-dimensional sub-arrays for expressing tensor programs in terms of operations on parametric tile variables and a set of novel tile-level optimization passes for compiling these programs into efficient GPU code.
Abstract: The validation and deployment of novel research ideas in the field of Deep Learning is often limited by the availability of efficient compute kernels for certain basic primitives. In particular, operations that cannot leverage existing vendor libraries (e.g., cuBLAS, cuDNN) are at risk of facing poor device utilization unless custom implementations are written by experts – usually at the expense of portability. For this reason, the development of new programming abstractions for specifying custom Deep Learning workloads at a minimal performance cost has become crucial. We present Triton, a language and compiler centered around the concept of tile, i.e., statically shaped multi-dimensional sub-arrays. Our approach revolves around (1) a C-based language and an LLVM-based intermediate representation (IR) for expressing tensor programs in terms of operations on parametric tile variables and (2) a set of novel tile-level optimization passes for compiling these programs into efficient GPU code. We demonstrate how Triton can be used to build portable implementations of matrix multiplication and convolution kernels on par with hand-tuned vendor libraries (cuBLAS / cuDNN), or for efficiently implementing recent research ideas such as shift convolutions.

59 citations


Journal ArticleDOI
TL;DR: A new GMQL‐based system with enhanced accessibility, portability, scalability and performance in genomic data management, based on the Genomic Data Model and the GenoMetric Query Language is presented.
Abstract: Motivation We previously proposed a paradigm shift in genomic data management, based on the Genomic Data Model (GDM) for mediating existing data formats and on the GenoMetric Query Language (GMQL) for supporting, at a high level of abstraction, data extraction and the most common data-driven computations required by tertiary data analysis of Next Generation Sequencing datasets. Here, we present a new GMQL-based system with enhanced accessibility, portability, scalability and performance. Results The new system has a well-designed modular architecture featuring: (i) an intermediate representation supporting many different implementations (including Spark, Flink and SciDB); (ii) a high-level technology-independent repository abstraction, supporting different repository technologies (e.g., local file system, Hadoop File System, database or others); (iii) several system interfaces, including a user-friendly Web-based interface, a Web Service interface, and a programmatic interface for Python language. Biological use case examples, using public ENCODE, Roadmap Epigenomics and TCGA datasets, demonstrate the relevance of our work. Availability and implementation The GMQL system is freely available for non-commercial use as open source project at: http://www.bioinformatics.deib.polimi.it/GMQLsystem/. Supplementary information Supplementary data are available at Bioinformatics online.

54 citations


Journal ArticleDOI
TL;DR: This article presents a concise definition for performance portability and an associated metric that accurately capture the performance and portability of an application across different platforms and suggests tractable approaches to code specialization which can aid the community in developing highly performance-portable applications with minimal impact to productivity.

53 citations


Proceedings Article
24 Feb 2019
TL;DR: FreeFlow is a software-based RDMA virtualization framework designed for containerized clouds that fully satisfies the requirements from cloud environments, such as isolation for multi-tenancy, portability for container migrations, and controllability for control and data plane policies.
Abstract: Many popular large-scale cloud applications are increasingly using containerization for high resource efficiency and lightweight isolation. In parallel, many data-intensive applications (e.g., data analytics and deep learning frameworks) are adopting or looking to adopt RDMA for high networking performance. Industry trends suggest that these two approaches are on an inevitable collision course. In this paper, we present FreeFlow, a software-based RDMA virtualization framework designed for containerized clouds. FreeFlow realizes virtual RDMA networking purely with a software-based approach using commodity RDMA NICs. Unlike existing RDMA virtualization solutions, FreeFlow fully satisfies the requirements from cloud environments, such as isolation for multi-tenancy, portability for container migrations, and controllability for control and data plane policies. FreeFlow is also transparent to applications and provides networking performance close to bare-metal RDMA with low CPU overhead. In our evaluations with TensorFlow and Spark, FreeFlow provides almost the same application performance as bare-metal RDMA.

53 citations


Proceedings ArticleDOI
17 Nov 2019
TL;DR: Stateful DataFlow multiGraph (SDFG) as discussed by the authors is a data-centric intermediate representation that enables separating program definition from its optimization by combining fine-grained data dependencies with high-level control-flow.
Abstract: The ubiquity of accelerators in high-performance computing has driven programming complexity beyond the skill-set of the average domain scientist. To maintain performance portability in the future, it is imperative to decouple architecture-specific programming paradigms from the underlying scientific computations. We present the Stateful DataFlow multiGraph (SDFG), a data-centric intermediate representation that enables separating program definition from its optimization. By combining fine-grained data dependencies with high-level control-flow, SDFGs are both expressive and amenable to program transformations, such as tiling and double-buffering. These transformations are applied to the SDFG in an interactive process, using extensible pattern matching, graph rewriting, and a graphical user interface. We demonstrate SDFGs on CPUs, GPUs, and FPGAs over various motifs --- from fundamental computational kernels to graph analytics. We show that SDFGs deliver competitive performance, allowing domain scientists to develop applications naturally and port them to approach peak hardware performance without modifying the original scientific code.

52 citations


Posted ContentDOI
16 Apr 2019-bioRxiv
TL;DR: The nf-core framework as discussed by the authors provides a community-driven platform for the creation and development of best practice analysis pipelines written in the Nextflow language, which can be used across various institutions and research facilities.
Abstract: The standardization, portability, and reproducibility of analysis pipelines is a renowned problem within the bioinformatics community. Bioinformatic analysis pipelines are often designed for execution on-premise, and this inevitably leads to a level of customisation and integration that is only applicable to the local infrastructure. More notably, the software required to run these pipelines is also tightly coupled with the local compute environment, and this leads to poor pipeline portability, and reproducibility of the ensuing results - both of which are fundamental requirements for the validation of scientific findings. Here we introduce nf-core, a framework that provides a community-driven platform for the creation and development of best practice analysis pipelines written in the Nextflow language. Nextflow has built-in support for pipeline execution on most computational infrastructures, as well as automated deployment using container technologies such as Conda, Docker, and Singularity. Therefore, key obstacles in pipeline development such as portability, reproducibility, scalability and unified parallelism are inherently addressed by all nf-core pipelines. Furthermore, to ensure that new pipelines can be added seamlessly, and existing pipelines are able to inherit up-to-date functionality the nf-core community is actively developing a suite of tools that automate pipeline creation, testing, deployment and synchronization. The peer-review process during pipeline development ensures that best practices and common usage patterns are imposed and therefore, adhere to community guidelines. Our primary goal is to provide a community-driven platform for high-quality, excellent documented and reproducible bioinformatics pipelines that can be utilized across various institutions and research facilities.

50 citations


Proceedings ArticleDOI
28 Feb 2019
TL;DR: The high potential of deep learning attacks against secure implementations of RSA is shown and raises the need for dedicated countermeasures.
Abstract: This paper presents the results of several successful profiled side-channel attacks against a secure implementation of the RSA algorithm. The implementation was running on a ARM Core SC 100 completed with a certified EAL4+ arithmetic co-processor. The analyses have been conducted by three experts’ teams, each working on a specific attack path and exploiting information extracted either from the electromagnetic emanation or from the power consumption. A particular attention is paid to the description of all the steps that are usually followed during a security evaluation by a laboratory, including the acquisitions and the observations preprocessing which are practical issues usually put aside in the literature. Remarkably, the profiling portability issue is also taken into account and different device samples are involved for the profiling and testing phases. Among other aspects, this paper shows the high potential of deep learning attacks against secure implementations of RSA and raises the need for dedicated countermeasures.

47 citations


Book ChapterDOI
TL;DR: This chapter lays out a research agenda in the sociology of work for a type of data and organizational intermediary: work platforms and employs a case study of the adoption of automated hiring platforms in which the authors distinguish between promises and existing practices.
Abstract: This chapter lays out a research agenda in the sociology of work for a type of data and organizational intermediary: work platforms. As an example, the authors employ a case study of the adoption of automated hiring platforms (AHPs) in which the authors distinguish between promises and existing practices. The authors draw on two main methods to do so: critical discourse analysis and affordance critique. The authors collected and examined a mix of trade, popular press, and corporate archives; 135 texts in total. The analysis reveals that work platforms offer five core affordances to management: (1) structured data fields optimized for capture and portability within organizations; (2) increased legibility of activity qua data captured inside and outside the workplace; (3) information asymmetry between labor and management; (4) an “ecosystem” design that supports the development of limited-use applications for specific domains; and (5) the standardization of managerial techniques between workplaces. These combine to create a managerial frame for workers as fungible human capital, available on demand and easily ported between job tasks and organizations. While outlining the origin of platform studies within media and communication studies, the authors demonstrate the specific tools the sociology of work brings to the study of platforms within the workplace. The authors conclude by suggesting avenues for future sociological research not only on hiring platforms, but also on other work platforms such as those supporting automated scheduling and customer relationship management.

46 citations


Journal ArticleDOI
TL;DR: This paper proposes a multiobjective optimization approach for reconfigurable real-time systems called MO2R2S for the development of a reconfigured real- time system and focuses on three optimization criteria: response time; memory allocation; and energy consumption.
Abstract: This paper deals with the reconfigurable real-time systems that should be adapted to their environment under real-time constraints. The reconfiguration allows moving from one implementation to another by adding/removing/modifying parameters of real-time software tasks which should meet related deadlines. Implementing those systems as threads generates a complex system code due to the large number of threads, which may lead to a reconfiguration time overhead as well as the energy consumption and the memory allocation increase. Thus this paper proposes a multiobjective optimization approach for reconfigurable systems called MO2R2S for the development of a reconfigurable real-time system. Given a specification, the proposed approach aims to produce an optimal design while ensuring the system feasibility. We focus on three optimization criteria: 1) response time; 2) memory allocation; and 3) energy consumption. To address the portability issue, the optimal design is then transformed to an abstract code that may in turn be transformed to a concrete code which is specific to a procedural programming (i.e., POSIX) or an object-oriented language (i.e., RT-Java). The MO2R2S approach allows reducing the number of threads by minimizing the redundancy between the implementation sets. By an experimental study, such optimization permits to decrease the memory allocation by 28.89%, the energy consumption by 40.2%, and the response time by 61.32%.

46 citations


Proceedings ArticleDOI
01 Feb 2019
TL;DR: This work introduces Arbor, a performance portable library for simulation of large networks of multi-compartment neurons on HPC systems, by virtue of back-end specific optimizations for x86 multicore, Intel KNL, and NVIDIA GPUs.
Abstract: We introduce Arbor, a performance portable library for simulation of large networks of multi-compartment neurons on HPC systems. Arbor is open source software, developed under the auspices of the HBP. The performance portability is by virtue of back-end specific optimizations for x86 multicore, Intel KNL, and NVIDIA GPUs. When coupled with low memory overheads, these optimizations make Arbor an order of magnitude faster than the most widely-used comparable simulation software. The single-node performance can be scaled out to run very large models at extreme scale with efficient weak scaling.

Proceedings ArticleDOI
TL;DR: Arbor as mentioned in this paper is a performance portable library for simulation of large networks of multi-compartment neurons on HPC systems, which is developed under the auspices of the HBP.
Abstract: We introduce Arbor, a performance portable library for simulation of large networks of multi-compartment neurons on HPC systems. Arbor is open source software, developed under the auspices of the HBP. The performance portability is by virtue of back-end specific optimizations for x86 multicore, Intel KNL, and NVIDIA GPUs. When coupled with low memory overheads, these optimizations make Arbor an order of magnitude faster than the most widely-used comparable simulation software. The single-node performance can be scaled out to run very large models at extreme scale with efficient weak scaling. HPC, GPU, neuroscience, neuron, software

Proceedings ArticleDOI
15 Sep 2019
TL;DR: In this paper, an end-to-end approach to extract semantic concepts directly from the speech audio signal is presented, based on the principles of curriculum learning, which can exploit out-of-domain data that can help to prepare a fully neural architecture.
Abstract: We present an end-to-end approach to extract semantic concepts directly from the speech audio signal. To overcome the lack of data available for this spoken language understanding approach, we investigate the use of a transfer learning strategy based on the principles of curriculum learning. This approach allows us to exploit out-of-domain data that can help to prepare a fully neural architecture. Experiments are carried out on the French MEDIA and PORTMEDIA corpora and show that this end-to-end SLU approach reaches the best results ever published on this task. We compare our approach to a classical pipeline approach that uses ASR, POS tagging, lemmatizer, chunker... and other NLP tools that aim to enrich ASR outputs that feed an SLU text to concepts system. Last, we explore the promising capacity of our end-to-end SLU approach to address the problem of domain portability.

Journal ArticleDOI
TL;DR: The key idea of MalDy portability is the modeling of the behavioral reports into a sequence of words, along with advanced natural language processing (NLP) and machine learning techniques for automatic engineering of relevant security features to detect and attribute malware without the investigator intervention.

Journal ArticleDOI
TL;DR: LFRic as mentioned in this paper is a weather and climate modelling system developed by the UK Met Office to replace the existing Unified Model in preparation for exascale computing in the 2020s.

Journal ArticleDOI
01 Sep 2019
TL;DR: The design of Apollo is presented, a toolchain for automatically detecting, reporting, and diagnosing performance regressions in DBMSs, and it is demonstrated that Apollo automates the generation of regression-triggering queries, simplifies the bug reporting process for users, and enables developers to quickly pinpoint the root cause of performance regressionions.
Abstract: The practical art of constructing database management systems (DBMSs) involves a morass of trade-offs among query execution speed, query optimization speed, standards compliance, feature parity, modularity, portability, and other goals. It is no surprise that DBMSs, like all complex software systems, contain bugs that can adversely affect their performance. The performance of DBMSs is an important metric as it determines how quickly an application can take in new information and use it to make new decisions.Both developers and users face challenges while dealing with performance regression bugs. First, developers usually find it challenging to manually design test cases to uncover performance regressions since DBMS components tend to have complex interactions. Second, users encountering performance regressions are often unable to report them, as the regression-triggering queries could be complex and database-dependent. Third, developers have to expend a lot of effort on localizing the root cause of the reported bugs, due to the system complexity and software development complexity.Given these challenges, this paper presents the design of Apollo, a toolchain for automatically detecting, reporting, and diagnosing performance regressions in DBMSs. We demonstrate that Apollo automates the generation of regression-triggering queries, simplifies the bug reporting process for users, and enables developers to quickly pinpoint the root cause of performance regressions. By automating the detection and diagnosis of performance regressions, Apollo reduces the labor cost of developing efficient DBMSs.

Proceedings ArticleDOI
25 Sep 2019
TL;DR: This study explores the wider landscape of performance portability by considering a number of applications from across the space of dwarfs, written in multiple parallel programming models, and across a diverse set of architectures.
Abstract: Previous studies into performance portability have typically analysed a single application (and its various imple- mentations) in isolation. In this study we explore the wider landscape of performance portability by considering a number of applications from across the space of dwarfs, written in multiple parallel programming models, and across a diverse set of architectures. We apply rigorous performance portability metrics, as defined by Pennycook et al [1]. We believe this is the broadest and most rigorous performance portability study to date, representing a far reaching exploration of the state of performance portability that is achievable today. We will present a summary of the performance portability of each application and programming model across our diverge range of twelve computer architectures, including six different server CPUs from five different vendors, five different GPUs from two different vendors, and one vector architecture. We will conclude with an analysis of the performance portability of key programming models in general, across different application spaces as well across differing architectures, allowing us to comment on more general performance portability principles.

Journal ArticleDOI
TL;DR: A compiler for the Scaffold quantum programming language in which aggressive optimization specifically targets NISQ machines with hundreds of qubits is presented, and it is shown that it is feasible to synthesize near-optimal compiled code for current and small NISq systems.

Journal ArticleDOI
TL;DR: The aim of this research is to evaluate available Internet of Things (IoT) databases in an edge/cloud platform by applying AHP and to suggest a suitable approach for developing a database application.

Proceedings ArticleDOI
07 Nov 2019
TL;DR: CSPOT is empirically evaluated to find that it implements function invocation with significantly lower latency than other FaaS offerings, while providing portability across tiers and similar data durability characteristics.
Abstract: In this paper, we present CSPOT, a distributed runtime system implementing a functions-as-service (FaaS) programming model for the "Internet of Things" (IoT). With FaaS, developers express arbitrary computations as simple functions that are automatically invoked and managed by a cloud platform in response to events. We extend this FaaS model so that it is suitable for use in all tiers of scale for IoT - sensors, edge devices, and cloud - to facilitate robust, portable, and low-latency IoT application development and deployment. To enable this, we combine the use of Linux containers and namespaces for isolation and portability, an append-only object store for robust persistence, and a causal event log for triggering functions and tracking event dependencies. We present the design and implementation of CSPOT, detail its abstractions and APIs, and overview examples of its use. We empirically evaluate the performance of CSPOT using different devices and applications and find that it implements function invocation with significantly lower latency than other FaaS offerings, while providing portability across tiers and similar data durability characteristics.

Journal ArticleDOI
TL;DR: The role of sensor technology and portable miniaturized systems has been considered to be paid special attention to the portable sample treatment systems based on microwave and ultrasound technologies and the use of image processing systems.
Abstract: Recent advances in portability of analytical equipment have been considered to enlighten the advantages offered by portable instrumentation on greening the analytical methods. Their use drastically reduces sampling, sample stockage, and transport, thus avoiding environmental side effects and risks, also improving decision-making. The fact that portable instrumentation is, in general, less expensive than bench instruments and apparatuses makes also available the analytical tools for extended sectors of the population, thus making accessible the advantages derived from analytical methods. The role of sensor technology and portable miniaturized systems has been considered to be paid special attention to the portable sample treatment systems based on microwave and ultrasound technologies and the use of image processing systems.

Proceedings ArticleDOI
20 May 2019
TL;DR: This paper analyzes the productivity advantages of adopting containers for large HPC codes, and quantifies performance overhead induced by the use of three different container technologies comparing it to native execution, and selected Singularity as best technology, based on performance and portability.
Abstract: Since the appearance of Docker in 2013, container technologies for computers have evolved and gained importance in cloud data centers. However, adoption of containers in High-Performance Computing (HPC) centers is still under discussion: on one hand, the ease in portability is very well accepted; on the other hand, the performance penalties and security issues introduced by the added software layers are often under scrutiny. Since very little evaluation of large production HPC codes running in containers is available, we provide in this paper a comparative study using a production simulation of a biological system. The simulation is performed using Alya, which is a computational fluid dynamics (CFD) code optimized for HPC environments and enabled to run multiphysics problems. In the paper, we analyze the productivity advantages of adopting containers for large HPC codes, and we quantify performance overhead induced by the use of three different container technologies (Docker, Singularity and Shifter) comparing it to native execution. Given the results of these tests, we selected Singularity as best technology, based on performance and portability. We show scalability results of Alya using singularity up to 256 computational nodes (up to 12k cores) of MareNostrum4 and present a study of performance and portability on three different HPC architectures (Intel Skylake, IBM Power9, and Arm-v8).

Proceedings ArticleDOI
01 Feb 2019
TL;DR: The results reveal that the presence of outdated npm packages in Docker images increases the risk of potential security vulnerabilities, suggesting that Docker maintainers should keep their installed JavaScript packages up to date.
Abstract: Containerized applications, and in particular Docker images, are becoming a common solution in cloud environments to meet ever-increasing demands in terms of portability, reliability and fast deployment. A Docker image includes all environmental dependencies required to run it, such as specific versions of system and third-party packages. Leveraging on its modularity, an image can be easily embedded in other images, thus simplifying the way of sharing dependencies and building new software. However, the dependencies included in an image may be out of date due to backward compatibility requirements, endangering the environments where the image has been deployed with known vulnerabilities. While previous research efforts have focused on studying the impact of bugs and vulnerabilities of system packages within Docker images, no attention has been given to third-party packages. This paper empirically studies the impact of npm JavaScript package vulnerabilities in Docker images. We based our analysis on 961 images from three official repositories that use Node.js, and 1,099 security reports of packages available on npm, the most popular JavaScript package manager. Our results reveal that the presence of outdated npm packages in Docker images increases the risk of potential security vulnerabilities, suggesting that Docker maintainers should keep their installed JavaScript packages up to date.

Book ChapterDOI
09 Apr 2019
TL;DR: TaPaSCo aims to increase the scalability and portability of FPGA designs by performing the construction of heterogeneous many-core architectures from custom processing elements, and providing a simple, uniform programming interface to utilize spatially parallel computation on FPGAs.
Abstract: In this paper we present TaPaSCo – the Task Parallel Systems Composer, an open-source, toolflow and software framework for automated construction of System-on-Chip FPGA designs for task parallel computation. TaPaSCo aims to increase the scalability and portability of FPGA designs by performing the construction of heterogeneous many-core architectures from custom processing elements, and providing a simple, uniform programming interface to utilize spatially parallel computation on FPGAs. A key feature of TaPaSCo’s is automated design space exploration, which can be performed in parallel on a computing cluster. This greatly simplifies scaling hardware designs, facilitating iterative growth and portability across FPGA devices and families.

Posted Content
TL;DR: In this article, the authors consider realistic side-channel scenarios and commonly used machine learning techniques to evaluate the influence of portability on the efficacy of an attack and show that portability plays an important role and should not be disregarded as it contributes to a significant overestimate of the attack efficiency.
Abstract: Profiled side-channel attacks represent a practical threat to digital devices, thereby having the potential to disrupt the foundation of e-commerce, the Internet of Things (IoT), and smart cities. In the profiled side-channel attack, the adversary gains knowledge about the target device by getting access to a cloned device. Though these two devices are different in realworld scenarios, yet, unfortunately, a large part of research works simplifies the setting by using only a single device for both profiling and attacking. There, the portability issue is conveniently ignored to ease the experimental procedure. In parallel to the above developments, machine learning techniques are used in recent literature, demonstrating excellent performance in profiled side-channel attacks. Again, unfortunately, the portability is neglected. In this paper, we consider realistic side-channel scenarios and commonly used machine learning techniques to evaluate the influence of portability on the efficacy of an attack. Our experimental results show that portability plays an important role and should not be disregarded as it contributes to a significant overestimate of the attack efficiency, which can easily be an order of magnitude size. After establishing the importance of portability, we propose a new model called the Multiple Device Model (MDM) that formally incorporates the device to device variation during a profiled side-channel attack. We show through experimental studies how machine learning and MDM significantly enhance the capacity for practical side-channel attacks. More precisely, we demonstrate how MDM can improve the performance of an attack by order of magnitude, completely negating the influence of portability.

Posted ContentDOI
25 Oct 2019-bioRxiv
TL;DR: DNBelab C4 (C4), a negative pressure orchestrated, portable and cost-effective device that enables high-throughput single-cell transcriptional profiling and can efficiently allow discrimination of species-specific cells at high resolution and dissect tissue heterogeneity in different organs.
Abstract: Single-cell technologies are becoming increasingly widespread and have been revolutionizing our understanding of cell identity, state, diversity and function. However, current platforms can be slow to apply to large-scale studies and resource-limited clinical arenas due to a variety of reasons including cost, infrastructure, sample quality and requirements. Here we report DNBelab C4 (C4), a negative pressure orchestrated, portable and cost-effective device that enables high-throughput single-cell transcriptional profiling. C4 system can efficiently allow discrimination of species-specific cells at high resolution and dissect tissue heterogeneity in different organs, such as murine lung and cerebral cortex. Finally, we show that the C4 system is comparable to existing platforms but has huge benefits in cost and portability and, as such, it will be of great interest for the wider scientific community.

Posted Content
TL;DR: The Stateful DataFlow multiGraph (SDFG), a data-centric intermediate representation that enables separating program definition from its optimization, is presented, allowing domain scientists to develop applications naturally and port them to approach peak hardware performance without modifying the original scientific code.
Abstract: The ubiquity of accelerators in high-performance computing has driven programming complexity beyond the skill-set of the average domain scientist. To maintain performance portability in the future, it is imperative to decouple architecture-specific programming paradigms from the underlying scientific computations. We present the Stateful DataFlow multiGraph (SDFG), a data-centric intermediate representation that enables separating program definition from its optimization. By combining fine-grained data dependencies with high-level control-flow, SDFGs are both expressive and amenable to program transformations, such as tiling and double-buffering. These transformations are applied to the SDFG in an interactive process, using extensible pattern matching, graph rewriting, and a graphical user interface. We demonstrate SDFGs on CPUs, GPUs, and FPGAs over various motifs --- from fundamental computational kernels to graph analytics. We show that SDFGs deliver competitive performance, allowing domain scientists to develop applications naturally and port them to approach peak hardware performance without modifying the original scientific code.

Journal ArticleDOI
TL;DR: A system of standardization enables a consistent application of numerous rule-based and machine learning based classification techniques downstream across disparate datasets which may originate across different institutions and data systems.
Abstract: This paper presents a portable phenotyping system that is capable of integrating both rule-based and statistical machine learning based approaches. Our system utilizes UMLS to extract clinically relevant features from the unstructured text and then facilitates portability across different institutions and data systems by incorporating OHDSI’s OMOP Common Data Model (CDM) to standardize necessary data elements. Our system can also store the key components of rule-based systems (e.g., regular expression matches) in the format of OMOP CDM, thus enabling the reuse, adaptation and extension of many existing rule-based clinical NLP systems. We experimented with our system on the corpus from i2b2’s Obesity Challenge as a pilot study. Our system facilitates portable phenotyping of obesity and its 15 comorbidities based on the unstructured patient discharge summaries, while achieving a performance that often ranked among the top 10 of the challenge participants. Our system of standardization enables a consistent application of numerous rule-based and machine learning based classification techniques downstream across disparate datasets which may originate across different institutions and data systems.

Journal ArticleDOI
TL;DR: A novel metric (KIP) is introduced to measure portability of phenotype algorithms for quantifying such efforts across the eMERGE Network and Phenotype developers are encouraged to analyze and optimize the portability in regards to knowledge, interpretation and programming.

Proceedings ArticleDOI
16 Feb 2019
TL;DR: This work designs a new set of high-level APIs and qualifiers, as well as specialized Abstract Syntax Tree (AST) transformations for high- level programming languages and DSLs, and implements parallel reduction, a fundamental building block used in a wide range of algorithms.
Abstract: Since the advent of GPU computing, GPU hardware has evolved at a fast pace. Since application performance heavily depends on the latest hardware improvements, performance portability is extremely challenging for GPU application library developers. Portability becomes even more difficult when new low-level instructions are added to the ISA (e.g., warp shuffle instructions) or the microarchitectural support for existing instructions is improved (e.g., atomic instructions). Library developers, besides re-tuning the code for new hardware features, deal with the performance portability issue by hand-writing multiple algorithm versions that leverage different instruction sets and microarchitectures. High-level programming frameworks and Domain Specific Languages (DSLs) do not typically support low-level instructions (e.g., warp shuffle and atomic instructions), so it is painful or even impossible for these programming systems to take advantage of the latest architectural improvements. In this work, we design a new set of high-level APIs and qualifiers, as well as specialized Abstract Syntax Tree (AST) transformations for high-level programming languages and DSLs. Our transformations enable warp shuffle instructions and atomic instructions (on global and shared memories) to be easily generated. We show a practical implementation of these transformations by building on Tangram, a high-level kernel synthesis framework. Using our new language and compiler extensions, we implement parallel reduction, a fundamental building block used in a wide range of algorithms. Parallel reduction is representative of the performance portability challenge, as its performance heavily depends on the latest hardware improvements. We compare our synthesized parallel reduction to another high-level programming framework and a hand-written high-performance library across three generations of GPU architectures, and show up to $7. 8 \times$ speedup $(2 \times$ on average) over hand-written code.