scispace - formally typeset
Search or ask a question

Showing papers by "Ayse K. Coskun published in 2018"


Book ChapterDOI
27 Aug 2018
TL;DR: Identifying applications on supercomputers is challenging because applications are executed using esoteric scripts along with binaries that are compiled and named by users.
Abstract: Modern supercomputers are shared among thousands of users running a variety of applications. Knowing which applications are running in the system can bring substantial benefits: knowledge of applications that intensively use shared resources can aid scheduling; unwanted applications such as cryptocurrency mining or password cracking can be blocked; system architects can make design decisions based on system usage. However, identifying applications on supercomputers is challenging because applications are executed using esoteric scripts along with binaries that are compiled and named by users.

29 citations


Proceedings ArticleDOI
21 May 2018
TL;DR: A memory object classification and allocation framework (MOCA) to characterize memory objects and then allocate them to their best-fit memory module to improve performance and energy efficiency is designed.
Abstract: In the era of abundant-data computing, main memory's latency and power significantly impact overall system performance and power. Today's computing systems are typically composed of homogeneous memory modules, which are optimized to provide either low latency, high bandwidth, or low power. Such memory modules do not cater to a wide range of applications with diverse memory access behavior. Thus, heterogeneous memory systems, which include several memory modules with distinct performance and power characteristics, are becoming promising alternatives. In such a system, allocating applications to their best-fitting memory modules improves system performance and energy efficiency. However, such an approach still leaves the full potential of heterogeneous memory systems under-utilized because not only applications, but also the memory objects within that application differ in their memory access behavior. This paper proposes a novel page allocation approach to utilize heterogeneous memory systems at the memory object level. We design a memory object classification and allocation framework (MOCA) to characterize memory objects and then allocate them to their best-fit memory module to improve performance and energy efficiency. We experiment with heterogeneous memory systems that are composed of a Reduced Latency DRAM (RLDRAM) for latency-sensitive objects, a 2.5D-stacked High Bandwidth Memory (HBM) for bandwidth-sensitive objects, and a Low Power DDR (LPDDR) for non-memory-intensive objects. The MOCA framework includes detailed application profiling, a classification mechanism, and an allocation policy to place memory objects. Compared to a system with homogeneous memory modules, we demonstrate that heterogeneous memory systems with MOCA improve memory system energy efficiency by up to 63%. Compared to a heterogeneous memory system with only application-level page allocation, MOCA achieves a 26% memory performance and a 33% energy efficiency improvement for multi-program workloads.

16 citations


Book ChapterDOI
10 Sep 2018
TL;DR: Proteus automatically identifies the instructions that cause divergent behavior between emulated and real CPUs and, on a set of 500K test programs, identified 28K divergent instances and it is shown that some of these root causes can be easily fixed without introducing observable performance degradation in the emulator.
Abstract: The popularity of Android and the personal information stored on these devices attract the attention of regular cyber-criminals as well as nation state adversaries who develop malware that targets this platform To identify malicious Android apps at a scale (eg, Google Play contains 37M Apps), state-of-the-art mobile malware analysis systems inspect the execution of apps in emulation-based sandboxes An emerging class of evasive Android malware, however, can evade detection by such analysis systems through ceasing malicious activities if an emulation sandbox is detected Thus, systematically uncovering potential methods to detect emulated environments is crucial to stay ahead of adversaries This work uncovers the detection methods based on discrepancies in instruction-level behavior between software-based emulators and real ARM CPUs that power the vast majority of Android devices To systematically discover such discrepancies at scale, we propose the Proteus system Proteus performs large-scale collection of application execution traces (ie, registers and memory) as they run on an emulator and on accurate software models of ARM CPUs Proteus automatically identifies the instructions that cause divergent behavior between emulated and real CPUs and, on a set of 500K test programs, identified 28K divergent instances By inspecting these instances, we reveal 3 major classes of root causes that are responsible for these discrepancies We show that some of these root causes can be easily fixed without introducing observable performance degradation in the emulator Thus, we have submitted patches to improve resilience of Android emulators against evasive malware

12 citations


Proceedings ArticleDOI
05 Nov 2018
TL;DR: Using the cross-layer methodology results in more accurate determination of (superior) inter-chiplet network and 2.5D system designs compared to prior methods, and achieves 29% better performance with the same manufacturing cost, or 25% lower cost with theSame performance.
Abstract: 2.5D integration technology is gaining popularity in the design of homogeneous and heterogeneous many-core computing systems. 2.5D network design, both inter- and intra-chiplet, impacts overall system performance as well as its manufacturing cost and thermal feasibility. This paper introduces a cross-layer methodology for designing networks in 2.5D systems. We optimize the network design and chiplet placement jointly across logical, physical, and circuit layers to achieve an energy-efficient network, while maximizing system performance, minimizing manufacturing cost, and adhering to thermal constraints. In the logical layer, our co-optimization considers eight different network topologies. In the physical layer, we consider routing, microbump assignment, and microbump pitch constraints to account for the extra costs associated with microbump utilization in the inter-chiplet communication. In the circuit layer, we consider both passive and active links with five different link types, including a gas station link design. Using our cross-layer methodology results in more accurate determination of (superior) inter-chiplet network and 2.5D system designs compared to prior methods. Compared to 2D systems, our approach achieves 29% better performance with the same manufacturing cost, or 25% lower cost with the same performance.

12 citations


Proceedings ArticleDOI
01 Sep 2018
TL;DR: Tangram is presented, a framework for colocating applications in HPC clusters that uses prior knowledge of applications, such as whether they are I/O or CPU intensive, to predict whether potential colocations improve overall performance and can choose colocations to reduce makespan.
Abstract: In a cluster that is shared by many users, jobs often need to wait in the queue for a significant amount of time. Much research has been done to reduce this time with scheduling, including aggressive back-filling strategies and sharing nodes among different jobs. Although most resources are shared to some extent in HPC clusters, it is somewhat surprising that a well-known technique used on commercial clouds, i.e., oversubscribing nodes so that CPU cores are shared among jobs, is rather rare. This is partially due to concerns about interference. This paper presents Tangram, a framework for colocating applications in HPC clusters. Tangram uses prior knowledge of applications, such as whether they are I/O or CPU intensive, to predict whether potential colocations improve overall performance. To predict with sufficient accuracy, Tangram uses a combination of performance counter measurements, knowledge of past colocation performance, and machine learning. We show that Tangram can choose colocations to reduce makespan by 19% on average and by 55% in the best case, while limiting the performance degradation caused by colocation from 1598% to 26% in the worst case.

11 citations


Proceedings ArticleDOI
21 May 2018
TL;DR: A novel allocation policy called Level-Spread for dragonfly networks spreads jobs within the smallest network level that a given job can fit in at the time of its allocation, which reduces the communication overhead compared to the state-of-the-art allocation policies.
Abstract: The dragonfly network topology has attracted attention in recent years owing to its high radix and constant diameter. However, the influence of job allocation on communication time in dragonfly networks is not fully understood. Recent studies have shown that random allocation is better at balancing the network traffic, while compact allocation is better at harnessing the locality in dragonfly groups. Based on these observations, this paper introduces a novel allocation policy called Level-Spread for dragonfly networks. This policy spreads jobs within the smallest network level that a given job can fit in at the time of its allocation. In this way, it simultaneously harnesses node adjacency and balances link congestion. To evaluate the performance of Level-Spread, we run packet-level network simulations using a diverse set of application communication patterns, job sizes, and communication intensities. We also explore the impact of network properties such as the number of groups, number of routers per group, machine utilization level, and global link bandwidth. Level-Spread reduces the communication overhead by 16% on average (and up to 71%) compared to the state-of-the-art allocation policies.

8 citations


Proceedings ArticleDOI
Ozan Tuncer1, Nilton Bila2, Sastry S. Duri2, Canturk Isci2, Ayse K. Coskun1 
25 Jun 2018
TL;DR: This paper introduces ConfEx, a framework that enables discovery and extraction of text-based configurations in multi-tenant cloud platforms and cloud image repositories for configuration analysis and validation and demonstrates a use case of ConfEx for detecting injected misconfigurations via outlier analysis.
Abstract: Modern cloud applications are designed in a highly configurable way to ensure increased reusability and portability With the growing complexity of these applications, configuration errors (ie, misconfigurations) have become major sources of service outages and disruptions While some research has so far focused on detecting errors in configurations that are represented as well-structured key-value pairs, the configurations of cloud applications are typically stored in text files with application-specific syntax and in unlabeled file system locations, limiting the use of existing error detection tools This paper introduces ConfEx, a framework that enables discovery and extraction of text-based configurations in multi-tenant cloud platforms and cloud image repositories for configuration analysis and validation ConfEx uses a novel vocabulary-based technique to identify text-based configuration files in cloud system instances with unlabeled content, and leverages existing configuration parsers to extract the information in these files We show that ConfEx achieves over 98% precision and recall in identifying configuration files on 3893 popular Docker Hub images and we also demonstrate a use case of ConfEx for detecting injected misconfigurations via outlier analysis

8 citations


Proceedings ArticleDOI
23 Jul 2018
TL;DR: This work explores the impact of FCA system design on various 3D architectures and proposes a methodology to optimize a 3D MPSoC with integrated FCA to run a given workload in the most energy-efficient way.
Abstract: Integrated flow cell array (FCA) is an emerging technology, targeting the cooling and power delivery challenges of modern 2D/3D Multi-Processor Systems-on-Chip (MPSoCs). In FCA, electrolytic solutions are pumped through microchannels etched in the silicon of the chips, removing heat from the system, while, at the same time, generating power on-chip. In this work, we explore the impact of FCA system design on various 3D architectures and propose a methodology to optimize a 3D MPSoC with integrated FCA to run a given workload in the most energy-efficient way. Our results show that an optimized configuration can save up to 50% energy with respect to sub-optimal 3D MPSoC configurations.

3 citations


Proceedings ArticleDOI
02 Jul 2018
TL;DR: This work proposes a novel cross-layer approach that can enable accurate power estimation by carefully integrating components from system-level and RTL simulation of the target design and demonstrates that it can improve the power estimation accuracy by up to 15% for individual simulation points and by ~9% for the full application, compared to that of a conventional system- level simulation scheme.
Abstract: While state-of-the-art system-level simulators can deliver swift estimation of power dissipation for microprocessor designs, they do so at the expense of reduced accuracy On the other hand, RTL simulators are typically cycle-accurate but overwhelmingly time consuming for real-life workloads Consequently, the design community often has to make a compromise between accuracy and speed In this work, we propose a novel cross-layer approach that can enable accurate power estimation by carefully integrating components from system-level and RTL simulation of the target design We first leverage the concept of simulation points to transform the workload application and isolate its most critical segments We then profile the highest weighted simulation point (HWSP) with a RTL simulator (AnyCore) for maximum accuracy, while the rest are simulated with a system-level simulator (gem5) for ensuring fast evaluation Finally, we combine the integrated set of profiling data as input to the power simulator (McPAT) Our evaluation results for three different SPEC2006 benchmark applications demonstrate that our proposed cross-layer framework can improve the power estimation accuracy by up to 15% for individual simulation points and by ~9% for the full application, compared to that of a conventional system-level simulation scheme

1 citations