scispace - formally typeset
Search or ask a question

Showing papers by "Ayse K. Coskun published in 2019"


Journal ArticleDOI
TL;DR: This paper presents a novel machine learning based framework to automatically diagnose performance anomalies at runtime and demonstrates that this approach successfully identifies 98 percent of injected anomalies and consistently outperforms existing anomaly diagnosis techniques.
Abstract: As the size and complexity of high performance computing (HPC) systems grow in line with advancements in hardware and software technology, HPC systems increasingly suffer from performance variations due to shared resource contention as well as software- and hardware-related problems. Such performance variations can lead to failures and inefficiencies, which impact the cost and resilience of HPC systems. To minimize the impact of performance variations, one must quickly and accurately detect and diagnose the anomalies that cause the variations and take mitigating actions. However, it is difficult to identify anomalies based on the voluminous, high-dimensional, and noisy data collected by system monitoring infrastructures. This paper presents a novel machine learning based framework to automatically diagnose performance anomalies at runtime. Our framework leverages historical resource usage data to extract signatures of previously-observed anomalies. We first convert collected time series data into easy-to-compute statistical features. We then identify the features that are required to detect anomalies, and extract the signatures of these anomalies. At runtime, we use these signatures to diagnose anomalies with negligible overhead. We evaluate our framework using experiments on a real-world HPC supercomputer and demonstrate that our approach successfully identifies 98 percent of injected anomalies and consistently outperforms existing anomaly diagnosis techniques.

52 citations


Proceedings ArticleDOI
13 May 2019
TL;DR: This paper analyzes the unique thermal characteristics of Mono3D ICs by simulating a two-tier flip-chip Mono3d IC and highlights the primary differences in comparison to a similarly-sized flip- chip TSV-based 3D IC.
Abstract: Monolithic 3D (Mono3D) is a three-dimensional integration technology that can overcome some of the fundamental limitations faced by traditional, two-dimensional scaling. This paper analyzes the unique thermal characteristics of Mono3D ICs by simulating a two-tier flip-chip Mono3D IC and highlights the primary differences in comparison to a similarly-sized flip-chip TSV-based 3D IC. Specifically, we perform architectural-level thermal simulations for both technologies and demonstrate that vertical thermal coupling is stronger in Mono3D ICs, leading to lower upper tier temperatures. We also investigate the significance of lateral versus vertical flow of heat in Mono3D ICs. We simulate different hot spot scenarios in a two-tier Mono3D IC and show that although the lateral heat flow is limited as compared to TSV-based 3D ICs, ignoring this mechanism can cause nonnegligible error (~4°C) in temperature estimation, particularly for layers farther from the heat sink. In addition, we show that with increasing interconnect utilization (due to the contribution of Joule heating to overall temperature), the on-chip temperatures and the significance of lateral heat flow within the two-tier Mono3D IC also increase. Finally, we discuss potential opportunities in Mono3D ICs to enhance their thermal integrity.

23 citations


Proceedings ArticleDOI
25 Mar 2019
TL;DR: WAVES, a wavelength selection technique to identify and activate the minimum number of laser wavelengths needed, depending on an application’s bandwidth requirement, is introduced, which demonstrates an average of 23% reduction in PNoC power with only <1% loss in system performance.
Abstract: Photonic Network-on-Chips (PNoCs) offer promising benefits over Electrical Network-on-Chips (ENoCs) in many-core systems owing to their lower latencies, higher bandwidth, and lower energy-per-bit communication with negligible data-dependent power. These benefits, however, are limited by a number of challenges. Microring resonators (MRRs) that are used for photonic communication have high sensitivity to process variations and on-chip thermal variations, giving rise to possible resonant wavelength mismatches. State-of-the-art microheaters, which are used to tune the resonant wavelength of MRRs, have poor efficiency resulting in high thermal tuning power. In addition, laser power and high static power consumption of drivers, serializers, comparators, and arbitration logic partially negate the benefits of the sub-pJ operating regime that can be obtained with PNoCs. To reduce PNoC power consumption, this paper introduces WAVES, a wavelength selection technique to identify and activate the minimum number of laser wavelengths needed, depending on an application’s bandwidth requirement. Our results on a simulated 2.5D manycore system with PNoC demonstrate an average of 23% (resp. 38%) reduction in PNoC power with only <1% (resp. <5%) loss in system performance.

17 citations


Proceedings ArticleDOI
20 Nov 2019
TL;DR: The vision of an automated instrumentation framework, Pythia, that runs alongside deployed distributed applications that searches the space of possible instrumentation choices to enable the instrumentation needed to help diagnose it.
Abstract: Diagnosing performance problems in distributed applications is extremely challenging. A significant reason is that it is hard to know where to place instrumentation a priori to help diagnose problems that may occur in the future. We present the vision of an automated instrumentation framework, Pythia, that runs alongside deployed distributed applications. In response to a newly-observed performance problem, Pythia searches the space of possible instrumentation choices to enable the instrumentation needed to help diagnose it. Our vision for Pythia builds on workflow-centric tracing, which records the order and timing of how requests are processed within and among a distributed application's nodes (i.e., records their workflows). It uses the key insight that localizing the sources high performance variation within the workflows of requests that are expected to perform similarly gives insight into where additional instrumentation is needed.

16 citations


Journal ArticleDOI
TL;DR: This paper introduces the MAESTRO framework to automatically manage QoS at runtime depending on application characteristics and thermal constraints and demonstrates 41% to $6.7\boldsymbol {\times }$ longer durations of sustained QoS compared to state-of-the-art for a set of mobile applications.
Abstract: Power densities of modern mobile system-on-a-chip designs can quickly exceed the thermal design limits during typical application use such as gaming or Web browsing Resulting high temperatures lead to frequent thermal throttling and significant loss in quality-of-service (QoS) delivered to users Thus, a joint consideration of thermal constraints and QoS requirements is essential to maximize the overall user experience Prior techniques either rely on users to determine the best tradeoff point between QoS and temperature, or greedily utilize the thermal headroom to maximize performance, causing QoS to drop below user tolerable levels over extended durations of use This paper introduces the MAESTRO framework to automatically manage QoS at runtime depending on application characteristics and thermal constraints MAESTRO builds on the observation that increased temperatures can be tolerated for applications with bursty compute patterns due to idle periods between activities, while causing large QoS degradations for long-running applications with continuous computations MAESTRO: 1) detects such continuous computations that are susceptible to throttling; 2) proactively finds a QoS level to balance user experience and temperature; and 3) performs closed-loop DVFS and thermally efficient thread mapping to meet the target QoS on a heterogeneous multicore CPU Such application-adaptive control of QoS-temperature tradeoffs allows MAESTRO to sustain a target QoS level within a user tolerable range for longer durations without sacrificing the performance of latency-sensitive bursty computations Evaluations on a real system prototype validates MAESTRO’s ability to accurately detect potential throttling-induced QoS degradations and demonstrates 41% to $67\boldsymbol {\times }$ longer durations of sustained QoS compared to state-of-the-art for a set of mobile applications

14 citations


Proceedings ArticleDOI
10 Nov 2019
TL;DR: RandR captures and replays multiple sources of input without requiring source code, administrative device privileges, or any special platform support for record and replay of Android applications, and contextualizes UI events as interactions with particular UI components as opposed to relying on platform specific features.
Abstract: The ability to repeat the execution of a program is a fundamental requirement in many areas of computing from computer system evaluation to software engineering. Reproducing executions of mobile apps, in particular, has proven difficult under real-life scenarios due to multiple sources of external inputs and interactive nature of the apps. Previous works that provide record/replay functionality for mobile apps are restricted to particular input sources (e.g., touchscreen events) and present deployment challenges due to intrusive modifications to the underlying software stack. Moreover, due to their reliance on record and replay of device specific events, the recorded executions cannot be reliably reproduced across different platforms. In this paper, we present a new practical approach, RandR, for record and replay of Android applications. RandR captures and replays multiple sources of input (i.e., UI and network) without requiring source code (OS or app), administrative device privileges, or any special platform support. RandR achieves these qualities by instrumenting a select set of methods at runtime within an application's own sandbox. In addition, to enable portability of recorded executions across different platforms for replay, RandR contextualizes UI events as interactions with particular UI components (e.g., a button) as opposed to relying on platform specific features (e.g., screen coordinates). We demonstrate RandR's accurate cross-platform record and replay capabilities using over 30 real-world Android apps across a variety of platforms including emulators as well as commercial off-the-shelf mobile devices deployed in real life.

14 citations


Journal ArticleDOI
24 Jan 2019
TL;DR: This article proposes EnergyQARE, the Energy and Quality-of-Service (QoS) Aware RSR Enabler, an approach that enables data center RSR provision in real-life scenarios, and contains a runtime policy that adaptively modulates data center power through server power management and server provisioning based on workload QoS feedback.
Abstract: Power market operators have recently introduced smart grid demand response (DR), in which electricity consumers regulate their power usage following market requirements. DR helps stabilize the grid and enables integrating a larger amount of intermittent renewable power generation. Data centers provide unique opportunities for DR participation due to their flexibility in both workload servicing and power consumption. While prior studies have focused on data center participation in legacy DR programs such as dynamic energy pricing and peak shaving, this article studies data centers in emerging DR programs, i.e., demand side capacity reserves. Among different types of capacity reserves, regulation service reserves (RSRs) are especially attractive due to their relatively higher value. This article proposes EnergyQARE, the Energy and Quality-of-Service (QoS) Aware RSR Enabler, an approach that enables data center RSR provision in real-life scenarios. EnergyQARE not only provides a bidding strategy in RSR provision, but also contains a runtime policy that adaptively modulates data center power through server power management and server provisioning based on workload QoS feedback. To reflect real-life scenarios, this runtime policy handles a heterogeneous set of jobs and considers transition time delay of servers. Simulated numerical results demonstrate that in a general data center scenario, EnergyQARE provides close to 50% of data center average power consumption as reserves to the market and saves up to 44% in data center electricity cost, while still meeting workload QoS constraints. Case studies in this article show that the percentages of savings are not sensitive to a specific type of non-interactive workload, or the size of the data center, although they depend strongly on data center utilization and parameters of server power states.

13 citations


Proceedings ArticleDOI
05 Aug 2019
TL;DR: HPC performance anomaly diagnosis using HPAS, an HPC Performance Anomaly Suite consisting of anomaly generators for the major subsystems in HPC systems, and several use cases of applications that are resilient to performance variability are demonstrated.
Abstract: Modern high performance computing (HPC) systems, including supercomputers, routinely suffer from substantial performance variations. The same application with the same input can have more than 100% performance variation, and such variations cause reduced efficiency and wasted resources. There have been recent studies on performance variability and on designing automated methods for diagnosing "anomalies" that cause performance variability. These studies either observe data collected from HPC systems, or they rely on synthetic reproduction of performance variability scenarios. However, there is no standardized way of creating performance variability inducing synthetic anomalies; so, researchers rely on designing ad-hoc methods for reproducing performance variability. This paper addresses this lack of a common method for creating relevant performance anomalies by introducing HPAS, an HPC Performance Anomaly Suite, consisting of anomaly generators for the major subsystems in HPC systems. These easy-to-use synthetic anomaly generators facilitate low-effort evaluation and comparison of various analytics methods as well as performance or resilience of applications, middleware, or systems under realistic performance variability scenarios. The paper also provides an analysis of the behavior of the anomaly generators and demonstrates several use cases: (1) performance anomaly diagnosis using HPAS, (2) evaluation of resource management policies under performance variations, and (3) design of applications that are resilient to performance variability.

11 citations


Proceedings ArticleDOI
15 Jun 2019
TL;DR: This paper develops a strategy for a data center to provide RSR while offering QoS guarantees, expressed in terms of the sojourn time of jobs, following the spirit of a Generalized Processor Sharing (GPS) policy.
Abstract: Demand response helps stabilize the power grid and offers opportunities for consumers to reduce their cost by regulating their power consumption to follow market requirements. Regulation service reserves (RSRs) is a specific form of a demand response program, requiring participants to regulate their power to follow a dynamically-changing target that is updated every few seconds. In return, participants' electricity bill is reduced in proportion to the reserves they provide. Data centers are significant power consumers, and they are good candidates to participate in RSRs because they have the flexibility to regulate their power consumption through various strategies. Previous work in this area has proposed power regulation policies that enable data centers to participate in RSRs, but without providing guarantees on the Quality-of-Service (QoS) of the jobs running on the data center. This paper develops a strategy for a data center to provide RSR while offering QoS guarantees, expressed in terms of the sojourn time of jobs. The proposed policy regulates data center power through power capping and job scheduling, following the spirit of a Generalized Processor Sharing (GPS) policy. Parameters in our policy are calculated so as to minimize the electricity cost under QoS constraints. The key in guaranteeing QoS is to determine an acceptable range for policy parameters using a queueing theoretic result for delay. We evaluate our policy in both large-scale simulations as well as real-system experiments on a small cluster. We demonstrate that our policy enables data centers to participate in RSRs and reduces their electricity bill by 14-51% while providing guarantees on the QoS of the jobs.

8 citations


Proceedings ArticleDOI
28 May 2019
TL;DR: In this article, the authors presented a thermal model capable of simulating two-phase vapor chambers with micropillar wick evaporators, which is an emerging technique that removes heat through the evaporation process of a coolant and has the potential to remove high heat fluxes.
Abstract: High power densities lead to thermal hot spots in modern processors. These power densities are expected to reach kW/cm2 scale in future high-performance chips and this increase may significantly degrade performance and reliability, if not handled efficiently. Using two-phase vapor chambers (VCs) with micropillar wick evaporators is an emerging technique that removes heat through the evaporation process of a coolant and has the potential to remove high heat fluxes. In this cooling system, the coolant is supplied passively to the micropillar wick via capillary pumping, eliminating the need for an external pump and ensuring stable thin-film flow. Evaluation of such an emerging cooling technique on realistic chip power densities and micropillar geometries necessitates accurate and fast thermal models. Although multiphysics simulators based on either finite-element or finite-volume methods are highly accurate, they have long design and simulation times. This paper introduces a novel compact thermal model capable of simulating two-phase vapor chambers with micropillar wick evaporators. In comparison to COMSOL, our model shows a competitively low error of 1.25°C and a 214x speedup. We also present a comparison of the cooling performance of different cooling techniques such as a conventional heat sink, liquid cooling via microchannels, hybrid cooling using thermoelectric coolers and liquid cooling via microchannels, and two-phase VCs with micropillar wick evaporators for the first time. Based on our observations, two-phase VCs and microchannel-based two-phase cooling show better cooling performance for hot spot power densities of less than 1500 W/cm2, while hybrid cooling achieves lower hot spot temperature and thermal gradients for hot spot power densities between 1500 and 2000 W/cm2.

5 citations


Proceedings ArticleDOI
09 Dec 2019
TL;DR: Praxi is a new software discovery method that builds upon the strengths of prior approaches by combining the accuracy of learning-based methods with the efficiency of practice-based method.
Abstract: With today's rapidly-evolving cloud software landscape, users of cloud systems must constantly monitor software running on their containers and virtual machines (VMs) to ensure compliance, security, and efficiency. Traditional solutions to this problem rely on manually-created rules that identify software installations and modifications, but these require expert authors and are often unmaintainable. More recent automated techniques leverage knowledge of packaging practices to aid in discovery without requiring any pre-training, but these practice-based methods cannot provide precise-enough information to perform discovery by themselves. Other approaches use machine learning models to facilitate discovery of software present in a training corpus, but prior approaches have high runtime and storage requirements. This demonstration features Praxi, a new software discovery method that builds upon the strengths of prior approaches by combining the accuracy of learning-based methods with the efficiency of practice-based methods. We demonstrate Praxi's training and detection process in real time while allowing laptop-equipped participants to follow along using a provided remote virtual machine.

Proceedings ArticleDOI
29 Jul 2019
TL;DR: This paper presents a fast and accurate compact thermal model for two-phase VCs with micropillar wicks and builds an optimization flow that selects the best cooling solution and its cooling parameters to minimize the cooling power under a temperature constraint for a given processor and power profile.
Abstract: Ultra-high power densities that are expected in future processors cannot be efficiently mitigated by conventional cooling solutions. Using two-phase vapor chambers (VCs) with micropillar wick evaporators is an emerging cooling technique that can effectively remove high heat fluxes through the evaporation process of a coolant. Two-phase VCs with micropillar wicks offer high cooling efficiency by leveraging a capillary-driven flow, where the coolant is passively driven by the wicking structure that eliminates the need for an external pump. Thermal models for such emerging cooling technologies are essential to evaluate their impact on future processors. Existing thermal models for two-phase VCs use computational fluid dynamics (CFD) modules, which incur long design and simulation times. This paper presents a fast and accurate compact thermal model for two-phase VCs with micropillar wicks. Our model achieves a maximum error of 1.25°C with a speedup of 214x in comparison to a CFD model. Using our proposed thermal model, we build an optimization flow that selects the best cooling solution and its cooling parameters to minimize the cooling power under a temperature constraint for a given processor and power profile. We then demonstrate our optimization flow on different chip sizes and hot spot distributions to choose the optimal cooling technique among VCs, microchannel-based two-phase cooling, liquid cooling via microchannels, and a hybrid cooling technique with thermoelectric coolers and liquid cooling with microchannels.

Proceedings ArticleDOI
15 Jun 2019
TL;DR: This work presents a novel framework for offering incentives to DCs so they can dynamically adjust their electricity consumption and provide DR to the grid and presents an inverse optimization approach to cost function parameter estimation for precise and efficient pricing.
Abstract: In Demand Response (DR), consumers regulate their power based on requests from an energy supplier. Data Centers (DC) are among the promising candidates to perform DR to help stabilize the power grid due to their flexibility and controllability. In this work, we present a novel framework for offering incentives to DCs so they can dynamically adjust their electricity consumption and provide DR to the grid. Coordination between an Independent System Operator (ISO) and DCs is done through pricing where the ISO computes optimal prices which elicit desired responses from the DCs. We model DCs using realistic cost functions based on Quality of Service (QoS) requirements of the DC workloads and present an inverse optimization approach to cost function parameter estimation for precise and efficient pricing along with simulation results that highlight the strength of our approach.

Proceedings ArticleDOI
02 Jun 2019
TL;DR: In this paper, the authors present a record/replay framework for Android, RandR, which handles multiple sources of input and provides cross-device replay capabilities through a dynamic instrumentation approach.
Abstract: The ability to repeat the execution of a program is a fundamental requirement in evaluating computer systems and apps. Reproducing executions of mobile apps has proven difficult under real-life scenarios due to different sources of external inputs and interactive nature of the apps. We present a new practical record/replay framework for Android, RandR, which handles multiple sources of input and provides cross-device replay capabilities through a dynamic instrumentation approach. We demonstrate the feasibility of RandR by recording and replaying a set of real-world apps.

Proceedings Article
01 Jun 2019
TL;DR: A new practical record/replay framework for Android, RandR, which handles multiple sources of input and provides cross-device replay capabilities through a dynamic instrumentation approach and is demonstrated by recording and replaying a set of real-world apps.
Abstract: The ability to repeat the execution of a program is a fundamental requirement in evaluating computer systems and apps. Reproducing executions of mobile apps has proven difficult under real-life scenarios due to different sources of external inputs and interactive nature of the apps. We present a new practical record/replay framework for Android, RandR, which handles multiple sources of input and provides cross-device replay capabilities through a dynamic instrumentation approach. We demonstrate the feasibility of RandR by recording and replaying a set of real-world apps.

Journal ArticleDOI
TL;DR: This work proposes a novel cross-layer power estimation (CAPE) technique that carefully integrates system-level and RTL profiling data for the target design in order to attain better accuracy, and shows that CAPE can improve the power estimation accuracy by up to 15% for individual simulation points and by ∼8% for the full application.