“Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告

We present a library that provides optimized implementations for deep learning primitives. Deep learning workloads are computationally intensive, and optimizing the kernels of deep learning workloads is difficult and time-consuming. As parallel architectures evolve, kernels must be reoptimized for new processors, which makes maintaining codebases difficult over time. Similar issues have long been addressed in the HPC community by libraries such as the Basic Linear Algebra Subroutines (BLAS) [2]. However, there is no analogous library for deep learning. Without such a library, researchers implementing deep learning workloads on parallel processors must create and optimize their own implementations of the main computational kernels, and this work must be repeated as new parallel processors emerge. To address this problem, we have created a library similar in intent to BLAS, with optimized routines for deep learning workloads. Our implementation contains routines for GPUs, and similarly to the BLAS library, could be implemented for other platforms. The library is easy to integrate into existing frameworks, and provides optimized performance and memory usage. For example, integrating cuDNN into Caffe, a popular framework for convolutional networks, improves performance by 36% on a standard model while also reducing memory consumption.

cuDNN: Efficient Primitives for Deep Learning

Memory isolation is a key property of a reliable and secure computing system--an access to one memory address should not have unintended side effects on data stored in other addresses. However, as DRAM process technology scales down to smaller dimensions, it becomes more difficult to prevent DRAM cells from electrically interacting with each other. In this paper, we expose the vulnerability of commodity DRAM chips to disturbance errors. By reading from the same address in DRAM, we show that it is possible to corrupt data in nearby addresses. More specifically, activating the same row in DRAM corrupts data in nearby rows. We demonstrate this phenomenon on Intel and AMD systems using a malicious program that generates many DRAM accesses. We induce errors in most DRAM modules (110 out of 129) from three major DRAM manufacturers. From this we conclude that many deployed systems are likely to be at risk. We identify the root cause of disturbance errors as the repeated toggling of a DRAM row's wordline, which stresses inter-cell coupling effects that accelerate charge leakage from nearby rows. We provide an extensive characterization study of disturbance errors and their behavior using an FPGA-based testing platform. Among our key findings, we show that (i) it takes as few as 139K accesses to induce an error and (ii) up to one in every 1.7K cells is susceptible to errors. After examining various potential ways of addressing the problem, we propose a low-overhead solution to prevent the errors

/pdf/flipping-bits-in-memory-without-accessing-them-an-3dirvwezou.pdf

Flipping bits in memory without accessing them: an experimental study of DRAM disturbance errors

Regional climate modeling using convection-permitting models (CPMs; horizontal grid spacing 10 km). CPMs no longer rely on convection parameterization schemes, which had been identified as a major source of errors and uncertainties in LSMs. Moreover, CPMs allow for a more accurate representation of surface and orography fields. The drawback of CPMs is the high demand on computational resources. For this reason, first CPM climate simulations only appeared a decade ago. In this study, we aim to provide a common basis for CPM climate simulations by giving a holistic review of the topic. The most important components in CPMs such as physical parameterizations and dynamical formulations are discussed critically. An overview of weaknesses and an outlook on required future developments is provided. Most importantly, this review presents the consolidated outcome of studies that addressed the added value of CPM climate simulations compared to LSMs. Improvements are evident mostly for climate statistics related to deep convection, mountainous regions, or extreme events. The climate change signals of CPM simulations suggest an increase in flash floods, changes in hail storm characteristics, and reductions in the snowpack over mountains. In conclusion, CPMs are a very promising tool for future climate research. However, coordinated modeling programs are crucially needed to advance parameterizations of unresolved physics and to assess the full potential of CPMs.

https://agupubs.onlinelibrary.wiley.com/doi/pdf/10.1002/2014RG000475

A review on regional convection-permitting climate modeling: Demonstrations, prospects, and challenges.

Virtualization is often used in cloud computing platforms for its several advantages in efficiently managing resources. However, virtualization raises certain additional challenges, and one of them is lack of power metering for virtual machines (VMs). Power management requirements in modern data centers have led to most new servers providing power usage measurement in hardware and alternate solutions exist for older servers using circuit and outlet level measurements. However, VM power cannot be measured purely in hardware. We present a solution for VM power metering, named Joulemeter. We build power models to infer power consumption from resource usage at runtime and identify the challenges that arise when applying such models for VM power metering. We show how existing instrumentation in server hardware and hypervisors can be used to build the required power models on real platforms with low error. Our approach is designed to operate with extremely low runtime overhead while providing practically useful accuracy. We illustrate the use of the proposed metering capability for VM power capping, a technique to reduce power provisioning costs in data centers. Experiments are performed on server traces from several thousand production servers, hosting Microsoft's real-world applications such as Windows Live Messenger. The results show that not only does VM power metering allows virtualized data centers to achieve the same savings that non-virtualized data centers achieved through physical server power capping, but also that it enables further savings in provisioning costs with virtualization.

/pdf/virtual-machine-power-metering-and-provisioning-1yj7epvvd8.pdf

Virtual machine power metering and provisioning

Main memory system is a shared resource in modern multicore machines, resulting in serious interference, which causes performance degradation in terms of throughput slowdown and unfairness. Numerous new memory scheduling algorithms have been proposed to address the interference problem. However, these algorithms usually employ complex scheduling logic and need hardware modification to memory controllers, as a result, industrial venders seem to have some hesitation in adopting them.

http://asg.ict.ac.cn/baoyg/downloads/PACT12-BPM.pdf

A software memory partition approach for eliminating bank-level interference in multicore systems

Cloud platform provides great flexibility and cost-efficiency for end-users and cloud operators. However, low resource utilization in modern datacenters brings huge wastes of hardware resources and infrastructure investment. To improve resource utilization, a straightforward way is co-locating different workloads on the same hardware. To figure out the resource efficiency and understand the key characteristics of workloads in co-located cluster, we analyze an 8-day trace from Alibaba's production trace. We reveal three key findings as follows. First, memory becomes the new bottleneck and limits the resource efficiency in Alibaba's datacenter. Second, in order to protect latency-critical applications, batch-processing applications are treated as second-class citizens and restricted to utilize limited resources. Third, more than 90% of latency-critical applications are written in Java applications. Massive self-contained JVMs further complicate resource management and limit the resource efficiency in datacenters.

Who limits the resource efficiency of my datacenter: an analysis of Alibaba datacenter traces

An ever increasing number of configuration parameters are provided to system users. But many users have used one configuration setting across different workloads, leaving untapped the performance potential of systems. A good configuration setting can greatly improve the performance of a deployed system under certain workloads. But with tens or hundreds of parameters, it becomes a highly costly task to decide which configuration setting leads to the best performance. While such task requires the strong expertise in both the system and the application, users commonly lack such expertise. To help users tap the performance potential of systems, we present Best Config, a system for automatically finding a best configuration setting within a resource limit for a deployed system under a given application workload. BestConfig is designed with an extensible architecture to automate the configuration tuning for general systems. To tune system configurations within a resource limit, we propose the divide-and-diverge sampling method and the recursive bound-and-search algorithm. BestConfig can improve the throughput of Tomcat by 75%, that of Cassandra by 63%, that of MySQL by 430%, and reduce the running time of Hive join job by about 50% and that of Spark join job by about 80%, solely by configuration adjustment.

BestConfig: tapping the performance potential of systems via automatic configuration tuning

Network measurement is challenged to fulfill stringent resource requirements in the face of massive network traffic. While approximate measurement can trade accuracy for resource savings, it demands intensive manual efforts to configure the right resource-accuracy trade-offs in real deployment. Such user burdens are caused by how existing approximate measurement approaches inherently deal with resource conflicts when tracking massive network traffic with limited resources. In particular, they tightly couple resource configurations with accuracy parameters, so as to provision sufficient resources to bound the measurement errors. We design SketchLearn, a novel sketch-based measurement framework that resolves resource conflicts by learning their statistical properties to eliminate conflicting traffic components. We prototype SketchLearn on OpenVSwitch and P4, and our testbed experiments and stress-test simulation show that SketchLearn accurately and automatically monitors various traffic statistics and effectively supports network-wide measurement with limited resources.

Sketchlearn: relieving user burdens in approximate measurement with automated statistical inference

In this paper we present a thorough experience on tuning double-precision matrix-matrix multiplication (DGEM-M) on the Fermi GPU architecture. We choose an optimal algorithm with blocking in both shared memory and registers to satisfy the constraints of the Fermi memory hierarchy. Our optimization strategy is further guided by a performance modeling based on micro-architecture benchmarks. Our optimizations include software pipelining, use of vector memory operations, and instruction scheduling. Our best CUDA algorithm achieves comparable performance with the latest CUBLAS library1. We further improve upon this with an implementation in the native machine language, leading to 20% increase in performance. That is, the achieved peak performance (efficiency) is improved from 302Gflop/s (58%) to 362Gflop/s (70%).

http://asg.ict.ac.cn/projects/dgemm/sc11_dgemm.pdf

Yungang Bao

Papers

A software memory partition approach for eliminating bank-level interference in multicore systems

Who limits the resource efficiency of my datacenter: an analysis of Alibaba datacenter traces

BestConfig: tapping the performance potential of systems via automatic configuration tuning

Sketchlearn: relieving user burdens in approximate measurement with automated statistical inference

Fast implementation of DGEMM on Fermi GPU