scispace - formally typeset
Search or ask a question

Showing papers by "Lingjia Tang published in 2017"


Proceedings ArticleDOI
04 Apr 2017
TL;DR: Neurosurgeon, a lightweight scheduler to automatically partition DNN computation between mobile devices and datacenters at the granularity of neural network layers is designed, finding that a fine-grained, layer-level computation partitioning strategy based on the data and computation variations of each layer within a DNN has significant latency and energy advantages over the status quo approach.
Abstract: The computation for today's intelligent personal assistants such as Apple Siri, Google Now, and Microsoft Cortana, is performed in the cloud. This cloud-only approach requires significant amounts of data to be sent to the cloud over the wireless network and puts significant computational pressure on the datacenter. However, as the computational resources in mobile devices become more powerful and energy efficient, questions arise as to whether this cloud-only processing is desirable moving forward, and what are the implications of pushing some or all of this compute to the mobile devices on the edge.In this paper, we examine the status quo approach of cloud-only processing and investigate computation partitioning strategies that effectively leverage both the cycles in the cloud and on the mobile device to achieve low latency, low energy consumption, and high datacenter throughput for this class of intelligent applications. Our study uses 8 intelligent applications spanning computer vision, speech, and natural language domains, all employing state-of-the-art Deep Neural Networks (DNNs) as the core machine learning technique. We find that given the characteristics of DNN algorithms, a fine-grained, layer-level computation partitioning strategy based on the data and computation variations of each layer within a DNN has significant latency and energy advantages over the status quo approach.Using this insight, we design Neurosurgeon, a lightweight scheduler to automatically partition DNN computation between mobile devices and datacenters at the granularity of neural network layers. Neurosurgeon does not require per-application profiling. It adapts to various DNN architectures, hardware platforms, wireless networks, and server load levels, intelligently partitioning computation for best latency or best mobile energy. We evaluate Neurosurgeon on a state-of-the-art mobile development platform and show that it improves end-to-end latency by 3.1X on average and up to 40.7X, reduces mobile energy consumption by 59.5% on average and up to 94.7%, and improves datacenter throughput by 1.5X on average and up to 6.7X.

899 citations


Proceedings ArticleDOI
04 Apr 2017
TL;DR: Proposed in this article, an approach to precisely predict the performance degradation of latency-sensitive applications on accelerators due to application co-location is presented, which can identify "safe" co-locations on accelerator to improve utilization without violating the QoS target.
Abstract: Guaranteeing Quality-of-Service (QoS) of latency-sensitive applications while improving server utilization through application co-location is important yet challenging in modern datacenters. The key challenge is that when applications are co-located on a server, performance interference due to resource contention can be detrimental to the application QoS. Although prior work has proposed techniques to identify "safe" co-locations where application QoS is satisfied by predicting the performance interference on multicores, no such prediction technique on accelerators such as GPUs.In this work, we present Prophet, an approach to precisely predict the performance degradation of latency-sensitive applications on accelerators due to application co-location. We analyzed the performance interference on accelerators through a real system investigation and found that unlike on multicores where the key contentious resources are shared caches and main memory bandwidth, the key contentious resources on accelerators are instead processing elements, accelerator memory bandwidth and PCIe bandwidth. Based on this observation, we designed interference models that enable the precise prediction for processing element, accelerator memory bandwidth and PCIe bandwidth contention on real hardware. By using a novel technique to forecast solo-run execution traces of the co-located applications using interference models, Prophet can accurately predict the performance degradation of latency-sensitive applications on non-preemptive accelerators. Using Prophet, we can identify "safe" co-locations on accelerators to improve utilization without violating the QoS target. Our evaluation shows that Prophet can predict the performance degradation with an average prediction error 5.47% on real systems. Meanwhile, based on the prediction, Prophet achieves accelerator utilization improvements of 49.9% on average while maintaining the QoS target of latency-sensitive applications.

121 citations


Proceedings ArticleDOI
14 Oct 2017
TL;DR: DeftNN is a GPU DNN execution framework that targets the key architectural bottlenecks of DNNs on GPUs to automatically and transparently improve execution performance, and is composed of two novel optimization techniques– synapse vector elimination, a technique that identifies non-contributing synapses in the DNN and carefully transforms data and removes the computation and data movement.
Abstract: Deep neural networks (DNNs) are key computational building blocks for emerging classes of web services that interact in real time with users via voice, images and video inputs. Although GPUs have gained popularity as a key accelerator platform for deep learning workloads, the increasing demand for DNN computation leaves a significant gap between the compute capabilities of GPU-enabled datacenters and the compute needed to service demand. The state-of-the-art techniques to improve DNN performance have significant limitations in bridging the gap on real systems. Current network pruning techniques remove computation, but the resulting networks map poorly to GPU architectures, yielding no performance benefit or even slowdowns. Meanwhile, current bandwidth optimization techniques focus on reducing off-chip band-width while overlooking on-chip bandwidth, a key DNN bottleneck. To address these limitations, this work introduces DeftNN, a GPU DNN execution framework that targets the key architectural bottlenecks of DNNs on GPUs to automatically and transparently improve execution performance. DeftNN is composed of two novel optimization techniques–(l) synapse vector elimination, a technique that identifies non-contributing synapses in the DNN and carefully transforms data and removes the computation and data movement of these synapses while fully utilizing the GPU to improve performance, and (2) near-compute data fission, a mechanism for scaling down the on-chip data movement requirements within DNN computations. Our evaluation of DeftNN spans 6 state-of-the-art DNNs. By applying both optimizations in concert, DeftNN is able to achieve an average speedup of $2.1\times$ on real GPU hardware. We also introduce a small additional hardware unit per GPU core to facilitate efficient data fission operations, increasing the speedup achieved by DeftNN to $2.6\times$CCS CONCEPTS• Computing methodologies $\rightarrow $Neural networks;• Computer systems organization $\rightarrow $Single instruction, multiple data; • General and reference $\rightarrow $Performance;

79 citations


Proceedings ArticleDOI
24 Jun 2017
TL;DR: PowerChief is presented, a runtime framework that provides joint design of service and query to monitor the latency statistics across service stages and adaptively chooses the boosting technique to accelerate the bottleneck service with improved responsiveness and dynamically reallocates the constrained power budget acrossservice stages to accommodate the chosen boosting technique.
Abstract: Modern user facing applications consist of multiple processing stages with a number of service instances in each stage. The latency profile of these multi-stage applications is intrinsically variable, making it challenging to provide satisfactory responsiveness. Given a limited power budget, improving the end-to-end latency requires intelligently boosting the bottleneck service across stages using multiple boosting techniques. However, prior work fail to acknowledge the multi-stage nature of user-facing applications and perform poorly in improving responsiveness on power constrained CMP, as they are unable to accurately identify bottleneck service and apply the boosting techniques adaptively.In this paper, we present PowerChief, a runtime framework that 1) provides joint design of service and query to monitor the latency statistics across service stages and accurately identifies the bottleneck service during runtime; 2) adaptively chooses the boosting technique to accelerate the bottleneck service with improved responsiveness; 3) dynamically reallocates the constrained power budget across service stages to accommodate the chosen boosting technique. Evaluated with real world multi-stage applications, PowerChief improves the average latency by 20.3x and 32.4x (99% tail latency by 13.3x and 19.4x) for Sirius and Natural Language Processing applications respectively compared to stage-agnostic power allocation. In addition, for the given QoS target, PowerChief reduces the power consumption of Sirius and Web Search applications by 23% and 33% respectively over prior work.

44 citations


Patent
30 Oct 2017
TL;DR: In this paper, the authors propose a slot identification machine learning model to segment the text of a query and label each of the slots of the query, generating a slot value for each slot, and using the slot values to identify an external data source relevant to the user query, fetch user data from the external source, and apply one or more operations to the query to generate response data.
Abstract: Systems and methods for implementing an artificially intelligent virtual assistant includes collecting a user query; using a competency classification machine learning model to generate a competency label for the user query; using a slot identification machine learning model to segment the text of the query and label each of the slots of the query; generating a slot value for each of the slots of the query; generating a handler for each of the slot values; and using the slot values to: identify an external data source relevant to the user query, fetch user data from the external data source, and apply one or more operations to the query to generate response data; and using the response data, to generate a response to the user query.

13 citations


Journal ArticleDOI
TL;DR: The authors propose using phase-change materials (PCMs) to shape a data center's thermal load, absorbing and releasing heat when it's advantageous, and find that in a data Center, PCM can reduce the necessary cooling system size by up to 12 percent without impacting peak throughput.
Abstract: As data centers increase in size and computational capacity, their growth comes at a cost: an increasing thermal load that must be removed to prevent overheating. Here, the authors propose using phase-change materials (PCMs) to shape a data center's thermal load, absorbing and releasing heat when it's advantageous. They evaluate three important opportunities for cost savings. They find that in a data center, PCM can reduce the necessary cooling system size by up to 12 percent without impacting peak throughput, or increase the number of servers by up to 14.6 percent without increasing the cooling load. In a thermally constrained setting, PCM can increase peak throughput up to 69 percent while delaying the onset of thermal limits by over 3 hours, and a wax-aware scheduler enables up to a 11 percent reduction in peak cooling load when batch jobs are added, increasing average daily throughput by 36-52 percent.

12 citations


Journal ArticleDOI
TL;DR: This work proposes Adrenaline, an approach to leverage finer-granularity (tens of nanoseconds) voltage boosting to effectively rein in the tail latency with query-level precision, and demonstrates the effectiveness of the methodology under various workload configurations.
Abstract: Reducing the long tail of the query latency distribution in modern warehouse scale computers is critical for improving performance and quality of service (QoS) of workloads such as Web Search and Memcached. Traditional turbo boost increases a processor’s voltage and frequency during a coarse-grained sliding window, boosting all queries that are processed during that window. However, the inability of such a technique to pinpoint tail queries for boosting limits its tail reduction benefit. In this work, we propose Adrenaline, an approach to leverage finer-granularity (tens of nanoseconds) voltage boosting to effectively rein in the tail latency with query-level precision. Two key insights underlie this work. First, emerging finer granularity voltage/frequency boosting is an enabling mechanism for intelligent allocation of the power budget to precisely boost only the queries that contribute to the tail latency; second, per-query characteristics can be used to design indicators for proactively pinpointing these queries, triggering boosting accordingly. Based on these insights, Adrenaline effectively pinpoints and boosts queries that are likely to increase the tail distribution and can reap more benefit from the voltage/frequency boost. By evaluating under various workload configurations, we demonstrate the effectiveness of our methodology. We achieve up to a 2.50 × tail latency improvement for Memcached and up to a 3.03 × for Web Search over coarse-grained dynamic voltage and frequency scaling (DVFS) given a fixed boosting power budget. When optimizing for energy reduction, Adrenaline achieves up to a 1.81 × improvement for Memcached and up to a 1.99 × for Web Search over coarse-grained DVFS. By using the carefully chosen boost thresholds, Adrenaline further improves the tail latency reduction to 4.82 × over coarse-grained DVFS.

11 citations