Model compression is an effective technique to efficiently deploy neural network models on mobile devices which have limited computation resources and tight power budgets. Conventional model compression techniques rely on hand-crafted features and require domain experts to explore the large design space trading off among model size, speed, and accuracy, which is usually sub-optimal and time-consuming. In this paper, we propose AutoML for Model Compression (AMC) which leverages reinforcement learning to efficiently sample the design space and can improve the model compression quality. We achieved state-of-the-art model compression results in a fully automated way without any human efforts. Under 4\(\times \) FLOPs reduction, we achieved 2.7% better accuracy than the hand-crafted model compression method for VGG-16 on ImageNet. We applied this automated, push-the-button compression pipeline to MobileNet-V1 and achieved a speedup of 1.53\(\times \) on the GPU (Titan Xp) and 1.95\(\times \) on an Android phone (Google Pixel 1), with negligible loss of accuracy.

/pdf/amc-automl-for-model-compression-and-acceleration-on-mobile-3ghqnq0zo2.pdf

AMC: AutoML for Model Compression and Acceleration on Mobile Devices

Deep learning is currently widely used in a variety of applications, including computer vision and natural language processing. End devices, such as smartphones and Internet-of-Things sensors, are generating data that need to be analyzed in real time using deep learning or used to train deep learning models. However, deep learning inference and training require substantial computation resources to run quickly. Edge computing, where a fine mesh of compute nodes are placed close to end devices, is a viable way to meet the high computation and low-latency requirements of deep learning on edge devices and also provides additional benefits in terms of privacy, bandwidth efficiency, and scalability. This paper aims to provide a comprehensive review of the current state of the art at the intersection of deep learning and edge computing. Specifically, it will provide an overview of applications where deep learning is used at the network edge, discuss various approaches for quickly executing deep learning inference across a combination of end devices, edge servers, and the cloud, and describe the methods for training deep learning models across multiple edge devices. It will also discuss open challenges in terms of systems performance, network technologies and management, benchmarks, and privacy. The reader will take away the following concepts from this paper: understanding scenarios where deep learning at the network edge can be useful, understanding common techniques for speeding up deep learning inference and performing distributed training on edge devices, and understanding recent trends and opportunities.

https://ieeexplore.ieee.org/ielaam/5/8789751/8763885-aam.pdf

Deep Learning With Edge Computing: A Review

Ubiquitous sensors and smart devices from factories and communities are generating massive amounts of data, and ever-increasing computing power is driving the core of computation and services from the cloud to the edge of the network. As an important enabler broadly changing people’s lives, from face recognition to ambitious smart factories and cities, developments of artificial intelligence (especially deep learning, DL) based applications and services are thriving. However, due to efficiency and latency issues, the current cloud computing service architecture hinders the vision of “providing artificial intelligence for every person and every organization at everywhere”. Thus, unleashing DL services using resources at the network edge near the data sources has emerged as a desirable solution. Therefore, edge intelligence , aiming to facilitate the deployment of DL services by edge computing, has received significant attention. In addition, DL, as the representative technique of artificial intelligence, can be integrated into edge computing frameworks to build intelligent edge for dynamic, adaptive edge maintenance and management. With regard to mutually beneficial edge intelligence and intelligent edge , this paper introduces and discusses: 1) the application scenarios of both; 2) the practical implementation methods and enabling technologies, namely DL training and inference in the customized edge computing framework; 3) challenges and future trends of more pervasive and fine-grained intelligence. We believe that by consolidating information scattered across the communication, networking, and DL areas, this survey can help readers to understand the connections between enabling technologies while promoting further discussions on the fusion of edge intelligence and intelligent edge , i.e., Edge DL.

Convergence of Edge Computing and Deep Learning: A Comprehensive Survey

There are two general approaches to developing artificial general intelligence (AGI)1: computer-science-oriented and neuroscience-oriented. Because of the fundamental differences in their formulations and coding schemes, these two approaches rely on distinct and incompatible platforms2–8, retarding the development of AGI. A general platform that could support the prevailing computer-science-based artificial neural networks as well as neuroscience-inspired models and algorithms is highly desirable. Here we present the Tianjic chip, which integrates the two approaches to provide a hybrid, synergistic platform. The Tianjic chip adopts a many-core architecture, reconfigurable building blocks and a streamlined dataflow with hybrid coding schemes, and can not only accommodate computer-science-based machine-learning algorithms, but also easily implement brain-inspired circuits and several coding schemes. Using just one chip, we demonstrate the simultaneous processing of versatile algorithms and models in an unmanned bicycle system, realizing real-time object detection, tracking, voice control, obstacle avoidance and balance control. Our study is expected to stimulate AGI development by paving the way to more generalized hardware platforms. The ‘Tianjic’ hybrid electronic chip combines neuroscience-oriented and computer-science-oriented approaches to artificial general intelligence, demonstrated by controlling an unmanned bicycle.

Towards artificial general intelligence with hybrid Tianjic chip architecture.

Model compression is a critical technique to efficiently deploy neural network models on mobile devices which have limited computation resources and tight power budgets. Conventional model compression techniques rely on hand-crafted heuristics and rule-based policies that require domain experts to explore the large design space trading off among model size, speed, and accuracy, which is usually sub-optimal and time-consuming. In this paper, we propose AutoML for Model Compression (AMC) which leverage reinforcement learning to provide the model compression policy. This learning-based compression policy outperforms conventional rule-based compression policy by having higher compression ratio, better preserving the accuracy and freeing human labor. Under 4x FLOPs reduction, we achieved 2.7% better accuracy than the handcrafted model compression policy for VGG-16 on ImageNet. We applied this automated, push-the-button compression pipeline to MobileNet and achieved 1.81x speedup of measured inference latency on an Android phone and 1.43x speedup on the Titan XP GPU, with only 0.1% loss of ImageNet Top-1 accuracy.

AMC: AutoML for Model Compression and Acceleration on Mobile Devices.

Long Short-Term Memory (LSTM) is widely used in speech recognition. In order to achieve higher prediction accuracy, machine learning scientists have built increasingly larger models. Such large model is both computation intensive and memory intensive. Deploying such bulky model results in high power consumption and leads to a high total cost of ownership (TCO) of a data center. To speedup the prediction and make it energy efficient, we first propose a load-balance-aware pruning method that can compress the LSTM model size by 20x (10x from pruning and 2x from quantization) with negligible loss of the prediction accuracy. The pruned model is friendly for parallel processing. Next, we propose a scheduler that encodes and partitions the compressed model to multiple PEs for parallelism and schedule the complicated LSTM data flow. Finally, we design the hardware architecture, named Efficient Speech Recognition Engine (ESE) that works directly on the sparse LSTM model. Implemented on Xilinx KU060 FPGA running at 200MHz, ESE has a performance of 282 GOPS working directly on the sparse LSTM network, corresponding to 2.52 TOPS on the dense one, and processes a full LSTM for speech recognition with a power dissipation of 41 Watts. Evaluated on the LSTM for speech recognition benchmark, ESE is 43x and 3x faster than Core i7 5930k CPU and Pascal Titan X GPU implementations. It achieves 40x and 11.5x higher energy efficiency compared with the CPU and GPU respectively.

ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA

Long Short-Term Memory (LSTM) is widely used in speech recognition. In order to achieve higher prediction accuracy, machine learning scientists have built larger and larger models. Such large model is both computation intensive and memory intensive. Deploying such bulky model results in high power consumption and leads to high total cost of ownership (TCO) of a data center. In order to speedup the prediction and make it energy efficient, we first propose a load-balance-aware pruning method that can compress the LSTM model size by 20x (10x from pruning and 2x from quantization) with negligible loss of the prediction accuracy. The pruned model is friendly for parallel processing. Next, we propose scheduler that encodes and partitions the compressed model to each PE for parallelism, and schedule the complicated LSTM data flow. Finally, we design the hardware architecture, named Efficient Speech Recognition Engine (ESE) that works directly on the compressed model. Implemented on Xilinx XCKU060 FPGA running at 200MHz, ESE has a performance of 282 GOPS working directly on the compressed LSTM network, corresponding to 2.52 TOPS on the uncompressed one, and processes a full LSTM for speech recognition with a power dissipation of 41 Watts. Evaluated on the LSTM for speech recognition benchmark, ESE is 43x and 3x faster than Core i7 5930k CPU and Pascal Titan X GPU implementations. It achieves 40x and 11.5x higher energy efficiency compared with the CPU and GPU respectively.

Long Short-Term Memory (LSTM) is widely used in speech recognition. In order
to achieve higher prediction accuracy, machine learning scientists have built
larger and larger models. Such large model is both computation intensive and
memory intensive. Deploying such bulky model results in high power consumption
and leads to high total cost of ownership (TCO) of a data center. In order to
speedup the prediction and make it energy efficient, we first propose a
load-balance-aware pruning method that can compress the LSTM model size by 20x
(10x from pruning and 2x from quantization) with negligible loss of the
prediction accuracy. The pruned model is friendly for parallel processing.
Next, we propose scheduler that encodes and partitions the compressed model to
each PE for parallelism, and schedule the complicated LSTM data flow. Finally,
we design the hardware architecture, named Efficient Speech Recognition Engine
(ESE) that works directly on the compressed model. Implemented on Xilinx
XCKU060 FPGA running at 200MHz, ESE has a performance of 282 GOPS working
directly on the compressed LSTM network, corresponding to 2.52 TOPS on the
uncompressed one, and processes a full LSTM for speech recognition with a power
dissipation of 41 Watts. Evaluated on the LSTM for speech recognition
benchmark, ESE is 43x and 3x faster than Core i7 5930k CPU and Pascal Titan X
GPU implementations. It achieves 40x and 11.5x higher energy efficiency
compared with the CPU and GPU respectively.

ESE: Efficient Speech Recognition Engine with Compressed LSTM on FPGA

The technical disclosure relates to artificial neural network. In particular, the technical disclosure relates to how to implement efficient data access control in the neural network hardware acceleration system. Specifically, it proposes an overall design of a device that can process data receiving, bit-width transformation and data storing. By employing the technical disclosure, neural network hardware acceleration system can avoid the data access process becomes the bottleneck in neural network computation.

Efficient data access control device for neural network hardware acceleration system

The Burrows-Wheeler Transform (BWT) has received special attention due to its effectiveness in lossless data compression algorithms Because BWT is a time-consuming task, the efficient hardware accelerator that can yield high throughputs is required in real-time applications This paper presents a novel BWT accelerator based on the streaming sorting network The streaming sorting network performs the suffix sorting of large amount of data which is the most difficult task in BWT Our BWT accelerator is implemented on a NetFPGA board Experimental results show that it achieves 143X speedup compared with the state-of-art work when the data block size is 4KB Furthermore, we design and implement a lossless data compression system based on the proposed BWT accelerator The hardware system is composed of Burrows-Wheeler Transform module, the move-to-front encoding module, the run length encoding module, and the canonical Huffman encoding module We evaluate the system performance on a NetFPGA board at the frequency of 155MHz The throughput of the system could reach 179 MB/s on board when we use only one streaming sorting network for a 4KB block The system throughput can be linearly improved up to 537 MB/s in simulation on a Virtex UltraScale xcvu440 chip if we use three streaming sorting networks to compute BWT

Yubin Li

Papers

ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA

ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA

ESE: Efficient Speech Recognition Engine with Compressed LSTM on FPGA

Efficient data access control device for neural network hardware acceleration system

Streaming sorting network based BWT acceleration on FPGA for lossless compression