Over the past few decades, interest in theories and algorithms for face recognition has been growing rapidly. Video surveillance, criminal identification, building access control, and unmanned and autonomous vehicles are just a few examples of concrete applications that are gaining attraction among industries. Various techniques are being developed including local, holistic, and hybrid approaches, which provide a face image description using only a few face image features or the whole facial features. The main contribution of this survey is to review some well-known techniques for each approach and to give the taxonomy of their categories. In the paper, a detailed comparison between these techniques is exposed by listing the advantages and the disadvantages of their schemes in terms of robustness, accuracy, complexity, and discrimination. One interesting feature mentioned in the paper is about the database used for face recognition. An overview of the most commonly used databases, including those of supervised and unsupervised learning, is given. Numerical results of the most interesting techniques are given along with the context of experiments and challenges handled by these techniques. Finally, a solid discussion is given in the paper about future directions in terms of techniques to be used for face recognition.

https://www.mdpi.com/1424-8220/20/2/342/pdf

Face Recognition Systems: A Survey

Deep neural networks have proven to be particularly effective in visual and audio recognition tasks. Existing models tend to be computationally expensive and memory intensive, however, and so methods for hardware-oriented approximation have become a hot topic. Research has shown that custom hardware-based neural network accelerators can surpass their general-purpose processor equivalents in terms of both throughput and energy efficiency. Application-tailored accelerators, when co-designed with approximation-based network training methods, transform large, dense, and computationally expensive networks into small, sparse, and hardware-efficient alternatives, increasing the feasibility of network deployment. In this article, we provide a comprehensive evaluation of approximation methods for high-performance network inference along with in-depth discussion of their effectiveness for custom hardware implementation. We also include proposals for future research based on a thorough analysis of current trends. This article represents the first survey providing detailed comparisons of custom hardware accelerators featuring approximation for both convolutional and recurrent neural networks, through which we hope to inspire exciting new developments in the field.

Deep Neural Network Approximation for Custom Hardware: Where We've Been, Where We're Going

HW/SW Co-Design of the HOG algorithm on a Xilinx Zynq SoC

Accelerating CNN Inference on ASICs: A Survey

Towards secure intrusion detection systems using deep learning techniques: Comprehensive analysis and review

Use of reduced precisions in Deep Learning (DL) inference tasks has recently been shown to significantly improve accelerator performance and greatly reduce both model memory footprint and the required external memory bandwidth. With appropriate network retuning, reduced precision networks can achieve accuracy close or equal to that of full-precision floating-point models. Given the wide spectrum of precisions used in DL inference, FPGAs' ability to create custom bit-width datapaths gives them an advantage over other acceleration platforms in this domain. However, the embedded DSP blocks in the latest Intel and Xilinx FPGAs do not natively support precisions below 18-bit and thus can not efficiently pack low-precision multiplications, leaving the DSP blocks under-utilized. In this work, we present an enhanced DSP block that can efficiently pack 2× as many 9-bit and 4× as many 4-bit multiplications compared to the baseline Arria-10-like DSP block at the cost of 12% block area overhead which leads to only 0.6% total FPGA core area increase. We quantify the performance gains of using this enhanced DSP block in two state-of-the-art convolutional neural network accelerators on three different models: AlexNet, VGG-16, and ResNet-50. On average, the new DSP block enhanced the computational performance of the 8-bit and 4-bit accelerators by 1.32× and 1.6× and at the same time reduced the utilized chip area by 15% and 30% respectively.

Embracing Diversity: Enhanced DSP Blocks for Low-Precision Deep Learning on FPGAs

The growing importance and compute demands of artificial intelligence (AI) have led to the emergence of domain-optimized hardware platforms. For example, Nvidia GPUs introduced specialized tensor cores for matrix operations to speed up deep learning (DL) computation, resulting in very high peak throughput up to 130 int8 TOPS in the T4 GPU. Recently, Intel introduced its first AI-optimized 14nm FPGA, the Stratix 10 NX, with in-fabric AI tensor blocks that offer estimated peak performance up to 143 int8 TOPS, comparable to 12nm GPUs. However, what matters in practice is not the peak performance but the actual achievable performance on target workloads. This depends mainly on the utilization of the tensor units, and the system-level overheads to send data to/from the accelerator. This paper presents the first performance evaluation of Intel's AI-optimized FPGA, the Stratix 10 NX, in comparison to the latest accessible AI-optimized GPUs, the Nvidia T4 and V100, on a large suite of real-time DL inference workloads. We enhance a re-implementation of the Brainwave NPU overlay architecture to utilize the FPGA's AI tensor blocks, and develop toolchain support that allows users to program tensor blocks purely through software, without FPGA EDA tools in the loop. We first compare the Stratix 10 NX NPU against Stratix 10 GX/MX versions with no tensor blocks, and then present detailed core compute and system-level performance comparisons to the T4 and V100 GPUs. We show that our enhanced NPU on Stratix 10 NX achieves better tensor block utilization than GPUs, resulting in 24× and 12× average compute speedups over the T4 and V100 GPUs at batch-6. Even with relaxed latency constraints that allow a batch size of 32, we still achieve average speedups of 5× and 2× against T4 and V100 GPUs, respectively. On a system-level, the FPGA's fine-grained flexibility with its integrated 100 Gbps Ethernet allows for remote access at 10× and 2× less system overhead latency than local access to a V100 GPU via 128 Gbps PCIe for short and long sequence RNNs, respectively.

Beyond Peak Performance: Comparing the Real Performance of AI-Optimized FPGAs and GPUs

Interactive intelligent services, such as smart web search, are important datacenter workloads. They rely on dataintensive deep learning (DL) algorithms with strict latency constraints and thus require balancing both data movement and compute capabilities. As such, a persistent approach that keeps the entire DL model on-chip is becoming the new norm for realtime services to avoid the expensive off-chip memory accesses. This approach is adopted in Microsoft's Brainwave and is also provided by Nvidia's cuDNN libraries. This paper presents a comparative study of FPGA, GPU, and FPGA+ASIC in-package solutions for persistent DL. Unlike prior work, we offer a fair and direct comparison targeting common numerical precisions (FP32, INT8) and modern high-end FPGA (Intel® Stratix®10), GPU (Nvidia Volta), and ASIC (10 nm process), all using the persistent approach. We show that Stratix 10 FPGAs offer 2.7× (FP32) to 8.6× (INT8) lower latency than Volta GPUs across RNN, GRU, and LSTM workloads from DeepBench. The GPU can only utilize ~6% of its peak TOPS, while the FPGA with a more balanced on-chip memory and compute can achieve much higher utilization (~57%). We also study integrating an ASIC chiplet, TensorRAM, with an FPGA as system-in-package to enhance on-chip memory capacity and bandwidth, and provide compute throughput matching the required bandwidth. We show that a small 32 mm2 TensorRAM 10nm chiplet can offer 64 MB memory, 32 TB/s on-chiplet bandwidth, and 64 TOPS (INT8). A small Stratix 10 FPGA with a TensorRAM (INT8) offers 15.9× better latency than GPU (FP32) and 34× higher energy efficiency. It has 2× aggregate on-chip memory capacity compared to a large FPGA or GPU. Overall, our study shows that the FPGA is better than the GPU for persistent DL, and when integrated with an ASIC chiplet, it can offer a more compelling solution.

Why Compete When You Can Work Together: FPGA-ASIC Integration for Persistent RNNs

Recently, deep learning (DL) has become best-in-class for numerous applications but at a high computational cost that necessitates high-performance energy-efficient acceleration. The reconfigurability of FPGAs is appealing due to the rapid change in DL models but also causes lower performance and area-efficiency compared to ASICs. In this article, we implement three state-of-the-art computing architectures (CAs) for convolutional neural network (CNN) inference on FPGAs and ASICs. By comparing the FPGA and ASIC implementations, we highlight the area and performance costs of programmability to pinpoint the inefficiencies in current FPGA architectures. We perform our experiments using three variations of these CAs for AlexNet, VGG-16 and ResNet-50 to allow extensive comparisons. We find that the performance gap varies significantly from 2.8× to 6.3×, while the area gap is consistent across CAs with an 8.7 average FPGA-to-ASIC area ratio. Among different blocks of the CAs, the convolution engine, constituting up to 60% of the total area, has a high area ratio ranging from 13 to 31. Motivated by our FPGA vs. ASIC comparisons, we suggest FPGA architectural changes such as increasing DSP block count, enhancing low-precision support in DSP blocks and rethinking the on-chip memories to reduce the programmability gap for DL applications.

Andrew Boutros

Papers

Embracing Diversity: Enhanced DSP Blocks for Low-Precision Deep Learning on FPGAs

HW/SW Co-Design of the HOG algorithm on a Xilinx Zynq SoC

Beyond Peak Performance: Comparing the Real Performance of AI-Optimized FPGAs and GPUs

Why Compete When You Can Work Together: FPGA-ASIC Integration for Persistent RNNs

You Cannot Improve What You Do not Measure: FPGA vs. ASIC Efficiency Gaps for Convolutional Neural Network Inference