Industrial information integration engineering (IIIE) is a set of foundational concepts and techniques that facilitate the industrial information integration process. In recent years, many applicat...

A Survey on Industrial Information Integration 2016–2019

Video-based detection infrastructure is crucial for promoting connected and autonomous shipping (CAS) development, which provides critical on-site traffic data for maritime participants. Ship behavior analysis, one of the fundamental tasks for fulfilling smart video-based detection infrastructure, has become an active topic in the CAS community. Previous studies focused on ship behavior analysis by exploring spatial-temporal information from automatic identification system (AIS) data, and less attention was paid to maritime surveillance videos. To bridge the gap, we proposed an ensemble you only look once (YOLO) framework for ship behavior analysis. First, we employed the convolutional neural network in the YOLO model to extract multi-scaled ship features from the input ship images. Second, the proposed framework generated many bounding boxes (i.e., potential ship positions) based on the object confidence level. Third, we suppressed the background bounding box interferences, and determined ship detection results with intersection over union (IOU) criterion, and thus obtained ship positions in each ship image. Fourth, we analyzed spatial-temporal ship behavior in consecutive maritime images based on kinematic ship information. The experimental results have shown that ships are accurately detected (i.e., both of the average recall and precision rate were higher than 90%) and the historical ship behaviors are successfully recognized. The proposed framework can be adaptively deployed in the connected and autonomous vehicle detection system in the automated terminal for the purpose of exploring the coupled interactions between traffic flow variation and heterogeneous detection infrastructures, and thus enhance terminal traffic network capacity and safety.

/pdf/video-based-detection-infrastructure-enhancement-for-tm81pow8a2.pdf

Video-Based Detection Infrastructure Enhancement for Automated Ship Recognition and Behavior Analysis

Convolutional neural network (CNN) based object detection algorithms are becoming dominant in many application fields due to their superior accuracy advantage over traditional schemes. Among them, You Look Only Once (YOLO) is one of the most popular detection frameworks that show best trade-offs between speed and accuracy. However, due to the intrinsic high computational workload of CNN, it is still challenging when targeting high-throughput processing with low cost in energy consumption. In this paper, we propose a hardware/software (HW/SW) co-design methodology targeting CPU+FPGA-based heterogeneous platforms. Firstly, we extend a novel sparse convolution algorithm to the YOLOv2 framework, and then develop a resource-efficient FPGA accelerator architecture based on asynchronously executed parallel convolution cores. Secondly, algorithm-level optimization schemes, including hardware-aware neural network pruning, clustering and quantization are introduced, which successfully save the computational workload of the YOLOv2 algorithm by 7 times. Finally, an end-to-end design space exploration flow for FPGA-based accelerator design is presented and two HW/SW partition strategies are studied and implemented. Experimental results show that our design can achieve a peak throughput of 2.13 TOPS (72.5 fps) on an Intel Arria-10 GX1150 FPGA under the working frequency of 211 MHz, while the detection accuracy is 74.45 on the PASCAL VOC2007 dataset.

/pdf/sparse-yolo-hardware-software-co-design-of-an-fpga-1tl4f153qs.pdf

Sparse-YOLO: Hardware/Software Co-Design of an FPGA Accelerator for YOLOv2

Due to the complex background of the building site and the diverse sorts of construction personnel, relying on traditional manual inspections and video surveillance methods to detect the wearing of personnel helmets has poor timeliness and missed inspections. This article provides a method based on deep learning to resolve the above issues. First, improvement based on YOLOv5, added a functionality detection scale to allow it to get smaller targets; second, by introducing the DloU-NMS instead of NMS, DIoU also considers the overlap area and the center distance of the two boxes, making it more accurate in suppressing the predicted bounding box. The experimental results show that the proposed algorithm significantly improves the accuracy compared to the YOLOv5 network model, and detection speed is 98 frames per second, which can meet the needs of real-time detection.

Improved YOLOv5 Network Model and Application in Safety Helmet Detection

Deep neural networks (DNNs) are successful in many computer vision tasks. However, the most accurate DNNs require millions of parameters and operations, making them energy, computation and memory intensive. This impedes the deployment of large DNNs in low-power devices with limited compute resources. Recent research improves DNN models by reducing the memory requirement, energy consumption, and number of operations without significantly decreasing the accuracy. This paper surveys the progress of low-power deep learning and computer vision, specifically in regards to inference, and discusses the methods for compacting and accelerating DNN models. The techniques can be divided into four major categories: (1) parameter quantization and pruning, (2) compressed convolutional filters and matrix factorization, (3) network architecture search, and (4) knowledge distillation. We analyze the accuracy, advantages, disadvantages, and potential solutions to the problems with the techniques in each category. We also discuss new evaluation metrics as a guideline for future research.

A Survey of Methods for Low-Power Deep Learning and Computer Vision

Convolutional neural networks (CNNs) require numerous computations and external memory accesses. Frequent accesses to off-chip memory cause slow processing and large power dissipation. For real-time object detection with high throughput and power efficiency, this paper presents a Tera-OPS streaming hardware accelerator implementing a you-only-look-once (YOLO) CNN. The parameters of the YOLO CNN are retrained and quantized with the PASCAL VOC data set using binary weight and flexible low-bit activation. The binary weight enables storing the entire network model in block RAMs of a field-programmable gate array (FPGA) to reduce off-chip accesses aggressively and, thereby, achieve significant performance enhancement. In the proposed design, all convolutional layers are fully pipelined for enhanced hardware utilization. The input image is delivered to the accelerator line-by-line. Similarly, the output from the previous layer is transmitted to the next layer line-by-line. The intermediate data are fully reused across layers, thereby eliminating external memory accesses. The decreased dynamic random access memory (DRAM) accesses reduce DRAM power consumption. Furthermore, as the convolutional layers are fully parameterized, it is easy to scale up the network. In this streaming design, each convolution layer is mapped to a dedicated hardware block. Therefore, it outperforms the “one-size-fits-all” designs in both performance and power efficiency. This CNN implemented using VC707 FPGA achieves a throughput of 1.877 tera operations per second (TOPS) at 200 MHz with batch processing while consuming 18.29 W of on-chip power, which shows the best power efficiency compared with the previous research. As for object detection accuracy, it achieves a mean average precision (mAP) of 64.16% for the PASCAL VOC 2007 data set that is only 2.63% lower than the mAP of the same YOLO network with full precision.

A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN for Object Detection

Image super-resolution (SR) networks have shown a remarkable restoration performance but come with a huge memory bandwidth requirement due to a large-size input and lack of a pooling layer. As a result, SR accelerators usually adopted a streaming-like scheme to fit a highly customized and small SR network to available resources, which causes a large accuracy drop. To address this problem, this work proposes an SR accelerator using a big.LITTLE core architecture which is able to execute various networks in real-time. The proposed SR processor achieves an inference speed of 36.63 frames per second and a throughput of 221.79 GOPs at 200MHz for ×2 SR (from 960×540 to 1920×1080) while using only 1,280 eight-bit multipliers and 330 KB on-chip SRAM.

A Real-time Super-resolution Accelerator Using a big. LITTLE Core Architecture

In recent years, object detection approaches such as you-only-look-once (YOLO) have been getting a special attention owing to the emerging trend of autonomous driving systems. However, memory and computation complexity are usually known as the bottlenecks in implementing a YOLOv2 in hardware design. This study proposes a simple yet effective variant of YOLOv2 by using dilated convolution to reduce its complexity. Experimental results show that the proposed method achieves up to 36% of memory access reduction and 9% of computation reduction on a network with a negligible performance loss of 1.75% on the well-known VOC dataset.

A Lightweight YOLOv2 Object Detector Using a Dilated Convolution

Residual block is a very common component in recent state-of-the art CNNs such as EfficientNet or EfficientDet. Shortcut data accounts for nearly 40% of feature-maps access in ResNet152 [8]. Most of the previous DNN compilers, accelerators ignore the shortcut data optimization. This paper presents ShortcutFusion, an optimization tool for FPGA-based accelerator with a reuse-aware static memory allocation for shortcut data, to maximize on-chip data reuse given resource constraints. From TensorFlow DNN models, the proposed design generates instruction sets for a group of nodes which uses an optimized data reuse for each residual block. The accelerator design implemented on the Xilinx KCU1500 FPGA card significantly outperforms NVIDIA RTX 2080 Ti, Titan Xp, and GTX 1080 Ti for the EfficientNet inference. Compared to RTX 2080 Ti, the proposed design is 1.35-2.33x faster and 6.7-7.9x more power efficient. Compared to the result from baseline, in which the weights, inputs, and outputs are accessed from the off-chip memory exactly once per each layer, ShortcutFusion reduces the DRAM access by 47.8-84.8% for RetinaNet, Yolov3, ResNet152, and EfficientNet. Given a similar buffer size to ShortcutMining [8], which also mine the shortcut data in hardware, the proposed work reduces off-chip access for feature-maps 5.27x while accessing weight from off-chip memory exactly once.

ShortcutFusion: From Tensorflow to FPGA-based accelerator with reuse-aware memory allocation for shortcut data.

This study introduces a computation-skipping mask (CSM) generation framework to reduce redundant computations of Super-Resolution (SR) neural networks. For a layer in a given network, the CSM framework adds a tiny module to predict all zero locations in its output feature maps, so that their related operations can be skipped during inference. The experimental results show that CSM reduces the computational cost up to 58%, with negligible performance degradation for various SR models.

Tuan Nghia Nguyen

Papers

A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN for Object Detection

A Real-time Super-resolution Accelerator Using a big. LITTLE Core Architecture

A Lightweight YOLOv2 Object Detector Using a Dilated Convolution

ShortcutFusion: From Tensorflow to FPGA-based accelerator with reuse-aware memory allocation for shortcut data.

Computation-Skipping Mask Generation for Super-Resolution Networks