scispace - formally typeset
Search or ask a question

Showing papers by "Marcelo A. C. Fernandes published in 2021"


Journal ArticleDOI
TL;DR: In this article, the authors propose an automated framework for the implementation of hardware-accelerated DNN architectures on FPGAs by combining custom hardware scalability with optimization strategies.
Abstract: Deep Learning techniques have been successfully applied to solve many Artificial Intelligence (AI) applications problems. However, owing to topologies with many hidden layers, Deep Neural Networks (DNNs) have high computational complexity, which makes their deployment difficult in contexts highly constrained by requirements such as performance, real-time processing, or energy efficiency. Numerous hardware/software optimization techniques using GPUs, ASICs, and reconfigurable computing (i.e, FPGAs), have been proposed in the literature. With FPGAs, very specialized architectures have been developed to provide an optimal balance between high-speed and low power. However, when targeting edge computing, user requirements and hardware constraints must be efficiently met. Therefore, in this work, we only focus on reconfigurable embedded systems based on the Xilinx ZYNQ SoC and popular DNNs that can be implemented on Embedded Edge improving performance per watt while maintaining accuracy. In this context, we propose an automated framework for the implementation of hardware-accelerated DNN architectures. This framework provides an end-to-end solution that facilitates the efficient deployment of topologies on FPGAs by combining custom hardware scalability with optimization strategies. Cutting-edge comparisons and experimental results demonstrate that the architectures developed by our framework offer the best compromise between performance, energy consumption, and system costs. For instance, the low power (0.266W) DNN topologies generated for the MNIST database achieved a high throughput of 3,626 FPS.

13 citations


Journal ArticleDOI
17 Jun 2021-Sensors
TL;DR: In this article, a high-throughput implementation of the Otsu automatic image thresholding algorithm on Field Programmable Gate Array (FPGA) aiming to process high-resolution images in real-time was proposed.
Abstract: This work proposes a high-throughput implementation of the Otsu automatic image thresholding algorithm on Field Programmable Gate Array (FPGA), aiming to process high-resolution images in real-time. The Otsu method is a widely used global thresholding algorithm to define an optimal threshold between two classes. However, this technique has a high computational cost, making it difficult to use in real-time applications. Thus, this paper proposes a hardware design exploiting parallelization to optimize the system's processing time. The implementation details and an analysis of the synthesis results concerning the hardware area occupation, throughput, and dynamic power consumption, are presented. Results have shown that the proposed hardware achieved a high speedup compared to similar works in the literature.

10 citations


Journal ArticleDOI
TL;DR: In this paper, a fully parallel architecture for the self-organizing maps (SOMs) is introduced to optimize the system's data processing time, which is validated on FPGA and evaluated concerning hardware throughput and the use of resources.

7 citations


Proceedings ArticleDOI
27 Sep 2021
Abstract: DevOps refers to a set of practices that integrate software development and operations with the primary aim to enable the continuous delivery of high-quality software. DevOps has also promoted several challenges to software engineering teaching. In this paper, we present a preliminary study that analyzes existing teaching strategies reported in the literature. Our findings indicate a set of approaches highlighting the use of environments to support teaching. Our work also investigates how these environments can contribute to address existing challenges and recommendations of DevOps teaching.

3 citations


Posted ContentDOI
15 Oct 2021-bioRxiv
TL;DR: Wang et al. as discussed by the authors used the stacked sparse autoencoder (SSAE) technique to classify the SARS-CoV-2 virus in the COVID-19 pandemic.
Abstract: Since December 2019, the world has been intensely affected by the COVID-19 pandemic, caused by the SARS-CoV-2 virus, first identified in Wuhan, China. In the case of a novel virus identification, the early elucidation of taxonomic classification and origin of the virus genomic sequence is essential for strategic planning, containment, and treatments. Deep learning techniques have been successfully used in many viral classification problems associated with viral infections diagnosis, metagenomics, phylogenetic, and analysis. This work proposes to generate an efficient viral genome classifier for the SARS-CoV-2 virus using the deep neural network (DNN) based on stacked sparse autoencoder (SSAE) technique. We performed four different experiments to provide different levels of taxonomic classification of the SARS-CoV-2 virus. The confusion matrix presented the validation and test sets and the ROC curve for the validation set. In all experiments, the SSAE technique provided great performance results. In this work, we explored the utilization of image representations of the complete genome sequences as the SSAE input to provide a viral classification of the SARS-CoV-2. For that, a dataset based on k-mers image representation, with k = 6, was applied. The results indicated the applicability of using this deep learning technique in genome classification problems.

3 citations


Proceedings ArticleDOI
01 Apr 2021
TL;DR: In this paper, a parallel implementation of a massive array beamforming composed of a spatial filter and adaptation unit based on Least-Mean-Squared (LMS) on FPGA is proposed.
Abstract: With the rise of 5G networks and the increasing number of communication devices, improving communication quality is essential. One approach is adaptive digital beamforming, which adjusts an antenna array’s radiation pattern based on the desired received signal. Adaptation based on Least-Mean-Squared (LMS) and its variants is still one of the most common literature methods. Although LMS techniques present good computational performance, the increase in antennas’ numbers led to high-performance hardware. Platforms such as Field Programmable Gate Arrays (FPGAs), designed for massive array systems, enables high-performance energy-efficient architectures. This work proposes a parallel implementation of a massive array beamforming composed of a spatial filter and adaptation unit based on LMS on FPGA. The proposed design presents ten times fewer hardware requirements and 30 times less power consumption than state of the art.

2 citations


Journal ArticleDOI
TL;DR: A hardware architecture for Typicality and Eccentricity Data Analytic (TEDA) algorithm implemented on Field Programmable Gate Arrays (FPGA) for use in data streaming anomaly detection and is a pioneer in the hardware implementation of the TEDA technique in FPGA.
Abstract: The amount of data in real-time, such as time series and streaming data, available today continues to grow. Being able to analyze this data the moment it arrives can bring an immense added value. However, it also requires a lot of computational effort and new acceleration techniques. As a possible solution to this problem, this paper proposes a hardware architecture for Typicality and Eccentricity Data Analytic (TEDA) algorithm implemented on Field Programmable Gate Arrays (FPGA) for use in data streaming anomaly detection. TEDA is based on a new approach to outlier detection in the data stream context. The suggested design has a full parallel input of N elements and a 3-stage pipelined architecture to reduce the critical path and thus optimize the throughput. In order to validate the proposals, results of the occupation and throughput of the proposed hardware are presented. The design reached a speed of up to 693x, compared to other software platforms, with a throughput of up to 10.96 MSPs (Mega Sample Per second), using a small portion of the target FPGA resources. Besides, the bit accurate simulation results are also presented. This work is a pioneer in the hardware implementation of the TEDA technique in FPGA. The project aims to Xilinx Virtex-6 xc6vlx240t-1ff1156 as the target FPGA.

2 citations


Proceedings ArticleDOI
18 Jul 2021
TL;DR: In this paper, the authors proposed a novel training strategy that simultaneously minimizes both pruning and quantization losses in training compressed models for reducing deep learning computational complexity for efficient viral classification.
Abstract: Deep learning techniques, such as deep neural networks (DNNs), have been used with success in many viral classification problems associated with metagenomics, diagnosis of viral infections, pharmacogenomics, phylogenetic analysis, and others. However, deep learning algorithms require a large number of math operations, and these computations themselves can be a bottleneck for processing the vast number of virus sequences in a short time. Currently, most works in this area use basic DNNs in viral classification, and they are not optimized for computational efficiency. This paper proposes a novel training strategy that simultaneously minimizes both pruning and quantization losses in training compressed models for reducing deep learning computational complexity. In training a compressed convolutional neural network (CNN), the scheme uses weight quantization followed by pruning in each training iteration rather than the pruning followed by quantization. The proposed training strategy scheme has been applied to train compressed models for efficient viral classification of 1600 sequences of four types of viruses associated with three families and one realm. A substantial reduction of DNN weights (77%) and operations (58%) is demonstrated, while maintaining high classification accuracy. These results show that the proposed new training regime of weight quantization followed weight pruning for each training iteration is superior to conventional approaches with weight pruning epochs followed by weight quantization epochs.

1 citations


Posted ContentDOI
27 Jul 2021-bioRxiv
TL;DR: In this article, a parallel hardware design for the Smith-Waterman (SW) algorithm with a systolic array structure is proposed to accelerate the Forward and Backtracking steps of the SW algorithm.
Abstract: In bioinformatics, alignment is an essential technique for finding similarities between biological sequences. Usually, the alignment is performed with the Smith-Waterman (SW) algorithm, a well-known sequence alignment technique of high-level precision based on dynamic programming. However, given the massive data volume in biological databases and their continuous exponential increase, high-speed data processing is necessary. Therefore, this work proposes a parallel hardware design for the SW algorithm with a systolic array structure to accelerate the Forward and Backtracking steps. For this purpose, the architecture calculates and stores the paths in the Forward stage for pre-organizing the alignment, which reduces the complexity of the Backtracking stage. The backtracking starts from the maximum score position in the matrix and generates the optimal SW sequence alignment path. The architecture was validated on Field-Programmable Gate Array (FPGA), and synthesis analyses have shown that the proposed design reaches up to 79.5 Giga Cell Updates per Second (GCPUS).

Proceedings ArticleDOI
29 Apr 2021
TL;DR: This work proposes an implementation in Field Programmable Gate Array (FPGA) of the Otsu’s method applied to real-time tracking of worms called Caenorhabditis elegans and shows it was possible to achieve a speedup up to 5 times higher than similar works in the literature.
Abstract: This work proposes an implementation in Field Programmable Gate Array (FPGA) of the Otsu’s method applied to real-time tracking of worms called Caenorhabditis elegans. Real-time tracking is necessary to measure changes in the worm’s behavior in response to treatment with Ribonucleic Acid (RNA) interference. Otsu’s method is a global thresholding algorithm used to define an optimal threshold between two classes. However, this technique in real-time applications associated with the processing of high-resolution videos has a high computational cost because of the massive amount of data generated. Otsu’s algorithm needs to identify the worms in each frame captured by a high-resolution camera in a real-time analysis of the worm’s behavior. Thus, this work proposes a highperformance implementation of Otsu’s algorithm in FPGA. The results show it was possible to achieve a speedup up to 5 times higher than similar works in the literature.