scispace - formally typeset
Search or ask a question

Showing papers on "Parallel processing (DSP implementation) published in 2021"


Proceedings ArticleDOI
09 Sep 2021
TL;DR: Elf as discussed by the authors proposes to partition the video frame and offload the partial inference tasks to multiple servers for parallel processing, which can accelerate the mobile deep vision applications with any server provisioning through the parallel offloading.
Abstract: As mobile devices continuously generate streams of images and videos, a new class of mobile deep vision applications are rapidly emerging, which usually involve running deep neural networks on these multimedia data in real-time. To support such applications, having mobile devices offload the computation, especially the neural network inference, to edge clouds has proved effective. Existing solutions often assume there exists a dedicated and powerful server, to which the entire inference can be offloaded. In reality, however, we may not be able to find such a server but need to make do with less powerful ones. To address these more practical situations, we propose to partition the video frame and offload the partial inference tasks to multiple servers for parallel processing. This paper presents the design of Elf, a framework to accelerate the mobile deep vision applications with any server provisioning through the parallel offloading. Elf employs a recurrent region proposal prediction algorithm, a region proposal centric frame partitioning, and a resource-aware multi-offloading scheme. We implement and evaluate Elf upon Linux and Android platforms using four commercial mobile devices and three deep vision applications with ten state-of-the-art models. The comprehensive experiments show that Elf can speed up the applications by 4.85× with saving bandwidth usage by 52.6%, while with

60 citations


Journal ArticleDOI
TL;DR: An adaptive parallel processing optimization mechanism (APPM) is proposed to self-adaptively adjust the service graph of SFCs and intelligently solve the joint problem of PSFC deployment and scheduling to reduce the cost of service creation and increase the agility of network operations.
Abstract: By replacing traditional hardware-based middleboxes with software-based Virtual Network Functions (VNFs) running on general-purpose servers, network function virtualization represents a promising technique to reduce the cost of service creation and increase the agility of network operations. Typically, Service Function Chains (SFCs) are adopted to orchestrate dynamical network services and facilitate management of network applications. Recently, SFC parallelism that implements parallel processing of VNFs has been investigated to further improve SFC service quality. However, the unreasonable service graph of parallel processing in existing parallelized SFCs (PSFCs) might cause excessive resource consumption; incoordination between PSFC deployment and scheduling also increases the queuing delay of VNFs and degrades PSFC performance. In this article, an adaptive parallel processing optimization mechanism (APPM) is proposed to self-adaptively adjust the service graph of PSFCs and intelligently solve the joint problem of PSFC deployment and scheduling. Specifically, APPM uses a parallelism optimization algorithm (POA) based on the bin packing problem with soft bin capacity to optimize the structure of the PSFC service graph. Afterward, APPM employs a joint optimization algorithm based on reinforcement learning (JORL) to jointly deploy and schedule the PSFCs optimized by POA via the online perception of environment status. Simulation results showed that POA reduces the SFC parallelism degree and resource consumption by about 35%; JORL lowers SFC delay by reducing the queuing delay and has better overall performance than the state of the art algorithms even with limited resources.

45 citations


Journal ArticleDOI
01 Feb 2021
TL;DR: A novel coherent parallel photonic DAC concept is introduced along with an experimental demonstration capable of performing this digital-to-analog conversion without optic-electric-optic domain crossing, which guarantees a linear intensity weighting among bits operating at high sampling rates, yet at a reduced footprint and power consumption compared to other photonic alternatives.
Abstract: Digital-to-analog converters (DAC) are indispensable functional units in signal processing instrumentation and wide-band telecommunication links for both civil and military applications. Since photonic systems are capable of high data throughput and low latency, an increasingly found system limitation stems from the required domain-crossing such as digital-to-analog, and electronic-to-optical. A photonic DAC implementation, in contrast, enables a seamless signal conversion with respect to both energy efficiency and short signal delay, often require bulky discrete optical components and electric-optic transformation hence introducing inefficiencies. Here, we introduce a novel coherent parallel photonic DAC concept along with an experimental demonstration capable of performing this digital-to-analog conversion without optic-electric-optic domain crossing. This design hence guarantees a linear intensity weighting among bits operating at high sampling rates, yet at a reduced footprint and power consumption compared to other photonic alternatives. Importantly, this photonic DAC could create seamless interfaces of next-generation data processing hardware for data-centers, task-specific compute accelerators such as neuromorphic engines, and network edge processing applications.

45 citations


Journal ArticleDOI
TL;DR: In this paper, the authors demonstrate a scalable on-chip photonic implementation of a simplified recurrent neural network, called a reservoir computer, using an integrated coherent linear photonic processor, which enables scalable and ultrafast computing beyond the input electrical bandwidth.
Abstract: Photonic neuromorphic computing is of particular interest due to its significant potential for ultrahigh computing speed and energy efficiency The advantage of photonic computing hardware lies in its ultrawide bandwidth and parallel processing utilizing inherent parallelism Here, we demonstrate a scalable on-chip photonic implementation of a simplified recurrent neural network, called a reservoir computer, using an integrated coherent linear photonic processor In contrast to previous approaches, both the input and recurrent weights are encoded in the spatiotemporal domain by photonic linear processing, which enables scalable and ultrafast computing beyond the input electrical bandwidth As the device can process multiple wavelength inputs over the telecom C-band simultaneously, we can use ultrawide optical bandwidth (~5 terahertz) as a computational resource Experiments for the standard benchmarks showed good performance for chaotic time-series forecasting and image classification The device is considered to be able to perform 2112 tera multiplication–accumulation operations per second (MAC ∙ s−1) for each wavelength and can reach petascale computation speed on a single photonic chip by using wavelength division multiplexing Our results are challenging for conventional Turing–von Neumann machines, and they confirm the great potential of photonic neuromorphic processing towards peta-scale neuromorphic super-computing on a photonic chip Optical computing holds promise for high-speed, low-energy information processing due to its large bandwidth and ability to multiplex signals The authors propose a recurrent neural network implementation using reservoir computing architecture in an integrated photonic processor capable of performing ~10 tera multiplication–accumulation operations per second for each wavelength channel

44 citations


Journal ArticleDOI
TL;DR: This article proposes a new paradigm of the parallel EC algorithm by making the first attempt to parallelize the algorithm in the generation level, inspired by the industrial pipeline technique and shows that generation-level parallelism is possible in EC algorithms and may have significant potential applications in time-consumption optimization problems.
Abstract: Due to the population-based and iterative-based characteristics of evolutionary computation (EC) algorithms, parallel techniques have been widely used to speed up the EC algorithms. However, the parallelism usually performs in the population level where multiple populations (or subpopulations) run in parallel or in the individual level where the individuals are distributed to multiple resources. That is, different populations or different individuals can be executed simultaneously to reduce running time. However, the research into generation-level parallelism for EC algorithms has seldom been reported. In this article, we propose a new paradigm of the parallel EC algorithm by making the first attempt to parallelize the algorithm in the generation level. This idea is inspired by the industrial pipeline technique. Specifically, a kind of EC algorithm called local version particle swarm optimization (PSO) is adopted to implement a pipeline-based parallel PSO (PPPSO, i.e., P3SO). Due to the generation-level parallelism in P3SO, when some particles still perform their evolutionary operations in the current generation, some other particles can simultaneously go to the next generation to carry out the new evolutionary operations, or even go to further next generation(s). The experimental results show that the problem-solving ability of P3SO is not affected while the evolutionary speed has been substantially accelerated in a significant fashion. Therefore, generation-level parallelism is possible in EC algorithms and may have significant potential applications in time-consumption optimization problems.

40 citations



Journal ArticleDOI
TL;DR: Bayesian models provide recursive inference naturally because they can formally reconcile new data and existing scientific information as discussed by the authors, however, popular use of Bayesian methods often avoids priors and thus avoids the need for prior knowledge.
Abstract: Bayesian models provide recursive inference naturally because they can formally reconcile new data and existing scientific information. However, popular use of Bayesian methods often avoids priors ...

38 citations


Journal ArticleDOI
17 Jun 2021
TL;DR: In this article, a survey of state-of-the-art methods for processing remotely sensed big data and thoroughly investigates existing parallel implementations on diverse popular high-performance computing platforms are discussed in terms of capability, scalability, reliability, and ease of use.
Abstract: This article gives a survey of state-of-the-art methods for processing remotely sensed big data and thoroughly investigates existing parallel implementations on diverse popular high-performance computing platforms. The pros/cons of these approaches are discussed in terms of capability, scalability, reliability, and ease of use. Among existing distributed computing platforms, cloud computing is currently the most promising solution to efficient and scalable processing of remotely sensed big data due to its advanced capabilities for high-performance and service-oriented computing. We further provide an in-depth analysis of state-of-the-art cloud implementations that seek for exploiting the parallelism of distributed processing of remotely sensed big data. In particular, we study a series of scheduling algorithms (GSs) aimed at distributing the computation load across multiple cloud computing resources in an optimized manner. We conduct a thorough review of different GSs and reveal the significance of employing scheduling strategies to fully exploit parallelism during the remotely sensed big data processing flow. We present a case study on large-scale remote sensing datasets to evaluate the parallel and distributed approaches and algorithms. Evaluation results demonstrate the advanced capabilities of cloud computing in processing remotely sensed big data and the improvements in computational efficiency obtained by employing scheduling strategies.

34 citations


Journal ArticleDOI
TL;DR: In this article, the use of Fabry-Perot (FP) lasers as potential neuromorphic computing machines with parallel processing capabilities was introduced for signal equalization in 25 Gbaud intensity modulation direct detection optical communication systems.
Abstract: We introduce the use of Fabry-Perot (FP) lasers as potential neuromorphic computing machines with parallel processing capabilities. With the use of optical injection between a master FP laser and a slave FP laser under feedback we demonstrate the potential for scaling up the processing power at longitudinal mode granularity and perform real-time processing for signal equalization in 25 Gbaud intensity modulation direct detection optical communication systems. We demonstrate the improvement of classification performance as the number of modes multiplies the number of virtual nodes and offers the capability of simultaneous processing of arbitrary data streams. Extensive numerical simulations show that up to 8 longitudinal modes in typical Fabry-Perot lasers can be leveraged to enhance classification performance.

28 citations


Journal ArticleDOI
TL;DR: In this article, a multi-context based ESM (MC-ESM) was proposed to measure the similarity of two entities by taking into consideration their semantic contexts, and a parallel compact Differential Evolution with Adaptive Step Length (pcDE-ASL) was also proposed to find all entity mappings, which uses the probability presentation on the population to save the memory consumption, and the parallel processing mechanism and ASL to help the algorithm efficiently converge on the global optima.
Abstract: Sensor ontology is able to resolve the sensor data heterogeneity issue among the Cybertwin-driven 6G based Internet of Everything (IoE) systems. However, due to human subjectivity, the sensor ontologies also suffer from the heterogeneity problem. To address this problem, it is necessary to execute the sensor Ontology Matching (OM) process, i.e., finding the identical entity pairs between two ontologies. To this end, we first propose a Multi-Context based ESM (MC-ESM) to measure the similarity of two entities by taking into consideration their semantic contexts. After that, a parallel compact Differential Evolution with Adaptive Step Length (pcDE-ASL) is proposed to find all entity mappings, which uses the probability presentation on the population to save the memory consumption, and the parallel processing mechanism and ASL to help the algorithm efficiently converge on the global optima. The experimental results show that pcDE-ASL is both effective and efficient.

27 citations


Journal ArticleDOI
TL;DR: In this article, a few-layer optical diffractive neural networks (ODNNs) are proposed to perform optical logical operations with the orbital angular momentum (OAM) mode and spatial position of multiple OAM modes.
Abstract: Optical logical operations demonstrate the key role of optical digital computing, which can perform general-purpose calculations and possess fast processing speed, low crosstalk, and high throughput. The logic states usually refer to linear momentums that are distinguished by intensity distributions, which blur the discrimination boundary and limit its sustainable applications. Here, we introduce orbital angular momentum (OAM) mode logical operations performed by optical diffractive neural networks (ODNNs). Using the OAM mode as a logic state not only can improve the parallel processing ability but also enhance the logic distinction and robustness of logical gates owing to the mode infinity and orthogonality. ODNN combining scalar diffraction theory and deep learning technology is designed to independently manipulate the mode and spatial position of multiple OAM modes, which allows for complex multilight modulation functions to respond to logic inputs. We show that few-layer ODNNs successfully implement the logical operations of AND, OR, NOT, NAND, and NOR in simulations. The logic units of XNOR and XOR are obtained by cascading the basic logical gates of AND, OR, and NOT, which can further constitute logical half-adder gates. Our demonstrations may provide a new avenue for optical logical operations and are expected to promote the practical application of optical digital computing.

Journal ArticleDOI
TL;DR: The results show the proposed platform can run deep learning algorithms on embedded devices while meeting the high scalability and fault tolerance required for real-time video processing.
Abstract: Real-time intelligent video processing on embedded devices with low power consumption can be useful for applications like drone surveillance, smart cars, and more. However, the limited resources of embedded devices is a challenging issue for effective embedded computing. Most of the existing work on this topic focuses on single device based solutions, without the use of cloud computing mechanisms for parallel processing to boost performance. In this paper, we propose a cloud platform for real-time video processing based on embedded devices. Eight NVIDIA Jetson TX1 and three Jetson TX2 GPUs are used to construct a streaming embedded cloud platform (SECP), on which Apache Storm is deployed as the cloud computing environment for deep learning algorithms (Convolutional Neural Networks - CNNs) to process video streams. Additionally, self-managing services are designed to ensure that this platform can run smoothly and stably, in the form of a metric sensor, a bottleneck detector and a scheduler. This platform is evaluated in terms of processing speed, power consumption, and network throughput by running various deep learning algorithms for object detection. The results show the proposed platform can run deep learning algorithms on embedded devices while meeting the high scalability and fault tolerance required for real-time video processing.

Journal ArticleDOI
TL;DR: This paper solves 2D and 3D second-order partial differential equations considering the Generalized Finite Difference Method with third- and fourth-order approximations with excellent results both for detecting ill-conditioned stars and for increasing the accuracy of the numerical approximation.
Abstract: In this paper, we solve 2D and 3D second-order partial differential equations considering the Generalized Finite Difference Method with third- and fourth-order approximations. We analyze the influence of the number of points per star and establish some values as references. We propose a new strategy to deal with ill-conditioned stars, which are frequent in higher-order approximations. This strategy uses a few points per star and presents excellent results both for detecting ill-conditioned stars and for increasing the accuracy of the numerical approximation. We apply parallel processing to the formation of stars and derivatives, and show the speedup achieved.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a new deep learning algorithm such as Convolutional Neural Networks (CNNs) for parallel processing tools, including modern Graphics Processing Units (GPU).
Abstract: Due to the advent of powerful parallel processing tools, including modern Graphics Processing Units (GPU), new deep learning algorithms, such as Convolutional Neural Networks (CNNs), have significa...

Journal ArticleDOI
TL;DR: In this article, a novel highly parallel and flexible hardware architecture for the 5G LDPC decoder targeting field-programmable gate array (FPGA) devices is proposed, targeting FPGA devices.
Abstract: The quasi-cyclic (QC) low-density parity-check (LDPC) code is a key error correction code for the fifth generation (5G) of cellular network technology. Designed to support several frame sizes and code rates, the 5G LDPC code structure allows high parallelism to deliver the high demanding data rate of 10 Gb/s. This impressive performance introduces challenging constraints on the hardware design. Particularly, allowing such high flexibility can introduce processing rate penalties on some configurations. In this context, a novel highly parallel and flexible hardware architecture for the 5G LDPC decoder is proposed, targeting field-programmable gate array (FPGA) devices. The architecture supports frame parallelism to maximize the utilization of the processing units, significantly improving the processing rate. The controller unit was carefully designed to support all 5G configurations and to avoid update conflicts. Furthermore, an efficient data scheduling is proposed to increase the processing rate. Compared to the recent related state of the art, the proposed FPGA prototype achieves a higher processing rate per hardware resource for most configurations.

Journal ArticleDOI
TL;DR: A particular approach for the Jordan network will be shown, however, the presented idea is applicable to other RNN structures and can be implemented in digital hardware.

Proceedings ArticleDOI
22 May 2021
TL;DR: This work proposes a multi-level parallel hardware accelerator for homomorphic computations in machine learning to address the core computation in neural networks, and designed to support Multiply-Accumulate operations natively between ciphertexts.
Abstract: Homomorphic Encryption (HE) allows untrusted parties to process encrypted data without revealing its content. People could encrypt the data locally and send it to the cloud to conduct neural network training or inferencing, which achieves data privacy in AI. However, the combined AI and HE computation could be extremely slow. To deal with it, we propose a multi-level parallel hardware accelerator for homomorphic computations in machine learning. The vectorized Number Theoretic Transform (NTT) unit is designed to form the low-level parallelism, and we apply a Residue Number System (RNS) to form the mid-level parallelism in one polynomial. Finally, a fully pipelined and parallel accelerator for two ciphertext operands is proposed to form the high-level parallelism. To address the core computation (matrix-vector multiplication) in neural networks, our work is designed to support Multiply-Accumulate (MAC) operations natively between ciphertexts. We have analyzed our design on FPGA ZCU102, and experimental results show that it outperforms previous works and achieves over an order of magnitude acceleration than software implementations.

Journal ArticleDOI
TL;DR: In this paper, the authors study the parallel k-means algorithm in MapReduce and parallelize the distance calculation process that provides independence between the data objects to perform cluster analysis in parallel.
Abstract: At present, the explosive growth of data and the mass storage state have brought many problems such as computational complexity and insufficient computational power to clustering research. The distributed computing platform through load balancing dynamically configures a large number of virtual computing resources, effectively breaking through the bottleneck of time and energy consumption, and embodies its unique advantages in massive data mining. This paper studies the parallel k-means extensively. This article first initializes random sampling and second parallelizes the distance calculation process that provides independence between the data objects to perform cluster analysis in parallel. After the parallel processing of the MapReduce, we use many nodes to calculate distance, which speeds up the efficiency of the algorithm. Finally, the clustering of data objects is parallelized. Results show that our method can provide services efficiently and stably and have good convergence.

Journal ArticleDOI
Abstract: Sequential region labelling, also known as connected components labelling, is a standard image segmentation problem that joins contiguous foreground pixels into blobs. Despite its long development ...

Proceedings ArticleDOI
01 Feb 2021
TL;DR: The content-addressable parallel processing paradigm (CAPP) as discussed by the authors is an in-situ PIM architecture that leverages content addressable memories to realize bit-serial arithmetic and logic operations via sequences of search and update operations over multiple memory rows in parallel.
Abstract: Processing-in-memory (PIM) architectures attempt to overcome the von Neumann bottleneck by combining computation and storage logic into a single component. The content-addressable parallel processing paradigm (CAPP) from the seventies is an in-situ PIM architecture that leverages content-addressable memories to realize bit-serial arithmetic and logic operations, via sequences of search and update operations over multiple memory rows in parallel. In this paper, we set out to investigate whether the concepts behind classic CAPP can be used successfully to build an entirely CMOS-based, general-purpose microarchitecture that can deliver manyfold speedups while remaining highly programmable. We conduct a full-stack design of a Content-Addressable Processing Engine (CAPE), built out of dense push-rule 6T SRAM arrays. CAPE is programmable using the RISC-V ISA with standard vector extensions. Our experiments show that CAPE achieves an average speedup of 14 (up to 254) over an area-equivalent (slightly under 9 mm2 at 7 nm) out-of-order processor core with three levels of caches.

Journal ArticleDOI
TL;DR: It is formally shown that, while the maximum number of tasks that can be performed simultaneously grows linearly with network size, under realistic scenarios (e.g. in an unpredictable environment), the expected number that could be performed concurrently grows radically sub-linearly withnetwork size.
Abstract: The ability to learn new tasks and generalize to others is a remarkable characteristic of both human brains and recent artificial intelligence systems. The ability to perform multiple tasks simultaneously is also a key characteristic of parallel architectures, as is evident in the human brain and exploited in traditional parallel architectures. Here we show that these two characteristics reflect a fundamental tradeoff between interactive parallelism, which supports learning and generalization, and independent parallelism, which supports processing efficiency through concurrent multitasking. Although the maximum number of possible parallel tasks grows linearly with network size, under realistic scenarios their expected number grows sublinearly. Hence, even modest reliance on shared representations, which support learning and generalization, constrains the number of parallel tasks. This has profound consequences for understanding the human brain’s mix of sequential and parallel capabilities, as well as for the development of artificial intelligence systems that can optimally manage the tradeoff between learning and processing efficiency. The ability to perform multiple tasks simultaneously is a key characteristic of parallel architectures. Using methods from statistical physics, this study provides analytical results that quantify the limitations of processing capacity for different types of tasks in neural networks.

Journal ArticleDOI
TL;DR: Using C# and under the .NET Framework, six design principles and six design patterns are applied to a user-friendly GNSSer software, with its purpose being to bridge the gap between the design and implementation of object-oriented methods.
Abstract: To cope with the construction and upgrading of GNSS, critical attention must be given to the maintainability, reusability, extensibility, portability, loose coupling capacity and flexibility of GNS...

Proceedings ArticleDOI
01 Jul 2021
TL;DR: This paper investigates the complexity of VVC decoder processing blocks and presents a highly optimized decoder implementation that can achieve 4K 60fps VVC real-time decoding on an x86 based CPU using SIMD instruction extensions of the processor and additional parallel processing including data and task-level parallelism.

Journal ArticleDOI
TL;DR: In this paper, a 10T static random access memory (SRAM) bit-cell is proposed for fully parallel computing and high throughput using 32 parallel binary MAC operations, which achieves 409.6 GOPS throughput, 1001.7 TOPS/W energy efficiency, and a 169.9 TOPS /mm2 throughput area efficiency.
Abstract: Computing-in-memory (CIM) is a promising approach to reduce latency and improve the energy efficiency of the multiply-and-accumulate (MAC) operation under a memory wall constraint for artificial intelligence (AI) edge processors. This paper proposes an approach focusing on scalable CIM designs using a new ten-transistor (10T) static random access memory (SRAM) bit-cell. Using the proposed 10T SRAM bit-cell, we present two SRAM-based CIM (SRAM-CIM) macros supporting multibit and binary MAC operations. The first design achieves fully parallel computing and high throughput using 32 parallel binary MAC operations. Advanced circuit techniques such as an input-dependent dynamic reference generator and an input-boosted sense amplifier are presented. Fabricated in 28 nm CMOS process, this design achieves 409.6 GOPS throughput, 1001.7 TOPS/W energy efficiency, and a 169.9 TOPS/mm2 throughput area efficiency. The proposed approach effectively solves previous problems such as writing disturb, throughput, and the power consumption of an analog to digital converter (ADC). The second design supports multibit MAC operation (4-b weight, 4-b input, and 8-b output) to increase the inference accuracy. We propose an architecture that divides 4-b weight and 4-b input multiplication to four 2-b multiplication in parallel, which increases the signal margin by $16\times $ compared to conventional 4-b multiplication. Besides, the capacitive digital-to-analog converter (CDAC) area issue is effectively addressed using the intrinsic bit-line capacitance existing in the SRAM-CIM architecture. The proposed approach of realizing four 2-b parallel multiplication using the CDAC is successfully demonstrated with a modified LeNet-5 neural network. These results demonstrate that the proposed 10T bit-cell is promising for realizing robust and scalable SRAM-CIM designs, which is essential for realizing fully parallel edge computing.

Journal ArticleDOI
TL;DR: This study proposes a secure storage, processing and transmission (SSPT) technique based on two modules named Advanced Encryption Standard (AES) in Electronic Code Book (ECB) mode and AES in Ciphered Message Authentication Code (CMAC) mode which is verified by the test vectors given by international standards.

Book ChapterDOI
13 Apr 2021
TL;DR: In this article, a multi-sensor Schmidt-Kalman filter-based coupled bias estimation problem is considered for single target multiple sensors case, where MSSKF augments the state vector and bias vector for bias estimation.
Abstract: The accelerators are gaining predominant attention in the HW/SW designs and embedded designs due to the less power consumption and parallel data processing capabilities compared to standard microprocessors and FPGA’s. In this paper, MSSKF (Multi-sensor Schmidt–Kalman filter)-based coupled bias estimation problem is considered for single target multiple sensors case. Here MSSKF augments the state vector and bias vector for bias estimation, results in computationally expensive as the dimensions of the state and sensors increases. Hence to address the computational complexity, digital signal processing (DSP) architectures are proposed and accelerated the algorithm to meet the real-time constraints. In the MSSKF algorithm, the overload of the algorithm is due to state covariance prediction and innovation covariance prediction. To realize the state covariance and innovation covariance, a folded DSP architecture and parallel processing based folded DSP architecture are proposed, respectively. The matrix multiplications are addressed with systolic arrays to gain the advantage of latency and parallel processing. Moreover, MSSKF using systolic array architectures simulated and synthesized in Vivado 2018.1 using Verilog and implemented on FPGA-Zynq-7000 board. The performance of the systolic-based accelerator realization was compared with normal matrix multiplication.

Journal ArticleDOI
01 Jan 2021
TL;DR: The proposed design offers sufficient processing power for the execution of state-of-the-art CNNs in real-time by utilizing a combination of data-level parallelism (DLP), instruction-level Parallelism (ILP), and subword parallelism, while consuming between 972mW and 340mW of power.
Abstract: ConvAix is an application-specific instruction-set processor (ASIP) that enables the energy-efficient processing of convolutional neural networks (CNNs) while retaining substantial flexibility through its instruction-set architecture (ISA) based design. By utilizing a combination of data-level parallelism (DLP), instruction-level parallelism (ILP), and subword parallelism, the proposed design offers sufficient processing power for the execution of state-of-the-art CNNs in real-time. ConvAix’s arithmetic logic units (ALUs) are C-programmable, thereby offering the degree of flexibility required to implement many different convolution layer types, e.g., depthwise-separable convolutions and residual blocks, as well as fully-connected and pooling layers. It comprises a total of 256 ALUs and leverages low-precision computations down to 4 bits. Furthermore, it exploits sparsity in feature maps and weights via zero-guarding of redundant computations to maximize its energy efficiency. The processor was implemented in a modern 28 nm CMOS technology operating at 1V supply voltage with a resulting clock frequency of 513 MHz. The final design offers a precision-dependent peak throughput between 263 GOP/s (int16) and 1.1 TOP/s (int4), while consuming between 972mW and 340mW of power, resulting in effective energy-efficiencies ranging from 176 GOP/s/W to 2 TOP/s/W. Well-known CNNs, such as AlexNet, MobileNet, and ResNet-18, are simulated based on the placed and routed netlist, achieving between 233 (AlexNet) and 69 (ResNet-18) frames-per-second for a batch-size of 1, including times for off-chip transfers.

Journal ArticleDOI
TL;DR: In this paper, the authors calculate the fractal dimension of border of India and coastline of India using a novel multicore parallel processing algorithm by both the divider method and the box-counting method.

Journal ArticleDOI
TL;DR: This paper addresses polystore issues by using the polyglot approach of the CloudMdsQL query language that allows native queries to be expressed as inline scripts and combined with SQL statements for ad-hoc integration, thus allowing for native scripts to be processed in parallel at data store shards.
Abstract: The blooming of different data stores has made polystores a major topic in the cloud and big data landscape. As the amount of data grows rapidly, it becomes critical to exploit the inherent parallel processing capabilities of underlying data stores and data processing platforms. To fully achieve this, a polystore should: (i) preserve the expressivity of each data store’s native query or scripting language and (ii) leverage a distributed architecture to enable parallel data integration, i.e. joins, on top of parallel retrieval of underlying partitioned datasets. In this paper, we address these points by: (i) using the polyglot approach of the CloudMdsQL query language that allows native queries to be expressed as inline scripts and combined with SQL statements for ad-hoc integration and (ii) incorporating the approach within the LeanXcale distributed query engine, thus allowing for native scripts to be processed in parallel at data store shards. In addition, (iii) efficient optimization techniques, such as bind join, can take place to improve the performance of selective joins. We evaluate the performance benefits of exploiting parallelism in combination with high expressivity and optimization through our experimental validation.

Journal ArticleDOI
TL;DR: In this paper, the authors present NPB-CPP, a fully C++ translated version of NPB consisting of all the NPB kernels and pseudo-applications developed using OpenMP, Intel TBB, and FastFlow parallel frameworks for multicores.