Showing papers on "Parallel processing (DSP implementation) published in 2021"

PDF

Open Access

Proceedings Article•DOI•

Elf: accelerate high-resolution mobile deep vision with content-aware parallel offloading

[...]

Wuyang Zhang¹, Zhezhi He², Luyang Liu³, Zhenhua Jia⁴, Yunxin Liu⁵, Marco Gruteser³, Dipankar Raychaudhuri¹, Yanyong Zhang⁶ - Show less +4 more•Institutions (6)

Rutgers University¹, Shanghai Jiao Tong University², Google³, Nvidia⁴, Microsoft⁵, University of Science and Technology of China⁶

09 Sep 2021

TL;DR: Elf as discussed by the authors proposes to partition the video frame and offload the partial inference tasks to multiple servers for parallel processing, which can accelerate the mobile deep vision applications with any server provisioning through the parallel offloading.

...read moreread less

Abstract: As mobile devices continuously generate streams of images and videos, a new class of mobile deep vision applications are rapidly emerging, which usually involve running deep neural networks on these multimedia data in real-time. To support such applications, having mobile devices offload the computation, especially the neural network inference, to edge clouds has proved effective. Existing solutions often assume there exists a dedicated and powerful server, to which the entire inference can be offloaded. In reality, however, we may not be able to find such a server but need to make do with less powerful ones. To address these more practical situations, we propose to partition the video frame and offload the partial inference tasks to multiple servers for parallel processing. This paper presents the design of Elf, a framework to accelerate the mobile deep vision applications with any server provisioning through the parallel offloading. Elf employs a recurrent region proposal prediction algorithm, a region proposal centric frame partitioning, and a resource-aware multi-offloading scheme. We implement and evaluate Elf upon Linux and Android platforms using four commercial mobile devices and three deep vision applications with ten state-of-the-art models. The comprehensive experiments show that Elf can speed up the applications by 4.85× with saving bandwidth usage by 52.6%, while with

...read moreread less

60 citations

Journal Article•DOI•

APPM: Adaptive Parallel Processing Mechanism for Service Function Chains

[...]

Jun Cai, Huang Zhongwei¹, Liping Liao, Jianzhen Luo, Liu Waixi² - Show less +1 more•Institutions (2)

Macau University of Science and Technology¹, Guangzhou University²

18 Jan 2021-IEEE Transactions on Network and Service Management

TL;DR: An adaptive parallel processing optimization mechanism (APPM) is proposed to self-adaptively adjust the service graph of SFCs and intelligently solve the joint problem of PSFC deployment and scheduling to reduce the cost of service creation and increase the agility of network operations.

...read moreread less

Abstract: By replacing traditional hardware-based middleboxes with software-based Virtual Network Functions (VNFs) running on general-purpose servers, network function virtualization represents a promising technique to reduce the cost of service creation and increase the agility of network operations. Typically, Service Function Chains (SFCs) are adopted to orchestrate dynamical network services and facilitate management of network applications. Recently, SFC parallelism that implements parallel processing of VNFs has been investigated to further improve SFC service quality. However, the unreasonable service graph of parallel processing in existing parallelized SFCs (PSFCs) might cause excessive resource consumption; incoordination between PSFC deployment and scheduling also increases the queuing delay of VNFs and degrades PSFC performance. In this article, an adaptive parallel processing optimization mechanism (APPM) is proposed to self-adaptively adjust the service graph of PSFCs and intelligently solve the joint problem of PSFC deployment and scheduling. Specifically, APPM uses a parallelism optimization algorithm (POA) based on the bin packing problem with soft bin capacity to optimize the structure of the PSFC service graph. Afterward, APPM employs a joint optimization algorithm based on reinforcement learning (JORL) to jointly deploy and schedule the PSFCs optimized by POA via the online perception of environment status. Simulation results showed that POA reduces the SFC parallelism degree and resource consumption by about 35%; JORL lowers SFC delay by reducing the queuing delay and has better overall performance than the state of the art algorithms even with limited resources.

...read moreread less

45 citations

Journal Article•DOI•

Electronic Bottleneck Suppression in Next‐Generation Networks with Integrated Photonic Digital‐to‐Analog Converters

[...]

Jiawei Meng¹, Mario Miscuglio¹, Jonathan K. George¹, Aydin Babakhani², Volker J. Sorger¹ - Show less +1 more•Institutions (2)

George Washington University¹, University of California, Los Angeles²

01 Feb 2021

TL;DR: A novel coherent parallel photonic DAC concept is introduced along with an experimental demonstration capable of performing this digital-to-analog conversion without optic-electric-optic domain crossing, which guarantees a linear intensity weighting among bits operating at high sampling rates, yet at a reduced footprint and power consumption compared to other photonic alternatives.

...read moreread less

Abstract: Digital-to-analog converters (DAC) are indispensable functional units in signal processing instrumentation and wide-band telecommunication links for both civil and military applications. Since photonic systems are capable of high data throughput and low latency, an increasingly found system limitation stems from the required domain-crossing such as digital-to-analog, and electronic-to-optical. A photonic DAC implementation, in contrast, enables a seamless signal conversion with respect to both energy efficiency and short signal delay, often require bulky discrete optical components and electric-optic transformation hence introducing inefficiencies. Here, we introduce a novel coherent parallel photonic DAC concept along with an experimental demonstration capable of performing this digital-to-analog conversion without optic-electric-optic domain crossing. This design hence guarantees a linear intensity weighting among bits operating at high sampling rates, yet at a reduced footprint and power consumption compared to other photonic alternatives. Importantly, this photonic DAC could create seamless interfaces of next-generation data processing hardware for data-centers, task-specific compute accelerators such as neuromorphic engines, and network edge processing applications.

...read moreread less

45 citations

Journal Article•DOI•

Scalable reservoir computing on coherent linear photonic processor

[...]

Mitsumasa Nakajima, Kenji Tanaka, Toshikazu Hashimoto

10 Feb 2021-Communications in Physics

TL;DR: In this paper, the authors demonstrate a scalable on-chip photonic implementation of a simplified recurrent neural network, called a reservoir computer, using an integrated coherent linear photonic processor, which enables scalable and ultrafast computing beyond the input electrical bandwidth.

...read moreread less

Abstract: Photonic neuromorphic computing is of particular interest due to its significant potential for ultrahigh computing speed and energy efficiency The advantage of photonic computing hardware lies in its ultrawide bandwidth and parallel processing utilizing inherent parallelism Here, we demonstrate a scalable on-chip photonic implementation of a simplified recurrent neural network, called a reservoir computer, using an integrated coherent linear photonic processor In contrast to previous approaches, both the input and recurrent weights are encoded in the spatiotemporal domain by photonic linear processing, which enables scalable and ultrafast computing beyond the input electrical bandwidth As the device can process multiple wavelength inputs over the telecom C-band simultaneously, we can use ultrawide optical bandwidth (~5 terahertz) as a computational resource Experiments for the standard benchmarks showed good performance for chaotic time-series forecasting and image classification The device is considered to be able to perform 2112 tera multiplication–accumulation operations per second (MAC ∙ s−1) for each wavelength and can reach petascale computation speed on a single photonic chip by using wavelength division multiplexing Our results are challenging for conventional Turing–von Neumann machines, and they confirm the great potential of photonic neuromorphic processing towards peta-scale neuromorphic super-computing on a photonic chip Optical computing holds promise for high-speed, low-energy information processing due to its large bandwidth and ability to multiplex signals The authors propose a recurrent neural network implementation using reservoir computing architecture in an integrated photonic processor capable of performing ~10 tera multiplication–accumulation operations per second for each wavelength channel

...read moreread less

44 citations

Journal Article•DOI•

Generation-Level Parallelism for Evolutionary Computation: A Pipeline-Based Parallel Particle Swarm Optimization

[...]

Jian-Yu Li¹, Zhi-Hui Zhan¹, Run-Dong Liu¹, Chuan Wang², Sam Kwong³, Jun Zhang⁴ - Show less +2 more•Institutions (4)

South China University of Technology¹, Henan Normal University², City University of Hong Kong³, University of Victoria⁴

01 Oct 2021-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: This article proposes a new paradigm of the parallel EC algorithm by making the first attempt to parallelize the algorithm in the generation level, inspired by the industrial pipeline technique and shows that generation-level parallelism is possible in EC algorithms and may have significant potential applications in time-consumption optimization problems.

...read moreread less

Abstract: Due to the population-based and iterative-based characteristics of evolutionary computation (EC) algorithms, parallel techniques have been widely used to speed up the EC algorithms. However, the parallelism usually performs in the population level where multiple populations (or subpopulations) run in parallel or in the individual level where the individuals are distributed to multiple resources. That is, different populations or different individuals can be executed simultaneously to reduce running time. However, the research into generation-level parallelism for EC algorithms has seldom been reported. In this article, we propose a new paradigm of the parallel EC algorithm by making the first attempt to parallelize the algorithm in the generation level. This idea is inspired by the industrial pipeline technique. Specifically, a kind of EC algorithm called local version particle swarm optimization (PSO) is adopted to implement a pipeline-based parallel PSO (PPPSO, i.e., P3SO). Due to the generation-level parallelism in P3SO, when some particles still perform their evolutionary operations in the current generation, some other particles can simultaneously go to the next generation to carry out the new evolutionary operations, or even go to further next generation(s). The experimental results show that the problem-solving ability of P3SO is not affected while the evolutionary speed has been substantially accelerated in a significant fashion. Therefore, generation-level parallelism is possible in EC algorithms and may have significant potential applications in time-consumption optimization problems.

...read moreread less

40 citations

Journal Article•DOI•

SlimChain: scaling blockchain transactions through off-chain storage and parallel processing

[...]

Cheng Xu¹, Ce Zhang¹, Jianliang Xu¹, Jian Pei²•Institutions (2)

Hong Kong Baptist University¹, Simon Fraser University²

01 Jul 2021

39 citations

Journal Article•DOI•

Making Recursive Bayesian Inference Accessible

[...]

Mevin B. Hooten¹, Mevin B. Hooten², Devin S. Johnson³, Brian M. Brost³•Institutions (3)

Colorado State University¹, United States Geological Survey², National Marine Fisheries Service³

03 Apr 2021-The American Statistician

TL;DR: Bayesian models provide recursive inference naturally because they can formally reconcile new data and existing scientific information as discussed by the authors, however, popular use of Bayesian methods often avoids priors and thus avoids the need for prior knowledge.

...read moreread less

Abstract: Bayesian models provide recursive inference naturally because they can formally reconcile new data and existing scientific information. However, popular use of Bayesian methods often avoids priors ...

...read moreread less

38 citations

Journal Article•DOI•

Recent Developments in Parallel and Distributed Computing for Remotely Sensed Big Data Processing

[...]

Zebin Wu¹, Jin Sun¹, Yi Zhang¹, Zhihui Wei¹, Jocelyn Chanussot² - Show less +1 more•Institutions (2)

Nanjing University of Science and Technology¹, University of Grenoble²

17 Jun 2021

TL;DR: In this article, a survey of state-of-the-art methods for processing remotely sensed big data and thoroughly investigates existing parallel implementations on diverse popular high-performance computing platforms are discussed in terms of capability, scalability, reliability, and ease of use.

...read moreread less

Abstract: This article gives a survey of state-of-the-art methods for processing remotely sensed big data and thoroughly investigates existing parallel implementations on diverse popular high-performance computing platforms. The pros/cons of these approaches are discussed in terms of capability, scalability, reliability, and ease of use. Among existing distributed computing platforms, cloud computing is currently the most promising solution to efficient and scalable processing of remotely sensed big data due to its advanced capabilities for high-performance and service-oriented computing. We further provide an in-depth analysis of state-of-the-art cloud implementations that seek for exploiting the parallelism of distributed processing of remotely sensed big data. In particular, we study a series of scheduling algorithms (GSs) aimed at distributing the computation load across multiple cloud computing resources in an optimized manner. We conduct a thorough review of different GSs and reveal the significance of employing scheduling strategies to fully exploit parallelism during the remotely sensed big data processing flow. We present a case study on large-scale remote sensing datasets to evaluate the parallel and distributed approaches and algorithms. Evaluation results demonstrate the advanced capabilities of cloud computing in processing remotely sensed big data and the improvements in computational efficiency obtained by employing scheduling strategies.

...read moreread less

34 citations

Journal Article•DOI•

Fabry-Perot Lasers as Enablers for Parallel Reservoir Computing

[...]

Adonis Bogris¹, Charis Mesaritakis², Stavros Deligiannidis¹, Pu Li³•Institutions (3)

University of the West¹, University of the Aegean², Taiyuan University of Technology³

28 Jan 2021-IEEE Journal of Selected Topics in Quantum Electronics

TL;DR: In this article, the use of Fabry-Perot (FP) lasers as potential neuromorphic computing machines with parallel processing capabilities was introduced for signal equalization in 25 Gbaud intensity modulation direct detection optical communication systems.

...read moreread less

Abstract: We introduce the use of Fabry-Perot (FP) lasers as potential neuromorphic computing machines with parallel processing capabilities. With the use of optical injection between a master FP laser and a slave FP laser under feedback we demonstrate the potential for scaling up the processing power at longitudinal mode granularity and perform real-time processing for signal equalization in 25 Gbaud intensity modulation direct detection optical communication systems. We demonstrate the improvement of classification performance as the number of modes multiplies the number of virtual nodes and offers the capability of simultaneous processing of arbitrary data streams. Extensive numerical simulations show that up to 8 longitudinal modes in typical Fabry-Perot lasers can be leveraged to enhance classification performance.

...read moreread less

28 citations

Journal Article•DOI•

Matching Sensor Ontologies With Multi-Context Similarity Measure and Parallel Compact Differential Evolution Algorithm

[...]

Xingsi Xue¹, Chao Jiang¹•Institutions (1)

Fujian University of Technology¹

01 Nov 2021-IEEE Sensors Journal

TL;DR: In this article, a multi-context based ESM (MC-ESM) was proposed to measure the similarity of two entities by taking into consideration their semantic contexts, and a parallel compact Differential Evolution with Adaptive Step Length (pcDE-ASL) was also proposed to find all entity mappings, which uses the probability presentation on the population to save the memory consumption, and the parallel processing mechanism and ASL to help the algorithm efficiently converge on the global optima.

...read moreread less

Abstract: Sensor ontology is able to resolve the sensor data heterogeneity issue among the Cybertwin-driven 6G based Internet of Everything (IoE) systems. However, due to human subjectivity, the sensor ontologies also suffer from the heterogeneity problem. To address this problem, it is necessary to execute the sensor Ontology Matching (OM) process, i.e., finding the identical entity pairs between two ontologies. To this end, we first propose a Multi-Context based ESM (MC-ESM) to measure the similarity of two entities by taking into consideration their semantic contexts. After that, a parallel compact Differential Evolution with Adaptive Step Length (pcDE-ASL) is proposed to find all entity mappings, which uses the probability presentation on the population to save the memory consumption, and the parallel processing mechanism and ASL to help the algorithm efficiently converge on the global optima. The experimental results show that pcDE-ASL is both effective and efficient.

...read moreread less

27 citations

Journal Article•DOI•

Orbital angular momentum mode logical operation using optical diffractive neural network

[...]

Peipei Wang¹, Wenjie Xiong¹, Zebin Huang¹, Yanliang He¹, Zhiqiang Xie¹, Junmin Liu, Huapeng Ye², Ying Li¹, Dianyuan Fan¹, Shuqing Chen¹ - Show less +6 more•Institutions (2)

Shenzhen University¹, South China Normal University²

01 Oct 2021-Photonics Research

TL;DR: In this article, a few-layer optical diffractive neural networks (ODNNs) are proposed to perform optical logical operations with the orbital angular momentum (OAM) mode and spatial position of multiple OAM modes.

...read moreread less

Abstract: Optical logical operations demonstrate the key role of optical digital computing, which can perform general-purpose calculations and possess fast processing speed, low crosstalk, and high throughput. The logic states usually refer to linear momentums that are distinguished by intensity distributions, which blur the discrimination boundary and limit its sustainable applications. Here, we introduce orbital angular momentum (OAM) mode logical operations performed by optical diffractive neural networks (ODNNs). Using the OAM mode as a logic state not only can improve the parallel processing ability but also enhance the logic distinction and robustness of logical gates owing to the mode infinity and orthogonality. ODNN combining scalar diffraction theory and deep learning technology is designed to independently manipulate the mode and spatial position of multiple OAM modes, which allows for complex multilight modulation functions to respond to logic inputs. We show that few-layer ODNNs successfully implement the logical operations of AND, OR, NOT, NAND, and NOR in simulations. The logic units of XNOR and XOR are obtained by cascading the basic logical gates of AND, OR, and NOT, which can further constitute logical half-adder gates. Our demonstrations may provide a new avenue for optical logical operations and are expected to promote the practical application of optical digital computing.

...read moreread less

Journal Article•DOI•

A Streaming Cloud Platform for Real-Time Video Processing on Embedded Devices

[...]

Weishan Zhang¹, Haoyun Sun¹, Dehai Zhao², Liang Xu³, Xin Liu¹, Huansheng Ning³, Jiehan Zhou⁴, Yi Guo¹, Su Yang⁵ - Show less +5 more•Institutions (5)

China University of Petroleum¹, Australian National University², University of Science and Technology Beijing³, University of Oulu⁴, Fudan University⁵

01 Jul 2021-IEEE Transactions on Cloud Computing

TL;DR: The results show the proposed platform can run deep learning algorithms on embedded devices while meeting the high scalability and fault tolerance required for real-time video processing.

...read moreread less

Abstract: Real-time intelligent video processing on embedded devices with low power consumption can be useful for applications like drone surveillance, smart cars, and more. However, the limited resources of embedded devices is a challenging issue for effective embedded computing. Most of the existing work on this topic focuses on single device based solutions, without the use of cloud computing mechanisms for parallel processing to boost performance. In this paper, we propose a cloud platform for real-time video processing based on embedded devices. Eight NVIDIA Jetson TX1 and three Jetson TX2 GPUs are used to construct a streaming embedded cloud platform (SECP), on which Apache Storm is deployed as the cloud computing environment for deep learning algorithms (Convolutional Neural Networks - CNNs) to process video streams. Additionally, self-managing services are designed to ensure that this platform can run smoothly and stably, in the form of a metric sensor, a bottleneck detector and a scheduler. This platform is evaluated in terms of processing speed, power consumption, and network throughput by running various deep learning algorithms for object detection. The results show the proposed platform can run deep learning algorithms on embedded devices while meeting the high scalability and fault tolerance required for real-time video processing.

...read moreread less

Journal Article•DOI•

The generalized finite difference method with third- and fourth-order approximations and treatment of ill-conditioned stars

[...]

A.C. Albuquerque-Ferreira¹, Miguel Ureña, Higinio Ramos¹•Institutions (1)

University of Salamanca¹

01 Jun 2021-Engineering Analysis With Boundary Elements

TL;DR: This paper solves 2D and 3D second-order partial differential equations considering the Generalized Finite Difference Method with third- and fourth-order approximations with excellent results both for detecting ill-conditioned stars and for increasing the accuracy of the numerical approximation.

...read moreread less

Abstract: In this paper, we solve 2D and 3D second-order partial differential equations considering the Generalized Finite Difference Method with third- and fourth-order approximations. We analyze the influence of the number of points per star and establish some values as references. We propose a new strategy to deal with ill-conditioned stars, which are frequent in higher-order approximations. This strategy uses a few points per star and presents excellent results both for detecting ill-conditioned stars and for increasing the accuracy of the numerical approximation. We apply parallel processing to the formation of stars and derivatives, and show the speedup achieved.

...read moreread less

Journal Article•DOI•

Wetland Mapping Using Multi-Spectral Satellite Imagery and Deep Convolutional Neural Networks: A Case Study in Newfoundland and Labrador, Canada

[...]

Ali Jamali¹, Masoud Mahdianpari¹, Brian Brisco, Jean Granger², Fariba Mohammadimanesh², Bahram Salehi³ - Show less +2 more•Institutions (3)

Memorial University of Newfoundland¹, St. John's University², State University of New York College of Environmental Science and Forestry³

06 Apr 2021-Canadian Journal of Remote Sensing

TL;DR: In this paper, the authors proposed a new deep learning algorithm such as Convolutional Neural Networks (CNNs) for parallel processing tools, including modern Graphics Processing Units (GPU).

...read moreread less

Abstract: Due to the advent of powerful parallel processing tools, including modern Graphics Processing Units (GPU), new deep learning algorithms, such as Convolutional Neural Networks (CNNs), have significa...

...read moreread less

Journal Article•DOI•

Parallel and Flexible 5G LDPC Decoder Architecture Targeting FPGA

[...]

Jeremy Nadal¹, Amer Baghdadi²•Institutions (2)

École Polytechnique de Montréal¹, Centre national de la recherche scientifique²

30 Apr 2021-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: In this article, a novel highly parallel and flexible hardware architecture for the 5G LDPC decoder targeting field-programmable gate array (FPGA) devices is proposed, targeting FPGA devices.

...read moreread less

Abstract: The quasi-cyclic (QC) low-density parity-check (LDPC) code is a key error correction code for the fifth generation (5G) of cellular network technology. Designed to support several frame sizes and code rates, the 5G LDPC code structure allows high parallelism to deliver the high demanding data rate of 10 Gb/s. This impressive performance introduces challenging constraints on the hardware design. Particularly, allowing such high flexibility can introduce processing rate penalties on some configurations. In this context, a novel highly parallel and flexible hardware architecture for the 5G LDPC decoder is proposed, targeting field-programmable gate array (FPGA) devices. The architecture supports frame parallelism to maximize the utilization of the processing units, significantly improving the processing rate. The controller unit was carefully designed to support all 5G configurations and to avoid update conflicts. Furthermore, an efficient data scheduling is proposed to increase the processing rate. Compared to the recent related state of the art, the proposed FPGA prototype achieves a higher processing rate per hardware resource for most configurations.

...read moreread less

Journal Article•DOI•

A novel method for speed training acceleration of recurrent neural networks

[...]

Jarosław Bilski¹, Leszek Rutkowski², Leszek Rutkowski¹, Jacek Smoląg¹, Dacheng Tao³ - Show less +1 more•Institutions (3)

Częstochowa University of Technology¹, Information Technology Institute², University of Sydney³

01 Apr 2021-Information Sciences

TL;DR: A particular approach for the Jordan network will be shown, however, the presented idea is applicable to other RNN structures and can be implemented in digital hardware.

...read moreread less

Proceedings Article•DOI•

A Multi-Layer Parallel Hardware Architecture for Homomorphic Computation in Machine Learning

[...]

Guozhu Xin¹, Yifan Zhao¹, Jun Han¹•Institutions (1)

Fudan University¹

22 May 2021

TL;DR: This work proposes a multi-level parallel hardware accelerator for homomorphic computations in machine learning to address the core computation in neural networks, and designed to support Multiply-Accumulate operations natively between ciphertexts.

...read moreread less

Abstract: Homomorphic Encryption (HE) allows untrusted parties to process encrypted data without revealing its content. People could encrypt the data locally and send it to the cloud to conduct neural network training or inferencing, which achieves data privacy in AI. However, the combined AI and HE computation could be extremely slow. To deal with it, we propose a multi-level parallel hardware accelerator for homomorphic computations in machine learning. The vectorized Number Theoretic Transform (NTT) unit is designed to form the low-level parallelism, and we apply a Residue Number System (RNS) to form the mid-level parallelism in one polynomial. Finally, a fully pipelined and parallel accelerator for two ciphertext operands is proposed to form the high-level parallelism. To address the core computation (matrix-vector multiplication) in neural networks, our work is designed to support Multiply-Accumulate (MAC) operations natively between ciphertexts. We have analyzed our design on FPGA ZCU102, and experimental results show that it outperforms previous works and achieves over an order of magnitude acceleration than software implementations.

...read moreread less

Journal Article•DOI•

K -Means Clustering Algorithm and Its Simulation Based on Distributed Computing Platform

[...]

Chunqiong Wu, Bingwen Yan, Rongrui Yu, Baoqin Yu, Xiukao Zhou, Yanliang Yu, Na Chen¹ - Show less +3 more•Institutions (1)

Software Engineering Institute¹

19 Jun 2021-Complexity

TL;DR: In this paper, the authors study the parallel k-means algorithm in MapReduce and parallelize the distance calculation process that provides independence between the data objects to perform cluster analysis in parallel.

...read moreread less

Abstract: At present, the explosive growth of data and the mass storage state have brought many problems such as computational complexity and insufficient computational power to clustering research. The distributed computing platform through load balancing dynamically configures a large number of virtual computing resources, effectively breaking through the bottleneck of time and energy consumption, and embodies its unique advantages in massive data mining. This paper studies the parallel k-means extensively. This article first initializes random sampling and second parallelizes the distance calculation process that provides independence between the data objects to perform cluster analysis in parallel. After the parallel processing of the MapReduce, we use many nodes to calculate distance, which speeds up the efficiency of the algorithm. Finally, the clustering of data objects is parallelized. Results show that our method can provide services efficiently and stably and have good convergence.

...read moreread less

Journal Article•DOI•

Multithreaded two-pass connected components labelling and particle analysis in ImageJ.

[...]

Michael Doube¹•Institutions (1)

City University of Hong Kong¹

03 Mar 2021-Royal Society Open Science

Abstract: Sequential region labelling, also known as connected components labelling, is a standard image segmentation problem that joins contiguous foreground pixels into blobs. Despite its long development ...

...read moreread less

Proceedings Article•DOI•

CAPE: A Content-Addressable Processing Engine

[...]

Helena Caminal¹, Kailin Yang¹, Srivatsa Srinivasa², Akshay Krishna Ramanathan³, Khalid Al-Hawaj¹, Tianshu Wu¹, Vijaykrishnan Narayanan³, Christopher Batten¹, Jose F. Martinez¹ - Show less +5 more•Institutions (3)

Cornell University¹, Intel², Pennsylvania State University³

01 Feb 2021

TL;DR: The content-addressable parallel processing paradigm (CAPP) as discussed by the authors is an in-situ PIM architecture that leverages content addressable memories to realize bit-serial arithmetic and logic operations via sequences of search and update operations over multiple memory rows in parallel.

...read moreread less

Abstract: Processing-in-memory (PIM) architectures attempt to overcome the von Neumann bottleneck by combining computation and storage logic into a single component. The content-addressable parallel processing paradigm (CAPP) from the seventies is an in-situ PIM architecture that leverages content-addressable memories to realize bit-serial arithmetic and logic operations, via sequences of search and update operations over multiple memory rows in parallel. In this paper, we set out to investigate whether the concepts behind classic CAPP can be used successfully to build an entirely CMOS-based, general-purpose microarchitecture that can deliver manyfold speedups while remaining highly programmable. We conduct a full-stack design of a Content-Addressable Processing Engine (CAPE), built out of dense push-rule 6T SRAM arrays. CAPE is programmable using the RISC-V ISA with standard vector extensions. Our experiments show that CAPE achieves an average speedup of 14 (up to 254) over an area-equivalent (slightly under 9 mm2 at 7 nm) out-of-order processor core with three levels of caches.

...read moreread less

Journal Article•DOI•

Topological limits to the parallel processing capability of network architectures

[...]

Giovanni Petri¹, Sebastian Musslick², Biswadip Dey², Kayhan Ozcimder³, David C. Turner², Nesreen K. Ahmed⁴, Theodeore L. Willke⁴, Jonathan D. Cohen² - Show less +4 more•Institutions (4)

Institute for Scientific Interchange¹, Princeton University², MathWorks³, Intel⁴

18 Feb 2021-Nature Physics

TL;DR: It is formally shown that, while the maximum number of tasks that can be performed simultaneously grows linearly with network size, under realistic scenarios (e.g. in an unpredictable environment), the expected number that could be performed concurrently grows radically sub-linearly withnetwork size.

...read moreread less

Abstract: The ability to learn new tasks and generalize to others is a remarkable characteristic of both human brains and recent artificial intelligence systems. The ability to perform multiple tasks simultaneously is also a key characteristic of parallel architectures, as is evident in the human brain and exploited in traditional parallel architectures. Here we show that these two characteristics reflect a fundamental tradeoff between interactive parallelism, which supports learning and generalization, and independent parallelism, which supports processing efficiency through concurrent multitasking. Although the maximum number of possible parallel tasks grows linearly with network size, under realistic scenarios their expected number grows sublinearly. Hence, even modest reliance on shared representations, which support learning and generalization, constrains the number of parallel tasks. This has profound consequences for understanding the human brain’s mix of sequential and parallel capabilities, as well as for the development of artificial intelligence systems that can optimally manage the tradeoff between learning and processing efficiency. The ability to perform multiple tasks simultaneously is a key characteristic of parallel architectures. Using methods from statistical physics, this study provides analytical results that quantify the limitations of processing capacity for different types of tasks in neural networks.

...read moreread less

Journal Article•DOI•

GNSSer: objected-oriented and design pattern-based software for GNSS data parallel processing

[...]

Linyang Li¹, Zhiping Lu¹, Zhengsheng Chen, Yang Cui, Dashuang Sun, Yupu Wang, Yingcai Kuang¹, Fangchao Wang¹ - Show less +4 more•Institutions (1)

PLA Information Engineering University¹

02 Jan 2021-Journal of Spatial Science

TL;DR: Using C# and under the .NET Framework, six design principles and six design patterns are applied to a user-friendly GNSSer software, with its purpose being to bridge the gap between the design and implementation of object-oriented methods.

...read moreread less

Abstract: To cope with the construction and upgrading of GNSS, critical attention must be given to the maintainability, reusability, extensibility, portability, loose coupling capacity and flexibility of GNS...

...read moreread less

Proceedings Article•DOI•

A Real-Time H.266/VVC Software Decoder

[...]

Bin Zhu¹, Shan Liu¹, Yuan Liu¹, Yi Luo¹, Ye Jing¹, Haiyan Xu¹, Ying Huang¹, Hualong Jiao¹, Xu Xiaozhong¹, Xianguo Zhang¹, Chenchen Gu¹ - Show less +7 more•Institutions (1)

Tencent¹

01 Jul 2021

TL;DR: This paper investigates the complexity of VVC decoder processing blocks and presents a highly optimized decoder implementation that can achieve 4K 60fps VVC real-time decoding on an x86 based CPU using SIMD instruction extensions of the processor and additional parallel processing including data and task-level parallelism.

...read moreread less

Journal Article•DOI•

10T SRAM Computing-in-Memory Macros for Binary and Multibit MAC Operation of DNN Edge Processors

[...]

Van Truong Nguyen¹, Jie-Seok Kim¹, Jong-Wook Lee¹•Institutions (1)

Kyung Hee University¹

12 May 2021-IEEE Access

TL;DR: In this paper, a 10T static random access memory (SRAM) bit-cell is proposed for fully parallel computing and high throughput using 32 parallel binary MAC operations, which achieves 409.6 GOPS throughput, 1001.7 TOPS/W energy efficiency, and a 169.9 TOPS /mm2 throughput area efficiency.

...read moreread less

Abstract: Computing-in-memory (CIM) is a promising approach to reduce latency and improve the energy efficiency of the multiply-and-accumulate (MAC) operation under a memory wall constraint for artificial intelligence (AI) edge processors. This paper proposes an approach focusing on scalable CIM designs using a new ten-transistor (10T) static random access memory (SRAM) bit-cell. Using the proposed 10T SRAM bit-cell, we present two SRAM-based CIM (SRAM-CIM) macros supporting multibit and binary MAC operations. The first design achieves fully parallel computing and high throughput using 32 parallel binary MAC operations. Advanced circuit techniques such as an input-dependent dynamic reference generator and an input-boosted sense amplifier are presented. Fabricated in 28 nm CMOS process, this design achieves 409.6 GOPS throughput, 1001.7 TOPS/W energy efficiency, and a 169.9 TOPS/mm2 throughput area efficiency. The proposed approach effectively solves previous problems such as writing disturb, throughput, and the power consumption of an analog to digital converter (ADC). The second design supports multibit MAC operation (4-b weight, 4-b input, and 8-b output) to increase the inference accuracy. We propose an architecture that divides 4-b weight and 4-b input multiplication to four 2-b multiplication in parallel, which increases the signal margin by $16\times $ compared to conventional 4-b multiplication. Besides, the capacitive digital-to-analog converter (CDAC) area issue is effectively addressed using the intrinsic bit-line capacitance existing in the SRAM-CIM architecture. The proposed approach of realizing four 2-b parallel multiplication using the CDAC is successfully demonstrated with a modified LeNet-5 neural network. These results demonstrate that the proposed 10T bit-cell is promising for realizing robust and scalable SRAM-CIM designs, which is essential for realizing fully parallel edge computing.

...read moreread less

Journal Article•DOI•

A secure data parallel processing based embedded system for internet of things computer vision using field programmable gate array devices

[...]

Kashif Naseer Qureshi¹, Sundus Qayyum¹, Muhammad Najam ul Islam¹, Gwanggil Jeon²•Institutions (2)

Bahria University¹, Incheon National University²

09 Feb 2021-International Journal of Circuit Theory and Applications

TL;DR: This study proposes a secure storage, processing and transmission (SSPT) technique based on two modules named Advanced Encryption Standard (AES) in Electronic Code Book (ECB) mode and AES in Ciphered Message Authentication Code (CMAC) mode which is verified by the test vectors given by international standards.

...read moreread less

Book Chapter•DOI•

Systolic-Architecture-Based Matrix Multiplications and Its Realization for Multi-Sensor Bias Estimation Algorithms

[...]

B. Gopala Swamy¹, U. Sripati Acharya¹, Pathipati Srihari¹, Bethi Pardhasaradhi¹•Institutions (1)

National Institute of Technology, Karnataka¹

13 Apr 2021

TL;DR: In this article, a multi-sensor Schmidt-Kalman filter-based coupled bias estimation problem is considered for single target multiple sensors case, where MSSKF augments the state vector and bias vector for bias estimation.

...read moreread less

Abstract: The accelerators are gaining predominant attention in the HW/SW designs and embedded designs due to the less power consumption and parallel data processing capabilities compared to standard microprocessors and FPGA’s. In this paper, MSSKF (Multi-sensor Schmidt–Kalman filter)-based coupled bias estimation problem is considered for single target multiple sensors case. Here MSSKF augments the state vector and bias vector for bias estimation, results in computationally expensive as the dimensions of the state and sensors increases. Hence to address the computational complexity, digital signal processing (DSP) architectures are proposed and accelerated the algorithm to meet the real-time constraints. In the MSSKF algorithm, the overload of the algorithm is due to state covariance prediction and innovation covariance prediction. To realize the state covariance and innovation covariance, a folded DSP architecture and parallel processing based folded DSP architecture are proposed, respectively. The matrix multiplications are addressed with systolic arrays to gain the advantage of latency and parallel processing. Moreover, MSSKF using systolic array architectures simulated and synthesized in Vivado 2018.1 using Verilog and implemented on FPGA-Zynq-7000 board. The performance of the systolic-based accelerator realization was compared with normal matrix multiplication.

...read moreread less

Journal Article•DOI•

ConvAix: An Application-Specific Instruction-Set Processor for the Efficient Acceleration of CNNs

[...]

Andreas Bytyn¹, Rainer Leupers¹, Gerd Ascheid¹•Institutions (1)

RWTH Aachen University¹

01 Jan 2021

TL;DR: The proposed design offers sufficient processing power for the execution of state-of-the-art CNNs in real-time by utilizing a combination of data-level parallelism (DLP), instruction-level Parallelism (ILP), and subword parallelism, while consuming between 972mW and 340mW of power.

...read moreread less

Abstract: ConvAix is an application-specific instruction-set processor (ASIP) that enables the energy-efficient processing of convolutional neural networks (CNNs) while retaining substantial flexibility through its instruction-set architecture (ISA) based design. By utilizing a combination of data-level parallelism (DLP), instruction-level parallelism (ILP), and subword parallelism, the proposed design offers sufficient processing power for the execution of state-of-the-art CNNs in real-time. ConvAix’s arithmetic logic units (ALUs) are C-programmable, thereby offering the degree of flexibility required to implement many different convolution layer types, e.g., depthwise-separable convolutions and residual blocks, as well as fully-connected and pooling layers. It comprises a total of 256 ALUs and leverages low-precision computations down to 4 bits. Furthermore, it exploits sparsity in feature maps and weights via zero-guarding of redundant computations to maximize its energy efficiency. The processor was implemented in a modern 28 nm CMOS technology operating at 1V supply voltage with a resulting clock frequency of 513 MHz. The final design offers a precision-dependent peak throughput between 263 GOP/s (int16) and 1.1 TOP/s (int4), while consuming between 972mW and 340mW of power, resulting in effective energy-efficiencies ranging from 176 GOP/s/W to 2 TOP/s/W. Well-known CNNs, such as AlexNet, MobileNet, and ResNet-18, are simulated based on the placed and routed netlist, achieving between 233 (AlexNet) and 69 (ResNet-18) frames-per-second for a batch-size of 1, including times for off-chip transfers.

...read moreread less

Journal Article•DOI•

Fractal dimension of india using multicore parallel processing

[...]

Akhlaq Husain¹, Akhlaq Husain¹, Jaideep Reddy¹, Deepika Bisht¹, Mohammad Sajid² - Show less +1 more•Institutions (2)

BML Munjal University¹, Qassim University²

29 Nov 2021-Computers & Geosciences

TL;DR: In this paper, the authors calculate the fractal dimension of border of India and coastline of India using a novel multicore parallel processing algorithm by both the divider method and the box-counting method.

...read moreread less

Journal Article•DOI•

Parallel Query Processing in a Polystore

[...]

Pavlos Kranas¹, Boyan Kolev², Oleksandra Levchenko², Esther Pacitti², Patrick Valduriez², Ricardo Jiménez-Peris, Marta Patiño-Martínez¹ - Show less +3 more•Institutions (2)

Technical University of Madrid¹, University of Montpellier²

03 Feb 2021-Distributed and Parallel Databases

TL;DR: This paper addresses polystore issues by using the polyglot approach of the CloudMdsQL query language that allows native queries to be expressed as inline scripts and combined with SQL statements for ad-hoc integration, thus allowing for native scripts to be processed in parallel at data store shards.

...read moreread less

Abstract: The blooming of different data stores has made polystores a major topic in the cloud and big data landscape. As the amount of data grows rapidly, it becomes critical to exploit the inherent parallel processing capabilities of underlying data stores and data processing platforms. To fully achieve this, a polystore should: (i) preserve the expressivity of each data store’s native query or scripting language and (ii) leverage a distributed architecture to enable parallel data integration, i.e. joins, on top of parallel retrieval of underlying partitioned datasets. In this paper, we address these points by: (i) using the polyglot approach of the CloudMdsQL query language that allows native queries to be expressed as inline scripts and combined with SQL statements for ad-hoc integration and (ii) incorporating the approach within the LeanXcale distributed query engine, thus allowing for native scripts to be processed in parallel at data store shards. In addition, (iii) efficient optimization techniques, such as bind join, can take place to improve the performance of selective joins. We evaluate the performance benefits of exploiting parallelism in combination with high expressivity and optimization through our experimental validation.

...read moreread less

Journal Article•DOI•

The NAS Parallel Benchmarks for evaluating C++ parallel programming frameworks on shared-memory architectures

[...]

Junior Loff¹, Dalvan Griebler¹, Gabriele Mencagli², Gabriell Alves de Araújo¹, Massimo Torquati², Marco Danelutto², Luiz Gustavo Fernandes¹ - Show less +3 more•Institutions (2)

Pontifícia Universidade Católica do Rio Grande do Sul¹, University of Pisa²

01 Dec 2021-Future Generation Computer Systems

TL;DR: In this paper, the authors present NPB-CPP, a fully C++ translated version of NPB consisting of all the NPB kernels and pseudo-applications developed using OpenMP, Intel TBB, and FastFlow parallel frameworks for multicores.

...read moreread less

Collapse