Showing papers on "Parallel processing (DSP implementation) published in 2018"

PDF

Open Access

Journal Article•DOI•

Parallel processing algorithm for railway signal fault diagnosis data based on cloud computing

[...]

Yuan Cao¹, Peng Li¹, Yuzhuo Zhang¹•Institutions (1)

01 Nov 2018-Future Generation Computer Systems

TL;DR: Results show that the algorithm proposed could reduce the amount of computation in the execution process, greatly reduce the memory space consumption, and improve the counting speed in railway signal system.

...read moreread less

117 citations

Journal Article•DOI•

Parallel bat algorithm for optimizing makespan in job shop scheduling problems

[...]

Thi-Kien Dao¹, Tien-Szu Pan¹, Trong-The Nguyen¹, Jeng-Shyang Pan²•Institutions (2)

National Kaohsiung University of Applied Sciences¹, Fujian University of Technology²

01 Feb 2018-Journal of Intelligent Manufacturing

TL;DR: An optimization algorithm based on parallel versions of the bat algorithm, random-key encoding scheme, communication strategy scheme and makespan scheme is proposed to solve the NP-hard job shop scheduling problem.

...read moreread less

Abstract: Parallel processing plays an important role in efficient and effective computations of function optimization. In this paper, an optimization algorithm based on parallel versions of the bat algorithm (BA), random-key encoding scheme, communication strategy scheme and makespan scheme is proposed to solve the NP-hard job shop scheduling problem. The aim of the parallel BA with communication strategies is to correlate individuals in swarms and to share the computation load over few processors. Based on the original structure of the BA, the bat populations are split into several independent groups. In addition, the communication strategy provides the diversity-enhanced bats to speed up solutions. In the experiment, forty three instances of the benchmark in job shop scheduling data set with various sizes are used to test the behavior of the convergence, and accuracy of the proposed method. The results compared with the other methods in the literature show that the proposed scheme increases more the convergence and the accuracy than BA and particle swarm optimization.

...read moreread less

106 citations

Proceedings Article•DOI•

Prediction based execution on deep neural networks

[...]

Mingcong Song¹, Jiechen Zhao¹, Yang Hu², Jiaqi Zhang¹, Tao Li¹ - Show less +1 more•Institutions (2)

University of Florida¹, University of Texas at Dallas²

02 Jun 2018

TL;DR: This work proposes a two-stage, prediction-based DNN execution model without accuracy loss, and proposes a uniform serial processing element (USPE), for both prediction and execution stages to improve the flexibility and minimize the area overhead.

...read moreread less

Abstract: Recently, deep neural network based approaches have emerged as indispensable tools in many fields, ranging from image and video recognition to natural language processing. However, the large size of such newly developed networks poses both throughput and energy challenges to the underlying processing hardware. This could be the major stumbling block to many promising applications such as self-driving cars and smart cities. Existing work proposes to weed zeros from input neurons to avoid unnecessary DNN computation (zero-valued operand multiplications). However, we observe that many output neurons are still ineffectual even if the zero-removal technique has been applied. These ineffectual output neurons could not pass their values to the subsequent layer, which means all the computations (including zero-valued and non-zero-valued operand multiplications) related to these output neurons are futile and wasteful. Therefore, there is an opportunity to significantly improve the performance and efficiency of DNN execution by predicting the ineffectual output neurons and thus completely avoid the futile computations by skipping over these ineffectual output neurons. To do so, we propose a two-stage, prediction-based DNN execution model without accuracy loss. We also propose a uniform serial processing element (USPE), for both prediction and execution stages to improve the flexibility and minimize the area overhead. To improve the processing throughput, we further present a scale-out design for USPE. Evaluation results over a set of state-of-the-art DNNs show that our proposed design achieves 2.5X speedup and 1.9X energy-efficiency on average over the traditional accelerator. Moreover, by stacking with our design, we can improve Cnvlutin and Stripes by 1.9X and 2.0X on average, respectively.

...read moreread less

77 citations

Proceedings Article•DOI•

Distributed Cloud Computing and Distributed Parallel Computing: A Review

[...]

Zryan Najat Rashid¹, Subhi R. M. Zebari, Karzan Hussein Sharif, Karwan Jacksi²•Institutions (2)

Sulaimani Polytechnic University¹, University of Zakho²

01 Oct 2018

TL;DR: This paper presents a discussion panel of two of the hottest topics in this area namely distributed parallel processing and distributed cloud computing, and introduces the concept of decreasing the response time in distributed parallel computing.

...read moreread less

Abstract: In this paper, we present a discussion panel of two of the hottest topics in this area namely distributed parallel processing and distributed cloud computing. Various aspects have been discussed in this review paper such as concentrating on whether these topics are discussed simultaneously in any previous works. Other aspects that have been reviewed in this paper include the algorithms, which simulated in both distributed parallel computing and distributed cloud computing. The goal is to process the tasks over resources then readjusted the calculation among the servers for the sake of optimization. These help us to improve the system performance with the desired rates. During our review, we presented some articles which explain the designing of applications in distributed cloud computing while some others introduced the concept of decreasing the response time in distributed parallel computing.

...read moreread less

70 citations

Journal Article•DOI•

Parallel image encryption with bitplane decomposition and genetic algorithm

[...]

Saeed Mozaffari¹•Institutions (1)

Semnan University¹

01 Oct 2018-Multimedia Tools and Applications

TL;DR: A parallel image encryption method based on bitplane decomposition is proposed that has parallel processing capability for multiple bitplanes encryption, which increases encryption speed and makes it suitable for real-time applications.

...read moreread less

Abstract: Image encryption is an efficient technique to protect image content from unauthorized parties. In this paper a parallel image encryption method based on bitplane decomposition is proposed. The original grayscale image is converted to a set of binary images by local binary pattern (LBP) technique and bitplane decomposition (BPD) methods. Then, permutation and substitution steps are performed by genetic algorithm (GA) using crossover and mutation operations. Finally, these scrambled bitplanes are combined together to obtain encrypted image. Instead of random population selection in GA, a deterministic method with security keys is utilized to improve security level. The proposed encryption method has parallel processing capability for multiple bitplanes encryption. This distributed GA with multiple populations increases encryption speed and makes it suitable for real-time applications. Simulations and security analysis are done to demonstrate efficiency of our algorithm.

...read moreread less

49 citations

Journal Article•DOI•

IMAGING: In-Memory AlGorithms for Image processiNG

[...]

Ameer Haj-Ali¹, Rotem Ben-Hur¹, Nimrod Wald¹, Ronny Ronen¹, Shahar Kvatinsky¹ - Show less +1 more•Institutions (1)

Technion – Israel Institute of Technology¹

27 Jun 2018-IEEE Transactions on Circuits and Systems I-regular Papers

TL;DR: Four in-memory algorithms for efficient execution of fixed point multiplication using MAGIC gates achieve much better latency and throughput than a previous work and significantly reduce the area cost and can be feasibly implemented inside the size-limited memory arrays.

...read moreread less

Abstract: Data-intensive applications such as image processing suffer from massive data movement between memory and processing units. The severe limitations on system performance and energy efficiency imposed by this data movement are further exacerbated with any increase in the distance the data must travel. This data transfer and its associated obstacles could be eliminated by the use of emerging non-volatile resistive memory technologies (memristors) that make it possible to both store and process data within the same memory cells. In this paper, we propose four in-memory algorithms for efficient execution of fixed point multiplication using MAGIC gates. These algorithms achieve much better latency and throughput than a previous work and significantly reduce the area cost. They can thus be feasibly implemented inside the size-limited memory arrays. We use these fixed point multiplication algorithms to efficiently perform more complex in-memory operations such as image convolution and further show how to partition large images to multiple memory arrays so as to maximize the parallelism. All the proposed algorithms are evaluated and verified using a cycle-accurate and functional simulator. Our algorithms provide on average $200\times $ better performance over state-of-the-art APIM, a processing in-memory architecture for data intensive applications.

...read moreread less

44 citations

Journal Article•DOI•

Multilevel Data Processing Using Parallel Algorithms for Analyzing Big Data in High-Performance Computing

[...]

Awais Ahmad¹, Anand Paul², Sadia Din², M. Mazhar Rathore², Gyu Sang Choi¹, Gwanggil Jeon³ - Show less +2 more•Institutions (3)

Yeungnam University¹, Kyungpook National University², Incheon National University³

01 Jun 2018-International Journal of Parallel Programming

TL;DR: A system architecture that enhances the working of traditional MapReduce by incorporating parallel processing algorithm is presented and complete four-tier architecture is also proposed that efficiently aggregate the data, eliminate unnecessary data, and analyze the data by the proposed parallelprocessing algorithm.

...read moreread less

Abstract: The growing gap between users and the Big Data analytics requires innovative tools that address the challenges faced by big data volume, variety, and velocity. Therefore, it becomes computationally inefficient to analyze such massive volume of data. Moreover, advancements in the field of Big Data application and data science poses additional challenges, where High-Performance Computing solution has become a key issue and has attracted attention in recent years. However, these systems are either memoryless or computational inefficient. Therefore, keeping in view the aforementioned needs, there is a requirement for a system that can efficiently analyze a stream of Big Data within their requirements. Hence, this paper presents a system architecture that enhances the working of traditional MapReduce by incorporating parallel processing algorithm. Moreover, complete four-tier architecture is also proposed that efficiently aggregate the data, eliminate unnecessary data, and analyze the data by the proposed parallel processing algorithm. The proposed system architecture both read and writes operations that enhance the efficiency of the Input/Output operation. To check the efficiency of the proposed algorithms exploited in the proposed system architecture, we have implemented our proposed system using Hadoop and MapReduce. MapReduce is supported by a parallel algorithm that efficiently processes a huge volume of data sets. The system is implemented using MapReduce tool at the top of the Hadoop parallel nodes to generate and process graphs with near real-time. Moreover, the system is evaluated in terms of efficiency by considering the system throughput and processing time. The results show that the proposed system is more scalable and efficient.

...read moreread less

39 citations

Proceedings Article•DOI•

Performance Analysis and CPU vs GPU Comparison for Deep Learning

[...]

Ebubekir Buber¹, Banu Diri¹•Institutions (1)

Yıldız Technical University¹

01 Oct 2018

TL;DR: It has been observed that the GPU runs faster than the CPU in all tests performed, and in some cases, GPU is 4-5 times faster than CPU, according to the tests performed on GPU server and CPU server.

...read moreread less

Abstract: Deep learning approaches are machine learning methods used in many application fields today. Some core mathematical operations performed in deep learning are suitable to be parallelized. Parallel processing increases the operating speed. Graphical Processing Units (GPU) are used frequently for parallel processing. Parallelization capacities of GPUs are higher than CPUs, because GPUs have far more cores than Central Processing Units (CPUs). In this study, benchmarking tests were performed between CPU and GPU. Tesla k80 GPU and Intel Xeon Gold 6126 CPU was used during tests. A system for classifying Web pages with Recurrent Neural Network (RNN) architecture was used to compare performance during testing. CPUs and GPUs running on the cloud were used in the tests because the amount of hardware needed for the tests was high. During the tests, some hyperparameters were adjusted and the performance values were compared between CPU and GPU. It has been observed that the GPU runs faster than the CPU in all tests performed. In some cases, GPU is 4-5 times faster than CPU, according to the tests performed on GPU server and CPU server. These values can be further increased by using a GPU server with more features.

...read moreread less

38 citations

Journal Article•DOI•

Dynamic parallelism for synaptic updating in GPU-accelerated spiking neural network simulations

[...]

Bahadir Kasap¹, A. John Van Opstal¹•Institutions (1)

Radboud University Nijmegen¹

09 Aug 2018-Neurocomputing

TL;DR: This work applies dynamic parallelism for synaptic updating in SNN simulations on a GPU, which eliminates the need to start many parallel applications at each time-step, and the associated lags of data transfer between CPU and GPU memories.

...read moreread less

36 citations

Journal Article•DOI•

A Tabu search based clustering algorithm and its parallel implementation on Spark

[...]

Yinhao Lu¹, Buyang Cao¹, César Rego², Fred Glover³•Institutions (3)

Tongji University¹, University of Mississippi², University of Colorado Boulder³

01 Feb 2018-Applied Soft Computing

TL;DR: In this article, an improved version of the K-means clustering algorithm is augmented by a Tabu Search strategy, which is better adapted to meet the needs of big data applications.

...read moreread less

36 citations

Patent•

Data processing method and device

[...]

Cui Hui, Robert H. Deng, Yingjiu Li, Guilin Wang, Tsz Hon Yuen - Show less +1 more

08 Feb 2018

TL;DR: In this article, the authors present a data processing method and device that can segment variable length packets and perform an alignment operation on each of the segments in parallel, thus making it easy to maintain the design code, increasing the code coverage degree during design code verification, and at the same time improving the time sequence markedly.

...read moreread less

Abstract: Provided are a data processing method and device. The data processing method includes: dividing the input data containing N data units corresponding to the current clock period into M data segments in order, wherein M and N are both positive integers, with N being greater than or equal to 2 and M being less than N; performing an alignment operation on data units of a first type in each of the M data segments in parallel, and enabling the data units of the first type to shift to the front of data units of another type, wherein the data units of another type are all set as an empty data packet type, and the first type is a data packet type to be processed, and the other type is a type not to be processed; and combining the M data segments after alignment processing into output data containing N data units. The data processing device includes a segmentation unit, a parallel processing unit and a combination unit. The data processing method and device in the embodiments of the present invention can segment variable length packets and perform an alignment operation on each of the segments in parallel, thus making it easy to maintain the design code, increasing the code coverage degree during design code verification, and at the same time improving the time sequence markedly.

...read moreread less

Book•

Algorithmic Aspects of Parallel Data Processing

[...]

Paraschos Koutris¹, Semih Salihoglu², Dan Suciu³•Institutions (3)

University of Wisconsin-Madison¹, University of Waterloo², University of Washington³

22 Feb 2018

TL;DR: Algorithmic Aspects of Parallel Data Processing discusses recent algorithmic developments for distributed data processing and uses a theoretical model of parallel processing called the Massively Parallel Computation (MPC) model, which is a simplification of the BSP model.

...read moreread less

Abstract: The last decade has seen a huge and growing interest in processing large data sets on large distributed clusters. This trend began with the MapReduce framework, and has been widely adopted by several other systems, including PigLatin, Hive, Scope, Dremmel, Spark and Myria to name a few. While the applications of such systems are diverse (for example, machine learning, data analytics), most involve relatively standard data processing tasks like identifying relevant data, cleaning, filtering, joining, grouping, transforming, extracting features, and evaluating results. This has generated great interest in the study of algorithms for data processing on large distributed clusters. Algorithmic Aspects of Parallel Data Processing discusses recent algorithmic developments for distributed data processing. It uses a theoretical model of parallel processing called the Massively Parallel Computation (MPC) model, which is a simplification of the BSP model where the only cost is given by the amount of communication and the number of communication rounds. The survey studies several algorithms for multi-join queries, sorting, and matrix multiplication. It discusses their relationships and common techniques applied across the different data processing tasks.

...read moreread less

Journal Article•DOI•

Exploiting Parallelism for Access Conflict Minimization in Flash-Based Solid State Drives

[...]

Congming Gao¹, Liang Shi¹, Cheng Ji², Yejia Di¹, Kaijie Wu¹, Chun Jason Xue², Edwin H.-M. Sha¹ - Show less +3 more•Institutions (2)

Chongqing University¹, City University of Hong Kong²

01 Jan 2018-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: This paper proposes parallel issue queueing (PIQ), a novel I/O scheduler at the host systems that delivers significant performance improvement especially for the applications which have heavy access conflicts.

...read moreread less

Abstract: Solid state drives (SSDs) have been widely deployed in personal computers, data centers, and cloud storages. In order to improve performance, SSDs are usually constructed with a number of channels with each channel connecting to a number of nand flash chips, each flash chip consisting of multiple dies and each die containing multiple planes. Based on this parallel architecture, I/O requests are potentially able to access parallel units simultaneously. Despite the rich parallelism offered by the parallel architecture, recent studies show that the utilization of flash parallel units is seriously low. This paper shows that the low parallel unit utilization is highly caused by the access conflict among I/O requests. In this paper, we propose parallel issue queueing (PIQ), a novel I/O scheduler at the host systems. PIQ groups I/O requests without conflicts into the same batch and I/O requests with conflicts into different batches. Hence, the multiple I/O requests in one batch can be fulfilled simultaneously by exploiting the rich parallelism of SSDs. Extensive experimental results show that PIQ delivers significant performance improvement especially for the applications which have heavy access conflicts.

...read moreread less

Patent•

Methods and compositions for labeling cells

[...]

Stéphane C. Boutet, Michael Y. Lucero, Tarjei S. Mikkelsen, Katherine Pfeiffer

07 Dec 2018

TL;DR: In this paper, the authors present methods, systems, and compositions for parallel processing of nucleic acid samples, using sample-specific barcode sequences, which facilitate the multiplexing of samples, detection of discrete cell populations within a pooled population, and detection of partitions comprising more than one cell.

...read moreread less

Abstract: The present disclosure provides methods, systems, and compositions for parallel processing of nucleic acid samples. Methods and systems of the present disclosure comprise the use of sample-specific barcode sequences, which facilitate the multiplexing of samples, detection of discrete cell populations within a pooled population, and detection of partitions comprising more than one cell.

...read moreread less

Journal Article•DOI•

A general-purpose framework for parallel processing of large-scale LiDAR data

[...]

Zhenlong Li¹, Michael E. Hodgson¹, Wenwen Li²•Institutions (2)

University of South Carolina¹, Arizona State University²

02 Jan 2018-International Journal of Digital Earth

TL;DR: In this paper, a tile-based spatial index is proposed to manage big LiDAR data in the scalable and fault-tolerant Hadoop distributed file system, and two spatial decomposition techniques are used to enable efficient parallelization.

...read moreread less

Abstract: Light detection and ranging (LiDAR) data are essential for scientific discoveries such as Earth and ecological sciences, environmental applications, and responding to natural disasters. While collecting LiDAR data over large areas is quite possible the subsequent processing steps typically involve large computational demands. Efficiently storing, managing, and processing LiDAR data are the prerequisite steps for enabling these LiDAR-based applications. However, handling LiDAR data poses grand geoprocessing challenges due to data and computational intensity. To tackle such challenges, we developed a general-purpose scalable framework coupled with a sophisticated data decomposition and parallelization strategy to efficiently handle ‘big’ LiDAR data collections. The contributions of this research were (1) a tile-based spatial index to manage big LiDAR data in the scalable and fault-tolerable Hadoop distributed file system, (2) two spatial decomposition techniques to enable efficient parallelization o...

...read moreread less

Journal Article•DOI•

Single-pulse writing of a concave microlens array.

[...]

Xiao-Wen Cao¹, Qi-Dai Chen¹, Lei Zhang¹, Zhen-Nan Tian¹, Qian-Kun Li¹, Lei Wang¹, Saulius Juodkazis², Hong-Bo Sun¹ - Show less +4 more•Institutions (2)

Jilin University¹, Swinburne University of Technology²

15 Feb 2018-Optics Letters

TL;DR: This work developed a method of femtosecond laser (fs-laser) parallel processing assisted by wet etching to fabricate 3D micro-optical components that showed a unique imaging property in multi-planes.

...read moreread less

Abstract: This work developed a method of femtosecond laser (fs-laser) parallel processing assisted by wet etching to fabricate 3D micro-optical components A 2D fs-laser spot array with designed spatial distribution was generated by a spatial light modulator A single-pulse exposure of the entire array was used for parallel processing By subsequent wet etching, a close-packed hexagonal arrangement, 3D concave microlens array on a curved surface with a radius of approximately 120 μm was fabricated, each unit lens of which has designable spatial distribution Characterization of imaging was carried out by a microscope and showed a unique imaging property in multi-planes This method provides a parallel and efficient technique to fabricate 3D micro-optical devices for applications in optofluidics, optical communication, and integrated optics

...read moreread less

Journal Article•DOI•

Not in Name Alone: A Memristive Memory Processing Unit for Real In-Memory Processing

[...]

Ameer Haj-Ali¹, Rotem Ben-Hur¹, Nimrod Wald¹, Ronny Ronen¹, Shahar Kvatinsky¹ - Show less +1 more•Institutions (1)

Technion – Israel Institute of Technology¹

27 Sep 2018-IEEE Micro

TL;DR: The memristive Memory Processing Unit (mMPU) is presented-a real processing-in-memory system in which the computation is done directly in the memory cells, thus eliminating the necessity for data transfer.

...read moreread less

Abstract: Data movement between processing and memory is the root cause of the limited performance and energy efficiency in modern von Neumann systems. To overcome the data-movement bottleneck, we present the memristive Memory Processing Unit (mMPU)-a real processing-in-memory system in which the computation is done directly in the memory cells, thus eliminating the necessity for data transfer. Furthermore, with its enormous inner parallelism, this system is ideal for data-intensive applications that are based on single instruction, multiple data (SIMD)-providing high throughput and energy-efficiency.

...read moreread less

Proceedings Article•DOI•

Performance of Some Image Processing Algorithms in Tensorflow

[...]

Damir Demirovic¹, Emir Skejic¹, Amira Serifovic-Trbalic¹•Institutions (1)

University of Tuzla¹

20 Jun 2018

TL;DR: In this paper image processing algorithms were evaluated, which are capable to execute in parallel manner on several platforms CPU and GPU, and all algorithms were tested in TensorFlow, which is a novel framework for deep learning, but also for image processing.

...read moreread less

Abstract: Signal, image and Synthetic Aperture Radar imagery algorithms in recent time are used in a daily routine. Due to huge data and complexity, their processing is almost impossible in a real time. Often image processing algorithms are inherently parallel in nature, so they fit nicely into parallel architectures multicore Central Processing Unit (CPU) and Graphics Processing Unit GPUs. In this paper image processing algorithms were evaluated, which are capable to execute in parallel manner on several platforms CPU and GPU. All algorithms were tested in TensorFlow, which is a novel framework for deep learning, but also for image processing. Relative speedups compared to CPU were given for all algorithms. TensorFlow GPU implementation can outperform multi-core CPUs for tested algorithms, obtained speedups range from 3.6 to 15 times.

...read moreread less

Journal Article•DOI•

Achieving Load Balance for Parallel Data Access on Distributed File Systems

[...]

Dan Huang¹, Dezhi Han², Jun Wang¹, Jiangling Yin¹, Xunchao Chen¹, Xuhong Zhang¹, Jian Zhou¹, Mao Ye¹ - Show less +4 more•Institutions (2)

University of Central Florida¹, Shanghai Maritime University²

01 Mar 2018-IEEE Transactions on Computers

TL;DR: Novel methods, referred to as Opass, are proposed to optimize parallel data reads, as well as to reduce the imbalance of parallel writes on distributed file systems to benefit parallel data-intensive analysis and balanced data access.

...read moreread less

Abstract: The distributed file system, HDFS, is widely deployed as the bedrock for many parallel big data analysis. However, when running multiple parallel applications over the shared file system, the data requests from different processes/executors will unfortunately be served in a surprisingly imbalanced fashion on the distributed storage servers. These imbalanced access patterns among storage nodes are caused because a). unlike conventional parallel file system using striping policies to evenly distribute data among storage nodes, data-intensive file system such as HDFS store each data unit, referred to as chunk file, with several copies based on a relative random policy, which can result in an uneven data distribution among storage nodes; b). based on the data retrieval policy in HDFS, the more data a storage node contains, the higher probability the storage node could be selected to serve the data. Therefore, on the nodes serving multiple chunk files, the data requests from different processes/executors will compete for shared resources such as hard disk head and network bandwidth , resulting in a degraded I/O performance. In this paper, we first conduct a complete analysis on how remote and imbalanced read/write patterns occur and how they are affected by the size of the cluster. We then propose novel methods, referred to as Opass, to optimize parallel data reads, as well as to reduce the imbalance of parallel writes on distributed file systems. Our proposed methods can benefit parallel data-intensive analysis with various parallel data access strategies. Opass adopts new matching-based algorithms to match processes to data so as to compute the maximum degree of data locality and balanced data access. Furthermore, to reduce the imbalance of parallel writes, Opass employs a heatmap for monitoring the I/O statuses of storage nodes and performs HM-LRU policy to select a local optimal storage node for serving write requests. Experiments are conducted on PRObE’s Marmot 128-node cluster testbed and the results from both benchmark and well-known parallel applications show the performance benefits and scalability of Opass.

...read moreread less

Proceedings Article•DOI•

ParSy: inspection and transformation of sparse matrix computations for parallelism

[...]

Kazem Cheshmi¹, Shoaib Kamil², Michelle Mills Strout³, Maryam Mehri Dehnavi¹•Institutions (3)

University of Toronto¹, Adobe Systems², University of Arizona³

11 Nov 2018

TL;DR: ParSy as mentioned in this paper uses a task coarsening strategy to create well-balanced tasks that can execute in parallel, while maintaining locality of memory accesses, and uses a novel inspection strategy along with a simple code transformation to optimize parallel sparse algorithms for shared memory processors.

...read moreread less

Abstract: In this work, we describe ParSy, a framework that uses a novel inspection strategy along with a simple code transformation to optimize parallel sparse algorithms for shared memory processors. Unlike existing approaches that can suffer from load imbalance and excessive synchronization, ParSy uses a novel task coarsening strategy to create well-balanced tasks that can execute in parallel, while maintaining locality of memory accesses. Code using the ParSy inspector and transformation outperforms existing highly-optimized sparse matrix algorithms such as Cholesky factorization on multi-core processors with speedups of 2.8 x and 3.1 x over the MKL Pardiso and PaStiX libraries respectively.

...read moreread less

Journal Article•DOI•

A multilevel FETI‐DP method and its performance for problems with billions of degrees of freedom

[...]

Jari Toivanen¹, Philip Avery¹, Philip Avery², Charbel Farhat¹•Institutions (2)

Stanford University¹, United States Army Research Laboratory²

01 Dec 2018-International Journal for Numerical Methods in Engineering

TL;DR: A multilevel generalization of the dual‐primal finite element tearing and interconnecting (FETI‐DP) domain decomposition method is proposed for very large‐scale discrete problems to address the bottleneck associated with the solution of the coarse problems at such scales.

...read moreread less

Proceedings Article•DOI•

Evolution of the GPU Device widely used in AI and Massive Parallel Processing

[...]

Toru Baji¹•Institutions (1)

Nvidia¹

13 Mar 2018

TL;DR: The details of this continuous performance growth, the constant evolution in transistors count and die size, and the scalable GPU architecture will be described.

...read moreread less

Abstract: While the CPU performance cannot benefits anymore from Moore's law, GPU (Graphic Processing Unit) still continue to increase its performance 1.5times/year. From this reason, GPU is now widely used not only for computer graphics but also for massive parallel processing and AI (Artificial Intelligence). In this paper, the details of this continuous performance growth, the constant evolution in transistors count and die size, and the scalable GPU architecture will be described.

...read moreread less

Patent•

Parallel transaction execution method based on blockchain

[...]

Xie Hanjian

13 Feb 2018

TL;DR: In this paper, a parallel transaction execution method based on a blockchain is proposed, where a data unit on the datachain is subjected to index numbering, the parallel transaction of a user needs to provide a data index which needs to be read and written for transaction execution in addition to basic transaction contents.

...read moreread less

Abstract: The invention discloses a parallel transaction execution method based on a blockchain. Firstly, a data unit on the datachain is subjected to index numbering, the parallel transaction of a user needs to provide a data index which needs to be read and written for transaction execution in addition to basic transaction contents. The serial transaction of the user only needs to provide the basic transaction contents. A node arranges parallel processing according to the data dependency relationship of the parallel transaction, and the transaction which can not carry out concurrence and the serial transaction can be executed in sequence.

...read moreread less

Journal Article•DOI•

New systolic array architecture for finite field division

[...]

Atef Ibrahim¹, Hamed Elsimary¹, Fayez Gebali²•Institutions (2)

Salman bin Abdulaziz University¹, University of Victoria²

28 May 2018-IEICE Electronics Express

TL;DR: The proposed systolic array architecture to perform division operations over GF(2m) based on the modified Stein's algorithm has the advantage of reducing the number of flip-flops required to store the intermediate variables of the algorithm and hence reduces the total gate counts to a large extent compared to the other related designs.

...read moreread less

Abstract: This paper proposes a new systolic array architecture to perform division operations over GF(2m) based on the modified Stein’s algorithm. The systolic structure is extracted by applying a regular approach to the division algorithm. This approach starts by obtaining the dependency graph for the intended algorithm and assigning a time value to each node in the dependency graph using a scheduling function and ends by projecting several nodes of the dependency graph to a processing element to constitute the systolic array. The obtained design structure has the advantage of reducing the number of flip-flops required to store the intermediate variables of the algorithm and hence reduces the total gate counts to a large extent compared to the other related designs. The analytical results show that the proposed design outperforms the related designs in terms of area (at least 32% reduction in area) and speed (at least 60% reduction in the total computation time) and has the lowest AT complexity that ranges from 80% to 94%.

...read moreread less

Journal Article•DOI•

Flexibility of individual multitasking strategies in task-switching with preview: are preferences for serial versus overlapping task processing dependent on between-task conflict?

[...]

Jovita Brüning¹, Dietrich Manzey¹•Institutions (1)

Technical University of Berlin¹

01 Jan 2018-Psychological Research-psychologische Forschung

TL;DR: The voluntarily occurring preferences for serial or overlapping processing seem to depend at least partially on the risk of crosstalk between tasks, and in both crosStalk conditions the individual performance efficiency was the higher, the more they processed in parallel.

...read moreread less

Abstract: The prevalence and the efficiency of serial and parallel processing under multiple task demands are highly debated. In the present study, we investigated whether individual preferences for serial or overlapping (parallel) processing represent a permanent predisposition or depend on the risk of crosstalk between tasks. Two groups (n = 91) of participants were tested. One group performed a classical task switching paradigm, enforcing a strict serial processing of tasks. The second group of participants performed the same tasks in a task-switching-with-preview paradigm, recently introduced by Reissland and Manzey (2016), which in principle allows for overlapping processing of both tasks in order to compensate for switch costs. In one condition, the tasks included univalent task stimuli, whereas in the other bivalent stimuli were used, increasing risk of crosstalk and task confusion in case of overlapping processing. The general distinction of voluntarily occurring preferences for serial or overlapping processing when performing task switching with preview could be confirmed. Tracking possible processing mode adjustments between low- and high-crosstalk conditions showed that individuals identified as serial processors in the low-crosstalk condition persisted in their processing mode. In contrast, overlapping processors split up in a majority adjusting to a serial processing mode and a minority persisting in overlapping processing, when working with bivalent stimuli. Thus, the voluntarily occurring preferences for serial or overlapping processing seem to depend at least partially on the risk of crosstalk between tasks. Strikingly, in both crosstalk conditions the individual performance efficiency was the higher, the more they processed in parallel.

...read moreread less

Patent•

Method and system of generating blocks on basis of block chain

[...]

Zhang Hang, Wang Jiwu, Zhang Guohua

16 Feb 2018

TL;DR: In this article, the authors proposed a method and a system of generating blocks on the basis of a block chain and computer equipment and a computer-readable storage medium, which relates to the technical field of data processing.

...read moreread less

Abstract: The invention provides a method and a system of generating blocks on the basis of a block chain and computer equipment and a computer-readable storage medium, and relates to the technical field of data processing. The method includes: receiving multiple-piece transaction information, which is sent by a transaction sending node, by a consensus node, and putting the same into a transaction pool; acquiring the multiple-piece transaction information from the transaction pool, and packaging the multiple-piece transaction information as multiple packets; starting multiple threads to carry out parallel processing packet by packet to obtain transaction processing results; and generating the blocks according to the transaction processing results. According to the method, processing efficiency of ablock-chain system is greatly improved, and a single-thread running speed is enabled to no longer be a performance bottleneck of the block-chain system.

...read moreread less

Journal Article•DOI•

Time-Domain Power Quality State Estimation Based on Kalman Filter Using Parallel Computing on Graphics Processing Units

[...]

Rafael Cisneros-Magana¹, Aurelio Medina¹, Venkata Dinavahi², Antonio Ramos-Paz¹•Institutions (2)

Universidad Michoacana de San Nicolás de Hidalgo¹, University of Alberta²

17 Apr 2018-IEEE Access

TL;DR: A methodology is developed to assess the time-domain power quality state estimation (PQSE) in electrical systems based on the Kalman filter implemented using parallel processing techniques through graphics processing units (GPUs) to reduce the execution time.

...read moreread less

Abstract: A methodology is developed to assess the time-domain power quality state estimation (PQSE) in electrical systems based on the Kalman filter implemented using parallel processing techniques through graphics processing units (GPUs) to reduce the execution time. The measurements used by the state estimation algorithm are taken from the simulation and transient propagation response of the power network. The parallel Kalman filter (PKF) state estimation obtains the waveforms for busbar voltages and line currents with several sources of time-varying electromagnetic transients. The PKF is evaluated using the compute unified device architecture (CUDA) platform and the CUDA basic linear algebra subprograms library, the parallel filter is executed on GPU cards. Case studies are applied to solve the time-domain state estimation using the proposed PKF-PQSE method, obtaining an execution time reduction and including time-varying harmonics, short circuit faults, and load transient conditions. The speed-up depends on the number of state variables modeling the electrical system under analysis. The PKF-PQSE results are successfully compared and validated against the power systems computer aided design/electromagnetic transients including direct current simulator.

...read moreread less

Journal Article•DOI•

GPU-Accelerated High-Throughput Online Stream Data Processing

[...]

Zhenhua Chen¹, Jielong Xu¹, Jian Tang¹, Kevin Kwiat², Charles A. Kamhoua², Chonggang Wang³ - Show less +2 more•Institutions (3)

Syracuse University¹, Air Force Research Laboratory², InterDigital, Inc.³

01 Jun 2018-IEEE Transactions on Big Data

TL;DR: The design, implementation and evaluation of G-Storm is presented, a GPU-enabled parallel system based on Storm, which harnesses the massively parallel computing power of GPUs for high-throughput online stream data processing.

...read moreread less

Abstract: The Single Instruction Multiple Data (SIMD) architecture of Graphic Processing Units (GPUs) makes them perfect for parallel processing of big data. In this paper, we present the design, implementation and evaluation of G-Storm , a GPU-enabled parallel system based on Storm, which harnesses the massively parallel computing power of GPUs for high-throughput online stream data processing. G-Storm has the following desirable features: 1) G-Storm is designed to be a general data processing platform as Storm, which can handle various applications and data types. 2) G-Storm exposes GPUs to Storm applications while preserving its easy-to-use programming model. 3) G-Storm achieves high-throughput and low-overhead data processing with GPUs. 4) G-Storm accelerates data processing further by enabling Direct Data Transfer (DDT), between two executors that process data at a common GPU. We implemented G-Storm based on Storm 0.9.2 and tested it using three different applications, including continuous query, matrix multiplication and image resizing. Extensive experimental results show that 1) Compared to Storm, G-Storm achieves over 7× improvement on throughput for continuous query, while maintaining reasonable average tuple processing time. It also leads to 2.3× and 1.3× throughput improvements on the other two applications, respectively. 2) DDT significantly reduces data processing time.

...read moreread less

Journal Article•DOI•

Energy aware simplicial processor for embedded morphological visual processing in intelligent internet of things

[...]

Martin Villemur, Pedro Julian, Andreas G. Andreou

01 Apr 2018-Electronics Letters

TL;DR: This Letter presents the architecture implementation and testing of an single instruction multiple data (SIMD) processor for energy aware embedded morphological visual processing using the simplicial piece-wise linear approximation.

...read moreread less

Abstract: This Letter presents the architecture implementation and testing of an single instruction multiple data (SIMD) processor for energy aware embedded morphological visual processing using the simplicial piece-wise linear approximation. The architecture comprises a linear array of 48 × 48 processing elements, each connected to an eight-neighbour clique operating on binary input and state data. The architecture is synthesised from a custom designed ultra low-voltage CMOS library and fabricated in a 55 nm CMOS technology. The chip is capable of dynamic voltage/frequency scaling with power supplies between 0.5 and 1.2 V. The fabricated chip achieves an overall performance of 293 TOPS/W with dynamic energy dissipation efficiency of 3.4 fJ per output operation at 0.6 V.

...read moreread less

Journal Article•DOI•

A fully coupled two-level Schwarz preconditioner based on smoothed aggregation for the transient multigroup neutron diffusion equations

[...]

Fande Kong¹, Yaqi Wang¹, Sebastian Schunert¹, John W. Peterson¹, Cody J. Permann¹, David Andrs¹, Richard C. Martineau¹ - Show less +3 more•Institutions (1)

Idaho National Laboratory¹

01 May 2018-Numerical Linear Algebra With Applications

Collapse