scispace - formally typeset
Search or ask a question

Showing papers on "Parallel processing (DSP implementation) published in 2018"


Journal ArticleDOI
TL;DR: Results show that the algorithm proposed could reduce the amount of computation in the execution process, greatly reduce the memory space consumption, and improve the counting speed in railway signal system.

117 citations


Journal ArticleDOI
TL;DR: An optimization algorithm based on parallel versions of the bat algorithm, random-key encoding scheme, communication strategy scheme and makespan scheme is proposed to solve the NP-hard job shop scheduling problem.
Abstract: Parallel processing plays an important role in efficient and effective computations of function optimization. In this paper, an optimization algorithm based on parallel versions of the bat algorithm (BA), random-key encoding scheme, communication strategy scheme and makespan scheme is proposed to solve the NP-hard job shop scheduling problem. The aim of the parallel BA with communication strategies is to correlate individuals in swarms and to share the computation load over few processors. Based on the original structure of the BA, the bat populations are split into several independent groups. In addition, the communication strategy provides the diversity-enhanced bats to speed up solutions. In the experiment, forty three instances of the benchmark in job shop scheduling data set with various sizes are used to test the behavior of the convergence, and accuracy of the proposed method. The results compared with the other methods in the literature show that the proposed scheme increases more the convergence and the accuracy than BA and particle swarm optimization.

106 citations


Proceedings ArticleDOI
02 Jun 2018
TL;DR: This work proposes a two-stage, prediction-based DNN execution model without accuracy loss, and proposes a uniform serial processing element (USPE), for both prediction and execution stages to improve the flexibility and minimize the area overhead.
Abstract: Recently, deep neural network based approaches have emerged as indispensable tools in many fields, ranging from image and video recognition to natural language processing. However, the large size of such newly developed networks poses both throughput and energy challenges to the underlying processing hardware. This could be the major stumbling block to many promising applications such as self-driving cars and smart cities. Existing work proposes to weed zeros from input neurons to avoid unnecessary DNN computation (zero-valued operand multiplications). However, we observe that many output neurons are still ineffectual even if the zero-removal technique has been applied. These ineffectual output neurons could not pass their values to the subsequent layer, which means all the computations (including zero-valued and non-zero-valued operand multiplications) related to these output neurons are futile and wasteful. Therefore, there is an opportunity to significantly improve the performance and efficiency of DNN execution by predicting the ineffectual output neurons and thus completely avoid the futile computations by skipping over these ineffectual output neurons. To do so, we propose a two-stage, prediction-based DNN execution model without accuracy loss. We also propose a uniform serial processing element (USPE), for both prediction and execution stages to improve the flexibility and minimize the area overhead. To improve the processing throughput, we further present a scale-out design for USPE. Evaluation results over a set of state-of-the-art DNNs show that our proposed design achieves 2.5X speedup and 1.9X energy-efficiency on average over the traditional accelerator. Moreover, by stacking with our design, we can improve Cnvlutin and Stripes by 1.9X and 2.0X on average, respectively.

77 citations


Proceedings ArticleDOI
01 Oct 2018
TL;DR: This paper presents a discussion panel of two of the hottest topics in this area namely distributed parallel processing and distributed cloud computing, and introduces the concept of decreasing the response time in distributed parallel computing.
Abstract: In this paper, we present a discussion panel of two of the hottest topics in this area namely distributed parallel processing and distributed cloud computing. Various aspects have been discussed in this review paper such as concentrating on whether these topics are discussed simultaneously in any previous works. Other aspects that have been reviewed in this paper include the algorithms, which simulated in both distributed parallel computing and distributed cloud computing. The goal is to process the tasks over resources then readjusted the calculation among the servers for the sake of optimization. These help us to improve the system performance with the desired rates. During our review, we presented some articles which explain the designing of applications in distributed cloud computing while some others introduced the concept of decreasing the response time in distributed parallel computing.

70 citations


Journal ArticleDOI
TL;DR: A parallel image encryption method based on bitplane decomposition is proposed that has parallel processing capability for multiple bitplanes encryption, which increases encryption speed and makes it suitable for real-time applications.
Abstract: Image encryption is an efficient technique to protect image content from unauthorized parties. In this paper a parallel image encryption method based on bitplane decomposition is proposed. The original grayscale image is converted to a set of binary images by local binary pattern (LBP) technique and bitplane decomposition (BPD) methods. Then, permutation and substitution steps are performed by genetic algorithm (GA) using crossover and mutation operations. Finally, these scrambled bitplanes are combined together to obtain encrypted image. Instead of random population selection in GA, a deterministic method with security keys is utilized to improve security level. The proposed encryption method has parallel processing capability for multiple bitplanes encryption. This distributed GA with multiple populations increases encryption speed and makes it suitable for real-time applications. Simulations and security analysis are done to demonstrate efficiency of our algorithm.

49 citations


Journal ArticleDOI
TL;DR: Four in-memory algorithms for efficient execution of fixed point multiplication using MAGIC gates achieve much better latency and throughput than a previous work and significantly reduce the area cost and can be feasibly implemented inside the size-limited memory arrays.
Abstract: Data-intensive applications such as image processing suffer from massive data movement between memory and processing units. The severe limitations on system performance and energy efficiency imposed by this data movement are further exacerbated with any increase in the distance the data must travel. This data transfer and its associated obstacles could be eliminated by the use of emerging non-volatile resistive memory technologies (memristors) that make it possible to both store and process data within the same memory cells. In this paper, we propose four in-memory algorithms for efficient execution of fixed point multiplication using MAGIC gates. These algorithms achieve much better latency and throughput than a previous work and significantly reduce the area cost. They can thus be feasibly implemented inside the size-limited memory arrays. We use these fixed point multiplication algorithms to efficiently perform more complex in-memory operations such as image convolution and further show how to partition large images to multiple memory arrays so as to maximize the parallelism. All the proposed algorithms are evaluated and verified using a cycle-accurate and functional simulator. Our algorithms provide on average $200\times $ better performance over state-of-the-art APIM, a processing in-memory architecture for data intensive applications.

44 citations


Journal ArticleDOI
TL;DR: A system architecture that enhances the working of traditional MapReduce by incorporating parallel processing algorithm is presented and complete four-tier architecture is also proposed that efficiently aggregate the data, eliminate unnecessary data, and analyze the data by the proposed parallelprocessing algorithm.
Abstract: The growing gap between users and the Big Data analytics requires innovative tools that address the challenges faced by big data volume, variety, and velocity. Therefore, it becomes computationally inefficient to analyze such massive volume of data. Moreover, advancements in the field of Big Data application and data science poses additional challenges, where High-Performance Computing solution has become a key issue and has attracted attention in recent years. However, these systems are either memoryless or computational inefficient. Therefore, keeping in view the aforementioned needs, there is a requirement for a system that can efficiently analyze a stream of Big Data within their requirements. Hence, this paper presents a system architecture that enhances the working of traditional MapReduce by incorporating parallel processing algorithm. Moreover, complete four-tier architecture is also proposed that efficiently aggregate the data, eliminate unnecessary data, and analyze the data by the proposed parallel processing algorithm. The proposed system architecture both read and writes operations that enhance the efficiency of the Input/Output operation. To check the efficiency of the proposed algorithms exploited in the proposed system architecture, we have implemented our proposed system using Hadoop and MapReduce. MapReduce is supported by a parallel algorithm that efficiently processes a huge volume of data sets. The system is implemented using MapReduce tool at the top of the Hadoop parallel nodes to generate and process graphs with near real-time. Moreover, the system is evaluated in terms of efficiency by considering the system throughput and processing time. The results show that the proposed system is more scalable and efficient.

39 citations


Proceedings ArticleDOI
01 Oct 2018
TL;DR: It has been observed that the GPU runs faster than the CPU in all tests performed, and in some cases, GPU is 4-5 times faster than CPU, according to the tests performed on GPU server and CPU server.
Abstract: Deep learning approaches are machine learning methods used in many application fields today. Some core mathematical operations performed in deep learning are suitable to be parallelized. Parallel processing increases the operating speed. Graphical Processing Units (GPU) are used frequently for parallel processing. Parallelization capacities of GPUs are higher than CPUs, because GPUs have far more cores than Central Processing Units (CPUs). In this study, benchmarking tests were performed between CPU and GPU. Tesla k80 GPU and Intel Xeon Gold 6126 CPU was used during tests. A system for classifying Web pages with Recurrent Neural Network (RNN) architecture was used to compare performance during testing. CPUs and GPUs running on the cloud were used in the tests because the amount of hardware needed for the tests was high. During the tests, some hyperparameters were adjusted and the performance values were compared between CPU and GPU. It has been observed that the GPU runs faster than the CPU in all tests performed. In some cases, GPU is 4-5 times faster than CPU, according to the tests performed on GPU server and CPU server. These values can be further increased by using a GPU server with more features.

38 citations


Journal ArticleDOI
TL;DR: This work applies dynamic parallelism for synaptic updating in SNN simulations on a GPU, which eliminates the need to start many parallel applications at each time-step, and the associated lags of data transfer between CPU and GPU memories.

36 citations


Journal ArticleDOI
TL;DR: In this article, an improved version of the K-means clustering algorithm is augmented by a Tabu Search strategy, which is better adapted to meet the needs of big data applications.

36 citations


Patent
08 Feb 2018
TL;DR: In this article, the authors present a data processing method and device that can segment variable length packets and perform an alignment operation on each of the segments in parallel, thus making it easy to maintain the design code, increasing the code coverage degree during design code verification, and at the same time improving the time sequence markedly.
Abstract: Provided are a data processing method and device. The data processing method includes: dividing the input data containing N data units corresponding to the current clock period into M data segments in order, wherein M and N are both positive integers, with N being greater than or equal to 2 and M being less than N; performing an alignment operation on data units of a first type in each of the M data segments in parallel, and enabling the data units of the first type to shift to the front of data units of another type, wherein the data units of another type are all set as an empty data packet type, and the first type is a data packet type to be processed, and the other type is a type not to be processed; and combining the M data segments after alignment processing into output data containing N data units. The data processing device includes a segmentation unit, a parallel processing unit and a combination unit. The data processing method and device in the embodiments of the present invention can segment variable length packets and perform an alignment operation on each of the segments in parallel, thus making it easy to maintain the design code, increasing the code coverage degree during design code verification, and at the same time improving the time sequence markedly.

Book
22 Feb 2018
TL;DR: Algorithmic Aspects of Parallel Data Processing discusses recent algorithmic developments for distributed data processing and uses a theoretical model of parallel processing called the Massively Parallel Computation (MPC) model, which is a simplification of the BSP model.
Abstract: The last decade has seen a huge and growing interest in processing large data sets on large distributed clusters. This trend began with the MapReduce framework, and has been widely adopted by several other systems, including PigLatin, Hive, Scope, Dremmel, Spark and Myria to name a few. While the applications of such systems are diverse (for example, machine learning, data analytics), most involve relatively standard data processing tasks like identifying relevant data, cleaning, filtering, joining, grouping, transforming, extracting features, and evaluating results. This has generated great interest in the study of algorithms for data processing on large distributed clusters. Algorithmic Aspects of Parallel Data Processing discusses recent algorithmic developments for distributed data processing. It uses a theoretical model of parallel processing called the Massively Parallel Computation (MPC) model, which is a simplification of the BSP model where the only cost is given by the amount of communication and the number of communication rounds. The survey studies several algorithms for multi-join queries, sorting, and matrix multiplication. It discusses their relationships and common techniques applied across the different data processing tasks.

Journal ArticleDOI
TL;DR: This paper proposes parallel issue queueing (PIQ), a novel I/O scheduler at the host systems that delivers significant performance improvement especially for the applications which have heavy access conflicts.
Abstract: Solid state drives (SSDs) have been widely deployed in personal computers, data centers, and cloud storages. In order to improve performance, SSDs are usually constructed with a number of channels with each channel connecting to a number of nand flash chips, each flash chip consisting of multiple dies and each die containing multiple planes. Based on this parallel architecture, I/O requests are potentially able to access parallel units simultaneously. Despite the rich parallelism offered by the parallel architecture, recent studies show that the utilization of flash parallel units is seriously low. This paper shows that the low parallel unit utilization is highly caused by the access conflict among I/O requests. In this paper, we propose parallel issue queueing (PIQ), a novel I/O scheduler at the host systems. PIQ groups I/O requests without conflicts into the same batch and I/O requests with conflicts into different batches. Hence, the multiple I/O requests in one batch can be fulfilled simultaneously by exploiting the rich parallelism of SSDs. Extensive experimental results show that PIQ delivers significant performance improvement especially for the applications which have heavy access conflicts.

Patent
07 Dec 2018
TL;DR: In this paper, the authors present methods, systems, and compositions for parallel processing of nucleic acid samples, using sample-specific barcode sequences, which facilitate the multiplexing of samples, detection of discrete cell populations within a pooled population, and detection of partitions comprising more than one cell.
Abstract: The present disclosure provides methods, systems, and compositions for parallel processing of nucleic acid samples. Methods and systems of the present disclosure comprise the use of sample-specific barcode sequences, which facilitate the multiplexing of samples, detection of discrete cell populations within a pooled population, and detection of partitions comprising more than one cell.

Journal ArticleDOI
TL;DR: In this paper, a tile-based spatial index is proposed to manage big LiDAR data in the scalable and fault-tolerant Hadoop distributed file system, and two spatial decomposition techniques are used to enable efficient parallelization.
Abstract: Light detection and ranging (LiDAR) data are essential for scientific discoveries such as Earth and ecological sciences, environmental applications, and responding to natural disasters. While collecting LiDAR data over large areas is quite possible the subsequent processing steps typically involve large computational demands. Efficiently storing, managing, and processing LiDAR data are the prerequisite steps for enabling these LiDAR-based applications. However, handling LiDAR data poses grand geoprocessing challenges due to data and computational intensity. To tackle such challenges, we developed a general-purpose scalable framework coupled with a sophisticated data decomposition and parallelization strategy to efficiently handle ‘big’ LiDAR data collections. The contributions of this research were (1) a tile-based spatial index to manage big LiDAR data in the scalable and fault-tolerable Hadoop distributed file system, (2) two spatial decomposition techniques to enable efficient parallelization o...

Journal ArticleDOI
TL;DR: This work developed a method of femtosecond laser (fs-laser) parallel processing assisted by wet etching to fabricate 3D micro-optical components that showed a unique imaging property in multi-planes.
Abstract: This work developed a method of femtosecond laser (fs-laser) parallel processing assisted by wet etching to fabricate 3D micro-optical components A 2D fs-laser spot array with designed spatial distribution was generated by a spatial light modulator A single-pulse exposure of the entire array was used for parallel processing By subsequent wet etching, a close-packed hexagonal arrangement, 3D concave microlens array on a curved surface with a radius of approximately 120 μm was fabricated, each unit lens of which has designable spatial distribution Characterization of imaging was carried out by a microscope and showed a unique imaging property in multi-planes This method provides a parallel and efficient technique to fabricate 3D micro-optical devices for applications in optofluidics, optical communication, and integrated optics

Journal ArticleDOI
TL;DR: The memristive Memory Processing Unit (mMPU) is presented-a real processing-in-memory system in which the computation is done directly in the memory cells, thus eliminating the necessity for data transfer.
Abstract: Data movement between processing and memory is the root cause of the limited performance and energy efficiency in modern von Neumann systems. To overcome the data-movement bottleneck, we present the memristive Memory Processing Unit (mMPU)-a real processing-in-memory system in which the computation is done directly in the memory cells, thus eliminating the necessity for data transfer. Furthermore, with its enormous inner parallelism, this system is ideal for data-intensive applications that are based on single instruction, multiple data (SIMD)-providing high throughput and energy-efficiency.

Proceedings ArticleDOI
20 Jun 2018
TL;DR: In this paper image processing algorithms were evaluated, which are capable to execute in parallel manner on several platforms CPU and GPU, and all algorithms were tested in TensorFlow, which is a novel framework for deep learning, but also for image processing.
Abstract: Signal, image and Synthetic Aperture Radar imagery algorithms in recent time are used in a daily routine. Due to huge data and complexity, their processing is almost impossible in a real time. Often image processing algorithms are inherently parallel in nature, so they fit nicely into parallel architectures multicore Central Processing Unit (CPU) and Graphics Processing Unit GPUs. In this paper image processing algorithms were evaluated, which are capable to execute in parallel manner on several platforms CPU and GPU. All algorithms were tested in TensorFlow, which is a novel framework for deep learning, but also for image processing. Relative speedups compared to CPU were given for all algorithms. TensorFlow GPU implementation can outperform multi-core CPUs for tested algorithms, obtained speedups range from 3.6 to 15 times.

Journal ArticleDOI
TL;DR: Novel methods, referred to as Opass, are proposed to optimize parallel data reads, as well as to reduce the imbalance of parallel writes on distributed file systems to benefit parallel data-intensive analysis and balanced data access.
Abstract: The distributed file system, HDFS, is widely deployed as the bedrock for many parallel big data analysis. However, when running multiple parallel applications over the shared file system, the data requests from different processes/executors will unfortunately be served in a surprisingly imbalanced fashion on the distributed storage servers. These imbalanced access patterns among storage nodes are caused because a). unlike conventional parallel file system using striping policies to evenly distribute data among storage nodes, data-intensive file system such as HDFS store each data unit, referred to as chunk file, with several copies based on a relative random policy, which can result in an uneven data distribution among storage nodes; b). based on the data retrieval policy in HDFS, the more data a storage node contains, the higher probability the storage node could be selected to serve the data. Therefore, on the nodes serving multiple chunk files, the data requests from different processes/executors will compete for shared resources such as hard disk head and network bandwidth , resulting in a degraded I/O performance. In this paper, we first conduct a complete analysis on how remote and imbalanced read/write patterns occur and how they are affected by the size of the cluster. We then propose novel methods, referred to as Opass, to optimize parallel data reads, as well as to reduce the imbalance of parallel writes on distributed file systems. Our proposed methods can benefit parallel data-intensive analysis with various parallel data access strategies. Opass adopts new matching-based algorithms to match processes to data so as to compute the maximum degree of data locality and balanced data access. Furthermore, to reduce the imbalance of parallel writes, Opass employs a heatmap for monitoring the I/O statuses of storage nodes and performs HM-LRU policy to select a local optimal storage node for serving write requests. Experiments are conducted on PRObE’s Marmot 128-node cluster testbed and the results from both benchmark and well-known parallel applications show the performance benefits and scalability of Opass.

Proceedings ArticleDOI
11 Nov 2018
TL;DR: ParSy as mentioned in this paper uses a task coarsening strategy to create well-balanced tasks that can execute in parallel, while maintaining locality of memory accesses, and uses a novel inspection strategy along with a simple code transformation to optimize parallel sparse algorithms for shared memory processors.
Abstract: In this work, we describe ParSy, a framework that uses a novel inspection strategy along with a simple code transformation to optimize parallel sparse algorithms for shared memory processors. Unlike existing approaches that can suffer from load imbalance and excessive synchronization, ParSy uses a novel task coarsening strategy to create well-balanced tasks that can execute in parallel, while maintaining locality of memory accesses. Code using the ParSy inspector and transformation outperforms existing highly-optimized sparse matrix algorithms such as Cholesky factorization on multi-core processors with speedups of 2.8 x and 3.1 x over the MKL Pardiso and PaStiX libraries respectively.

Journal ArticleDOI
TL;DR: A multilevel generalization of the dual‐primal finite element tearing and interconnecting (FETI‐DP) domain decomposition method is proposed for very large‐scale discrete problems to address the bottleneck associated with the solution of the coarse problems at such scales.

Proceedings ArticleDOI
Toru Baji1
13 Mar 2018
TL;DR: The details of this continuous performance growth, the constant evolution in transistors count and die size, and the scalable GPU architecture will be described.
Abstract: While the CPU performance cannot benefits anymore from Moore's law, GPU (Graphic Processing Unit) still continue to increase its performance 1.5times/year. From this reason, GPU is now widely used not only for computer graphics but also for massive parallel processing and AI (Artificial Intelligence). In this paper, the details of this continuous performance growth, the constant evolution in transistors count and die size, and the scalable GPU architecture will be described.

Patent
13 Feb 2018
TL;DR: In this paper, a parallel transaction execution method based on a blockchain is proposed, where a data unit on the datachain is subjected to index numbering, the parallel transaction of a user needs to provide a data index which needs to be read and written for transaction execution in addition to basic transaction contents.
Abstract: The invention discloses a parallel transaction execution method based on a blockchain. Firstly, a data unit on the datachain is subjected to index numbering, the parallel transaction of a user needs to provide a data index which needs to be read and written for transaction execution in addition to basic transaction contents. The serial transaction of the user only needs to provide the basic transaction contents. A node arranges parallel processing according to the data dependency relationship of the parallel transaction, and the transaction which can not carry out concurrence and the serial transaction can be executed in sequence.

Journal ArticleDOI
TL;DR: The proposed systolic array architecture to perform division operations over GF(2m) based on the modified Stein's algorithm has the advantage of reducing the number of flip-flops required to store the intermediate variables of the algorithm and hence reduces the total gate counts to a large extent compared to the other related designs.
Abstract: This paper proposes a new systolic array architecture to perform division operations over GF(2m) based on the modified Stein’s algorithm. The systolic structure is extracted by applying a regular approach to the division algorithm. This approach starts by obtaining the dependency graph for the intended algorithm and assigning a time value to each node in the dependency graph using a scheduling function and ends by projecting several nodes of the dependency graph to a processing element to constitute the systolic array. The obtained design structure has the advantage of reducing the number of flip-flops required to store the intermediate variables of the algorithm and hence reduces the total gate counts to a large extent compared to the other related designs. The analytical results show that the proposed design outperforms the related designs in terms of area (at least 32% reduction in area) and speed (at least 60% reduction in the total computation time) and has the lowest AT complexity that ranges from 80% to 94%.

Journal ArticleDOI
TL;DR: The voluntarily occurring preferences for serial or overlapping processing seem to depend at least partially on the risk of crosstalk between tasks, and in both crosStalk conditions the individual performance efficiency was the higher, the more they processed in parallel.
Abstract: The prevalence and the efficiency of serial and parallel processing under multiple task demands are highly debated. In the present study, we investigated whether individual preferences for serial or overlapping (parallel) processing represent a permanent predisposition or depend on the risk of crosstalk between tasks. Two groups (n = 91) of participants were tested. One group performed a classical task switching paradigm, enforcing a strict serial processing of tasks. The second group of participants performed the same tasks in a task-switching-with-preview paradigm, recently introduced by Reissland and Manzey (2016), which in principle allows for overlapping processing of both tasks in order to compensate for switch costs. In one condition, the tasks included univalent task stimuli, whereas in the other bivalent stimuli were used, increasing risk of crosstalk and task confusion in case of overlapping processing. The general distinction of voluntarily occurring preferences for serial or overlapping processing when performing task switching with preview could be confirmed. Tracking possible processing mode adjustments between low- and high-crosstalk conditions showed that individuals identified as serial processors in the low-crosstalk condition persisted in their processing mode. In contrast, overlapping processors split up in a majority adjusting to a serial processing mode and a minority persisting in overlapping processing, when working with bivalent stimuli. Thus, the voluntarily occurring preferences for serial or overlapping processing seem to depend at least partially on the risk of crosstalk between tasks. Strikingly, in both crosstalk conditions the individual performance efficiency was the higher, the more they processed in parallel.

Patent
16 Feb 2018
TL;DR: In this article, the authors proposed a method and a system of generating blocks on the basis of a block chain and computer equipment and a computer-readable storage medium, which relates to the technical field of data processing.
Abstract: The invention provides a method and a system of generating blocks on the basis of a block chain and computer equipment and a computer-readable storage medium, and relates to the technical field of data processing. The method includes: receiving multiple-piece transaction information, which is sent by a transaction sending node, by a consensus node, and putting the same into a transaction pool; acquiring the multiple-piece transaction information from the transaction pool, and packaging the multiple-piece transaction information as multiple packets; starting multiple threads to carry out parallel processing packet by packet to obtain transaction processing results; and generating the blocks according to the transaction processing results. According to the method, processing efficiency of ablock-chain system is greatly improved, and a single-thread running speed is enabled to no longer be a performance bottleneck of the block-chain system.

Journal ArticleDOI
TL;DR: A methodology is developed to assess the time-domain power quality state estimation (PQSE) in electrical systems based on the Kalman filter implemented using parallel processing techniques through graphics processing units (GPUs) to reduce the execution time.
Abstract: A methodology is developed to assess the time-domain power quality state estimation (PQSE) in electrical systems based on the Kalman filter implemented using parallel processing techniques through graphics processing units (GPUs) to reduce the execution time. The measurements used by the state estimation algorithm are taken from the simulation and transient propagation response of the power network. The parallel Kalman filter (PKF) state estimation obtains the waveforms for busbar voltages and line currents with several sources of time-varying electromagnetic transients. The PKF is evaluated using the compute unified device architecture (CUDA) platform and the CUDA basic linear algebra subprograms library, the parallel filter is executed on GPU cards. Case studies are applied to solve the time-domain state estimation using the proposed PKF-PQSE method, obtaining an execution time reduction and including time-varying harmonics, short circuit faults, and load transient conditions. The speed-up depends on the number of state variables modeling the electrical system under analysis. The PKF-PQSE results are successfully compared and validated against the power systems computer aided design/electromagnetic transients including direct current simulator.

Journal ArticleDOI
TL;DR: The design, implementation and evaluation of G-Storm is presented, a GPU-enabled parallel system based on Storm, which harnesses the massively parallel computing power of GPUs for high-throughput online stream data processing.
Abstract: The Single Instruction Multiple Data (SIMD) architecture of Graphic Processing Units (GPUs) makes them perfect for parallel processing of big data. In this paper, we present the design, implementation and evaluation of G-Storm , a GPU-enabled parallel system based on Storm, which harnesses the massively parallel computing power of GPUs for high-throughput online stream data processing. G-Storm has the following desirable features: 1) G-Storm is designed to be a general data processing platform as Storm, which can handle various applications and data types. 2) G-Storm exposes GPUs to Storm applications while preserving its easy-to-use programming model. 3) G-Storm achieves high-throughput and low-overhead data processing with GPUs. 4) G-Storm accelerates data processing further by enabling Direct Data Transfer (DDT), between two executors that process data at a common GPU. We implemented G-Storm based on Storm 0.9.2 and tested it using three different applications, including continuous query, matrix multiplication and image resizing. Extensive experimental results show that 1) Compared to Storm, G-Storm achieves over 7× improvement on throughput for continuous query, while maintaining reasonable average tuple processing time. It also leads to 2.3× and 1.3× throughput improvements on the other two applications, respectively. 2) DDT significantly reduces data processing time.

Journal ArticleDOI
TL;DR: This Letter presents the architecture implementation and testing of an single instruction multiple data (SIMD) processor for energy aware embedded morphological visual processing using the simplicial piece-wise linear approximation.
Abstract: This Letter presents the architecture implementation and testing of an single instruction multiple data (SIMD) processor for energy aware embedded morphological visual processing using the simplicial piece-wise linear approximation. The architecture comprises a linear array of 48 × 48 processing elements, each connected to an eight-neighbour clique operating on binary input and state data. The architecture is synthesised from a custom designed ultra low-voltage CMOS library and fabricated in a 55 nm CMOS technology. The chip is capable of dynamic voltage/frequency scaling with power supplies between 0.5 and 1.2 V. The fabricated chip achieves an overall performance of 293 TOPS/W with dynamic energy dissipation efficiency of 3.4 fJ per output operation at 0.6 V.