scispace - formally typeset
Search or ask a question

Showing papers on "Parallel algorithm published in 2011"


Journal ArticleDOI
TL;DR: This work investigates two representative ways of approximating the dense similarity matrix and picks the strategy of sparsifying the matrix via retaining nearest neighbors and investigates its parallelization, which can effectively handle large problems.
Abstract: Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms, such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform clustering on large data sets, we investigate two representative ways of approximating the dense similarity matrix. We compare one approach by sparsifying the matrix with another by the Nystrom method. We then pick the strategy of sparsifying the matrix via retaining nearest neighbors and investigate its parallelization. We parallelize both memory use and computation on distributed computers. Through an empirical study on a document data set of 193,844 instances and a photo data set of 2,121,863, we show that our parallel algorithm can effectively handle large problems.

591 citations


Book
22 Aug 2011
TL;DR: The algorithms apply a novel “random-like” deterministic technique that provides for a fast and efficient breaking of an apparently symmetric situation in parallel and distributed computation.
Abstract: The following problem is considered: given a linked list of length n , compute the distance from each element of the linked list to the end of the list. The problem has two standard deterministic algorithms: a linear time serial algorithm, and an O (log n ) time parallel algorithm using n processors. We present new deterministic parallel algorithms for the problem. Our strongest results are (1) O (log n log* n ) time using n /(log n log* n ) processors (this algorithm achieves optimal speed-up); (2) O (log n ) time using n log ( k ) n /log n processors, for any fixed positive integer k . The algorithms apply a novel “random-like” deterministic technique. This technique provides for a fast and efficient breaking of an apparently symmetric situation in parallel and distributed computation.

474 citations


Book
09 Sep 2011
TL;DR: In this paper, a parallel implementation of merge sort on a CREW PRAM that uses n processors and O(logn) time is given, and the constant in the running time is small.
Abstract: We give a parallel implementation of merge sort on a CREW PRAM that uses n processors and O(logn) time; the constant in the running time is small. We also give a more complex version of the algorithm for the EREW PRAM; it also uses n processors and O(logn) time. The constant in the running time is still moderate, though not as small.

346 citations


Journal ArticleDOI
TL;DR: This work proposes a new parallel bi-objective hybrid genetic algorithm that takes into account, not only makespan, but also energy consumption, and focuses on the island parallel model and the multi-start parallel model.

327 citations


Journal ArticleDOI
TL;DR: This PEGA, consisting of two parallel EGAs along with a migration operator, takes advantages of maintaining better population diversity, inhibiting premature convergence, and keeping parallelism in comparison with conventional GAs, thus significantly expediting computation speed.
Abstract: This paper presents a parallel elite genetic algorithm (PEGA) and its application to global path planning for autonomous mobile robots navigating in structured environments. This PEGA, consisting of two parallel EGAs along with a migration operator, takes advantages of maintaining better population diversity, inhibiting premature convergence, and keeping parallelism in comparison with conventional GAs. This initial feasible path generated from the PEGA planner is then smoothed using the cubic B-spline technique, in order to construct a near-optimal collision-free continuous path. Both global path planner and smoother are implemented in one field-programmable gate array chip utilizing the system-on-a-programmable-chip technology and the pipelined hardware implementation scheme, thus significantly expediting computation speed. Simulations and experimental results are conducted to show the merit of the proposed PEGA path planner and smoother for global path planning of autonomous mobile robots.

254 citations


Proceedings ArticleDOI
12 Nov 2011
TL;DR: In this article, the design space of parallel algorithms for Breadth-First Search (BFS), a key subroutine in several graph algorithms, is explored and two highly-tuned parallel approaches for BFS on large parallel systems: a level-synchronous strategy that relies on a simple vertex-based partitioning of the graph and a two-dimensional sparse matrix partitioning-based approach that mitigates parallel communication overhead.
Abstract: Data-intensive, graph-based computations are pervasive in several scientific applications, and are known to to be quite challenging to implement on distributed memory systems. In this work, we explore the design space of parallel algorithms for Breadth-First Search (BFS), a key subroutine in several graph algorithms. We present two highly-tuned parallel approaches for BFS on large parallel systems: a level-synchronous strategy that relies on a simple vertex-based partitioning of the graph, and a two-dimensional sparse matrix partitioning-based approach that mitigates parallel communication overhead. For both approaches, we also present hybrid versions with intra-node multithreading. Our novel hybrid two-dimensional algorithm reduces communication times by up to a factor of 3.5, relative to a common vertex based approach. Our experimental study identifies execution regimes in which these approaches will be competitive, and we demonstrate extremely high performance on leading distributed-memory parallel systems. For instance, for a 40,000-core parallel execution on Hopper, an AMD Magny-Cours based system, we achieve a BFS performance rate of 17.8 billion edge visits per second on an undirected graph of 4.3 billion vertices and 68.7 billion edges with skewed degree distribution.

229 citations


Proceedings ArticleDOI
12 Feb 2011
TL;DR: The language, compiler, and runtime features that enable Copperhead to efficiently execute data parallel code are discussed and the program analysis techniques necessary for compiling Copperhead code into efficient low-level implementations are introduced.
Abstract: Modern parallel microprocessors deliver high performance on applications that expose substantial fine-grained data parallelism. Although data parallelism is widely available in many computations, implementing data parallel algorithms in low-level languages is often an unnecessarily difficult task. The characteristics of parallel microprocessors and the limitations of current programming methodologies motivate our design of Copperhead, a high-level data parallel language embedded in Python. The Copperhead programmer describes parallel computations via composition of familiar data parallel primitives supporting both flat and nested data parallel computation on arrays of data. Copperhead programs are expressed in a subset of the widely used Python programming language and interoperate with standard Python modules, including libraries for numeric computation, data visualization, and analysis. In this paper, we discuss the language, compiler, and runtime features that enable Copperhead to efficiently execute data parallel code. We define the restricted subset of Python which Copperhead supports and introduce the program analysis techniques necessary for compiling Copperhead code into efficient low-level implementations. We also outline the runtime support by which Copperhead programs interoperate with standard Python modules. We demonstrate the effectiveness of our techniques with several examples targeting the CUDA platform for parallel programming on GPUs. Copperhead code is concise, on average requiring 3.6 times fewer lines of code than CUDA, and the compiler generates efficient code, yielding 45-100% of the performance of hand-crafted, well optimized CUDA code.

225 citations


Proceedings ArticleDOI
21 Aug 2011
TL;DR: In this article, the authors consider the problem of k-center and k-median clustering in MapReduce and develop fast clustering algorithms with constant factor approximation guarantees.
Abstract: Clustering problems have numerous applications and are becoming more challenging as the size of the data increases. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the practical and popular clustering problems, k-center and k-median. We develop fast clustering algorithms with constant factor approximation guarantees. From a theoretical perspective, we give the first analysis that shows several clustering algorithms are in MRC0, a theoretical MapReduce class introduced by Karloff et al. [26]. Our algorithms use sampling to decrease the data size and they run a time consuming clustering algorithm such as local search or Lloyd's algorithm on the resulting data set. Our algorithms have sufficient flexibility to be used in practice since they run in a constant number of MapReduce rounds. We complement these results by performing experiments using our algorithms. We compare the empirical performance of our algorithms to several sequential and parallel algorithms for the k-median problem. The experiments show that our algorithms' solutions are similar to or better than the other algorithms' solutions. Furthermore, on data sets that are sufficiently large, our algorithms are faster than the other parallel algorithms that we tested.

215 citations


Posted Content
TL;DR: This work designs optimal simulations of the the well-established PRAM and BSP models in MapReduce, immediately resulting in optimal solutions to the problems of computing fixed-dimensional linear programming and 2-D and 3-D convex hulls.
Abstract: In this paper, we study the MapReduce framework from an algorithmic standpoint and demonstrate the usefulness of our approach by designing and analyzing efficient MapReduce algorithms for fundamental sorting, searching, and simulation problems. This study is motivated by a goal of ultimately putting the MapReduce framework on an equal theoretical footing with the well-known PRAM and BSP parallel models, which would benefit both the theory and practice of MapReduce algorithms. We describe efficient MapReduce algorithms for sorting, multi-searching, and simulations of parallel algorithms specified in the BSP and CRCW PRAM models. We also provide some applications of these results to problems in parallel computational geometry for the MapReduce framework, which result in efficient MapReduce algorithms for sorting, 2- and 3-dimensional convex hulls, and fixed-dimensional linear programming. For the case when mappers and reducers have a memory/message-I/O size of $M=\Theta(N^\epsilon)$, for a small constant $\epsilon>0$, all of our MapReduce algorithms for these applications run in a constant number of rounds.

201 citations


Book
27 Aug 2011
TL;DR: A parallel algorithm for the prefix sums problem which runs in timeO( logn/log logn) time using n/lognprocessors (optimal speedup) is presented.
Abstract: We present a parallel algorithm for the prefix sums problem which runs in timeO( logn/log logn) usingnlog logn/lognprocessors (optimal speedup). This algorithm leads to a parallel list ranking algorithm which runs inO(logn) time usingn/lognprocessors (optimal speedup).

199 citations


Journal ArticleDOI
TL;DR: A family of very efficient parallel algorithms for radix sorting; and the authors' allocation-oriented algorithmic design strategies that match the strengths of GPU processor architecture to this genre of dynamic parallelism are presented.
Abstract: The need to rank and order data is pervasive, and many algorithms are fundamentally dependent upon sorting and partitioning operations. Prior to this work, GPU stream processors have been perceived as challenging targets for problems with dynamic and global data-dependences such as sorting. This paper presents: (1) a family of very efficient parallel algorithms for radix sorting; and (2) our allocation-oriented algorithmic design strategies that match the strengths of GPU processor architecture to this genre of dynamic parallelism. We demonstrate multiple factors of speedup (up to 3.8x) compared to state-of-the-art GPU sorting. We also reverse the performance differentials observed between GPU and multi/many-core CPU architectures by recent comparisons in the literature, including those with 32-core CPU-based accelerators. Our average sorting rates exceed 1B 32-bit keys/sec on a single GPU microprocessor. Our sorting passes are constructed from a very efficient parallel prefix scan "runtime" that incorporates three design features: (1) kernel fusion for locally generating and consuming prefix scan data; (2) multi-scan for performing multiple related, concurrent prefix scans (one for each partitioning bin); and (3) flexible algorithm serialization for avoiding unnecessary synchronization and communication within algorithmic phases, allowing us to construct a single implementation that scales well across all generations and configurations of programmable NVIDIA GPUs.

Journal Article
TL;DR: A Load Balanced Min-Min (LBMM) algorithm is proposed that reduces the makespan and increases the resource utilization in grid computing and it is shown that the proposed method has two-phases.
Abstract: Grid computing has become a real alternative to traditional supercomputing environments for developing parallel applications that harness massive computational resources. However, the complexity incurred in building such parallel Grid-aware applications is higher than the traditional parallel computing environments. It addresses issues such as resource discovery, heterogeneity, fault tolerance and task scheduling. Load balanced task scheduling is very important problem in complex grid environment. So task scheduling which is one of the NP-Complete problems becomes a focus of research scholars in grid computing area. The traditional Min-Min algorithm is a simple algorithm that produces a schedule that minimizes the makespan than the other traditional algorithms in the literature. But it fails to produce a load balanced schedule. In this paper a Load Balanced Min-Min (LBMM) algorithm is proposed that reduces the makespan and increases the resource utilization. The proposed method has two-phases. In the first phase the traditional Min-Min algorithm is executed and in the second phase the tasks are rescheduled to use the unutilized resources effectively.

Journal ArticleDOI
TL;DR: It is suggested that the considerable intellectual effort needed for designing efficient algorithms for multi-core architectures may be most fruitfully expended in designing portable algorithms, once and for all, for such a bridging model.

Posted Content
TL;DR: This paper designs clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets, and focuses on the practical and popular clustering problems, k-center and k-median.
Abstract: Clustering problems have numerous applications and are becoming more challenging as the size of the data increases. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the practical and popular clustering problems, $k$-center and $k$-median. We develop fast clustering algorithms with constant factor approximation guarantees. From a theoretical perspective, we give the first analysis that shows several clustering algorithms are in $\mathcal{MRC}^0$, a theoretical MapReduce class introduced by Karloff et al. \cite{KarloffSV10}. Our algorithms use sampling to decrease the data size and they run a time consuming clustering algorithm such as local search or Lloyd's algorithm on the resulting data set. Our algorithms have sufficient flexibility to be used in practice since they run in a constant number of MapReduce rounds. We complement these results by performing experiments using our algorithms. We compare the empirical performance of our algorithms to several sequential and parallel algorithms for the $k$-median problem. The experiments show that our algorithms' solutions are similar to or better than the other algorithms' solutions. Furthermore, on data sets that are sufficiently large, our algorithms are faster than the other parallel algorithms that we tested.

Journal ArticleDOI
TL;DR: In this article, the authors discuss possible approaches to parallelizing PSO on graphics hardware within the Compute Unified Device Architecture (CUDA(TM)), a GPU programming environment by nVIDIA(TM) which supports the company's latest cards.

Journal ArticleDOI
TL;DR: This paper presents a new exact maximum clique algorithm which improves the bounds obtained in state of the art approximate coloring by reordering the vertices at each step, and significantly outperforms a current leading algorithm.

Journal ArticleDOI
TL;DR: In this article, the authors adopt a convex optimization framework where the criterion to be minimized is split in the sum of more than two terms, and an accelerated version of the Parallel Proximal Algorithm is proposed to perform the minimization.
Abstract: Regularization approaches have demonstrated their effectiveness for solving ill-posed problems. However, in the context of variational restoration methods, a challenging question remains, namely how to find a good regularizer. While total variation introduces staircase effects, wavelet-domain regularization brings other artefacts, e.g., ringing. However, a tradeoff can be made by introducing a hybrid regularization including several terms not necessarily acting in the same domain (e.g., spatial and wavelet transform domains). While this approach was shown to provide good results for solving deconvolution problems in the presence of additive Gaussian noise, an important issue is to efficiently deal with this hybrid regularization for more general noise models. To solve this problem, we adopt a convex optimization framework where the criterion to be minimized is split in the sum of more than two terms. For spatial domain regularization, isotropic or anisotropic total variation definitions using various gradient filters are considered. An accelerated version of the Parallel Proximal Algorithm is proposed to perform the minimization. Some difficulties in the computation of the proximity operators involved in this algorithm are also addressed in this paper. Numerical experiments performed in the context of Poisson data recovery, show the good behavior of the algorithm as well as promising results concerning the use of hybrid regularization techniques.

Journal ArticleDOI
TL;DR: A new slack reclamation algorithm is proposed by approaching the energy reduction problem from a different angle and a novel algorithm to find the best combination of frequencies to result the optimal energy is presented.

Journal ArticleDOI
TL;DR: This work follows carefully chosen data transfer schemes in global memory of the Lattice Boltzmann Method and shows that highly efficient implementations of LBM on GPUs are possible, even for complex models.
Abstract: Emerging many-core processors, like CUDA capable nVidia GPUs, are promising platforms for regular parallel algorithms such as the Lattice Boltzmann Method (LBM). Since the global memory for graphic devices shows high latency and LBM is data intensive, the memory access pattern is an important issue for achieving good performances. Whenever possible, global memory loads and stores should be coalescent and aligned, but the propagation phase in LBM can lead to frequent misaligned memory accesses. Most previous CUDA implementations of 3D LBM addressed this problem by using low latency on chip shared memory. Instead of this, our CUDA implementation of LBM follows carefully chosen data transfer schemes in global memory. For the 3D lid-driven cavity test case, we obtained up to 86% of the global memory maximal throughput on nVidia's GT200. We show that as a consequence highly efficient implementations of LBM on GPUs are possible, even for complex models.

Journal ArticleDOI
TL;DR: This work introduces a class of parallel preconditioners for the FSI problem obtained by exploiting the block-structure of the linear system and shows that the construction and evaluation of the devised preconditionser is modular.
Abstract: The increasing computational load required by most applications and the limits in hardware performances affecting scientific computing contributed in the last decades to the development of parallel software and architectures. In fluid-structure interaction (FSI) for haemodynamic applications, parallelization and scalability are key issues (see [L. Formaggia, A. Quarteroni, and A. Veneziani, eds., Cardiovascular Mathematics: Modeling and Simulation of the Circulatory System, Modeling, Simulation and Applications 1, Springer, Milan, 2009]). In this work we introduce a class of parallel preconditioners for the FSI problem obtained by exploiting the block-structure of the linear system. We stress the possibility of extending the approach to a general linear system with a block-structure, then we provide a bound in the condition number of the preconditioned system in terms of the conditioning of the preconditioned diagonal blocks, and finally we show that the construction and evaluation of the devised preconditioner is modular. The preconditioners are tested on a benchmark three-dimensional (3D) geometry discretized in both a coarse and a fine mesh, as well as on two physiological aorta geometries. The simulations that we have performed show an advantage in using the block preconditioners introduced and confirm our theoretical results.

Proceedings ArticleDOI
16 May 2011
TL;DR: This work gives a new data structure transformation, called bit masked register blocks, which promises significant reductions on bandwidth requirements by reducing the number of indexing elements without introducing additional fill-in zeros and shows how to incorporate this transformation into existing parallel algorithms without limiting their parallel scalability.
Abstract: On multicore architectures, the ratio of peak memory bandwidth to peak floating-point performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymmetric case) with a dense vector is the core of sparse iterative methods. In this paper, we present a new multithreaded algorithm for the symmetric case which potentially cuts the bandwidth requirements in half while exposing lots of parallelism in practice. We also give a new data structure transformation, called bit masked register blocks, which promises significant reductions on bandwidth requirements by reducing the number of indexing elements without introducing additional fill-in zeros. Our work shows how to incorporate this transformation into existing parallel algorithms (both symmetric and unsymmetric) without limiting their parallel scalability. Experimental results indicate that the combined benefits of bit masked register blocks and the new symmetric algorithm can be as high as a factor of 3.5x in multicore performance over an already scalable parallel approach. We also provide a model that accurately predicts the performance of the new methods, showing that even larger performance gains are expected in future multicore systems as current trends (decreasing byte:flop ratio and larger sparse matrices) continue.

Journal ArticleDOI
TL;DR: A message passing interface (MPI) parallel approach designed to both increase the size and speed with which hydrological proximity measures (HPMs) from a Digital Elevation Model (DEM) are computed allows efficient analysis of much larger DEMs than were possible using the serial recursive algorithms.
Abstract: Land surface topography is one of the most important terrain properties which impact hydrological, geomorphological, and ecological processes active on a landscape. In our previous efforts to develop a soil depth model based upon topographic and land cover variables, we derived a set of hydrological proximity measures (HPMs) from a Digital Elevation Model (DEM) as potential explanatory variables for soil depth. These HPMs are variations of the distance up to ridge points (cells with no incoming flow) and variations of the distance down to stream points (cells with a contributing area greater than a threshold), following the flow path. The HPMs were computed using the D-infinity flow model that apportions flow between adjacent neighbors based on the direction of steepest downward slope on the eight triangular facets constructed in a 3 x 3 grid cell window using the center cell and each pair of adjacent neighboring grid cells in turn. The D-infinity model typically results in multiple flow paths between 2 points on the topography, with the result that distances may be computed as the minimum, maximum or average of the individual flow paths. In addition, each of the HPMs, are calculated vertically, horizontally, and along the land surface. Previously, these HPMs were calculated using recursive serial algorithms which suffered from stack overflow problems when used to process large datasets, limiting the size of DEMs that could be analyzed. To overcome this limitation, we developed a message passing interface (MPI) parallel approach designed to both increase the size and speed with which these HPMs are computed. The parallel HPM algorithms spatially partition the input grid into stripes which are each assigned to separate processes for computation. Each of those processes then uses a queue data structure to order the processing of cells so that each cell is visited only once and the cross-process communications that are a standard part of MPI are handled in an efficient manner. This parallel approach allows efficient analysis of much larger DEMs than were possible using the serial recursive algorithms. The HPMs given here may also have other, more general modeling applicability in hydrology, geomorphology and ecology, and so are described here from a general perspective. In this paper, we present the definitions of the HPMs, the serial and parallel algorithms used in their computation and their potential applications.

Book
12 Sep 2011
TL;DR: The mapping of algorithms structured as depth-p nested FOR loops into special-purpose systolic VLSI linear arrays is addressed by using linear functions to transform the original sequential algorithms into a form suitable for parallel execution on linear arrays.
Abstract: The mapping of algorithms structured as depth-p nested FOR loops into special-purpose systolic VLSI linear arrays is addressed The mappings are done by using linear functions to transform the original sequential algorithms into a form suitable for parallel execution on linear arrays A feasible mapping is derived by identifying formal criteria to be satisfied by both the original sequential algorithm and the proposed transformation function The methodology is illustrated by synthesizing algorithms for matrix multiplication and a version of the Warshall-Floyd transitive closure algorithm >

Journal ArticleDOI
TL;DR: The paper presents a systematic framework for exploiting the potential of the decomposition structures as a way to obtain different parallel algorithms, each with a different tradeoff among convergence speed, message passing amount and distributed computation architecture.

Book
19 Aug 2011
TL;DR: Parallel algorithms are presented for important components of computational fluid dynamics algorithms along with implementations on hypercube computers, used to solve hyperbolic and hypercube problems.
Abstract: Parallel algorithms are presented for important components of computational fluid dynamics algorithms along with implementations on hypercube computers. These programs, used to solve hyperbolic and...

Journal ArticleDOI
TL;DR: This work demonstrates the implementation of the FETI method to a hybrid CPU–GPU computing environment and reveals the tremendous potential of this type of hybrid computing environment as a result of the full exploitation of multi-core CPU hardware resources and the intrinsic software and hardware features of the GPUs.

Journal ArticleDOI
TL;DR: This paper analyzed the bag-of-words model for visual categorization in terms of computational cost and identified two major bottlenecks: the quantization step and the classification step and proposes two efficient algorithms for quantization and classification by exploiting the GPU hardware and the CUDA parallel programming model.
Abstract: Visual categorization is important to manage large collections of digital images and video, where textual metadata is often incomplete or simply unavailable. The bag-of-words model has become the most powerful method for visual categorization of images and video. Despite its high accuracy, a severe drawback of this model is its high computational cost. As the trend to increase computational power in newer CPU and GPU architectures is to increase their level of parallelism, exploiting this parallelism becomes an important direction to handle the computational cost of the bag-of-words approach. When optimizing a system based on the bag-of-words approach, the goal is to minimize the time it takes to process batches of images. this paper, we analyze the bag-of-words model for visual categorization in terms of computational cost and identify two major bottlenecks: the quantization step and the classification step. We address these two bottlenecks by proposing two efficient algorithms for quantization and classification by exploiting the GPU hardware and the CUDA parallel programming model. The algorithms are designed to (1) keep categorization accuracy intact, (2) decompose the problem, and (3) give the same numerical results. In the experiments on large scale datasets, it is shown that, by using a parallel implementation on the Geforce GTX260 GPU, classifying unseen images is 4.8 times faster than a quad-core CPU version on the Core i7 920, while giving the exact same numerical results. In addition, we show how the algorithms can be generalized to other applications, such as text retrieval and video retrieval. Moreover, when the obtained speedup is used to process extra video frames in a video retrieval benchmark, the accuracy of visual categorization is improved by 29%.

Proceedings ArticleDOI
29 Jul 2011
TL;DR: The results show that the strategy designed in this paper can archive higher efficiency when doing frequent item set mining in cloud computing environment.
Abstract: Cloud computing provides cheap and efficient solutions of storing and analyzing mass data. It is very important to research the data mining strategy based on cloud computing from the theoretical view and practical view. In this paper, the strategy of mining association rules in cloud computing environment is focused on. Firstly, cloud computing, Hadoop, MapReduce programming model, Apriori algorithm and parallel association rule mining algorithm are introduced. Then, a parallel association rule mining strategy adapting to the cloud computing environment is designed. It includes data set division method, data set allocation method, improved Apriori algorithm, and the implementation procedure of the improved Apriori algorithm on MapReduce. Finally, the Hadoop platform is built and the experiment for testing performance of the strategy as well as the improved algorithm has been done. The results show that the strategy designed in this paper can archive higher efficiency when doing frequent item set mining in cloud computing environment.

Proceedings ArticleDOI
16 May 2011
TL;DR: This paper designs a new CUDA-aware procedure for pivot selection and adapt selected parallel algorithms for CUDA accelerated computation and experimentally demonstrates that with a single GTX 480 GPU card, this paper can easily outperform the optimal serial CPU implementation by an order of magnitude.
Abstract: The problem of decomposing a directed graph into its strongly connected components is a fundamental graph problem inherently present in many scientific and commercial applications. In this paper we show how some of the existing parallel algorithms can be reformulated in order to be accelerated by NVIDIA CUDA technology. In particular, we design a new CUDA-aware procedure for pivot selection and we adapt selected parallel algorithms for CUDA accelerated computation. We also experimentally demonstrate that with a single GTX 480 GPU card we can easily outperform the optimal serial CPU implementation by an order of magnitude in most cases, 40 times on some sufficiently big instances. This is an interesting result as unlike the serial CPU case, the asymptotic complexity of the parallel algorithms is not optimal.

Journal ArticleDOI
TL;DR: It is found that the proposed methodology can successfully consider benefits of all stakeholders in the introduction of transit lanes and enables the methodology to be used for real-world-network scale in a shorter computer processing time.
Abstract: This paper proposes a detailed formulation to optimize transit road space priority at the network level and utilizes an efficient heuristic method to find the optimum solution. Previous approaches to transit priority have a localized focus in which only limited combinations of transit exclusive lanes could be assessed. The aim of this work is to reallocate the road space between private car and transit modes so that the system is optimized. A bilevel programming approach is adapted for this purpose. The upper level involves an objective function from the system managers' perspective, whereas at the lower level, a users' perspective is modeled. To take into account the major effects of a priority provision, three models are used: 1) a modal split; 2) a user equilibrium traffic assignment; and 3) a transit assignment. A genetic algorithm (GA) approach is used, which enables the method to be applied to large networks. Application of a parallel GA is also demonstrated in the solution method, which has a considerably shorter execution time. The methodology is applied to an example network, and results are discussed. It is found that the proposed methodology can successfully consider benefits of all stakeholders in the introduction of transit lanes. Furthermore, using parallel GA enables the methodology to be used for real-world-network scale in a shorter computer processing time.