scispace - formally typeset
Search or ask a question

Showing papers on "Parallel algorithm published in 2015"


Posted Content
Ren Wu, Shengen Yan, Yi Shan, Qingqing Dang, Gang Sun 
TL;DR: A state-of-the-art image recognition system, Deep Image, developed using end-to-end deep learning, which achieves excellent results on multiple challenging computer vision benchmarks.
Abstract: We present a state-of-the-art image recognition system, Deep Image, developed using end-to-end deep learning. The key components are a custom-built supercomputer dedicated to deep learning, a highly optimized parallel algorithm using new strategies for data partitioning and communication, larger deep neural network models, novel data augmentation approaches, and usage of multi-scale high-resolution images. Our method achieves excellent results on multiple challenging computer vision benchmarks.

363 citations


Book
Richard Cole1
06 Sep 2015
TL;DR: This paper provides a general method that trims a factor of O(log n) time for many applications of this technique.
Abstract: Megiddo introduced a technique for using a parallel algorithm for one problem to construct an efficient serial algorithm for a second problem. We give a general method that trims a factor o f 0(logn) time (or more) for many applications of this technique.

301 citations


Journal ArticleDOI
TL;DR: Numerical tests show that very few sweeps are needed to construct a factorization that is an effective preconditioner, and the amount of parallelism is large irrespective of the ordering of the matrix, and matrix ordering can be used to enhance the accuracy of the factorization rather than to increase parallelism.
Abstract: This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros in the incomplete factors can be computed in parallel and asynchronously, using one or more sweeps that iteratively improve the accuracy of the factorization. Unlike existing parallel algorithms, the amount of parallelism is large irrespective of the ordering of the matrix, and matrix ordering can be used to enhance the accuracy of the factorization rather than to increase parallelism. Numerical tests show that very few sweeps are needed to construct a factorization that is an effective preconditioner.

162 citations


Book
09 Sep 2015
TL;DR: In this article, the authors present techniques for parallel divide-and-conquer, resulting in improved parallel algorithms for a number of problems including intersection detection, trapezoidal decomposition, and planar point location.
Abstract: We present techniques for parallel divide-and-conquer, resulting in improved parallel algorithms for a number of problems. The problems for which we give improved algorithms include intersection detection, trapezoidal decomposition (hence, polygon triangulation), and planar point location (hence, Voronoi diagram construction). We also give efficient parallel algorithms for fractional cascading, 3-dimensional maxima, 2-set dominance counting, and visibility from a point. All of our algorithms run in O(log n) time with either a linear or sub-linear number of processors in the CREW PRAM model.

162 citations


Journal Article
TL;DR: This book describes how to design and implement maintainable and efficient parallel algorithms using a pattern-based approach and gives some specific examples using multiple programming models.
Abstract: In this book the authors, who are parallel computing experts and industry insiders, describe how to design and implement maintainable and efficient parallel algorithms using a pattern-based approach. They present both theory and practice, and give some specific examples using multiple programming models. The book begins with two introductory chapters related with “Why is necessary to Think Parallel” and presenting background related with hardware trends that have lead to need explicit parallel programming.

159 citations


Proceedings ArticleDOI
17 May 2015
TL;DR: This work builds Graph SC, a framework that provides a programming paradigm that allows non-cryptography experts to write secure code, brings parallelism to such secure implementations, and meets the need for obliviousness, thereby not leaking any private information.
Abstract: We propose introducing modern parallel programming paradigms to secure computation, enabling their secure execution on large datasets. To address this challenge, we present Graph SC, a framework that (i) provides a programming paradigm that allows non-cryptography experts to write secure code, (ii) brings parallelism to such secure implementations, and (iii) meets the need for obliviousness, thereby not leaking any private information. Using Graph SC, developers can efficiently implement an oblivious version of graph-based algorithms (including sophisticated data mining and machine learning algorithms) that execute in parallel with minimal communication overhead. Importantly, our secure version of graph-based algorithms incurs a small logarithmic overhead in comparison with the non-secure parallel version. We build Graph SC and demonstrate, using several algorithms as examples, that secure computation can be brought into the realm of practicality for big data analysis. Our secure matrix factorization implementation can process 1 million ratings in 13 hours, which is a multiple order-of-magnitude improvement over the only other existing attempt, which requires 3 hours to process 16K ratings.

152 citations


Journal ArticleDOI
TL;DR: This paper presents an architecture, protocol, and parallel algorithms for collaborative 3D mapping in the cloud with low-cost robots, as well as quantitative evaluation of localization accuracy, bandwidth usage, processing speeds, and map storage.
Abstract: This paper presents an architecture, protocol, and parallel algorithms for collaborative 3D mapping in the cloud with low-cost robots. The robots run a dense visual odometry algorithm on a smartphone-class processor. Key-frames from the visual odometry are sent to the cloud for parallel optimization and merging with maps produced by other robots. After optimization the cloud pushes the updated poses of the local key-frames back to the robots. All processes are managed by Rapyuta, a cloud robotics framework that runs in a commercial data center. This paper includes qualitative visualization of collaboratively built maps, as well as quantitative evaluation of localization accuracy, bandwidth usage, processing speeds, and map storage.

133 citations


Journal ArticleDOI
TL;DR: A new constrained tensor factorization framework is proposed in this paper, building upon the Alternating Direction Method of Multipliers (ADMoM).
Abstract: Tensor factorization has proven useful in a wide range of applications, from sensor array processing to communications, speech and audio signal processing, and machine learning. With few recent exceptions, all tensor factorization algorithms were originally developed for centralized, in-memory computation on a single machine; and the few that break away from this mold do not easily incorporate practically important constraints, such as non-negativity. A new constrained tensor factorization framework is proposed in this paper, building upon the Alternating Direction Method of Multipliers (ADMoM). It is shown that this simplifies computations, bypassing the need to solve constrained optimization problems in each iteration; and it naturally leads to distributed algorithms suitable for parallel implementation. This opens the door for many emerging big data-enabled applications. The methodology is exemplified using non-negativity as a baseline constraint, but the proposed framework can incorporate many other types of constraints. Numerical experiments are encouraging, indicating that ADMoM-based non-negative tensor factorization (NTF) has high potential as an alternative to state-of-the-art approaches.

126 citations


Proceedings ArticleDOI
15 Nov 2015
TL;DR: The compressed sparse fiber (CSF) a data structure for sparse tensors along with a novel parallel algorithm for tensor-matrix multiplication is introduced and offers similar operation reductions as existing compressed methods while using only a single tensor structure.
Abstract: The Canonical Polyadic Decomposition (CPD) of tensors is a powerful tool for analyzing multi-way data and is used extensively to analyze very large and extremely sparse datasets. The bottleneck of computing the CPD is multiplying a sparse tensor by several dense matrices. Algorithms for tensor-matrix products fall into two classes. The first class saves floating point operations by storing a compressed tensor for each dimension of the data. These methods are fast but suffer high memory costs. The second class uses a single uncompressed tensor at the cost of additional floating point operations. In this work, we bridge the gap between the two approaches and introduce the compressed sparse fiber (CSF) a data structure for sparse tensors along with a novel parallel algorithm for tensor-matrix multiplication. CSF offers similar operation reductions as existing compressed methods while using only a single tensor structure. We validate our contributions with experiments comparing against state-of-the-art methods on a diverse set of datasets. Our work uses 58% less memory than the state-of-the-art while achieving 81% of the parallel performance on 16 threads.

125 citations


Proceedings ArticleDOI
25 May 2015
TL;DR: This work presents a new primitive, masked matrix multiplication, that can be beneficial especially for the enumeration case and provides results from an initial implementation for the counting case along with various optimizations for communication reduction and load balance.
Abstract: Triangle counting and enumeration are important kernels that are used to characterize graphs. They are also used to compute important statistics such as clustering coefficients. We provide a simple exact algorithm that is based on operations on sparse adjacency matrices. By parallelizing the individual sparse matrix operations, we achieve a parallel algorithm for triangle counting. The algorithm is generalizable to triangle enumeration by modifying the semiring that underlies the matrix algebra. We present a new primitive, masked matrix multiplication, that can be beneficial especially for the enumeration case. We provide results from an initial implementation for the counting case along with various optimizations for communication reduction and load balance.

124 citations


Proceedings ArticleDOI
07 Apr 2015
TL;DR: This study studies compression techniques for parallel in-memory graph algorithms, and shows that they can achieve reduced space usage while obtaining competitive or improved performance compared to running the algorithms on uncompressed graphs.
Abstract: We study compression techniques for parallel in-memory graph algorithms, and show that we can achieve reduced space usage while obtaining competitive or improved performance compared to running the algorithms on uncompressed graphs. We integrate the compression techniques into Ligra, a recent shared-memory graph processing system. This system, which we call Ligra+, is able to represent graphs using about half of the space for the uncompressed graphs on average. Furthermore, Ligra+ is slightly faster than Ligra on average on a 40-core machine with hyper-threading. Our experimental study shows that Ligra+ is able to process graphs using less memory, while performing as well as or faster than Ligra.

Journal ArticleDOI
TL;DR: It is proposed to construct electron correlation methods that are scalable in both molecule size and aggregated parallel computational power, in the sense that the total elapsed time of a calculation becomes nearly independent of the molecular size when the number of processors grows linearly with the Molecular size.
Abstract: We propose to construct electron correlation methods that are scalable in both molecule size and aggregated parallel computational power, in the sense that the total elapsed time of a calculation becomes nearly independent of the molecular size when the number of processors grows linearly with the molecular size. This is shown to be possible by exploiting a combination of local approximations and parallel algorithms. The concept is demonstrated with a linear scaling pair natural orbital local second-order Moller–Plesset perturbation theory (PNO-LMP2) method. In this method, both the wave function manifold and the integrals are transformed incrementally from projected atomic orbitals (PAOs) first to orbital-specific virtuals (OSVs) and finally to pair natural orbitals (PNOs), which allow for minimum domain sizes and fine-grained accuracy control using very few parameters. A parallel algorithm design is discussed, which is efficient for both small and large molecules, and numbers of processors, although tru...

Journal ArticleDOI
TL;DR: This work addresses the ubiquitous case where these QPs are strictly convex and proposes a dual Newton strategy that exploits the block-bandedness similarly to an interior-point method.
Abstract: Quadratic programming problems (QPs) that arise from dynamic optimization problems typically exhibit a very particular structure. We address the ubiquitous case where these QPs are strictly convex and propose a dual Newton strategy that exploits the block-bandedness similarly to an interior-point method. Still, the proposed method features warmstarting capabilities of active-set methods. We give details for an efficient implementation, including tailored numerical linear algebra, step size computation, parallelization, and infeasibility handling. We prove convergence of the algorithm for the considered problem class. A numerical study based on the open-source implementation qpDUNES shows that the algorithm outperforms both well-established general purpose QP solvers as well as state-of-the-art tailored control QP solvers significantly on the considered benchmark problems.

Journal ArticleDOI
TL;DR: This article presents parallel algorithms, distributed data structures, and communication routines that are implemented in the software framework waLBerla in order to support large-scale, massively parallel lattice Boltzmann-based simulations on nonuniform grids, and evaluates the performance on two current petascale supercomputers.
Abstract: The lattice Boltzmann method exhibits excellent scalability on current supercomputing systems and has thus increasingly become an alternative method for large-scale non-stationary flow simulations, reaching up to a trillion grid nodes. Additionally, grid refinement can lead to substantial savings in memory and compute time. These saving, however, come at the cost of much more complex data structures and algorithms. In particular, the interface between subdomains with different grid sizes must receive special treatment. In this article, we present parallel algorithms, distributed data structures, and communication routines that are implemented in the software framework waLBerla in order to support large-scale, massively parallel lattice Boltzmann-based simulations on non-uniform grids. Additionally, we evaluate the performance of our approach on two current petascale supercomputers. On an IBM Blue Gene/Q system, the largest weak scaling benchmarks with refined grids are executed with almost two million threads, demonstrating not only near-perfect scalability but also an absolute performance of close to a trillion lattice Boltzmann cell updates per second. On an Intel-based system, the strong scaling of a simulation with refined grids and a total of more than 8.5 million cells is demonstrated to reach a performance of less than one millisecond per time step. This enables simulations with complex, non-uniform grids and four million time steps per hour compute time.

Journal ArticleDOI
TL;DR: A fast parallel SG method, FPSG, for shared memory systems is developed by dramatically reducing the cache-miss rate and carefully addressing the load balance of threads, which is more efficient than state-of-the-art parallel algorithms for matrix factorization.
Abstract: Matrix factorization is known to be an effective method for recommender systems that are given only the ratings from users to items. Currently, stochastic gradient (SG) method is one of the most popular algorithms for matrix factorization. However, as a sequential approach, SG is difficult to be parallelized for handling web-scale problems. In this article, we develop a fast parallel SG method, FPSG, for shared memory systems. By dramatically reducing the cache-miss rate and carefully addressing the load balance of threads, FPSG is more efficient than state-of-the-art parallel algorithms for matrix factorization.

Journal ArticleDOI
01 Aug 2015
TL;DR: A parallel scalable algorithm is provided that guarantees a polynomial speedup over sequential algorithms with the increase of processors and develops a parallel algorithm with accuracy bound for the problem of discovering top-k diversified GPARs.
Abstract: We propose graph-pattern association rules (GPARs) for social media marketing. Extending association rules for item-sets, GPARs help us discover regularities between entities in social graphs, and identify potential customers by exploring social influence. We study the problem of discovering top-k diversified GPARs. While this problem is NP-hard, we develop a parallel algorithm with accuracy bound. We also study the problem of identifying potential customers with GPARs. While it is also NP-hard, we provide a parallel scalable algorithm that guarantees a polynomial speedup over sequential algorithms with the increase of processors. Using real-life and synthetic graphs, we experimentally verify the scalability and effectiveness of the algorithms.

Journal ArticleDOI
TL;DR: A better variation of CLPSO is proposed, called the parallel comprehensive learning particle swarm optimizer (PCLPSO) which has multiple swarms based on the master-slave paradigm and works cooperatively and concurrently.

Journal ArticleDOI
TL;DR: This work proposes a framework for SpGEMM on GPUs and emerging CPU-GPU heterogeneous processors using the CSR format, and proposes an efficient parallel insert method for long rows of the resulting matrix and develops a heuristic-based load balancing strategy.

Proceedings ArticleDOI
25 May 2015
TL;DR: This paper presents and evaluates a parallel community detection algorithm derived from the state-of-the-art Louvain modularity maximization method, which is able to parallelize graphs with up to 138 billion edges on 8, 192 Blue Gene/Q nodes and 1, 024 P7-IH nodes.
Abstract: In this paper we present and evaluate a parallel community detection algorithm derived from the state-of-the-artLouvain modularity maximization method. Our algorithm adoptsa novel graph mapping and data representation, and relies onan efficient communication runtime, specifically designed forfine-grained applications executed on large-scale supercomputers. We have been able to parallelize graphs with up to 138 billion edges on 8, 192 Blue Gene/Q nodes and 1, 024 P7-IH nodes. Leveraging the convergence properties of our algorithm and the efficient implementation, we can analyze communities of large scalegraphs in just a few seconds. To the best of our knowledge, this is the first parallel implementation of the Louvain algorithm that scales to these large data and processor configurations.

Journal ArticleDOI
TL;DR: This paper proposes and analyzes three parallel hybrid extragradient methods for finding a common element of the set of solutions of equilibrium problems involving pseudomonotone bifunctions and theSet of fixed points of nonexpansive mappings in a real Hilbert space based on parallel computation.
Abstract: In this paper we propose and analyze three parallel hybrid extragradient methods for finding a common element of the set of solutions of equilibrium problems involving pseudomonotone bifunctions and the set of fixed points of nonexpansive mappings in a real Hilbert space. Based on parallel computation we can reduce the overall computational effort under widely used conditions on the bifunctions and the nonexpansive mappings.A simple numerical example is given to illustrate the proposed parallel algorithms.

Journal ArticleDOI
TL;DR: In this article, a multi-objective optimization method is proposed to model transient stability as an objective function rather than an inequality constraint and consider classic transient stability constrained optimal power flow (TSCOPF) as a tradeoff procedure using Pareto ideology.
Abstract: Stability is an important constraint in power system operation and the transient stability constrained optimal power flow (OPF) has always received considerable attention in recent years. In this paper, the defects of the existing models and algorithms around this topic are firstly analyzed, on the basis of which, a multi-objective optimization method is proposed. The basic idea of the proposed method is to model transient stability as an objective function rather than an inequality constraint and consider classic transient stability constrained OPF (TSCOPF) as a tradeoff procedure using Pareto ideology. Second, a master-slave parallel elitist non-dominated sorting genetic algorithm II is used to solve the proposed multi-objective optimization problem, the parallel algorithm shows an excellent acceleration effect and provides a set of Pareto optimal solutions for decision makers to select. An innovative weight assigning technique based on fuzzy membership variance is also introduced for a more scientific and objective optimal solution decision. Case study results demonstrate the proposed multi-objective method has many advantages, compared with traditional TSCOPF methods.

Journal ArticleDOI
TL;DR: This paper has combined PSO with the gravitational emulation local search (GELS) algorithm to form a new method, PSO–GELS, and experimental results demonstrate the effectiveness of PSO-GELS compared to other algorithms.
Abstract: A grid computing system consists of a group of programs and resources that are spread across machines in the grid. A grid system has a dynamic environment and decentralized distributed resources, so it is important to provide efficient scheduling for applications. Task scheduling is an NP-hard problem and deterministic algorithms are inadequate and heuristic algorithms such as particle swarm optimization (PSO) are needed to solve the problem. PSO is a simple parallel algorithm that can be applied in different ways to resolve optimization problems. PSO searches the problem space globally and needs to be combined with other methods to search locally as well. In this paper, we propose a hybrid-scheduling algorithm to solve the independent task-scheduling problem in grid computing. We have combined PSO with the gravitational emulation local search (GELS) algorithm to form a new method, PSO---GELS. Our experimental results demonstrate the effectiveness of PSO---GELS compared to other algorithms.

Proceedings ArticleDOI
25 May 2015
TL;DR: In this paper, the label propagation technique was adapted for multilevel graph partitioning, and a highly parallel evolutionary algorithm was applied to the coarsest graph to obtain very high quality.
Abstract: Processing large complex networks like social networks or web graphs has recently attracted considerable interest. To do this in parallel, we need to partition them into pieces of about equal size. Unfortunately, previous parallel graph practitioners originally developed for more regular mesh-like networks do not work well for these networks. This paper addresses this problem by parallelizing and adapting the label propagation technique originally developed for graph clustering. By introducing size constraints, label propagation becomes applicable for both the coarsening and the refinement phase of multilevel graph partitioning. We obtain very high quality by applying a highly parallel evolutionary algorithm to the coarsest graph. The resulting system is both more scalable and achieves higher quality than state-of-the-art systems like ParMetis or PT-Scotch. For large complex networks the performance differences are very big. As an example, our algorithm partitions a web graph with 3.3G edges in 16 seconds using 512 cores of a high-performance cluster while producing a high quality partition -- none of the competing systems can handle this graph on our system.

Journal ArticleDOI
TL;DR: The scaling capability of the Birmingham parallel genetic algorithm is demonstrated through its application to the global optimisation of iridium clusters with 10 to 20 atoms, a catalytically important system with interesting size-specific effects.
Abstract: A new open-source parallel genetic algorithm, the Birmingham parallel genetic algorithm, is introduced for the direct density functional theory global optimisation of metallic nanoparticles. The program utilises a pool genetic algorithm methodology for the efficient use of massively parallel computational resources. The scaling capability of the Birmingham parallel genetic algorithm is demonstrated through its application to the global optimisation of iridium clusters with 10 to 20 atoms, a catalytically important system with interesting size-specific effects. This is the first study of its type on Iridium clusters of this size and the parallel algorithm is shown to be capable of scaling beyond previous size restrictions and accurately characterising the structures of these larger system sizes. By globally optimising the system directly at the density functional level of theory, the code captures the cubic structures commonly found in sub-nanometre sized Ir clusters.

Proceedings ArticleDOI
13 Jun 2015
TL;DR: In this paper, the authors use exponential start time clustering to design faster parallel graph algorithms involving distances, and give linear work parallel algorithms that construct spanners with O(k) stretch and size O(n 1+1/k log k) in unweighted graphs, and O(m poly log n) in weighted graphs.
Abstract: We use exponential start time clustering to design faster parallel graph algorithms involving distances. Previous algorithms usually rely on graph decomposition routines with strict restrictions on the diameters of the decomposed pieces. We weaken these bounds in favor of stronger local probabilistic guarantees. This allows more direct analyses of the overall process, giving: Linear work parallel algorithms that construct spanners with O(k) stretch and size O(n1+1/k) in unweighted graphs, and size O(n1+1/k log k) in weighted graphs.Hopsets that lead to the first parallel algorithm for approximating shortest paths in undirected graphs with O(m poly log n) work.

Journal ArticleDOI
Kun Guo, Wenzhong Guo, Yuzhong Chen, Qirong Qiu1, Qishan Zhang1 
TL;DR: Three strategies, namely, localizing propagation of affinity messages, relaxing self-exemplar constraints, and hierarchical processing, are employed in the algorithm to achieve reasonable time and space complexities in social networks.

Book
01 Jun 2015
TL;DR: This fully-revised edition includes the latest enhancements in OpenCL 2.0 including: Shared virtual memory to increase programming flexibility and reduce data transfers that consume resources Dynamic parallelism which reduces processor load and avoids bottlenecks
Abstract: Heterogeneous Computing with OpenCL 2.0 teaches OpenCL and parallel programming for complex systems that may include a variety of device architectures: multi-core CPUs, GPUs, and fully-integrated Accelerated Processing Units (APUs). This fully-revised edition includes the latest enhancements in OpenCL 2.0 including: Shared virtual memory to increase programming flexibility and reduce data transfers that consume resources Dynamic parallelism which reduces processor load and avoids bottlenecks Improved imaging support and integration with OpenGL Designed to work on multiple platforms, OpenCL will help you more effectively program for a heterogeneous future. Written by leaders in the parallel computing and OpenCL communities, this book explores memory spaces, optimization techniques, extensions, debugging and profiling. Multiple case studies and examples illustrate high-performance algorithms, distributing work across heterogeneous systems, embedded domain-specific languages, and will give you hands-on OpenCL experience to address a range of fundamental parallel algorithms. Updated content to cover the latest developments in OpenCL 2.0, including improvements in memory handling, parallelism, and imaging support Explanations of principles and strategies to learn parallel programming with OpenCL, from understanding the abstraction models to thoroughly testing and debugging complete applications Example code covering image analytics, web plugins, particle simulations, video editing, performance optimization, and more

Journal ArticleDOI
TL;DR: The scalability performances and the efficiency of the parallel algorithm are shown, and the robustness of the method is tested on complex and strongly connected DFN configurations which would be very difficult to mesh using conventional app...
Abstract: Flows in fractured media have been modeled using many different approaches in order to get reliable and efficient simulations for many critical applications. The common issues to be tackled are the wide range of scales involved in the phenomenon, the complexity of the domain, and the huge computational cost. In the present paper we propose a parallel implementation of the PDE-constrained optimization method presented in [S. Berrone, S. Pieraccini, and S. Scialo, SIAM J. Sci. Comput., 35 (2013), pp. B487--B510; S. Berrone, S. Pieraccini, and S. Scialo, SIAM J. Sci. Comput., 35 (2013), pp. A908--A935; S. Berrone, S. Pieraccini, and S. Scialo, J. Comput. Phys., 256 (2014), pp. 838--853] for dealing with arbitrary discrete fracture networks (DFNs) on nonconforming grids. We show the scalability performances and the efficiency of the parallel algorithm, and we also test the robustness of the method on complex and strongly connected DFN configurations which would be very difficult to mesh using conventional app...

Journal ArticleDOI
TL;DR: The results show that using PHPSO to solve the one-dimensional heat conduction equation can outperform two parallel algorithms as well as HPSO itself and is shown to be with strong robustness and high speedup.

Journal ArticleDOI
TL;DR: A parallel CRF algorithm called MapReduce CRF (MRCRF) is proposed in this paper, which contains two parallel sub-algorithms to handle two time-consuming steps of the CRF model and outperforms other competing methods in terms of time efficiency and correctness.
Abstract: Processing large volumes of data has presented a challenging issue, particularly in data-redundant systems. As one of the most recognized models, the conditional random fields (CRF) model has been widely applied in biomedical named entity recognition (Bio-NER). Due to the internally sequential feature, performance improvement of the CRF model is nontrivial, which requires new parallelized solutions. By combining and parallelizing the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) and Viterbi algorithms, we propose a parallel CRF algorithm called MapReduce CRF (MRCRF) in this paper, which contains two parallel sub-algorithms to handle two time-consuming steps of the CRF model. The MapReduce L-BFGS (MRLB) algorithm leverages the MapReduce framework to enhance the capability of estimating parameters. Furthermore, the MapReduce Viterbi (MRVtb) algorithm infers the most likely state sequence by extending the Viterbi algorithm with another MapReduce job. Experimental results show that the MRCRF algorithm outperforms other competing methods by exhibiting significant performance improvement in terms of time efficiency as well as preserving a guaranteed level of correctness.