scispace - formally typeset
Search or ask a question
Book ChapterDOI

Geometric Spanners in the MapReduce Model

TL;DR: This paper proposes an efficient MapReduce algorithm for constructing a geometric spanner in a constant number of rounds, using linear amount of communication.
Abstract: A geometric spanner on a point set is a sparse graph that approximates the Euclidean distances between all pairs of points in the point set. Here, we intend to construct a geometric spanner for a massive point set, using a distributed algorithm on parallel machines. In particular, we use the MapReduce model of computation to construct spanners in several rounds with inter-communications in between. An algorithm in this model is called efficient if it uses a sublinear number of machines and runs in a polylogarithmic number of rounds. In this paper, we propose an efficient MapReduce algorithm for constructing a geometric spanner in a constant number of rounds, using linear amount of communication. The stretch factors of our spanner is \(1+\epsilon \), for any \(\epsilon >0\).
Citations
More filters
Journal Article
TL;DR: The key idea is to consider the t-spanner as an approximation of the complete graph of distances among the objects, and use it as a compact device to simulate the large matrix of distances required by successful search algorithms like AESA.
Abstract: A t-spanner, a subgraph that approximates graph distances within a precision factor t, is a well known concept in graph theory. In this paper we use it in a novel way, namely as a data structure for searching metric spaces. The key idea is to consider the t-spanner as an approximation of the complete graph of distances among the objects, and use it as a compact device to simulate the large matrix of distances required by successful search algorithms like AESA [Vidal 1986]. The t-spanner provides a time-space tradeoff where full AESA is just one extreme. We show that the resulting algorithm is competitive against current approaches, e.g., 1.5 times the time cost of AESA using only 3.21% of its space requirement, in a metric space of strings; and 1.09 times the time cost of AESA using only 3.83 % of its space requirement, in a metric space of documents. We also show that t-spanners provide better space-time tradeoffs than classical alternatives such as pivot-based indexes. Furthermore, we show that the concept of t-spanners has potential for large improvements.

1 citations

Journal ArticleDOI
TL;DR: In this paper , a fat shape, such as a square, instead of a disk, is used to compute the maximum curve length inside any disk divided by its radius, which gives a constant factor approximation for packedness.
Abstract: Packedness is a measure defined for curves as the ratio of maximum curve length inside any disk divided by its radius. Sparsification allows us to reduce the number of candidate disks for maximum packedness to a polynomial amount in terms of the number of vertices of the polygonal curve. This gives an exact algorithm for computing packedness. We prove that using a fat shape, such as a square, instead of a disk gives a constant factor approximation for packedness. Further sparsification using well-separated pair decomposition improves the time complexity at the cost of losing some accuracy. By adjusting the ratio of the separation factor and the size of the query, we improve the approximation factor of the existing algorithm for packedness using square queries. Our experiments show that uniform sampling works well for finding the average packedness of trajectories with almost constant speed. The empirical results confirm that the sparsification method approximates the maximum packedness for arbitrary polygonal curves. In big data models such as massively parallel computations, both sampling and sparsification are efficient and take a constant number of rounds. Most existing algorithms use line-sweeping which is sequential in nature. Also, we design two data-structures for computing the length of the curve inside a query shape: an exact data-structure for disks called hierarchical aggregated queries and an approximate data-structure for a given set of square queries. Using our modified segment tree, we achieve a near-linear time approximation algorithm.
Posted Content
TL;DR: In this article, a polygonal curve with n vertices is defined to be "c$-packed" if the sum of the lengths of the parts of the edges of the curve that are inside any disk of radius r$ is at most $cr$ for any r>0.
Abstract: A polygonal curve $P$ with $n$ vertices is $c$-packed, if the sum of the lengths of the parts of the edges of the curve that are inside any disk of radius $r$ is at most $cr$, for any $r>0$. Similarly, the concept of $c$-packedness can be defined for any scaling of a given shape. Assuming $L$ is the diameter of $P$ and $\delta$ is the minimum distance between points on disjoint edges of $P$, we show the approximation factor of the existing $O(\frac{\log (L/\delta)}{\epsilon}n^3)$ time algorithm is $1+\epsilon$-approximation algorithm. The massively parallel versions of these algorithms run in $O(\log (L/\delta))$ rounds. We improve the existing $O((\frac{n}{\epsilon^3})^{\frac 4 3}\polylog \frac n \epsilon)$ time $(6+\epsilon)$-approximation algorithm by providing a $(4+\epsilon)$-approximation $O(n(\log^2 n)(\log^2 \frac{1}{\epsilon})+\frac{n}{\epsilon})$ time algorithm, and the existing $O(n^2)$ time $2$-approximation algorithm improving the existing $O(n^2\log n)$ time $2$-approximation algorithm. Our exact $c$-packedness algorithm takes $O(n^5)$ time, which is the first exact algorithm for disks. We show using $\alpha$-fat shapes instead of disks adds a factor $\alpha^2$ to the approximation. We also give a data-structure for computing the curve-length inside query disks. It has $O(n^6\log n)$ construction time, uses $O(n^6)$ space, and has query time $O(\log n+k)$, where $k$ is the number of intersected segments with the query shape. We also give a massively parallel algorithm for relative $c$-packedness with $O(1)$ rounds.
Posted Content
TL;DR: This work reviews samples of the algorithms that exploit symmetry, and gives several new ones, for finding lower-bounds, beating adversaries in online algorithms, designing parallel algorithms and data summarization.
Abstract: We call an objective function or algorithm symmetric with respect to an input if after swapping two parts of the input in any algorithm, the solution of the algorithm and the output remain the same. More formally, for a permutation $\pi$ of an indexed input, and another permutation $\pi'$ of the same input, such that swapping two items converts $\pi$ to $\pi'$, $f(\pi)=f(\pi')$, where $f$ is the objective function. After reviewing samples of the algorithms that exploit symmetry, we give several new ones, for finding lower-bounds, beating adversaries in online algorithms, designing parallel algorithms and data summarization. We show how to use the symmetry between the sampled points to get a lower/upper bound on the solution. This mostly depends on the equivalence class of the parts of the input that when swapped, do not change the solution or its cost.

Cites methods from "Geometric Spanners in the MapReduce..."

  • ...In [1], an O(1)-round algorithm for indexing the vertices based on a grid in a balanced manner is given....

    [...]

TL;DR: This paper presents the geometric spanner construction in the Massively Parallel Computing (MPC) model, modified distributed range tree to find the nearest point to the apex of a θ -cone efficiently and form a (1 + ϵ )-spanner in O (1) rounds and (cid:101) O ( S ) time, where S is the memory size of a single machine.
Abstract: The importance of processing large-scale data is growing rapidly in contemporary computation. In order to design and analyze practical distributed algorithms, re-cently, the MPC model has been introduced as a the-oretical framework. In this paper, we present the geometric spanner construction in the Massively Parallel Computing (MPC) model. Constructing θ -graph using the given ϵ , we modified distributed range tree to find the nearest point to the apex of a θ -cone efficiently and form a (1 + ϵ )-spanner in O (1) rounds and (cid:101) O ( S ) time, where S is the memory size of a single machine.
References
More filters
Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

20,309 citations

Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.

17,663 citations

Proceedings ArticleDOI
17 Jan 2010
TL;DR: A simulation lemma is proved showing that a large class of PRAM algorithms can be efficiently simulated via MapReduce, and it is demonstrated how algorithms can take advantage of this fact to compute an MST of a dense graph in only two rounds.
Abstract: In recent years the MapReduce framework has emerged as one of the most widely used parallel computing platforms for processing data on terabyte and petabyte scales. Used daily at companies such as Yahoo!, Google, Amazon, and Facebook, and adopted more recently by several universities, it allows for easy parallelization of data intensive computations over many machines. One key feature of MapReduce that differentiates it from previous models of parallel computation is that it interleaves sequential and parallel computation. We propose a model of efficient computation using the MapReduce paradigm. Since MapReduce is designed for computations over massive data sets, our model limits the number of machines and the memory per machine to be substantially sublinear in the size of the input. On the other hand, we place very loose restrictions on the computational power of of any individual machine---our model allows each machine to perform sequential computations in time polynomial in the size of the original input.We compare MapReduce to the PRAM model of computation. We prove a simulation lemma showing that a large class of PRAM algorithms can be efficiently simulated via MapReduce. The strength of MapReduce, however, lies in the fact that it uses both sequential and parallel computation. We demonstrate how algorithms can take advantage of this fact to compute an MST of a dense graph in only two rounds, as opposed to Ω(log(n)) rounds needed in the standard PRAM model. We show how to evaluate a wide class of functions using the MapReduce framework. We conclude by applying this result to show how to compute some basic algorithmic problems such as undirected s-t connectivity in the MapReduce framework.

643 citations


"Geometric Spanners in the MapReduce..." refers background in this paper

  • ...A class of functions that can be computed with minimum round and communication complexity are known as MRC-parallelizable functions [16]....

    [...]

  • ...Different theoretical models for MapReduce has been introduced over the years [9,14,16]....

    [...]

Proceedings ArticleDOI
13 Apr 2015
TL;DR: SpatialHadoop is a comprehensive extension to Hadoop that injects spatial data awareness in each Hadoan layer, namely, the language, storage, MapReduce, and operations layers, with orders of magnitude better performance than Hadoops for spatial data processing.
Abstract: This paper describes SpatialHadoop; a full-fledged MapReduce framework with native support for spatial data. SpatialHadoop is a comprehensive extension to Hadoop that injects spatial data awareness in each Hadoop layer, namely, the language, storage, MapReduce, and operations layers. In the language layer, SpatialHadoop adds a simple and expressive high level language for spatial data types and operations. In the storage layer, SpatialHadoop adapts traditional spatial index structures, Grid, R-tree and R+-tree, to form a two-level spatial index. SpatialHadoop enriches the MapReduce layer by two new components, SpatialFileSplitter and SpatialRecordReader, for efficient and scalable spatial data processing. In the operations layer, SpatialHadoop is already equipped with a dozen of operations, including range query, kNN, and spatial join. Other spatial operations are also implemented following a similar approach. Extensive experiments on real system prototype and real datasets show that SpatialHadoop achieves orders of magnitude better performance than Hadoop for spatial data processing.

475 citations


"Geometric Spanners in the MapReduce..." refers methods in this paper

  • ...Moreover, fixed-dimensional linear programming, 1-dimensional all nearest neighbors, 2dimensional and 3-dimensional convex hull algorithms were solved in memorybound MapReduce model [13] and practically proven algorithms for sky-line computation, merging two polygons, diameter and closest pair problems have been discussed in MapReduce model [8,10]....

    [...]

Book
01 Jan 2007
TL;DR: In this paper, the authors present rigorous descriptions of the main algorithms and their analyses for different variations of the Geometric Spanner Network Problem, and present several basic principles and results that are used throughout the book.
Abstract: Aimed at an audience of researchers and graduate students in computational geometry and algorithm design, this book uses the Geometric Spanner Network Problem to showcase a number of useful algorithmic techniques, data structure strategies, and geometric analysis techniques with many applications, practical and theoretical. The authors present rigorous descriptions of the main algorithms and their analyses for different variations of the Geometric Spanner Network Problem. Though the basic ideas behind most of these algorithms are intuitive, very few are easy to describe and analyze. For most of the algorithms, nontrivial data structures need to be designed, and nontrivial techniques need to be developed in order for analysis to take place. Still, there are several basic principles and results that are used throughout the book. One of the most important is the powerful well-separated pair decomposition. This decomposition is used as a starting point for several of the spanner constructions.

444 citations