scispace - formally typeset
Search or ask a question

Showing papers presented at "Parallel and Distributed Computing: Applications and Technologies in 2014"


Proceedings ArticleDOI
09 Dec 2014
TL;DR: A new dynamic periodic decentralized data replication strategy, called RSBMFCP, which consider a set of correlated files as granularity, and has better performance in comparison with other strategies in terms of job execution time and effective network usage.
Abstract: Data replication in data grids is an efficient technique that aims to improve response time, reduce the bandwidth consumption and maintain reliability. In this context, a lot of work is done and many strategies have been proposed. Unfortunately, most of existing replication techniques are based on single file granularity and neglect correlation among different data files. Indeed, file correlations become an increasingly important consideration for performance enhancement in data grids. In fact, the analysis of real data intensive grid applications reveals that job requests for groups of correlated files and suggests that these correlations can be exploited for improving the effectiveness of replication strategies. In this paper, we propose a new dynamic periodic decentralized data replication strategy, called RSBMFCP (1), which consider a set of correlated files as granularity. Our strategy gathers files according to a relationship of simultaneous accesses between files by jobs and stores correlated files at the same site. In order to find out these correlations, a maximal frequent correlated pattern mining algorithm of the data mining field is introduced. We choose the all-confidence as correlation measure. The proposed strategy consists of four steps: storing file access history, converting the file access history into a logical history file, applying maximal frequent correlated pattern mining algorithm and performing replication and replacement. Experiments using the well-known data grid simulator Opt or Sim show that our proposed strategy has better performance in comparison with other strategies in terms of job execution time and effective network usage.

9 citations


Proceedings ArticleDOI
09 Dec 2014
TL;DR: A scheduling algorithm that assigns rCUDA (a remote CUDA execution technology) to some processes of some jobs and can maintain the average jobs' lifetime around the same as the scheduling algorithm currently used in the TSUBAME2.5's GPU queue.
Abstract: In heterogeneous supercomputers such as TSUBAME2.5, GPUs on some nodes in GPU batch queues are left idle even though there are jobs waiting in the queues, this is caused by GPU resource-assignment fragmentation problem. For example, in the case that each node has three GPUs like TSUBAME2.5's, if a node has already been assigned to a job requesting two GPUs per node, that node cannot be assigned to another job requesting more than one GPU per node until the ongoing job finishes, hence, one GPU is left idle on that node. We examine this problem on TSUBAME2.5's GPU batch-queue system and present a scheduling algorithm that assigns rCUDA (a remote CUDA execution technology) to some processes of some jobs. Because rCUDA allows jobs to utilize the idle GPUs, the proposed scheduling algorithm can alleviate the problem. Using a job pattern obtained from a scheduler log of a TSUBAME2.5's GPU queue, our simulation shows that the proposed algorithm can decrease jobs' lifetime (from the time when a job arrives until finishes) by about 5% on average. Moreover, it can reduce the average number of idle GPUs by about 15%. Also, even reducing the number of nodes serving jobs by around 4%, the proposed algorithm can maintain the average jobs' lifetime around the same as the scheduling algorithm currently used in the TSUBAME2.5's GPU queue.

6 citations


Proceedings ArticleDOI
09 Dec 2014
TL;DR: The new protocol is the first ownership transfer protocol based on the PUF in an open environment and it replaces the pseudo-random generator, which can protect the privacy of the original owner and the new owner.
Abstract: In the supply chain, RFID tags are deployed more widely. In the life of the supply chain, the owner of the tag will change frequently. Ownership transfer protocol can achieve the purpose that the access rights of the tag are transferred from the original owner to the new owner, and protect the privacy of the original owner and the new owner. To resist cloning attack and side channel analysis attack, physical unclonable function (PUF) has been proposed to enhance the security of the tags. Since the PUF of each tag is unique and different, it is difficult to be forged. However, most of PUF-based authentication protocols need the response value previously stored in the readers. On the other hand, most of the ownership transfer protocols assume the original owner and the new owner has a secure channel. However, in an open environment, due to time and space constraints, such a channel is often unable to quickly established. In this paper, we studied the ownership transfer protocols in an open environment and proposed a PUF-based RFID ownership transfer protocols, PROTP. The new protocol is the first ownership transfer protocol based on the PUF in an open environment. The new protocol does not need to store the respond values of the PUF. To utilize the randomness of the PUF, it replaces the pseudo-random generator. Meanwhile, PROTP can protect the privacy of the original owner and the new owner. In terms of efficiency, since the protocol is designed to satisfy the requirement in an open environment, the total cost of the computation is more than others protocols. However, due to the new protocol utilizes the PUF to replace the pseudo-random generator, the each step of the authentication messages achieves a better optimization in computational cost.

6 citations


Proceedings ArticleDOI
09 Dec 2014
TL;DR: A highly distributed QoS-aware replication technique that computes the optimal data center locations for the replicas so that the overall replication cost is minimized and the replication strategy aims at maximizing QoS satisfaction to improve data availability and reduce access latency.
Abstract: Cloud computing infrastructures capable of providing scalable storage and computing resources can efficiently be used for big data storage and processing. There are growing trends in developing data-intensive (big data) applications in this computing environment that need to access massive datasets. Hence, effective data management such as data availability and efficient accesses has become critical requirements in these applications. This can be achieved by using data replication, which offers reduced data access latency, higher data availability and improved system load balancing. Moreover, different applications may have different quality-of-service (QoS) requirements. To continuously support the QoS requirement of an application, we propose a highly distributed QoS-aware replication technique that computes the optimal data center locations for the replicas so that the overall replication cost is minimized. Further, the replication strategy aims at maximizing QoS satisfaction to improve data availability and reduce access latency. The problem is formulated using dynamic programming. Finally, simulation experiments are performed using widely observed data access patterns to demonstrate the effectiveness of the proposed technique.

5 citations


Proceedings ArticleDOI
09 Dec 2014
TL;DR: The Distributive Interoperable Executive Library (DIEL) is presented, to facilitate the collaboration, exploration, and execution of multiphysics modeling projects suited for a diversified research community on emergent large-scale parallel computing platforms.
Abstract: As HPC capability and software adaptability continues to expand, the interest to perform complex system-wide simulations involving multiple interacting components grows. In this paper, we present a novel integrative software platform -- the Distributive Interoperable Executive Library (DIEL) - to facilitate the collaboration, exploration, and execution of multiphysics modeling projects suited for a diversified research community on emergent large-scale parallel computing platforms. It does so by providing a managing executive, a layer of numerical libraries, a number of commonly used physics modules, and two set of native communication protocols. DIEL allows users to plug in their individual modules, prescribe the interactions between those modules, and schedule communications between them. The DIEL framework is designed to be applicable for preliminary concept design, sensitivity prototyping, and productive simulation of a complex system.

5 citations


Proceedings ArticleDOI
09 Dec 2014
TL;DR: This work presents an optimized implementation of the FastICA algorithm, which is specifically tailored for next-generation GPU architectures such as Nvidia Kepler, which achieves a two digit factor of speedup in the prototype implementation, compared to a multithreaded CPU implementation.
Abstract: Blind Signal Separation is an algorithmic problem class that deals with the restoration of original signal data from a signal mixture. Implementations, such as Fast ICA, are optimized for parallelization on CPU or first-generation GPU hardware. With the advent of modern, compute centered GPU hardware with powerful features such as dynamic parallelism support, these solutions no longer leverage the available hardware performance in the best-possible way. We present an optimized implementation of the Fast ICA algorithm, which is specifically tailored for next-generation GPU architectures such as Nvidia Kepler. Our proposal achieves a two digit factor of speedup in the prototype implementation, compared to a multithreaded CPU implementation. Our custom matrix multiplication kernels, tailored specifically for the use case, contribute to the speedup by delivering better performance than the state-of-the-art CUBLAS library.

4 citations


Proceedings ArticleDOI
09 Dec 2014
TL;DR: For any small real number ∈ > 0, the algorithm computes a Steiner tree with delay and cost bounded by (1 + ∈)D and the optimum cost respectively.
Abstract: Let G = (V, E) be a given graph with non-negative integral edge cost and delay, S a#x2286; V be a terminal set and r a#x220A; S be the selected root. The shallow-light Steiner tree (SLST) problem is to compute a minimum cost tree spanning the terminals of S, such that the delay between r and every other terminal is bounded by a given delay constraint Da#x220A;Z+0. It is known that the SLST problem is NP-hard and unless NPa#x2286; DTIME (nlog logn) there exists no approximation algorithm with ratio (1, a#x03B3;log2n) for some fixed a#x03B3;>0 [12]. Nevertheless, under the same assumption it admits no approximation ratio better than (1, a#x03B3;log|V|) for some fixed a#x03B3;>0 even when D = 2 [2]. This paper first gives an exact algorithm with time complexity O(3tnD + 2tn2D2 + n3D3), where n and t are the numbers of vertices and terminals of the given graph respectively. This is a pseudo polynomial time parameterized algorithm with respect to the parameterization "number of terminals". Later, this algorithm is improved to a parameterized approximation algorithm with a time complexity O(3t n2/a#x03B5; + 2t n4/a#x03B5;2 + n6/a#x03B5;3) and a bifactor approximation ratio (1 + a#x03B5;, 1). That is, for any small real number a#x03B5; > 0, the algorithm computes a Steiner tree with delay and cost bounded by (1 + a#x03B5;)D and the optimum cost respectively.

4 citations


Proceedings ArticleDOI
09 Dec 2014
TL;DR: This paper proposes a bandwidth model for volunteer computing projects based on the real trace data taken from the Docking@Home project with more than 280,000 hosts over a 5-year period and validate the proposed statistical model using model-based and simulation-based techniques.
Abstract: The emergence of Big Data applications provides new challenges in data management such as processing and movement of masses of data. Volunteer computing has proven itself as a distributed paradigm that can fully support Big Data generation. This paradigm uses a large number of heterogeneous and unreliable Internet-connected hosts to provide Peta-scale computing power for scientific projects. With the increase in data size and number of devices that can potentially join a volunteer computing project, the host bandwidth can become a main hindrance to the analysis of the data generated by these projects, especially if the analysis is a concurrent approach based on either in-situ or in-transit processing. In this paper, we propose a bandwidth model for volunteer computing projects based on the real trace data taken from the Docking@Home project with more than 280,000 hosts over a 5-year period. We validate the proposed statistical model using model-based and simulation-based techniques. Our modeling provides us with valuable insights on the concurrent integration of data generation with in-situ and in-transit analysis in the volunteer computing paradigm.

4 citations


Proceedings ArticleDOI
09 Dec 2014
TL;DR: Experiments show the proposed novel clustering algorithm, Kmms, which is the abbreviation of k-Means and Mean Shift, costs less initialization time compared with other density based algorithms, but also achieves better clustering quality and higher efficiency.
Abstract: K-Means, a simple but effective clustering algorithm, is widely used in data mining, machine learning and computer vision community. K-Means algorithm consists of initialization of cluster centers and iteration. The initial cluster centers have a great impact on cluster result and algorithm efficiency. More appropriate initial centers of k-Means can get closer to the optimum solution, and even much quicker convergence. In this paper, we propose a novel clustering algorithm, Kmms, which is the abbreviation of k-Means and Mean Shift. It is a density based algorithm. Experiments show our algorithm not only costs less initialization time compared with other density based algorithms, but also achieves better clustering quality and higher efficiency. And compared with the popular k-Means++ algorithm, our method gets comparable accuracy, mostly even better. Furthermore, we parallelize Kmms algorithm based on OPenMP from both initialization and iteration step and prove the convergence of the algorithm.

4 citations


Proceedings ArticleDOI
09 Dec 2014
TL;DR: The hpcwld package is presented that provides R function for the workload (unfinished work) evaluation of a stochastic model of a supercomputer based on a modified Kiefer -- Wolfowitz recursion.
Abstract: We present the hpcwld package that provides R [1] function for the workload (unfinished work) evaluation of a stochastic model of a supercomputer based on a modified Kiefer -- Wolfowitz recursion. Beyond the workload evaluation function, we provide functions that can be used for workload data input and output, as well as a distributional measure of correlation computation that might be useful for a real system workload analysis. The package can be obtained via Comprehensive R Archive Network [2].

3 citations


Proceedings ArticleDOI
09 Dec 2014
TL;DR: This article forms the Minimum Total Transmission Power problem, which aims to address the issue of constructing a congestion-free converge cast tree in WSNs with adjustable transmission power of sensor nodes, and transforms MTTP to an Integer Linear Programming (ILP) model, by which the optimal solution to MTTP is derived.
Abstract: Converge cast is a critical communication paradigm for data collection in wireless sensor networks, where both energy and bandwidth are scarce resources Previous converge cast algorithms only focused on minimizing the energy cost without considering the constraint of wireless bandwidth This article shows that constructing a congestion-free converge cast tree cannot ignore the bandwidth constraint Considering the adjustable transmission power of sensor nodes, it will affect not only the topology of networks but also the bandwidth of wireless links In this paper, we formulate the Minimum Total Transmission Power (MTTP) problem, which aims to address the issue of constructing a congestion-free converge cast tree in WSNs with adjustable transmission power of sensor nodes We transform MTTP to an Integer Linear Programming (ILP) model, by which the optimal solution to MTTP is derived To strike a balance between scheduling overhead and system performance, we propose a heuristic algorithm called Nearest-to-Sink, which searches viable paths in a greedy way and achieves near optimal performance We build the simulation model and give a comprehensive performance evaluation, which demonstrates the feasibility and the effectiveness of the proposed algorithm

Proceedings ArticleDOI
09 Dec 2014
TL;DR: A novel protocol identification system PSKS, which relies on the statistical signatures of network packet payloads, based on the key insight that message segmentation patterns can be leveraged for accurate application identification.
Abstract: In-depth understanding of network traffic is important for a variety of applications, such as network management and network security. In this paper, we propose a novel protocol identification system PSKS, which relies on the statistical signatures of network packet payloads. The proposed approach is based on the key insight that message segmentation patterns can be leveraged for accurate application identification. Specifically, the segmentation possibility for every position of protocol messages exhibits highly skewed frequency distribution due to the reason that different protocols have different message formats (i.e., Distinct message segmentation patterns). Motivated by this observation, we want to extract statistical application fingerprints by exploiting the message segmentation patterns. In PSKS, we first extract the message segmentation patterns by scoring the segmentation possibility scale for each position of messages, and then extract statistical signatures by Kolmogorov-Smirnov test and feed the signatures to tri-training, a collaborative learning algorithm. The tri-training can improve the generalization ability of our final classifier. We implemented and evaluated PSKS, and the experimental results show that PSKS achieves an average precision and recall of approximately 98%.

Proceedings ArticleDOI
09 Dec 2014
TL;DR: A comparative study of parallel and sequential GSA was carried out and a significant speedup was shown that re-emphasizes the utility of CUDA based implementation for complex and computationally intensive parallel applications.
Abstract: Many scientific and technical problems with massive computation requirements could benefit from the Graphics Processing Units (GPUs) using Compute Unified Device Architecture (CUDA) for high speed processing Gravitational Search Algorithm (GSA) is a population-based metaheuristic algorithm that can be effectively implemented on GPU to reduce the execution time In this paper we discuss possible approaches to parallelize GSA on graphics hardware using CUDA An in-depth study of the computation efficiency of parallel algorithms and capability to effectively exploit the architecture of GPU is performed Additionally, a comparative study of parallel and sequential GSA was carried out on a set of standard benchmark optimization functions The results show a significant speedup that re-emphasizes the utility of CUDA based implementation for complex and computationally intensive parallel applications

Proceedings ArticleDOI
09 Dec 2014
TL;DR: A message-passing algorithm which forces a deterministic behavior for a subset of piecewise-deterministic applications is designed, to affix each message with a predetermined tag, prescribing a total order for messages.
Abstract: Ensuring replicable execution of distributed computation has many possible applications in areas of fault-tolerance, debugging, and state-machine based replication. We design a message-passing algorithm which forces a deterministic behavior for a subset of piecewise-deterministic applications. The main principle of the algorithm is to affix each message with a predetermined tag, prescribing a total order for messages.

Proceedings ArticleDOI
09 Dec 2014
TL;DR: A fault-tolerant routing algorithm is proposed that establishes a fault-free path between any pair of non-faulty nodes in an Sn, k with faulty nodes by using limited global information called safety vectors.
Abstract: An (n, k)-star graph Sn, k is a promising topology for interconnection networks of parallel processing systems because it inherits the merits of a star graph while providing various network sizes. In this study, we propose a fault-tolerant routing algorithm that establishes a fault-free path between any pair of non-faulty nodes in an Sn, k with faulty nodes by using limited global information called safety vectors. In addition, we carried out a computer experiment to verify its effectiveness.

Proceedings ArticleDOI
09 Dec 2014
TL;DR: A 3-step method memory that adjusts memory scheduling algorithm to optimize LLC bypassing performance is proposed and results show that after applied, the schedulers improve the system performance obviously.
Abstract: The shared last-level cache (SLLC) in heterogeneous multicore system is an important memory component that shared and competitive between multiple cores, so how to improve the SLLC performance has become an important research area. Last-level cache (LLC) bypassing technique that bypasses the LLC a part of memory requests is one of the most effective methods. The bypassed requests are sent directly to off-chip main memory (DRAM) rather than eliminated. We find that the bypassed requests influence the original scheduling sequence in Memory Controller (MC) severely. Besides, immoderate bypassing will disturb the MC load balance. We propose a 3-step method memory that adjusts memory scheduling algorithm to optimize LLC bypassing performance. The first step is adding an independent bypass stream for bypassed requests. The second step is scheduling the bypass stream with a smaller probability than that of normal GPU stream. The third step is adding a guard mechanism for MC. By dynamically set and revoke the guard, we can avoid unbalanced bypassing. For case study, we applied the 3-step method on two modern memory schedulers. The experimental results show that after applied the 3-step method, the schedulers improve the system performance obviously.

Proceedings ArticleDOI
09 Dec 2014
TL;DR: A data mapping method is proposed for SMVM on Network-on-Chip which achieves balanced working load and reduces the communication cost, and an FPGA-based architecture is introduced which is designed to fit with the proposed data mapped method.
Abstract: The performance of the sparse matrix-vector multiplication (SMVM) on a parallel system is strongly affected by the distribution of data among its components. Two costs arise as a result of the used data mapping method: arithmetic and communication. The communication cost often dominates the arithmetic cost, and the gap between these costs tends to increase. Therefore, finding a mapping method that reduces the communication cost is of high importance. On the other hand, the load distribution among the processing units must not be sacrificed. In this paper, a data mapping method is proposed for SMVM on Network-on-Chip which achieves balanced working load and reduces the communication cost. Afterwards, an FPGA-based architecture is introduced which is designed to fit with the proposed data mapping method.

Proceedings ArticleDOI
09 Dec 2014
TL;DR: A generative model Topic Block is presented that clearly pinpoints the latent concept underlying the text corpus and link network, i.e., User inner interests, and can infer the topic and community distributions based on the user inner interests through both content and topology information.
Abstract: Text corpus and link network are interrelated data in social networks. Discovering the inner relationship between these two kinds of data can help better understand the evolution mechanism underneath social networks. Moreover, social networks exhibit unique characteristics such as sparse and noisy in both text and link data. Thus, it is imperative to combine both text and link data to complement and correct mining results. However, previous work did not explore a uniform generative model that can unveil their inner relationship probably because of the difficulty to harness the heterogenous data in social networks. To address this issue, in this paper we present a generative model Topic Block that clearly pinpoints the latent concept underlying the text corpus and link network, i.e., User inner interests. In our generative model, user inner interests guide the generation of the topic and community distributions underlying the text corpus and link data. We can infer the topic and community distributions based on the user inner interests through both content and topology information. Compared to existing popular models, our method experimentally outperforms on three real world social network data sets.

Proceedings ArticleDOI
09 Dec 2014
TL;DR: This paper formalizes a specific anonymizing model to deal with dart-related attacks, and proposes an efficient method for the dart anonymization problem to prevent a combinatorial map from the attack.
Abstract: Combinatorial Map (CM) is becoming increasingly popular due to its power in modeling topological structures with subdivided objects, which is widely used in the fields of social network, computer vision, social media and so on. However, due to its specific structural properties, an unprotected release of a combinatorial map may cause the identity disclosure problem, which is a major privacy breach revealing the identification of entities with certain background knowledge known by an adversary. In this paper, we discuss the privacy preserving problem in publishing private combinatorial maps. We first formalize a specific anonymizing model to deal with dart-related attacks, and discuss an efficient metric to quantify information loss incurred in the perturbation. Then we propose an efficient method for the dart anonymization problem to prevent a CM from the attack. Our approaches are efficient and practical, and have been validated by extensive experiments on two sets of synthetic data.

Proceedings ArticleDOI
09 Dec 2014
TL;DR: This contribution aims at dynamically managing load imbalance by allowing multiple hash functions on different peers, while maintaining consistency of the overlay.
Abstract: Storing highly skewed data in a distributed system has become a very frequent issue, in particular with the emergence of semantic web and Big Data. This often leads to biased data dissemination among nodes. Addressing load imbalance is necessary, especially to minimize response time and avoid workload being handled by only one or few nodes. Our contribution aims at dynamically managing load imbalance by allowing multiple hash functions on different peers, while maintaining consistency of the overlay. Our experiments, on highly skewed data sets from the semantic web, show we can distribute data on at least 300 times more peers than when not using any load balancing strategy.

Proceedings ArticleDOI
09 Dec 2014
TL;DR: TOUGH2-PETSc is presented, a parallel implementation of Tough2 that uses PETSc to solve the linear systems in TOUGH2, a general-purpose numerical simulation program for multi-dimensional, multiphase, multicomponent fluid flows, heat transfer and contaminant transport in porous and fractured media.
Abstract: TOUGH2 is a general-purpose numerical simulation program for multi-dimensional, multiphase, multicomponent fluid flows, heat transfer and contaminant transport in porous and fractured media. It has been used worldwide for geothermal reservoir engineering, nuclear waste isolation, environmental assessment and remediation, and modeling flow and transport in variably saturated media. TOUGH2 is very computationally intense, and the accuracy and scope of the simulation is limited by the amount of processing power available on a single computer. This makes it an ideal canadate for parallel computing, as more CPU power and memory is available. Furthermore, TOUGH2's main computational unit is a linear equation solver. In parallel computing, a lot of effort has been spent to develop highly efficient parallel linear equation solvers. In this paper, we present TOUGH2-PETSc, a parallel implementation of TOUGH2 that uses PETSc to solve the linear systems in TOUGH2. PETSc is a library of high-performance linear and non-linear equation solvers that has been throughly tested at scale. Based on TOUGH2 and PETSc, TOUGH2-PETSc gives TOUGH2 users the potential to perform larger scale and higher resolution simulations. Experimental results demonstrate that the parallel TOUGH2-PETSc shows improved performance over the sequential version.

Proceedings ArticleDOI
09 Dec 2014
TL;DR: A strategy to calculate the locality of predicted spatial points with the temporal and statistic function series, which will be able to find regions with critical levels and supports to choose prevention methods and offers a prediction analysis based on georeferenced resources.
Abstract: Spatiotemporal data stored in geographic databases provide an evolutionary panorama about the characteristics of a specific region. With integration of prediction concepts and statistical functions to that data, it is possible to make inferences of obtained information, to support in many areas such as management of occupational health, environmental resources, quality of life and, others. In this article is proposed a strategy to calculate the locality of predicted spatial points with the temporal and statistic function series, which will be able to find regions with critical levels. In the concentrations of more dense occurrences, this strategy supports to choose prevention methods and offers a prediction analysis based on georeferenced resources. This work contributes towards to prediction, analysis and visualization of georeferenced data to reduce costs and improve the life quality.

Proceedings ArticleDOI
09 Dec 2014
TL;DR: Based on a set of experiments with a scientific application on diverse OpenCL devices, major pitfalls and insights are pointed out, and directions for further efforts in developing pattern libraries for OpenCL are outlined.
Abstract: Parallel pattern libraries (e.g., Intel TBB) are popular and useful tools for developing applications in SMP environments at a higher level of abstraction. Such libraries execute user-provided code efficiently on shared memory parallel architectures in accordance with well-defined execution patterns like parallel for-loops or pipelines. For heterogeneous architectures comprised of CPUs and accelerators, OpenCL has gained a lot of momentum. Since accelerated architectures do not provide a shared memory, it is not possible to directly use the approach taken in pattern libraries for SMP systems for OpenCL as well. In this paper, we are exploring issues and opportunities encountered by attempts to provide such patterns in the context of OpenCL. Based on a set of experiments with a scientific application on diverse OpenCL devices, we point out major pitfalls and insights, and outline directions for further efforts in developing pattern libraries for OpenCL.

Proceedings ArticleDOI
09 Dec 2014
TL;DR: This paper has considered binding of already scheduled multiphase application on to linear multicore architecture and shows that hierarchical binding using minimum cost perfect matching based approach outperform rest of the approaches.
Abstract: As almost all applications run-time characteristics exhibit time varying phase behavior. So scheduling and binding strategy considering this behavior of applications plays an important role in achieving high throughput and less power consumption. In this paper, we have considered binding of already scheduled multiphase application on to linear multicore architecture. This approaches bind the scheduled applications on nearby cores and hence reduces the overall data movement. We have modeled over all data communication overhead of application on a linear architecture and use this model in binding. Also we have proposed and evaluated four different approaches for binding the multi-phase applications on linear multicore architecture. The proposed approach are (a) random iterative refinement, (b) biggest block left-right approach (c) biggest block center-center approach and (d) hierarchical binding using perfect minimum cost matching. Result shows that hierarchical binding using minimum cost perfect matching based approach outperform rest of the approaches.

Proceedings ArticleDOI
09 Dec 2014
TL;DR: This work proposes a conflict-free scheduling to avoid register read/write conflicts among concurrently executing code blocks, and finds that, this problem with conflict- free scheduling could be partially fixed by complementing it with the naive scheduling.
Abstract: Speculative Multi-Threading (SpMT) is a promising technology to harness the growing number of cores in modern CPUs by exploiting the available parallelism out of sequential threads And with SpMT, the inter-dependent fine-grain threads on different cores need to sync their input and output registers Inter-core register sync is quite expensive, and heavily impacts the overall performance We propose a conflict-free scheduling to avoid register read/write conflicts among concurrently executing code blocks Therefore, while executing, the fine-grain threads do not need to sync the input or output registers with each other Their output registers are still synced across the cores, but only consumed later Consequently, the register sync and the execution are decoupled, and the inter-core waiting are eliminated or hidden However, sometimes the conflict-free scheduling does not perform well when there are insufficient conflict-free code blocks to feed all the available cores We found that, this problem with conflict-free scheduling could be partially fixed by complementing it with the naive scheduling Experiments with SPEC2006 showed that in most cases (24 out of 29) the improved scheduling (Hybrid scheduling), ie The combination of the conflict-free scheduling and the naive scheduling, significantly out performed the naive scheduling

Proceedings ArticleDOI
09 Dec 2014
TL;DR: This paper presents three scheduling heuristics for running large-scale, data-intensive scientific workflows in clouds designed to leverage slot queue threshold, data locality and data prefetching, respectively.
Abstract: The scale of scientific applications becomes increasingly large not only in computation, but also in data. Many of these applications also concern inter-related tasks with data dependencies, hence, they are scientific workflows. The efficient coordination of executing/running scientific workflows is of great practical importance. The core of such coordination is scheduling and resource allocation. In this paper, we present three scheduling heuristics for running large-scale, data-intensive scientific workflows in clouds. In particular, the three heuristic algorithms are designed to leverage slot queue threshold, data locality and data prefetching, respectively. We also demonstrate how these heuristics can be collectively used to tackle different issues in running "data-intensive" workflows in clouds although each of these heuristics can be used independently. The practicality of our algorithms has been realized by actually implementing and incorporating them into our workflow execution system (DEWE). Using Montage, an astronomical image mosaic engine, as an example workflow, and Amazon EC2 as the cloud environment, we evaluate the performance of our heuristics in terms primarily of completion time (make span). We also scrutinize workflow execution showing different execution phases to identify their impact on performance. Our algorithms scale well and reduce make span by up to 27%.

Proceedings ArticleDOI
09 Dec 2014
TL;DR: A fault-tolerant routing algorithm based on improved safety levels to attain higher reach ability is proposed and estimated in time and space complexities and carried out a computer experiment to verify its effectiveness.
Abstract: In a parallel processing system, a pancake graph is one of the superior topologies for interconnection network because of the small diameter and the high degree. In previous research, fault-tolerant routing using restricted global information called safety levels in a pancake graph was proposed. But there are some rooms for improvement. Therefore, we propose a fault-tolerant routing algorithm based on improved safety levels to attain higher reach ability. In addition, we estimated the proposed method in time and space complexities, and carried out a computer experiment to verify its effectiveness.

Proceedings ArticleDOI
09 Dec 2014
TL;DR: In this work, it is found that if the number of available parallelism of the targeted application is higher and data sharing between tasks is high then one of the proposed modification of work stealing outperform the rest of the modifications.
Abstract: Classical work stealing is an efficient dynamic load-balancing technique in shared memory multiprocessor or multicore system. But the performance of the same classical work scheduling on cluster chip multicore is not appreciable. So modification to this is necessary to improve performance. In this paper, we have discussed many earlier proposed modifications, and also proposed some simplistic modifications to suite targeted clustered environment. We have described a methodology to evaluate all the variations of work stealing analytically and experimentally on multiprocessor simulator and on real platform. Our methodology of evaluation include designing of novel parametric synthetic benchmark, which can be used to mimic behavior (or profile) of many real life benchmarks. The designed synthetic benchmark caters a wide range of application profiles to evaluate the design space of both variations of work stealing algorithms and clustered chip multiprocessor. In this work, we found that if the number of available parallelism of the targeted application is higher and data sharing between tasks is high then one of the proposed modification of work stealing (probabilistic based victim search and threshold on size of migratable task) outperform the rest of the modifications.

Proceedings ArticleDOI
09 Dec 2014
TL;DR: A technique to integrate energy-aware real-time scheduling into battery-powered sensor node platforms and yields efficient performance in terms of low energy consumption and a fast average response time for non-real-time tasks, whilst meeting the timing constraints of real- time tasks.
Abstract: This paper primarily presents a technique to integrate energy-aware real-time scheduling into battery-powered sensor node platforms. The proposed scheduling technique attempts to achieve energy savings while meeting timing requirements of real-time tasks as well as improving the response time of non-real-time tasks. The proposed energy-aware real-time scheduling technique is addressed from two perspectives: a dynamic voltage scaling (DVS)-based scheduling approach and a combined real-time and non-real-time scheduling approach. From the perspective of DVS-based scheduling approach, it reduces the processor frequency and still ensures that no tasks miss their timing constraints. From the perspective of a combined real-time and non-real-time scheduling scheme, tasks are allowed to have timing constraints. Consequently, real-time tasks need to be processed in a timely fashion and delivered to meet their timing constraints, and non-real-time tasks need to be quickly processed to deliver a fast response. An experimental evaluation shows that the proposed scheduling technique yields efficient performance in terms of low energy consumption and a fast average response time for non-real-time tasks, whilst meeting the timing constraints of real-time tasks.

Proceedings ArticleDOI
09 Dec 2014
TL;DR: An online strategy for mapping application's tasks and data onto 3D LCMP platforms by proposed approximate nearest approach to allocate resources of a layer with respect to other resource layer (either processor or memory).
Abstract: Performance of 3D stacked memory is impressive in multicore system. In three dimensional stacked large chip multiprocessor (3D LCMP), memory and memory network are integrated on top of the processors and processor network. In this paper, we have proposed an online strategy for mapping application's tasks and data onto 3D LCMP platforms. To meet performance constraints every application demand a set of resources (may be number of processor and amount of memory). An important criteria in allocation is to allocate all required resources as near as possible to reduce the communication overhead. We have proposed approximate nearest approach to allocate resources of a layer with respect to other resource layer (either processor or memory). In our work, we have compared three different strategies to allocate resources to application: in first one processor allocation is preferred over memory allocation, in second one memory allocation is preferred over processor and in third one priority is set depend on requirement of applications. Our experimental analysis shows 21% in average and up to 30% improvement over state of art one layer resource allocation. On demand priority based resource allocation improve 26% over simple processor or memory priority based resource allocation.