scispace - formally typeset
Search or ask a question

Showing papers in "Concurrency and Computation: Practice and Experience in 2012"


Journal ArticleDOI
TL;DR: A competitive analysis is conducted and competitive ratios of optimal online deterministic algorithms for the single VM migration and dynamic VM consolidation problems are proved, and novel adaptive heuristics for dynamic consolidation of VMs are proposed based on an analysis of historical data from the resource usage by VMs.
Abstract: The rapid growth in demand for computational power driven by modern service applications combined with the shift to the Cloud computing model have led to the establishment of large-scale virtualized data centers. Such data centers consume enormous amounts of electrical energy resulting in high operating costs and carbon dioxide emissions. Dynamic consolidation of virtual machines (VMs) using live migration and switching idle nodes to the sleep mode allows Cloud providers to optimize resource usage and reduce energy consumption. However, the obligation of providing high quality of service to customers leads to the necessity in dealing with the energy-performance trade-off, as aggressive consolidation may lead to performance degradation. Because of the variability of workloads experienced by modern applications, the VM placement should be optimized continuously in an online manner. To understand the implications of the online nature of the problem, we conduct a competitive analysis and prove competitive ratios of optimal online deterministic algorithms for the single VM migration and dynamic VM consolidation problems. Furthermore, we propose novel adaptive heuristics for dynamic consolidation of VMs based on an analysis of historical data from the resource usage by VMs. The proposed algorithms significantly reduce energy consumption, while ensuring a high level of adherence to the service level agreement. We validate the high efficiency of the proposed algorithms by extensive simulations using real-world workload traces from more than a thousand PlanetLab VMs. Copyright © 2011 John Wiley & Sons, Ltd.

1,616 citations


Journal ArticleDOI
TL;DR: A novel intermediate data storage strategy that can reduce the cost of scientific cloud workflow systems by automatically storing appropriate intermediate data sets with one cloud service provider is developed.
Abstract: Many scientific workflows are data intensive where large volumes of intermediate data are generated during their execution. Some valuable intermediate data need to be stored for sharing or reuse. Traditionally, they are selectively stored according to the system storage capacity, determined manually. As doing science in the cloud has become popular nowadays, more intermediate data can be stored in scientific cloud workflows based on a pay-for-use model. In this paper, we build an intermediate data dependency graph (IDG) from the data provenance in scientific workflows. With the IDG, deleted intermediate data can be regenerated, and as such we develop a novel intermediate data storage strategy that can reduce the cost of scientific cloud workflow systems by automatically storing appropriate intermediate data sets with one cloud service provider. The strategy has significant research merits, i.e. it achieves a cost-effective trade-off of computation cost and storage cost and is not strongly impacted by the forecasting inaccuracy of data sets' usages. Meanwhile, the strategy also takes the users' tolerance of data accessing delay into consideration. We utilize Amazon's cost model and apply the strategy to general random as well as specific astrophysics pulsar searching scientific workflows for evaluation. The results show that our strategy can reduce the overall cost of scientific cloud workflow execution significantly. Copyright © 2010 John Wiley & Sons, Ltd. (A preliminary version of this paper was published in the proceedings of IPDPS'2010, Atlanta, U.S.A., April 2010.)

72 citations


Journal ArticleDOI
TL;DR: This paper proposes and investigates the application of market‐oriented mechanisms based on the General Equilibrium Theory of Microeconomics to coordinate the sharing of resources between the Clouds in the Federated Cloud.
Abstract: Cloud Computing is the latest paradigm proposed toward fulfilling the vision of computing being delivered as an utility such as phone, electricity, gas and water services. It enables users to have access to computing infrastructure, platform and software as services over the Internet. The services can be accessed on demand and from anywhere in the world in a quick and flexible manner, and charged for based on their usage, making the rapid and often unpredictable expansion demanded by nowadays' business environment affordable also for small spin-off and start-up companies. In order to be competitive, however, Cloud providers need to be able to adapt to the dynamic loads from users, not only optimizing the local usage and costs but also engaging into agreements with other Clouds so as to complement local capacity. The infrastructure in which competing Clouds are able to cooperate to maximize their benefits is called a Federated Cloud. Just as Clouds enable users to cope with unexpected demand loads, a Federated Cloud will enable individual Clouds to cope with unforeseen variations of demand. The definition of the mechanism to ensure mutual benefits for the individual Clouds composing the federation, however, is one of its main challenges. This paper proposes and investigates the application of market-oriented mechanisms based on the General Equilibrium Theory of Microeconomics to coordinate the sharing of resources between the Clouds in the Federated Cloud. Copyright © 2010 John Wiley & Sons, Ltd.

55 citations


Journal ArticleDOI
TL;DR: SciCumulus is a middleware that manages the parallel execution of scientific workflows in cloud environments that adapts itself according to the availability of resources during workflow execution and dynamically tunes the workflow activity size to achieve better performance.
Abstract: Many of the existing large-scale scientific experiments modeled as scientific workflows are compute-intensive. Some scientific workflow management systems already explore parallel techniques, such as parameter sweep and data fragmentation, to improve performance. In those systems, computing resources are used to accomplish many computational tasks in high performance environments, such as multiprocessor machines or clusters. Meanwhile, cloud computing provides scalable and elastic resources that can be instantiated on demand during the course of a scientific experiment, without requiring its users to acquire expensive infrastructure or to configure many pieces of software. In fact, because of these advantages some scientists have already adopted the cloud model in their scientific experiments. However, this model also raises many challenges. When scientists are executing scientific workflows that require parallelism, it is hard to decide a priori the amount of resources to use and how long they will be needed because the allocation of these resources is elastic and based on demand. In addition, scientists have to manage new aspects such as initialization of virtual machines and impact of data staging. SciCumulus is a middleware that manages the parallel execution of scientific workflows in cloud environments. In this paper, we introduce an adaptive approach for executing parallel scientific workflows in the cloud. This approach adapts itself according to the availability of resources during workflow execution. It checks the available computational power and dynamically tunes the workflow activity size to achieve better performance. Experimental evaluation showed the benefits of parallelizing scientific workflows using the adaptive approach of SciCumulus, which presented an increase of performance up to 47.1%. Copyright © 2011 John Wiley & Sons, Ltd.

49 citations


Journal ArticleDOI
TL;DR: This paper presents a quantization method for the computation of the conditional expectations that allows a straightforward parallelization on the MC level and develops for first‐order autoregressive processes a further parallelization in the time domain, which makes use of faster memory structures and therefore maximizes parallel execution.
Abstract: The pricing of American style and multiple exercise options is a very challenging problem in mathematical finance. One usually employs a least squares Monte Carlo approach (Longstaff–Schwartz method) for the evaluation of conditional expectations, which arise in the backward dynamic programming principle for such optimal stopping or stochastic control problems in a Markovian framework. Unfortunately, these least squares MC approaches are rather slow and allow, because of the dependency structure in the backward dynamic programming principle, no parallel implementation neither on the MC level nor on the time layer level of this problem. We therefore present in this paper a quantization method for the computation of the conditional expectations that allows a straightforward parallelization on the MC level. Moreover, we are able to develop for first-order autoregressive processes a further parallelization in the time domain, which makes use of faster memory structures and therefore maximizes parallel execution. Furthermore, we discuss the generation of random numbers in parallel on a GPGPU architecture, which is the crucial tool for the application of massive parallel computing architectures in mathematical finance. Finally, we present numerical results for a CUDA implementation of these methods. It will turn out that such an implementation leads to an impressive speed-up compared with a serial CPU implementation. Copyright © 2011 John Wiley & Sons, Ltd.

41 citations


Journal ArticleDOI
TL;DR: The paradigm shift towards multicore and manycore technologies coupled with accelerators in a heterogeneous environment is offering a great potential of computing power for scientific and industrial applications and holistic approaches coupling the expertise ranging from hardware architecture and software design to numerical algorithms are a pressing necessity.
Abstract: In the last few years, the landscape of parallel computing has been subject to profound and highly dynamic changes. The paradigm shift towards multicore and manycore technologies coupled with accelerators in a heterogeneous environment is offering a great potential of computing power for scientific and industrial applications. However, for one to take full advantage of these new technologies, holistic approaches coupling the expertise ranging from hardware architecture and software design to numerical algorithms are a pressing necessity. Parallel computing is no longer limited to supercomputers and is now much more diversified – with a multitude of technologies, architectures, and programming approaches leading to increased complexity for developers and engineers. In this work, we give – from the perspective of numerical simulation and applications – an overview of existing and emerging multicore and manycore technologies as well as accelerator concepts. We emphasize the challenges associated with high-performance heterogeneous computing and discuss the interfaces needed to fill the gap between the hardware architecture and the implementation of efficient numerical algorithms. By means of this short survey – which stresses the necessity of hardware-aware computing – we aim at giving assistance to users in scientific computing entering this fascinating field and help understanding associated issues and capabilities. Copyright © 2011 John Wiley & Sons, Ltd.

40 citations


Journal ArticleDOI
TL;DR: Automatic composition results in a table‐driven implementation that, for each parallel call of a performance‐aware component, looks up the expected best implementation variant, processor allocation and schedule given the current problem, and processor group sizes.
Abstract: We describe the principles of a novel framework for performance-aware composition of sequential and explicitly parallel software components with implementation variants. Automatic composition results in a table-driven implementation that, for each parallel call of a performance-aware component, looks up the expected best implementation variant, processor allocation and schedule given the current problem, and processor group sizes. The dispatch tables are computed off-line at component deployment time by an interleaved dynamic programming algorithm from time-prediction meta-code provided by the component supplier. Copyright © 2011 John Wiley & Sons, Ltd.

38 citations


Journal ArticleDOI
Kimikazu Kato1, Tikara Hosino1
TL;DR: This work gives an effective algorithm to solve the k‐nearest neighbor problem, and shows that when the size of the problem is large, an implementation of the algorithm on two GPUs runs more than 330 times faster than a single core implementation on a latest CPU.
Abstract: The recommendation system is a mechanism which automatically recommends items that are likely to be of interest to the user. In the recommendation system, customers' preferences are encoded into vectors, and finding the nearest vectors to each vector is an essential part. This vector‐searching part of the problem is called a k‐nearest neighbor problem. We give an effective algorithm to solve this problem on multiple graphics processor units (GPUs). Our algorithm consists of two parts: the N‐body problem and the partial sort. For an algorithm of the N‐body problem, we applied the idea of a known algorithm, although another trick is needed to overcome the problem of small‐sized shared memory. For the partial sort, we give a novel GPU algorithm which is effective for small k. In our partial sort algorithm, a heap is accessed in parallel by threads with a low cost of synchronization. We show through an experiment that when the size of the problem is large, an implementation of the algorithm on two GPUs runs more than 330 times faster than a single core implementation on a latest CPU. Copyright © 2011 John Wiley & Sons, Ltd.

37 citations


Journal ArticleDOI
TL;DR: This work extends its work in two ways: by applying the same techniques to accelerate the computation of portfolio level risk for credit derivatives and to different asset classes using a different type of mathematical model, which together present challenges that are quite different to those dealt with in earlier work.
Abstract: We report new results from an on-going project to accelerate derivatives computations. Our earlier work was focused on accelerating the valuation of credit derivatives. In this paper, we extend our work in two ways: by applying the same techniques, first, to accelerate the computation of portfolio level risk for credit derivatives and, second, to different asset classes using a different type of mathematical model, which together present challenges that are quite different to those dealt with in our earlier work. Specifically, we report acceleration over 270 times faster than a single Intel Core for a multi-asset Monte Carlo model. We also explore the implications for risk. Copyright © 2011 John Wiley & Sons, Ltd.

37 citations


Journal ArticleDOI
TL;DR: ProvManager is proposed, a provenance management approach that eases the gathering, storage, and analysis of provenance information in a distributed and heterogeneous environment scenario, without putting the burden of adaptations on the scientist.
Abstract: Running scientific workflows in distributed and heterogeneous environments has been a motivating approach for provenance management, which is loosely coupled to the workflow execution engine. This kind of approach is interesting because it allows both storage and access to provenance data in a homogeneous way, even in an environment where different workflow management systems work together. However, current approaches overload scientists with many ad hoc tasks, such as script adaptations and implementations of extra functionalities to provide provenance independence. This paper proposes ProvManager, a provenance management approach that eases the gathering, storage, and analysis of provenance information in a distributed and heterogeneous environment scenario, without putting the burden of adaptations on the scientist. ProvManager leverages the provenance management at the experiment level by integrating different workflow executions from multiple workflow management systems. Copyright © 2011 John Wiley & Sons, Ltd.

37 citations


Journal ArticleDOI
TL;DR: The cold chain logistics system based on cloud computing can be used to connect the database betweencold chain logistics and external customers, so that each database connection terminal can keep track of and update the data.
Abstract: All information technology (IT) resources that cloud computing provide can be seen as services. As a new IT implementation, cloud computing now has a profound impact on technological change. Cloud computing can provide critical software for business management, reducing IT and maintenance costs for hardware and software effectively. It can also enable small and medium enterprises to have access to professional IT solutions with less IT investment. The cold chain logistics system based on cloud computing can be used to connect the database between cold chain logistics and external customers, so that each database connection terminal can keep track of and update the data. A cold chain logistics system based on cloud computing is designed in this paper. This system consists of six functional parts: (1) data collection; (2) data calculation; (3) data updating and transmission; (4) configuration management; (5) control strategy; and (6) business development. It brings better cooperation between cold chain logistics and their customers, realizes co-control of product sales information, accelerates the speed of cold chain logistics and maximizes the interests of all parties. Copyright © 2011 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: The proposed mechanism is a practical approach to efficiently coordinate concurrent service negotiations within complex workflows, enabling the iterative and interactive adjustment of the negotiation boundaries for each atomic service in a composition based on the performance of other atomic negotiations.
Abstract: The end-to-end QoS negotiation for service level agreement establishment for composite services involves compound multi-party negotiations in which the composite service provider concurrently negotiates with multiple candidates for each atomic service, selecting the one that best satisfies the atomic service QoS preferences while ensuring that the end-to-end QoS requirements are also fulfilled In order to be able to negotiate with potential candidates, it is necessary to derive the atomic utility boundaries from the global utility boundary Additionally, there has to be a mechanism for updating these boundaries in subsequent negotiation rounds based on the individual negotiation outcomes In this paper, we propose an algorithm for the decomposition of global utility boundary into atomic service utility boundaries, and the surplus redistribution from successful negotiation outcomes among the remaining negotiations The proposed mechanism is a practical approach to efficiently coordinate concurrent service negotiations within complex workflows, enabling the iterative and interactive adjustment of the negotiation boundaries for each atomic service in a composition based on the performance of other atomic negotiations We demonstrate the feasibility of our approach by evaluating it with some popular negotiation strategies using the Specialized Property Search Scenario Copyright © 2011 John Wiley & Sons, Ltd

Journal ArticleDOI
TL;DR: A flexible software architecture and a language for systems of mobile agents starting from a formalism with timed interactions and explicit locations is presented and a timed migration in a distributed environment is allowed.
Abstract: In this paper, we present a flexible software architecture and a language for systems of mobile agents starting from a formalism with timed interactions and explicit locations. The language supports the specification of a distributed system, that is, agents and their physical distribution, and allows a timed migration in a distributed environment. Advanced software technologies are used to define the software architecture and the agents language, facilitating also the agents development. We illustrate the system by a dynamic network discovery in which the agents take into account the latency and the CPU load when choosing where to migrate. Copyright © 2011 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: This work develops highly efficient parallel PDE‐based pricing methods on graphics processing units (GPUs) for multi‐asset American options by pricing American options written on three assets using a combination of a discrete penalty approach and a GPU‐based parallel alternating direction implicit approximate factorization technique.
Abstract: We develop highly efficient parallel PDE-based pricing methods on graphics processing units (GPUs) for multi-asset American options. Our pricing approach is built upon a combination of a discrete penalty approach for the linear complementarity problem arising because of the free boundary and a GPU-based parallel alternating direction implicit approximate factorization technique with finite differences on uniform grids for the solution of the linear algebraic system arising from each penalty iteration. A timestep size selector implemented efficiently on GPUs is used to further increase the efficiency of the methods. We demonstrate the efficiency and accuracy of the parallel numerical methods by pricing American options written on three assets. Copyright © 2011 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: A workload characterization methodology was developed to support NWChem performance engineering on large‐scale parallel platforms and has successfully identified several algorithmic bottlenecks, which are already being tackled by computational chemists to improve NWchem performance.
Abstract: The use of global address space languages and one-sided communication for complex applications is gaining attention in the parallel computing community. However, lack of good evaluative methods to observe multiple levels of performance makes it difficult to isolate the cause of performance deficiencies and to understand the fundamental limitations of system and application design for future improvement. NWChem is a popular computational chemistry package, which depends on the Global Arrays/Aggregate Remote Memory Copy Interface suite for partitioned global address space functionality to deliver high-end molecular modeling capabilities. A workload characterization methodology was developed to support NWChem performance engineering on large-scale parallel platforms. The research involved both the integration of performance instrumentation and measurement in the NWChem software, as well as the analysis of one-sided communication performance in the context of NWChem workloads. Scaling studies were conducted for NWChem on Blue Gene/P and on two large-scale clusters using different generation Infiniband interconnects and x86 processors. The performance analysis and results show how subtle changes in the runtime parameters related to the communication subsystem could have significant impact on performance behavior. The tool has successfully identified several algorithmic bottlenecks, which are already being tackled by computational chemists to improve NWChem performance. Copyright © 2011 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: A framework and software tool intended for simulation of cooperative defence mechanisms against botnets and distributed denial of service defence mechanisms is outlined, based on agent‐oriented approach and packet‐level network simulation.
Abstract: The paper outlines a framework and software tool intended for simulation of cooperative defence mechanisms against botnets. These framework and software tool are based on agent-oriented approach and packet-level network simulation. They are intended to evaluate and compare different cooperative distributed attacks and defence mechanisms. Botnet and defence components are represented in the paper as a set of collaborating and counteracting agent teams. Agents are supposed to collect information from various network sources, operate different situational knowledge, and react to actions of other agents. The paper describes the results of experiments aimed to investigate botnets and distributed denial of service defence mechanisms. We explore various botnet attacks and counteraction against them on the example of defence against distributed denial of service attacks. Copyright © 2011 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: This work presents a model‐driven approach for generating a provably correct implementation of the transaction model of interest, and the specification of nested transactions is verified, because it is the basis for many advanced transaction models.
Abstract: In modern transaction processing software, the ACID properties (atomicity, consistency, isolation, durability) are often relaxed, in order to address requirements that arise in computing environments of today. Typical examples are the long-running transactions in mobile computing, in service-oriented architectures and B2B collaborative applications. These new transaction models are collectively known as advanced or extended transactions. Formal specification and reasoning for transaction properties have been limited to proof-theoretic approaches, despite the recent progress in model checking. In this work, we present a model-driven approach for generating a provably correct implementation of the transaction model of interest. The model is specified by state machines for the transaction participants, which are synchronized on a set of events. All possible execution paths of the synchronized state machines are checked for property violations. An implementation for the verified transaction model is then automatically generated. To demonstrate the approach, the specification of nested transactions is verified, because it is the basis for many advanced transaction models. Concurrency and Computation: Practice and Experience. Copyright © 2012 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: Results show that the new model keeps the accuracy of the underlying bio‐inspired trust model and the level of client satisfaction, while enhancing the interpretability of the model and thus making it closer to the final user.
Abstract: Trust is, in some cases, being considered as a requirement in highly distributed communication scenarios. Before accessing a particular service, a trust model is then being used in these scenarios to determine if the service provider can be trusted or not. It is done usually on behalf of the final user or service customer, and with a little intervention of him or her. This is usually happening with the main aim of automatizing the process and because trust models are normally making use of reasoning mechanisms and models difficult to understand by humans. In this paper, we propose the adaptation of a bio-inspired trust model to deal with linguistic fuzzy labels, which are closer to the human way of thinking. This Linguistic Fuzzy Trust Model also uses fuzzy reasoning. Results show that the new model keeps the accuracy of the underlying bio-inspired trust model and the level of client satisfaction, while enhancing the interpretability of the model and thus making it closer to the final user. Copyright © 2011 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: This article proposes a methodology that addresses the issues of unreliability and unpredictability such that Cloud software services could be hosted upon volunteered resources and was able to scale out the backend infrastructure of the Cloud service elastically, opportunistically and autonomically.
Abstract: Many research institutions and Universities own computational capacity that is not effectively utilized, thereby providing an opportunity for such institutions to use such capacity to offer Cloud services (to both internal and external users). However, the unreliability and unpredictability of these resources mean that their use in the context of a Service Level Agreement (SLA) is high risk, leading to a reduction in reputation as well as economic penalties in case of SLA violation. We propose a methodology that addresses the issues of unreliability and unpredictability such that Cloud software services could be hosted upon volunteered resources. To enable the harnessing of these resources we rely on autonomic fault management techniques that allow such systems to independently adapt the resources they use based upon their perception of individual resource reliability. Using our approach we were able to scale out the backend infrastructure of the Cloud service elastically (min 30thinspaces per worker), opportunistically and autonomically. We address two key questions in this article: can a campus volunteer infrastructure be used in Cloud provisioning? What measures are necessary in order to ensure reliability at the resource level? Copyright © 2011 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: The bulk synchronous parallel (BSP) model, originally designed for distributed‐memory systems, is also applicable for shared‐memory multicore systems and, furthermore, that BSP libraries are useful in scientific computing on these systems.
Abstract: We show that the bulk synchronous parallel (BSP) model, originally designed for distributed-memory systems, is also applicable for shared-memory multicore systems and, furthermore, that BSP libraries are useful in scientific computing on these systems. A proof-of-concept MulticoreBSP library has been implemented in Java, and is used to show that BSP algorithms can attain proper speedups on multicore architectures. This library is based on the BSPlib implementation, adapted to an object-oriented setting. In comparison, the number of function primitives is reduced, while the overall design simplicity is improved. We detail applying the BSP model and library on the sparse matrix–vector (SpMV) multiplication problem, and show by performing numerical experiments that the resulting BSP SpMV algorithm attains speedups, in one case reaching a speedup of 3.5 for 4 threads. Whereas not described in detail in this paper, algorithms for the fast Fourier transform and the dense LU decomposition are also investigated; in one case, attaining superlinear speedups of 5 for 4 threads. The predictability of BSP algorithms in the case of the SpMV is also investigated. Copyright © 2011 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: Understanding the behavior of large scale distributed systems is generally extremely difficult as it requires to observe a very large number of components over very large time.
Abstract: Understanding the behavior of large scale distributed systems is generally extremely difficult as it requires to observe a very large number of components over very large time. Most analysis tools for distributed systems gather basic information such as individual processor or network utilization. Although scalable because of the data reduction techniques applied before the analysis, these tools are often insufficient to detect or fully understand anomalies in the dynamic behavior of resource utilization and their influence on the applications performance. In this paper, we propose a methodology for detecting resource usage anomalies in large scale distributed systems. The methodology relies on four functionalities: characterized trace collection, multi-scale data aggregation, specifically tailored user interaction techniques, and visualization techniques. We show the efficiency of this approach through the analysis of simulations of the volunteer computing Berkeley Open Infrastructure for Network Computing architecture. Three scenarios are analyzed in this paper: analysis of the resource sharing mechanism, resource usage considering response time instead of throughput, and the evaluation of input file size on Berkeley Open Infrastructure for Network Computing architecture. The results show that our methodology enables to easily identify resource usage anomalies, such as unfair resource sharing, contention, moving network bottlenecks, and harmful short-term resource sharing. Copyright © 2011 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: A well‐known algorithm in the field of biosequence matching and database searching, the Smith–Waterman (S‐W) algorithm is taken as an example, and approaches that fully exploit its performance potentials on CPU, GPU, and field‐programmable gate array (FPGA) computing platforms are demonstrated.
Abstract: With fierce competition between CPU and graphics processing unit (GPU) platforms, performance evaluation has become the focus of various sectors. In this paper, we take a well-known algorithm in the field of biosequence matching and database searching, the Smith–Waterman (S-W) algorithm as an example, and demonstrate approaches that fully exploit its performance potentials on CPU, GPU, and field-programmable gate array (FPGA) computing platforms. For CPU platforms, we perform two optimizations, single instruction, multiple data and multithread, with compiler options, to gain over 70 × speedups over naive CPU versions on quad-core CPU platforms. For GPU platforms, we propose the combination of coalesced global memory accesses, shared memory tiles, and loop unfolding, achieving 50 × speedups over initial GPU versions on an NVIDIA GeForce GTX 470 card. Experimental results show that the GPU GTX 470 gains 12 × speedups, instead of 100 × reported by some studies, over Intel quadcore CPU Q9400, under the same manufacturing technology and both with fully optimized schemes. In addition, for FPGA platforms, we customize a linear systolic array for the S-W algorithm in a 45-nm FPGA chip from Xilinx (XC6VLX760), with up to 1024 processing elements. Under only 133 MHz clock rate, the FPGA platform reaches the highest performance and becomes the most power-efficient platform, using only 25 W compared with 190 W of the GPU GTX 470. Copyright © 2011 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: Although recently, the term itself is becoming less used, yet it still exists in mobile and sensor networks, service-oriented architectures, grid computing, cloud computing, online multi-player games, networked robotics, the Internet of things, and much more.
Abstract: Middleware! It was somewhere, and now it is definitely everywhere. Over the past four decades, the term middleware has been tossed around, picked up, and investigated vigorously. Yet, if you ask 10 different people “what is middleware?” You will most likely get 10 different answers. It started as some additions on top of operating systems to facilitate complex applications development, moved to become data integration features, then became network applications facilitator, and eventually became an important component of every distributed environment, application, system and platform there is. To-date, if you examine any type of distributed system or application, you must find middleware or some middleware functionality involved. Although recently, the term itself is becoming less used, yet it still exists in mobile and sensor networks, service-oriented architectures, grid computing, cloud computing, online multi-player games, networked robotics, the Internet of things, and much more. So as is, middleware is really still everywhere and most likely will remain everywhere for a very long time.

Journal ArticleDOI
TL;DR: StarSs is a family of parallel programming models based on automatic function‐level parallelism that targets productivity that deploys a data‐flow model that analyzes dependencies between tasks and manages their execution, exploiting their concurrency as much as possible.
Abstract: Programming for large-scale, multicore-based architectures requires adequate tools that offer ease of programming and do not hinder application performance. StarSs is a family of parallel programming models based on automatic function-level parallelism that targets productivity. StarSs deploys a data-flow model: it analyzes dependencies between tasks and manages their execution, exploiting their concurrency as much as possible. This paper introduces Cluster Superscalar (ClusterSs), a new StarSs member designed to execute on clusters of SMPs (Symmetric Multiprocessors). ClusterSs tasks are asynchronously created and assigned to the available resources with the support of the IBM APGAS runtime, which provides an efficient and portable communication layer based on one-sided communication. We present the design of ClusterSs on top of APGAS, as well as the programming model and execution runtime for Java applications. Finally, we evaluate the productivity of ClusterSs, both in terms of programmability and performance and compare it to that of the IBM X10 language. Copyright © 2012 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: A new object cache design is explored, which is driven by the capabilities of static WCET analysis, and an early architecture exploration by means of static timing analysis techniques helps to identify configurations suitable for hard real‐time systems.
Abstract: Hard real-time systems need a time-predictable computing platform to enable static worst-case execution time (WCET) analysis. All performance-enhancing features need to be WCET analyzable. However, standard data caches containing heap-allocated data are very hard to analyze statically. In this paper we explore a new object cache design, which is driven by the capabilities of static WCET analysis. Simulations of standard benchmarks estimating the expected average case performance usually drive computer architecture design. The design decisions derived from this methodology do not necessarily result in a WCET analysis-friendly design. Aiming for a time-predictable design, we therefore propose to employ WCET analysis techniques for the design space exploration of processor architectures. We evaluated different object cache configurations using static analysis techniques. The number of field accesses that can be statically classified as hits is considerable. The analyzed number of cache miss cycles is 3–46% of the access cycles needed without a cache, which agrees with trends obtained using simulations. Standard data caches perform comparably well in the average case, but accesses to heap data result in overly pessimistic WCET estimations. We therefore believe that an early architecture exploration by means of static timing analysis techniques helps to identify configurations suitable for hard real-time systems. Copyright © 2011 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: A combination of uniform grid mesh with AMR mesh, and the merger of two different sets of solvers to overcome the scalability limitation of the Poisson solver in FLASH is proposed.
Abstract: FLASH is a multiphysics multiscale adaptive mesh refinement (AMR) code originally designed for simulation of reactive flows often found in Astrophysics. With its wide user base and flexible applications configuration capability, FLASH has a dual task of maintaining scalability and portability in all its solvers. The scalability of fully explicit solvers in the code is tied very closely to that of the underlying mesh. Others such as the Poisson solver based on a multigrid method have more complex scaling behavior. Multigrid methods suffer from processor starvation and dominating communication costs at coarser grids with increase in the number of processors. In this paper, we propose a combination of uniform grid mesh with AMR mesh, and the merger of two different sets of solvers to overcome the scalability limitation of the Poisson solver in FLASH. The principal challenge in the proposed merger is the efficiency of the communication algorithm to map the mesh back and forth between uniform grid and AMR. We present two different parallel mapping algorithms and also discuss results from performance studies of the two implementations. Copyright © 2012 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: Because more and more large networks aim at leveraging trust, approaches to its assessment have to take into account the factors as efficient distributed implementation and effective security protection against malicious attacks.
Abstract: Currently several computer-based scenarios leverage the concept of trust as a mean to make electronic interactions (e.g., e-commerce transactions) as reliable as possible, allowing to cope with uncertainty and risks by recommending trusted peers. Generally, the evaluation of trustworthiness can be accomplished according to many principles, from social-based to psychology-based; one of the commonly adopted approaches within peer-to-peer networks, virtual social networks, and recommendation systems is the reputation-based trust evaluation. Because more and more large networks (even with millions of nodes) aim at leveraging trust, approaches to its assessment have to take into account the factors as efficient distributed implementation and effective security protection against malicious attacks. In this paper, we present a distributed and secure algorithm based on TrustWebRank, a metric that takes into account both personalized trust evaluation and network dynamics issues. To test our proposal both in terms of complexity and bandwidth usage, we performed simulations on a large and real dataset built from the Epinions.com recommendation system. Results show that the proposed distributed algorithm is effective and efficient, while preserving original benefits of TrustWebRank. Copyright © 2011 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: This work focuses on the use of multiple GPU devices with a single CPU host and the asynchronous CPU/GPU communications issues involved, and obtains more than two orders of magnitude of speedup over a comparable CPU core.
Abstract: SUMMARY Graphical Processing Units (GPUs) are good data-parallel performance accelerators for solving regular mesh partial differential equations (PDEs) whereby low-latency communications and high compute to communications ratios can yield very high levels of computational efficiency. Finite-difference time-domain methods still play an important role for many PDE applications. Iterative multi-grid and multilevel algorithms can converge faster than ordinary finite difference methods but can be much more difficult to parallelise with GPU memory constraints. We report on some practical algorithmic and data layout approaches and on performance data on a range of GPUs with CUDA. We focus on the use of multiple GPU devices with a single CPU host and the asynchronous CPU/GPU communications issues involved. We obtain more than two orders of magnitude of speedup over a comparable CPU core.

Journal ArticleDOI
TL;DR: A fast parallel simulator that solves the acoustic wave equation on a graphics processing unit (GPU) cluster that handles all the steps of seismic modeling and RTM and is used to solve real‐world problems in an industrial production context.
Abstract: We designed a fast parallel simulator that solves the acoustic wave equation on a graphics processing unit (GPU) cluster. Solving the acoustic wave equation in an oil exploration industrial context aims at speeding up seismic modeling and reverse time migration (RTM). We considered a finite difference approach on a regular mesh, in both two-dimensional and three-dimensional cases. The acoustic wave equation is solved in a constant density or a variable density domain. All the computations were carried out in single precision (both in the CPU reference implementation and in the GPU implementation), because double precision was not required in our context. We used Compute Unified Device Architecture to take advantage of the GPU computational power. We studied different implementations and their impact on the application performance. The described application handles all the steps of seismic modeling and RTM and is used to solve real-world problems in an industrial production context. We obtained a speedup of 16 for RTM and up to 43 for the modeling application over a sequential code running on general-purpose CPUs. A CPU rack versus a GPU rack comparison was described and showed a 4.3 speedup. Copyright © 2011 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: This work gave a comprehensive description of the Partitioned Global Address Space programming model and in particular the Unified Parallel C language and demonstrated the efficiency of the state‐of‐the‐art MPI implementation, and showed that one can develop an easy‐to‐follow yet efficient Unified parallel C implementation, which is also easy to debug and maintain.
Abstract: The analysis of a huge backload of ever-accumulating data presents a huge challenge in all respects of computing. Inverse covariance matrices in this respect are very important. We target data uncertainty quantification, a very useful measure of which is provided by inverse covariance matrix diagonal entries. In previous work, we introduced a novel method that reduces overall complexity by at least two orders of magnitude. At the same time, a state-of-the-art message-passing interface (MPI) implementation allowed us to reach a sustained performance of up to 73% (730 TFLOPS on the full 72 Blue Gene/P rack configuration at Julich). Thanks to its reduced complexity, this work has attracted significant interest, and thus, we have received numerous requests concerning its exploitation in various fields. A common denominator in these requests is that they almost all came from people with no or, in the best case, limited high-performance computing background. Nevertheless, all interest is in analyzing huge data sets, suitably adapting the method to particular applications. A bottleneck then is that potential users are reluctant to pay for a steep learning curve to get proficient in parallel computing using the de facto standard: MPI. Thus, we turned to the Partitioned Global Address Space programming model and in particular the Unified Parallel C language. In this work, we gave a comprehensive description of the framework and demonstrated the efficiency of the state-of-the-art MPI implementation. In addition, we showed that one can develop an easy-to-follow yet efficient Unified Parallel C implementation, which is also easy to debug and maintain, features that significantly boost overall productivity. Copyright © 2011 John Wiley & Sons, Ltd.