scispace - formally typeset
Search or ask a question

Showing papers on "Multi-core processor published in 2014"


Proceedings ArticleDOI
02 Apr 2014
TL;DR: MICA optimizes for multi-core architectures by enabling parallel access to partitioned data, and for efficient parallel data access, MICA maps client requests directly to specific CPU cores at the server NIC level by using client-supplied information and adopts a light-weight networking stack that bypasses the kernel.
Abstract: MICA is a scalable in-memory key-value store that handles 65.6 to 76.9 million key-value operations per second using a single general-purpose multi-core system. MICA is over 4-13.5x faster than current state-of-the-art systems, while providing consistently high throughput over a variety of mixed read and write workloads.MICA takes a holistic approach that encompasses all aspects of request handling, including parallel data access, network request handling, and data structure design, but makes unconventional choices in each of the three domains. First, MICA optimizes for multi-core architectures by enabling parallel access to partitioned data. Second, for efficient parallel data access, MICA maps client requests directly to specific CPU cores at the server NIC level by using client-supplied information and adopts a light-weight networking stack that bypasses the kernel. Finally, MICA's new data structures--circular logs, lossy concurrent hash indexes, and bulk chaining--handle both read-and write-intensive workloads at low overhead.

446 citations


Journal ArticleDOI
TL;DR: Haswell provides enhancements in power-performance efficiency, power management, form factor and cost, core and uncore microarchitecture, and the core's instruction set.
Abstract: Haswell, Intel's fourth-generation core processor architecture, delivers a range of client parts, a converged core for the client and server, and technologies used across many products. It uses an optimized version of Intel 22-nm process technology. Haswell provides enhancements in power-performance efficiency, power management, form factor and cost, core and uncore microarchitecture, and the core's instruction set.

255 citations


Proceedings ArticleDOI
TL;DR: A key feature is the measurement of latency with sub-microsecond precision and accuracy by using hardware timestamping capabilities of modern commodity NICs.
Abstract: We present MoonGen, a flexible high-speed packet generator. It can saturate 10 GbE links with minimum sized packets using only a single CPU core by running on top of the packet processing framework DPDK. Linear multi-core scaling allows for even higher rates: We have tested MoonGen with up to 178.5 Mpps at 120 Gbit/s. We move the whole packet generation logic into user-controlled Lua scripts to achieve the highest possible flexibility. In addition, we utilize hardware features of Intel NICs that have not been used for packet generators previously. A key feature is the measurement of latency with sub-microsecond precision and accuracy by using hardware timestamping capabilities of modern commodity NICs. We address timing issues with software-based packet generators and apply methods to mitigate them with both hardware support on commodity NICs and with a novel method to control the inter-packet gap in software. Features that were previously only possible with hardware-based solutions are now provided by MoonGen on commodity hardware. MoonGen is available as free software under the MIT license at this https URL

236 citations


Journal ArticleDOI
TL;DR: The strategy is to formulate optimal power allocation and load distribution for multiple servers in a cloud of clouds as optimization problems, i.e., power constrained performance optimization and performance constrained power optimization.
Abstract: For multiple heterogeneous multicore server processors across clouds and data centers, the aggregated performance of the cloud of clouds can be optimized by load distribution and balancing. Energy efficiency is one of the most important issues for large-scale server systems in current and future data centers. The multicore processor technology provides new levels of performance and energy efficiency. The present paper aims to develop power and performance constrained load distribution methods for cloud computing in current and future large-scale data centers. In particular, we address the problem of optimal power allocation and load distribution for multiple heterogeneous multicore server processors across clouds and data centers. Our strategy is to formulate optimal power allocation and load distribution for multiple servers in a cloud of clouds as optimization problems, i.e., power constrained performance optimization and performance constrained power optimization. Our research problems in large-scale data centers are well-defined multivariable optimization problems, which explore the power-performance tradeoff by fixing one factor and minimizing the other, from the perspective of optimal load distribution. It is clear that such power and performance optimization is important for a cloud computing provider to efficiently utilize all the available resources. We model a multicore server processor as a queuing system with multiple servers. Our optimization problems are solved for two different models of core speed, where one model assumes that a core runs at zero speed when it is idle, and the other model assumes that a core runs at a constant speed. Our results in this paper provide new theoretical insights into power management and performance optimization in data centers.

161 citations


Journal ArticleDOI
TL;DR: The paper provides a comprehensive survey of existing or proposed approaches to estimate the power consumption of single-core as well as multicore processors, virtual machines, and an entire server.
Abstract: The power consumption of presently available Internet servers and data centers is not proportional to the work they accomplish. The scientific community is attempting to address this problem in a number of ways, for example, by employing dynamic voltage and frequency scaling, selectively switching off idle or underutilized servers, and employing energy-aware task scheduling. Central to these approaches is the accurate estimation of the power consumption of the various subsystems of a server, particularly, the processor. We distinguish between power consumption measurement techniques and power consumption estimation models. The techniques refer to the art of instrumenting a system to measure its actual power consumption whereas the estimation models deal with indirect evidences (such as information pertaining to CPU utilization or events captured by hardware performance counters) to reason about the power consumption of a system under consideration. The paper provides a comprehensive survey of existing or proposed approaches to estimate the power consumption of single-core as well as multicore processors, virtual machines, and an entire server.

145 citations


Journal ArticleDOI
08 Jul 2014
TL;DR: Motivated by specific threats, this paper describes FPGA security primitives from multiple FPGAs vendors and gives examples of those primitives in use in applications.
Abstract: Since their inception, field-programmable gate arrays (FPGAs) have grown in capacity and complexity so that now FPGAs include millions of gates of logic, megabytes of memory, high-speed transceivers, analog interfaces, and whole multicore processors. Applications running in the FPGA include communications infrastructure, digital cinema, sensitive database access, critical industrial control, and high-performance signal processing. As the value of the applications and the data they handle have grown, so has the need to protect those applications and data. Motivated by specific threats, this paper describes FPGA security primitives from multiple FPGA vendors and gives examples of those primitives in use in applications.

144 citations


Proceedings ArticleDOI
28 Jul 2014
TL;DR: This paper introduces additional phases to state-of-the-art timing analysis techniques to analyse an application's resource usage and compute an interference delay, and implements full transparency to the temporal and functional behaviour of applications, enabling the seamless integration of legacy applications.
Abstract: The performance and power efficiency of multi-core processors are attractive features for safety-critical applications, as in avionics But increased integration and average-case performance optimisations pose challenges when deploying them for such domains In this paper we propose a novel approach to compute an is WCET considering variable access delays due to the concurrent use of shared resources in multi-core processors, particularly focusing on shared interconnects and main memory Thereby we tackle the problem of temporal partitioning as required by safety-critical applications In particular, we introduce additional phases to state-of-the-art timing analysis techniques to analyse an application's resource usage and compute an interference delay We further complement the offline analysis with a runtime monitoring concept to enforce resource usage guarantees The concepts are evaluated on Free scale's P4080 multi-core processor in combination with SYSGO's commercial real-time operating system Pike OS and Abs Int's timing analysis framework aiT We abstract real applications' behaviour using a representative task set of the EEMBC Auto bench benchmark suite Our results show a reduction of up to 53% of the multi-core WCET, while implementing full transparency to the temporal and functional behaviour of applications, enabling the seamless integration of legacy applications

142 citations


Proceedings ArticleDOI
13 Dec 2014
TL;DR: This paper demonstrates through a real- system investigation that the fundamental difference between resource sharing behaviors on CMP and SMT architectures calls for a redesign of the way the authors model interference, and proposes SMiTe, a methodology that enables precise performance prediction for SMT co-location on real-system commodity processors.
Abstract: One of the key challenges for improving efficiency in warehouse scale computers (WSCs) is to improve server utilization while guaranteeing the quality of service (QoS) of latency-sensitive applications. To this end, prior work has proposed techniques to precisely predict performance and QoS interference to identify 'safe' application co-locations. However, such techniques are only applicable to resources shared across cores. Achieving such precise interference prediction on real-system simultaneous multithreading (SMT) architectures has been a significantly challenging open problem due to the complexity introduced by sharing resources within a core. In this paper, we demonstrate through a real-system investigation that the fundamental difference between resource sharing behaviors on CMP and SMT architectures calls for a redesign of the way we model interference. For SMT servers, the interference on different shared resources, including private caches, memory ports, as well as integer and floating-point functional units, do not correlate with each other. This insight suggests the necessity of decoupling interference into multiple resource sharing dimensions. In this work, we propose SMiTe, a methodology that enables precise performance prediction for SMT co-location on real-system commodity processors. With a set of Rulers, which are carefully designed software stressors that apply pressure to a multidimensional space of shared resources, we quantify application sensitivity and contentiousness in a decoupled manner. We then establish a regression model to combine the sensitivity and contentiousness in different dimensions to predict performance interference. Using this methodology, we are able to precisely predict the performance interference in SMT co-location with an average error of 2.80% on SPEC CPU2006 and 1.79% on Cloud Suite. Our evaluation shows that SMiTe allows us to improve the utilization of WSCs by up to 42.57% while enforcing an application's QoS requirements.

141 citations


Proceedings ArticleDOI
24 Aug 2014
TL;DR: A memory management framework called COLORIS, which provides support for both static and dynamic cache partitioning using page coloring and monitors the cache miss rates of running applications and triggers re-partitioning of the cache to prevent miss rates exceeding applications-specific ranges.
Abstract: Shared caches in multicore processors are subject to contention from co-running threads. The resultant interference can lead to highly-variable performance for individual applications. This is particularly problematic for real-time applications, requiring predictable timing guarantees. Previous work has applied page coloring techniques to partition a shared cache, so that conflict misses are minimized amongst co-running workloads. However, prior page coloring techniques have not addressed the problem of partitioning a cache on over-committed processors where there are more executable threads than cores. Similarly, page coloring techniques have not proven efficient at adapting the cache partition sizes for threads with varying memory demands. This paper presents a memory management framework called COLORIS, which provides support for both static and dynamic cache partitioning using page coloring. COLORIS supports novel policies to reconfigure the assignment of page colors amongst application threads in over-committed systems. For quality-of-service (QoS), COLORIS monitors the cache miss rates of running applications and triggers re-partitioning of the cache to prevent miss rates exceeding applications-specific ranges. This paper presents the design and evaluation of COLORIS as applied to Linux. We show the efficiency and effectiveness of COLORIS to color memory pages for a set of SPEC CPU2006 workloads, thereby enhancing performance isolation over existing page coloring techniques.

133 citations


Journal ArticleDOI
TL;DR: It is shown how simply increasing the number of cores in a processor can significantly diminish its energy efficiency, and that there is an optimal number of CPU cores that maximize the performance-per-watt.
Abstract: Energy efficiency has taken center stage in all aspects of computing, regardless of whether it is performed on a portable battery-powered device, a desktop PC, on servers in a data center, or on a supercomputer. It is expressed as performance-per-watt (PPW), which is equal to the number of instructions that are executed per Joule of energy. The shift to multicore processors, with tens or hundreds of cores on a single die requires that the operation of the cores be dynamically controlled to maximize the processor's overall energy efficiency. This paper presents a unified formulation and an efficient solution for this problem. The solution considers dynamic frequency and voltage scaling, thread migration, and active cooling as the means to control the cores. The solution method is efficient for a real-time implementation. The formulation includes accurate power and thermal models, temperature constraints, and accounts for the dependence of leakage power and circuit delay on temperature. The PPW metric is extended to Pα PW (performanceα-per-watt), which allows examining the tradeoffs between optimizing for performance versus optimizing for energy by varying . Simulation experiments assuming a four-core processor demonstrate that the derived control strategy can achieve 3.2× greater energy efficiency (i.e., executes more than three times the number of instructions per Joule) over the performance-optimal solution. The formulation and the efficiency of the solution method also allows for fast design space exploration. Specifically, it is shown how simply increasing the number of cores in a processor can significantly diminish its energy efficiency, and that there is an optimal number of cores that maximize the PPW. This number depends on the ratio of how much the power of an individual core is reduced by scaling, i.e., as the number of cores are increased. Finally, the proposed method is implemented on a quad-core Intel Sandy Bridge processor, and verified by running benchmarks. The experiments suggest that the proposed method results in an improvement of 37 percent over the current state-of-the-art energy-efficient schemes.

132 citations


Journal ArticleDOI
14 Jun 2014
TL;DR: SCORPIO is presented, an ordered mesh Network-on-Chip (NoC) architecture with a separate fixed-latency, bufferless network to achieve distributed global ordering, designed to plug-and-play with existing multicore IP and with practicality, timing, area, and power as top concerns.
Abstract: In the many-core era, scalable coherence and on-chip interconnects are crucial for shared memory processors. While snoopy coherence is common in small multicore systems, directory-based coherence is the de facto choice for scalability to many cores, as snoopy relies on ordered interconnects which do not scale. However, directory-based coherence does not scale beyond tens of cores due to excessive directory area overhead or inaccurate sharer tracking. Prior techniques supporting ordering on arbitrary unordered networks are impractical for full multicore chip designsWe present SCORPIO, an ordered mesh Network-on-Chip (NoC) architecture with a separate fixed-latency, bufferless network to achieve distributed global ordering. Message delivery is decoupled from the ordering, allowing messages to arrivein any order and at any time, and still be correctly ordered. The architecture is designed to plug-and-play with existing multicore IP and with practicality, timing, area, and power as top concerns. Full-system 36 and 64-core simulations on SPLASH-2 and PARSEC benchmarks show an average application runtime reduction of 24.1% and 12.9%, in comparison to distributed directory and AMD HyperTransport coherence protocols, respectivelyThe SCORPIO architecture is incorporated in an 11 mm-by-13mm chip prototype, fabricated in IBM 45nm SOI technology, comprising 36 Freescale e200 Power ArchitectureTMcores with private L1 and L2 caches interfacing with the NoC via ARM AMBA, along with two Cadence on-chip DDR2 controllers. The chip prototype achieves a post synthesis operating frequency of 1 GHz (833MHz post-layout) with an estimated power of 28.8W (768mW per tile), while the network consumes only 10% of tile area and 19 % of tile power.

Proceedings ArticleDOI
15 Apr 2014
TL;DR: FlexPRET is presented, a processor designed specifically for mixed-criticality systems by allowing each task to make a trade-off between hardware-based isolation and efficient processor utilization.
Abstract: Mixed-criticality systems, in which multiple tasks of varying criticality execute on a single hardware platform, are an emerging research area in real-time embedded systems. High-criticality tasks require spatial and temporal isolation guarantees for independent verification, and the task set should efficiently utilize hardware resources. Hardware-based isolation is desirable but often underutilizes hardware resources, which can consist of multiple single-core, multicore, or multithreaded processors. We present FlexPRET, a processor designed specifically for mixed-criticality systems by allowing each task to make a trade-off between hardware-based isolation and efficient processor utilization. FlexPRET uses fine-grained multithreading with flexible scheduling and timing instructions to provide this functionality.

Journal ArticleDOI
TL;DR: This paper designs a custom computing machine (CCM) called a scalable streaming-array (SSA), for high-performance stencil computations with multiple field-programmable gate arrays (FPGAs) based on a domain-specific programmable concept.
Abstract: Stencil computation is one of the important kernels in scientific computations. However, sustained performance is limited owing to restriction on memory bandwidth, especially on multicore microprocessors and graphics processing units (GPUs) because of their small operational intensity. In this paper, we present a custom computing machine (CCM), called a scalable streaming-array (SSA), for high-performance stencil computations with multiple field-programmable gate arrays (FPGAs). We design SSA based on a domain-specific programmable concept, where CCMs are programmable with the minimum functionality required for an algorithm domain. We employ a deep pipelining approach over successive iterations to achieve linear scalability for multiple devices with a constant memory bandwidth. Prototype implementation using nine FPGAs demonstrates good agreement with a performance model, and achieves 260 and 236 GFlop/s for 2D and 3D Jacobi computation, which are 87.4 and 83.9 percent of the peak, respectively, with a memory bandwidth of only 2.0 GB/s. We also evaluate the performance of SSA for state-of-the-art FPGAs.

Proceedings ArticleDOI
17 Aug 2014
TL;DR: This work presents Sandstorm and Namestorm, web and DNS servers that utilize a clean-slate userspace network stack that exploits knowledge of application-specific workloads that merges application and network-stack memory models, aggressively amortizes protocol-layer costs based on application-layer knowledge, and exploits microarchitectural features.
Abstract: Contemporary network stacks are masterpieces of generality, supporting many edge-node and middle-node functions. Generality comes at a high performance cost: current APIs, memory models, and implementations drastically limit the effectiveness of increasingly powerful hardware. Generality has historically been required so that individual systems could perform many functions. However, as providers have scaled services to support millions of users, they have transitioned toward thousands (or millions) of dedicated servers, each performing a few functions. We argue that the overhead of generality is now a key obstacle to effective scaling, making specialization not only viable, but necessary. We present Sandstorm and Namestorm, web and DNS servers that utilize a clean-slate userspace network stack that exploits knowledge of application-specific workloads. Based on the netmap framework, our novel approach merges application and network-stack memory models, aggressively amortizes protocol-layer costs based on application-layer knowledge, couples tightly with the NIC event model, and exploits microarchitectural features. Simultaneously, the servers retain use of conventional programming frameworks. We compare our approach with the FreeBSD and Linux stacks using the nginx web server and NSD name server, demonstrating 2--10x and 9x improvements in web-server and DNS throughput, lower CPU usage, linear multicore scaling, and saturated NIC hardware.

Journal ArticleDOI
TL;DR: This paper details the algorithm and techniques proposed to decrease the communication costs of parallel applications by matching the communication pattern to the underlying hardware architecture.
Abstract: Current generations of NUMA node clusters feature multicore or manycore processors. Programming such architectures efficiently is a challenge because numerous hardware characteristics have to be taken into account, especially the memory hierarchy. One appealing idea to improve the performance of parallel applications is to decrease their communication costs by matching the communication pattern to the underlying hardware architecture. In this paper, we detail the algorithm and techniques proposed to achieve such a result: first, we gather both the communication pattern information and the hardware details. Then we compute a relevant reordering of the various process ranks of the application. Finally, those new ranks are used to reduce the communication costs of the application.

Proceedings ArticleDOI
24 Aug 2014
TL;DR: The asymmetric scheduling algorithm uses low-overhead online profiling to automatically partition the work of dataparallel kernels between the CPU and GPU without input from application developers, underscoring the feasibility of online profile-based heterogeneous scheduling on integrated CPU-GPU processors.
Abstract: Many processors today integrate a CPU and GPU on the same die, which allows them to share resources like physical memory and lowers the cost of CPU-GPU communication. As a consequence, programmers can effectively utilize both the CPU and GPU to execute a single application. This paper presents novel adaptive scheduling techniques for integrated CPU-GPU processors. We present two online profiling-based scheduling algorithms: naive and asymmetric. Our asymmetric scheduling algorithm uses low-overhead online profiling to automatically partition the work of data-parallel kernels between the CPU and GPU without input from application developers. It does profiling on the CPU and GPU in a way that it doesn't penalize GPU-centric workloads that run significantly faster on the GPU. It adapts to application characteristics by addressing: 1) load imbalance via irregularity caused by, e.g., data-dependent control flow, 2) different amounts of work on each kernel call, and 3) multiple kernels with different characteristics. Unlike many existing approaches primarily targeting NVIDIA discrete GPUs, our scheduling algorithm does not require offline processing. We evaluate our asymmetric scheduling algorithm on a desktop system with an Intel 4th generation Core processor using a set of sixteen regular and irregular workloads from diverse application areas. On average, our asymmetric scheduling algorithm performs within 3.2% of the maximum throughput with a perfect CPU-and-GPU oracle that always chooses the ideal work partitioning between the CPU and GPU. These results underscore the feasibility of online profile-based heterogeneous scheduling on integrated CPU-GPU processors.

Journal ArticleDOI
TL;DR: A new faster molecular dynamics engine is introduced into the CHARMM software package that is faster both in serial and parallel execution and allows the MD engine to parallelize up to hundreds of CPU cores.
Abstract: We introduce a new faster molecular dynamics (MD) engine into the CHARMM software package. The new MD engine is faster both in serial (i.e., single CPU core) and parallel execution. Serial performance is approximately two times higher than in the previous version of CHARMM. The newly programmed parallelization method allows the MD engine to parallelize up to hundreds of CPU cores.

Journal ArticleDOI
TL;DR: A new robust algorithm to automatically generate hierarchical Cartesian meshes on distributed multicore HPC systems with multiple levels of refinement is presented and the efficiency of the approach is demonstrated by considering human nasal cavity and internal combustion engine flow problems.

Journal ArticleDOI
14 Jun 2014
TL;DR: This architecture has the potential to outperform the best single-ISA heterogeneous architecture by as much as 21%, with 23% energy savings and a reduction of 32% in Energy Delay Product.
Abstract: Heterogeneous multicore architectures have the potential for high performance and energy efficiency. These architectures may be composed of small power-efficient cores, large high-performance cores, and/or specialized cores that accelerate the performance of a particular class of computation. Architects have explored multiple dimensions of heterogeneity, both in terms of micro-architecture and specialization. While early work constrained the cores to share a single ISA, this work shows that allowing heterogeneous ISAs further extends the effectiveness of such architecturesThis work exploits the diversity offered by three modern ISAs: Thumb, x86-64, and Alpha. This architecture has the potential to outperform the best single-ISA heterogeneous architecture by as much as 21%, with 23% energy savings and a reduction of 32% in Energy Delay Product.

Book ChapterDOI
16 Nov 2014
TL;DR: This paper focuses on the processor architecture characterization engine, a collection of portable instrumented micro benchmarks implemented with Message Passing Interface (MPI), and OpenMP used to express thread-level parallelism.
Abstract: We present preliminary results of the Roofline Toolkit for multicore, manycore, and accelerated architectures. This paper focuses on the processor architecture characterization engine, a collection of portable instrumented micro benchmarks implemented with Message Passing Interface (MPI), and OpenMP used to express thread-level parallelism. These benchmarks are specialized to quantify the behavior of different architectural features. Compared to previous work on performance characterization, these microbenchmarks focus on capturing the performance of each level of the memory hierarchy, along with thread-level parallelism, instruction-level parallelism and explicit SIMD parallelism, measured in the context of the compilers and run-time environments. We also measure sustained PCIe throughput with four GPU memory managed mechanisms. By combining results from the architecture characterization with the Roofline model based solely on architectural specifications, this work offers insights for performance prediction of current and future architectures and their software systems. To that end, we instrument three applications and plot their resultant performance on the corresponding Roofline model when run on a Blue Gene/Q architecture.

Proceedings ArticleDOI
16 Oct 2014
TL;DR: The PREESM framework objective is to simplify the programming of multicore DSP systems by building on dataflow programming methods, and the current functionalities of this scalable framework cover memory and time analysis, as well as automatic deadlock-free code generation.
Abstract: The high performance Digital Signal Processors (DSPs) currently manufactured by Texas Instruments are heterogeneous multiprocessor architectures. Programming these architectures is a complex task often reserved to specialized engineers because the bottlenecks of both the algorithm and the architecture need to be deeply understood in order to obtain a fairly parallel execution. The PREESM framework objective is to simplify the programming of multicore DSP systems by building on dataflow programming methods. The current functionalities of this scalable framework cover memory and time analysis, as well as automatic deadlock-free code generation. Several tutorials are provided with the tool for fast initiation of C programmers to multicore DSP programming. This paper demonstrates PREESM capabilities by comparing simulation and execution performances on a stereo matching algorithm prototyped on the TMS320C6678 8-core DSP device.

Proceedings ArticleDOI
01 Jun 2014
TL;DR: This work proposes Panthre, a solution that deploys power-gating to provide long intervals of uninterrupted sleep to selected units and employs a feedback-based distributed mechanism to control the amount of sleeping components and of packets detours, so that performance degradation is kept at a minimum.
Abstract: With the advent of multicore processors and system-on-chip designs, intra-chip communication demands have exacerbated, leading to a growing adoption of scalable networks-on-chip (NoCs) as the interconnect fabric. Today, conventional NoC designs may consume up to 30% of the entire chip's power budget, in large part due to leakage power. In this work, we address this issue by proposing Panthre: our solution deploys power-gating to provide long intervals of uninterrupted sleep to selected units. Packets that would normally use power-gated components are steered away via topology and routing reconfiguration, while Panthre provides low-latency alternate paths to their destinations. The routing reconfiguration operates in a distributed fashion and guarantees that deadlock-free routes are available at all times. At runtime, Panthre adapts to the application's communication patterns by updating its power-gating decisions. It employs a feedback-based distributed mechanism to control the amount of sleeping components and of packets detours, so that performance degradation is kept at a minimum. Our design is flexible, providing a mechanism that designers can use to tradeoff power savings with performance, based on application's requirements.Our experiments on multi-programmed communication-light workloads from the SPEC CPU2006 suite show that Panthre reduces total network power consumption by 14.5% on average, with only a 1.8% degradation in performance, when all processor nodes are active. At times when 15-25% of the processor cores are communication-idle, Panthre enables leakage power savings of 36.9% on average, while still providing connected and deadlock-free routes for all other nodes.

Journal ArticleDOI
TL;DR: An efficient and scalable symmetric iterative eigensolver developed for distributed memory multi‐core platforms is described, with over 80% parallel efficiency by major reductions in communication overheads for the sparse matrix‐vector multiplication and basis orthogonalization tasks.
Abstract: We describe an efficient and scalable symmetric iterative eigensolver developed for distributed memory multi-core platforms. We achieve over 80% parallel efficiency by major reductions in communication overheads for the sparse matrix-vector multiplication and basis orthogonalization tasks. We show that the scalability of the solver is significantly improved compared to an earlier version, after we carefully reorganize the computational tasks and map them to processing units in a way that exploits the network topology. We discuss the advantage of using a hybrid OpenMP/MPI programming model to implement such a solver. We also present strategies for hiding communication on a multi-core platform. We demonstrate the effectiveness of these techniques by reporting the performance improvements achieved when we apply our solver to large-scale eigenvalue problems arising in nuclear structure calculations. Because sparse matrix-vector multiplication and inner product computation constitute the main kernels in most iterative methods, our ideas are applicable in general to the solution of problems involving large-scale symmetric sparse matrices with irregular sparsity patterns. Copyright © 2013 John Wiley & Sons, Ltd.

Journal ArticleDOI
01 Jan 2014
TL;DR: This work analyses the effects of sequential-to-parallel synchronization and inter-core communication on multicore performance, speedup and scaling from Amdahl's law perspective and results show lower than originally predicted speedup in applications with high degree of data sharing.
Abstract: This work analyses the effects of sequential-to-parallel synchronization and inter-core communication on multicore performance, speedup and scaling from Amdahl's law perspective. Analytical modeling supported by simulation leads to a modification of Amdahl's law, reflecting lower than originally predicted speedup, due to these effects. In applications with high degree of data sharing, leading to intense inter-core connectivity requirements, the workload should be executed on a smaller number of larger cores. Applications requiring intense sequential-to-parallel synchronization, even highly parallelizable ones, may better be executed by the sequential core. To improve the scalability and performance speedup of a multicore, it is as important to address the synchronization and connectivity intensities of parallel algorithms as their parallelization factor.

Journal ArticleDOI
TL;DR: This work differs by modeling the interaction of shared cache and shared bus with other basic micro-architectural components (e.g. pipeline and branch predictor) by assuming a timing anomaly free multi-core architecture for computing the WCET.
Abstract: With the advent of multicore architectures, worst-case execution time (WCET) analysis has become an increasingly difficult problem. In this article, we propose a unified WCET analysis framework for multicore processors featuring both shared cache and shared bus. Compared to other previous works, our work differs by modeling the interaction of shared cache and shared bus with other basic microarchitectural components (e.g., pipeline and branch predictor). In addition, our framework does not assume a timing anomaly free multicore architecture for computing the WCET. A detailed experiment methodology suggests that we can obtain reasonably tight WCET estimates in a wide range of benchmark programs.

Proceedings ArticleDOI
09 Jun 2014
TL;DR: Varuna as mentioned in this paper is a system that dynamically, continuously, rapidly and transparently adapts a program's parallelism to best match the instantaneous capabilities of the hardware resources while satisfying different efficiency metrics.
Abstract: Future multicore processors will be heterogeneous, be increasingly less reliable, and operate in dynamically changing operating conditions. Such environments will result in a constantly varying pool of hardware resources which can greatly complicate the task of efficiently exposing a program's parallelism onto these resources. Coupled with this uncertainty is the diverse set of efficiency metrics that users may desire. This paper proposes Varuna, a system that dynamically, continuously, rapidly and transparently adapts a program's parallelism to best match the instantaneous capabilities of the hardware resources while satisfying different efficiency metrics. Varuna is applicable to both multithreaded and task-based programs and can be seamlessly inserted between the program and the operating system without needing to change the source code of either. We demonstrate Varuna's effectiveness in diverse execution environments using unaltered C/C++ parallel programs from various benchmark suites. Regardless of the execution environment, Varuna always outperformed the state-of-the-art approaches for the efficiency metrics considered.

Journal ArticleDOI
TL;DR: This work combines the ideas of multicore wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches, and provides a controllable trade-off between concurrency and memory usage.
Abstract: The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, especially the main memory interface. In this work we combine the ideas of multi-core wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches. The resulting schemes show performance advantages in bandwidth-starved situations, which are exacerbated by the high bytes per lattice update case of variable coefficients. Our thread groups concept provides a controllable trade-off between concurrency and memory usage, shifting the pressure between the memory interface and the CPU. We present performance results on a contemporary Intel processor.

Journal ArticleDOI
TL;DR: A compiler-based approach to automatically generate optimized OpenCL code from data parallel OpenMP programs for GPUs that leverages existing transformations to improve performance on GPU architectures and uses automatic machine learning to build a predictive model to determine if it is worthwhile running the OpenCLcode on the GPU or OpenMP code on the multicore host.
Abstract: General-purpose GPU-based systems are highly attractive, as they give potentially massive performance at little cost. Realizing such potential is challenging due to the complexity of programming. This article presents a compiler-based approach to automatically generate optimized OpenCL code from data parallel OpenMP programs for GPUs. A key feature of our scheme is that it leverages existing transformations, especially data transformations, to improve performance on GPU architectures and uses automatic machine learning to build a predictive model to determine if it is worthwhile running the OpenCL code on the GPU or OpenMP code on the multicore host. We applied our approach to the entire NAS parallel benchmark suite and evaluated it on distinct GPU-based systems. We achieved average (up to) speedups of 4.51× and 4.20× (143× and 67×) on Core i7/NVIDIA GeForce GTX580 and Core i7/AMD Radeon 7970 platforms, respectively, over a sequential baseline. Our approach achieves, on average, greater than 10× speedups over two state-of-the-art automatic GPU code generators.

Proceedings ArticleDOI
23 Mar 2014
TL;DR: Manifold is an open-source parallel simulation framework for multicore architectures that consists of a parallel simulation kernel, a set of microarchitecture components, and an integrated library of power, thermal, reliability, and energy models.
Abstract: This paper presents Manifold, an open-source parallel simulation framework for multicore architectures. It consists of a parallel simulation kernel, a set of microarchitecture components, and an integrated library of power, thermal, reliability, and energy models. Using the components as building blocks, users can assemble multicore architecture simulation models and perform serial or parallel simulations to study the architectural and/or the physical characteristics of the models. Users can also create new components for Manifold or port existing models. Importantly, Manifold's component-based design provides the user with the ability to easily replace a component with another for efficient explorations of the design space. It also allows components to evolve independently and making it easy for simulators to incorporate new components as they become available. The distinguishing features of Manifold include i) transparent parallel execution, ii) integration of power, thermal, reliability, and energy models, iii) full system simulation, e.g., operating system and system binaries, and iv) component-based design. In this paper we provide a description of the software architecture of Manifold, and its main elements - a parallel multicore emulator front-end and a parallel component-based back-end timing model. We describe a few simulators that are built with Manifold components to illustrate its flexibility, and present test results of the scalability obtained on full-system simulation of coherent shared-memory multicore models with 16, 32, and 64 cores executing PARSEC and SPLASH-2 benchmarks.

Journal ArticleDOI
TL;DR: In this article, the authors present a survey of parallel computing platforms, including graphics processing units to multicore CPUs with a fast interconnect, along with effective parallel solvers and associated solver libraries effective for inductive EM modeling and imaging.
Abstract: Many geoscientific applications exploit electrostatic and electromagnetic fields to interrogate and map subsurface electrical resistivity—an important geophysical attribute for characterizing mineral, energy, and water resources. In complex three-dimensional geologies, where many of these resources remain to be found, resistivity mapping requires large-scale modeling and imaging capabilities, as well as the ability to treat significant data volumes, which can easily overwhelm single-core and modest multicore computing hardware. To treat such problems requires large-scale parallel computational resources, necessary for reducing the time to solution to a time frame acceptable to the exploration process. The recognition that significant parallel computing processes must be brought to bear on these problems gives rise to choices that must be made in parallel computing hardware and software. In this review, some of these choices are presented, along with the resulting trade-offs. We also discuss future trends in high-performance computing and the anticipated impact on electromagnetic (EM) geophysics. Topics discussed in this review article include a survey of parallel computing platforms, graphics processing units to multicore CPUs with a fast interconnect, along with effective parallel solvers and associated solver libraries effective for inductive EM modeling and imaging.