scispace - formally typeset
Search or ask a question

Showing papers by "Tatiana Shpeisman published in 2014"


Proceedings ArticleDOI
24 Aug 2014
TL;DR: The asymmetric scheduling algorithm uses low-overhead online profiling to automatically partition the work of dataparallel kernels between the CPU and GPU without input from application developers, underscoring the feasibility of online profile-based heterogeneous scheduling on integrated CPU-GPU processors.
Abstract: Many processors today integrate a CPU and GPU on the same die, which allows them to share resources like physical memory and lowers the cost of CPU-GPU communication. As a consequence, programmers can effectively utilize both the CPU and GPU to execute a single application. This paper presents novel adaptive scheduling techniques for integrated CPU-GPU processors. We present two online profiling-based scheduling algorithms: naive and asymmetric. Our asymmetric scheduling algorithm uses low-overhead online profiling to automatically partition the work of data-parallel kernels between the CPU and GPU without input from application developers. It does profiling on the CPU and GPU in a way that it doesn't penalize GPU-centric workloads that run significantly faster on the GPU. It adapts to application characteristics by addressing: 1) load imbalance via irregularity caused by, e.g., data-dependent control flow, 2) different amounts of work on each kernel call, and 3) multiple kernels with different characteristics. Unlike many existing approaches primarily targeting NVIDIA discrete GPUs, our scheduling algorithm does not require offline processing. We evaluate our asymmetric scheduling algorithm on a desktop system with an Intel 4th generation Core processor using a set of sixteen regular and irregular workloads from diverse application areas. On average, our asymmetric scheduling algorithm performs within 3.2% of the maximum throughput with a perfect CPU-and-GPU oracle that always chooses the ideal work partitioning between the CPU and GPU. These results underscore the feasibility of online profile-based heterogeneous scheduling on integrated CPU-GPU processors.

106 citations


Proceedings ArticleDOI
24 Aug 2014
TL;DR: In this paper, Invyswell is presented, a novel HyTM that exploits the benefits and manages the limitations of Haswell's RTM and outperforms NOrec, a state-of-the-art STM, by 35%, Hybrid Norec, NOrec's hybrid implementation, by 18%, and haswell's hardware-only lock elision by 25% across all STAMP benchmarks.
Abstract: The Intel Haswell processor includes restricted transactional memory (RTM), which is the first commodity-based hardware transactional memory (HTM) to become publicly available. However, like other real HTMs, such as IBM's Blue Gene/Q, Haswell's RTM is best-effort, meaning it provides no transactional forward progress guarantees. Because of this, a software fallback system must be used in conjunction with Haswell's RTM to ensure transactional programs execute to completion. To complicate matters, Haswell does not provide escape actions. Without escape actions, non-transactional instructions cannot be executed within the context of a hardware transaction, thereby restricting the ways in which a software fallback can interact with the HTM. As such, the challenge of creating a scalable hybrid TM (HyTM) that uses Haswell's RTM and a software TM (STM) fallback is exacerbated. In this paper, we present Invyswell, a novel HyTM that exploits the benefits and manages the limitations of Haswell's RTM. After describing Invyswell's design, we show that it outperforms NOrec, a state-of-the-art STM, by 35%, Hybrid NOrec, NOrec's hybrid implementation, by 18%, and Haswell's hardware-only lock elision by 25% across all STAMP benchmarks.

60 citations


Proceedings ArticleDOI
15 Feb 2014
TL;DR: This work presents a compiler framework with support for C++ features that enables GPU acceleration of a wide range of C++ applications with minimal changes and includes compiler optimizations to improve irregular application performance on GPUs.
Abstract: There is growing interest in using GPUs to accelerate general-purpose computation since they offer the potential of massive parallelism with reduced energy consumption. This interest has been encouraged by the ubiquity of integrated processors that combine a GPU and CPU on the same die, lowering the cost of offloading work to the GPU. However, while the majority of effort has focused on GPU acceleration of regular applications, relatively little is known about the behavior of irregular applications on GPUs. These applications are expected to perform poorly on GPUs without major software engineering effort. We present a compiler framework with support for C++ features that enables GPU acceleration of a wide range of C++ applications with minimal changes. This framework, Concord, includes a low-cost, software SVM implementation that permits seamless sharing of pointer-containing data structures between the CPU and GPU. It also includes compiler optimizations to improve irregular application performance on GPUs. Using Concord, we ran nine irregular C++ programs on two computer systems containing Intel 4th Generation Core processors. One system is an Ultrabook with an integrated HD Graphics 5000 GPU, and the other system is a desktop with an integrated HD Graphics 4600 GPU. The nine applications are pointer-intensive and operate on irregular data structures such as trees and graphs; they include face detection, BTree, single-source shortest path, soft-body physics simulation, and breadth-first search. Our results show that Concord acceleration using the GPU improves energy efficiency by up to 6.04× on the Ultrabook and 3.52× on the desktop over multicore-CPU execution.

26 citations


Patent
26 Mar 2014
TL;DR: In this paper, an execution logic for concurrent execution of at least one first software transaction and at least two second software transactions of a second software transaction mode is presented, where the execution logic is implemented within the processor.
Abstract: In an embodiment of a transactional memory system, an apparatus includes a processor and an execution logic to enable concurrent execution of at least one first software transaction of a first software transaction mode and a second software transaction of a second software transaction mode and at least one hardware transaction of a first hardware transaction mode and at least one second hardware transaction of a second hardware transaction mode. In one example, the execution logic may be implemented within the processor. Other embodiments are described and claimed.

22 citations


Patent
26 Dec 2014
TL;DR: In this paper, the authors present a system for adaptive task assignment among heterogeneous processor cores, including any number of CPUs, a graphics processing unit (GPU) and memory configured to store a pool of work items to be shared by the CPUs and the GPU.
Abstract: Generally, this disclosure provides systems, devices, methods and computer readable media for adaptive scheduling of task assignment among heterogeneous processor cores. The system may include any number of CPUs, a graphics processing unit (GPU) and memory configured to store a pool of work items to be shared by the CPUs and GPU. The system may also include a GPU proxy profiling module associated with one of the CPUs to profile execution of a first portion of the work items on the GPU. The system may further include profiling modules, each associated with one of the CPUs, to profile execution of a second portion of the work items on each of the CPUs. The measured profiling information from the CPU profiling modules and the GPU proxy profiling module is used to calculate a distribution ratio for execution of a remaining portion of the work items between the CPUs and the GPU.

17 citations


Patent
18 Aug 2014
TL;DR: In this article, a method for improving the efficiency with which speculative critical sections are executed within a transactional memory architecture is presented. But it does not specify how to improve the efficiency of the execution of such sections.
Abstract: An apparatus and method for improving the efficiency with which speculative critical sections are executed within a transactional memory architecture. For example, a method in accordance with one embodiment comprises: waiting to execute a speculative critical section of program code until a lock is freed by a current transaction; responsively executing the speculative critical section to completion upon detecting that the lock has been freed, regardless of whether the lock is held by another transaction during the execution of the speculative critical section; once execution of the speculative critical section is complete, determining whether the lock is taken; and if the lock is not taken, then committing the speculative critical section and, if the lock is taken, then aborting the speculative critical section.

15 citations


Patent
27 Dec 2014
TL;DR: In this paper, the authors present a framework for generating composable library functions based on a plurality of abstractions written at different levels of abstraction, such as block-algorithm abstraction at high level, blocked-algorithms abstraction at medium level, and region-based code abstraction at low level.
Abstract: Technologies for generating composable library functions include a first computing device that includes a library compiler configured to compile a composable library and second computing device that includes an application compiler configured to compose library functions of the composable library based on a plurality of abstractions written at different levels of abstractions. For example, the abstractions may include an algorithm abstraction at a high level, a blocked-algorithm abstraction at medium level, and a region-based code abstraction at a low level. Other embodiments are described and claimed herein.

6 citations


Patent
04 Nov 2014
TL;DR: In this paper, a processing device implementing unbounded transactional memory with forward progress guarantees using a hardware global lock is described. But it is not shown how the global hardware lock is read by bounded transactions that abort when the lock is taken.
Abstract: A processing device implementing unbounded transactional memory with forward progress guarantees using a hardware global lock is disclosed A processing device of the disclosure includes a hardware transactional memory (HTM) hardware contention manager to cause a bounded transaction to be translated to an unbounded transaction, the unbounded transaction to acquire a global hardware lock for the unbounded transaction, the global hardware lock read by bounded transactions that abort when the global hardware lock is taken The processing device further includes an execution unit communicably coupled to the HTM hardware contention manager to execute instructions of the unbounded transaction without speculation, the unbounded transaction to release the global hardware lock upon completion of execution of the instructions

4 citations


Patent
29 Jan 2014
TL;DR: In this paper, a method and apparatus for providing efficient strong atomicity is described, where optimized strong atomic operations are inserted at non-transactional read accesses to provide efficient strong atomicicity.
Abstract: A method and apparatus for providing efficient strong atomicity is herein described. Optimized strong atomic operations may be inserted at non-transactional read accesses to provide efficient strong atomicity. A global transaction value is copied at a beginning of a non-transactional function to a local transaction value; essentially creating a local timestamp of the global transaction value. At a non-transactional memory access within the function, a counter value or version value is compared to the LTV to see if a transaction has started updating memory locations, or specifically the memory location accessed. If memory locations have not been updated by a transaction, execution is accelerated by avoiding a full set of slowpath strong atomic operations to ensure validity of data accessed. In contrast, the slowpath operations may be executed to resolve contention between a transactional and non-transaction access contending for the same memory location.

3 citations


Patent
27 Mar 2014
TL;DR: In this article, the authors propose a method for identifying an object in a managed runtime environment and determining an age of the object at a software level of the managed run-time environment.
Abstract: Systems and methods may provide for identifying an object in a managed runtime environment and determining an age of the object at a software level of the managed runtime environment. Additionally, the object may be selectively allocated in one of a dynamic random access memory (DRAM) or a non-volatile random access memory (NVRAM) based at least in part on the age of the object. In one example, the data type of the object is also determined, wherein the object is selectively allocated further based on the data type.

1 citations