scispace - formally typeset
Open AccessProceedings ArticleDOI

Hardware transactional memory for GPU architectures

Reads0
Chats0
TLDR
KILO TM is proposed, a novel hardware TM design for GPUs that scales to 1000s of concurrent transactions that uses word-level, value-based conflict detection to avoid broadcast communication and reduce on-chip storage overhead.
Abstract: 
Graphics processor units (GPUs) are designed to efficiently exploit thread level parallelism (TLP), multiplexing execution of 1000s of concurrent threads on a relatively smaller set of single-instruction, multiple-thread (SIMT) cores to hide various long latency operations. While threads within a CUDA block/OpenCL workgroup can communicate efficiently through an intra-core scratchpad memory, threads in different blocks can only communicate via global memory accesses. Programmers wishing to exploit such communication have to consider data-races that may occur when multiple threads modify the same memory location. Recent GPUs provide a form of inter-block communication through atomic operations for single 32-bit/64-bit words. Although fine-grained locks can be constructed from these atomic operations, synchronization using locks is prone to deadlock. In this paper, we propose to solve these problems by extending GPUs to support transactional memory (TM). Major challenges include supporting 1000s of concurrent transactions and committing non-conflicting transactions in parallel. We propose KILO TM, a novel hardware TM design for GPUs that scales to 1000s of concurrent transactions. Without cache coherency hardware to depend on, it uses word-level, value-based conflict detection to avoid broadcast communication and reduce on-chip storage overhead. It employs speculative validation using a novel bloom filter organization to increase transaction commit parallelism. For a set of TM-enhanced GPU applications, KILO TM captures 59% of the performance of fine-grained locking, and is on average 128x faster than executing all transactions serially, for an estimated hardware area overhead of 0.5% of a commercial GPU.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

TL;DR: This paper presents a coordinated CTA-aware scheduling policy that utilizes four schemes to minimize the impact of long memory latencies, and indicates that the proposed mechanism can provide 33% average performance improvement compared to the commonly-employed round-robin warp scheduling policy.
Proceedings ArticleDOI

Neither more nor less: optimizing thread-level parallelism for GPGPUs

TL;DR: To reduce resource contention, this paper proposes a dynamic CTA scheduling mechanism, called DYNCTA, which modulates the TLP by allocating optimal number of CTAs, based on application characteristics, to minimize resource contention.
Proceedings ArticleDOI

Orchestrated scheduling and prefetching for GPGPUs

TL;DR: Techniques that coordinate the thread scheduling and prefetching decisions in a General Purpose Graphics Processing Unit (GPGPU) architecture to better tolerate long memory latencies are presented and a new prefetch-aware warp scheduling policy is proposed that overcomes problems with existing warp scheduling policies.
Patent

Transactional memory that performs a camr 32-bit lookup operation

TL;DR: In this article, a transactional memory (TM) receives a lookup command across a bus from a processor, which includes a base address, a starting bit position, and a mask size.
Journal ArticleDOI

Fine-grain task aggregation and coordination on GPUs

TL;DR: This work proposes and evaluates the first channel implementation, and presents a case study that maps the fine-grain, recursive task spawning in the Cilk programming language to channels by representing it as a flow graph, and proposes a hardware mechanism that allows wavefronts to yield their execution resources.
References
More filters
Proceedings ArticleDOI

Transactional memory: architectural support for lock-free data structures

TL;DR: Simulation results show that transactional memory matches or outperforms the best known locking techniques for simple benchmarks, even in the absence of priority inversion, convoying, and deadlock.
Proceedings ArticleDOI

Scalable parallel programming with CUDA

TL;DR: Presents a collection of slides covering the following topics: CUDA parallel programming model; CUDA toolkit and libraries; performance optimization; and application development.
Proceedings ArticleDOI

Analyzing CUDA workloads using a detailed GPU simulator

TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.
Journal ArticleDOI

NVIDIA Tesla: A Unified Graphics and Computing Architecture

TL;DR: To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture, which is massively multithreaded and programmable in C or via graphics APIs.
Related Papers (5)