Hardware transactional memory for GPU architectures

doi:10.1145/2155620.2155655

Open AccessProceedings ArticleDOI

Hardware transactional memory for GPU architectures

Wilson W. L. Fung, +3 more

- pp 296-307

Chats0

TLDR

KILO TM is proposed, a novel hardware TM design for GPUs that scales to 1000s of concurrent transactions that uses word-level, value-based conflict detection to avoid broadcast communication and reduce on-chip storage overhead.

Abstract:

Graphics processor units (GPUs) are designed to efficiently exploit thread level parallelism (TLP), multiplexing execution of 1000s of concurrent threads on a relatively smaller set of single-instruction, multiple-thread (SIMT) cores to hide various long latency operations. While threads within a CUDA block/OpenCL workgroup can communicate efficiently through an intra-core scratchpad memory, threads in different blocks can only communicate via global memory accesses. Programmers wishing to exploit such communication have to consider data-races that may occur when multiple threads modify the same memory location. Recent GPUs provide a form of inter-block communication through atomic operations for single 32-bit/64-bit words. Although fine-grained locks can be constructed from these atomic operations, synchronization using locks is prone to deadlock. In this paper, we propose to solve these problems by extending GPUs to support transactional memory (TM). Major challenges include supporting 1000s of concurrent transactions and committing non-conflicting transactions in parallel. We propose KILO TM, a novel hardware TM design for GPUs that scales to 1000s of concurrent transactions. Without cache coherency hardware to depend on, it uses word-level, value-based conflict detection to avoid broadcast communication and reduce on-chip storage overhead. It employs speculative validation using a novel bloom filter organization to increase transaction commit parallelism. For a set of TM-enhanced GPU applications, KILO TM captures 59% of the performance of fine-grained locking, and is on average 128x faster than executing all transactions serially, for an estimated hardware area overhead of 0.5% of a commercial GPU.

Hardware transactional memory for GPU architectures

Citations

OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

Neither more nor less: optimizing thread-level parallelism for GPGPUs

Orchestrated scheduling and prefetching for GPGPUs

Transactional memory that performs a camr 32-bit lookup operation

Fine-grain task aggregation and coordination on GPUs

References

Fast discovery of association rules

Transactional memory: architectural support for lock-free data structures

Scalable parallel programming with CUDA

Analyzing CUDA workloads using a detailed GPU simulator

NVIDIA Tesla: A Unified Graphics and Computing Architecture

Related Papers (5)

Transactional memory: architectural support for lock-free data structures

Analyzing CUDA workloads using a detailed GPU simulator

Rodinia: A benchmark suite for heterogeneous computing

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

Improving GPU performance via large warps and two-level warp scheduling