scispace - formally typeset
Search or ask a question
Author

Peter C. Mills

Bio: Peter C. Mills is an academic researcher from Nvidia. The author has contributed to research in topics: Indirect branch & Shared memory. The author has an hindex of 10, co-authored 22 publications receiving 354 citations.

Papers
More filters
Patent
15 Dec 2005
TL;DR: In this article, a cooperative thread array (CTA) is defined as a group of multiple threads that concurrently execute the same program on an input data set to produce an output data set.
Abstract: A “cooperative thread array,” or “CTA,” is a group of multiple threads that concurrently execute the same program on an input data set to produce an output data set. Each thread in a CTA has a unique thread identifier assigned at thread launch time that controls various aspects of the thread's processing behavior such as the portion of the input data set to be processed by each thread, the portion of an output data set to be produced by each thread, and/or sharing of intermediate results among threads. Different threads of the CTA are advantageously synchronized at appropriate points during CTA execution using a barrier synchronization technique in which barrier instructions in the CTA program are detected and used to suspend execution of some threads until a specified number of other threads also reaches the barrier point.

70 citations

Patent
Brett W. Coon1, John R. Nickolls1, Lars Nyland1, Peter C. Mills1, John Erik Lindholm1 
24 Mar 2008
TL;DR: In this article, an indirect branch instruction takes an address register as an argument in order to provide indirect function call capability for single-instruction multiple-thread (SIMT) architectures.
Abstract: An indirect branch instruction takes an address register as an argument in order to provide indirect function call capability for single-instruction multiple-thread (SIMT) processor architectures. The indirect branch instruction is used to implement indirect function calls, virtual function calls, and switch statements to improve processing performance compared with using sequential chains of tests and branches.

44 citations

Patent
24 Mar 2008
TL;DR: In this article, a system and method for locking and unlocking access to a shared memory for atomic operations provides immediate feedback indicating whether or not the lock was successful; read data is returned to the requestor with the lock status.
Abstract: A system and method for locking and unlocking access to a shared memory for atomic operations provides immediate feedback indicating whether or not the lock was successful. Read data is returned to the requestor with the lock status. The lock status may be changed concurrently when locking during a read or unlocking during a write. Therefore, it is not necessary to check the lock status as a separate transaction prior to or during a read-modify-write operation. Additionally, a lock or unlock may be explicitly specified for each atomic memory operation. Therefore, lock operations are not performed for operations that do not modify the contents of a memory location.

39 citations

Patent
06 Nov 2006
TL;DR: In this paper, a multithreaded processing unit is configured to perform the steps of fetching a program instruction, determining that the program instruction is an indirect branch instruction, and processing the indirect branch Instruction as a sequence of two-way branches to execute an indirect Branch Instruction with multiple branch addresses.
Abstract: One embodiment of a computing system configured to manage divergent threads in a thread group includes a stack configured to store at least one token and a multithreaded processing unit. The multithreaded processing unit is configured to perform the steps of fetching a program instruction, determining that the program instruction is an indirect branch instruction, and processing the indirect branch instruction as a sequence of two-way branches to execute an indirect branch instruction with multiple branch addresses. Indirect branch instructions may be used to allow greater flexibility since the branch address or multiple branch addresses do not need to be determined at compile time.

35 citations

Patent
21 Jun 2011
TL;DR: In this paper, a group of parallel memory access requests are generated by parallel processing engines, each specifying a target address that might be the same or different for different requests, and the serialization logic selects one of the target addresses and determines which of the requests specify the selected target address.
Abstract: A memory is used by concurrent threads in a multithreaded processor. Any addressable storage location is accessible by any of the concurrent threads, but only one location at a time is accessible. The memory is coupled to parallel processing engines that generate a group of parallel memory access requests, each specifying a target address that might be the same or different for different requests. Serialization logic selects one of the target addresses and determines which of the requests specify the selected target address. All such requests are allowed to proceed in parallel, while other requests are deferred. Deferred requests may be regenerated and processed through the serialization logic so that a group of requests can be satisfied by accessing each different target address in the group exactly once.

32 citations


Cited by
More filters
Patent
09 Apr 2008
TL;DR: In this article, a method and an apparatus that schedule a plurality of executables in a schedule queue for execution in one or more physical compute devices such as CPUs or GPUs concurrently are described.
Abstract: A method and an apparatus that schedule a plurality of executables in a schedule queue for execution in one or more physical compute devices such as CPUs or GPUs concurrently are described. One or more executables are compiled online from a source having an existing executable for a type of physical compute devices different from the one or more physical compute devices. Dependency relations among elements corresponding to scheduled executables are determined to select an executable to be executed by a plurality of threads concurrently in more than one of the physical compute devices. A thread initialized for executing an executable in a GPU of the physical compute devices are initialized for execution in another CPU of the physical compute devices if the GPU is busy with graphics processing threads. Sources and existing executables for an API function are stored in an API library to execute a plurality of executables in a plurality of physical compute devices, including the existing executables and online compiled executables from the sources.

171 citations

Patent
24 Oct 2007
TL;DR: In this article, the power consumption, the performance, and the power/performance value are determined for various computational processes between a plurality of subsystems where each of the subsystems is capable of performing the computational processes.
Abstract: Exemplary embodiments of methods and apparatuses to dynamically redistribute computational processes in a system that includes a plurality of processing units are described. The power consumption, the performance, and the power/performance value are determined for various computational processes between a plurality of subsystems where each of the subsystems is capable of performing the computational processes. The computational processes are exemplarily graphics rendering process, image processing process, signal processing process, Bayer decoding process, or video decoding process, which can be performed by a central processing unit, a graphics processing units or a digital signal processing unit. In one embodiment, the distribution of computational processes between capable subsystems is based on a power setting, a performance setting, a dynamic setting or a value setting.

169 citations

Patent
02 Sep 2011
TL;DR: In this paper, the authors present a method and device for maintaining data in a data storage system, comprising a plurality of data storage nodes, the method being employed in a storage node in the datacenter system and comprising: monitoring and detecting, conditions in the data storage systems that imply the need for replication of data between the nodes in the Data storage system; initiating replication processes in case such a condition is detected, wherein the replication processes include sending multicast and unicast requests to other storage nodes.
Abstract: A method and device for maintaining data in a data storage system, comprising a plurality of data storage nodes, the method being employed in a storage node in the data storage system and comprising: monitoring and detecting, conditions in the data storage system that imply the need for replication of data between the nodes in the data storage system; initiating replication processes in case such a condition is detected, wherein the replication processes include sending multicast and unicast requests to other storage nodes, said requests including priority flags, receiving multicast and unicast requests from other storage nodes, wherein the received requests include priority flags, ordering the received requests in different queues depending on their priority flags, and dealing with requests in higher priority queues with higher frequency than requests in lower priority queues.

126 citations

Patent
03 May 2007
TL;DR: In this article, a method and an apparatus that allocate one or more physical compute devices such as CPUs or GPUs attached to a host processing unit running an application for executing one or multiple threads of the application are described.
Abstract: A method and an apparatus that allocate one or more physical compute devices such as CPUs or GPUs attached to a host processing unit running an application for executing one or more threads of the application are described. The allocation may be based on data representing a processing capability requirement from the application for executing an executable in the one or more threads. A compute device identifier may be associated with the allocated physical compute devices to schedule and execute the executable in the one or more threads concurrently in one or more of the allocated physical compute devices concurrently.

112 citations

Journal ArticleDOI
09 Jun 2012
TL;DR: A novel thread reconvergence technique that ensures threads are run back in lockstep at control-flow reconvergent points without hindering their ability to run branches in parallel is introduced and a lane shuffling technique is proposed to allow solution (2) to benefit from inter-warp correlations in divergence patterns.
Abstract: Single-Instruction Multiple-Thread (SIMT) micro-architectures implemented in Graphics Processing Units (GPUs) run fine-grained threads in lockstep by grouping them into units, referred to as warps, to amortize the cost of instruction fetch, decode and control logic over multiple execution units. As individual threads take divergent execution paths, their processing takes place sequentially, defeating part of the efficiency advantage of SIMD execution. We present two complementary techniques that mitigate the impact of thread divergence on SIMT micro-architectures. Both techniques relax the SIMD execution model by allowing two distinct instructions to be scheduled to disjoint subsets of the the same row of execution units, instead of one single instruction. They increase flexibility by providing more thread grouping opportunities than SIMD, while preserving the affinity between threads to avoid introducing extra memory divergence. We consider (1) co-issuing instructions from different divergent paths of the same warp and (2) co-issuing instructions from different warps. To support (1), we introduce a novel thread reconvergence technique that ensures threads are run back in lockstep at control-flow reconvergence points without hindering their ability to run branches in parallel. We propose a lane shuffling technique to allow solution (2) to benefit from inter-warp correlations in divergence patterns. The combination of all these techniques improves performance by 23% on a set of regular GPGPU applications and by 40% on irregular applications, while maintaining the same instruction-fetch and processing-unit resource requirements as the contemporary Fermi GPU architecture.

101 citations