Home
/
Authors
/
Peter C. Mills

Author

Peter C. Mills

Bio: Peter C. Mills is an academic researcher from Nvidia. The author has contributed to research in topics: Indirect branch & Shared memory. The author has an hindex of 10, co-authored 22 publications receiving 354 citations.

Papers

PDF

Open Access

More filters

Patent•

Synchronization of threads in a cooperative thread array

[...]

John R. Nickolls¹, Stephen D. Lew¹, Brett W. Coon¹, Peter C. Mills¹•Institutions (1)

Nvidia¹

15 Dec 2005

TL;DR: In this article, a cooperative thread array (CTA) is defined as a group of multiple threads that concurrently execute the same program on an input data set to produce an output data set.

...read moreread less

Abstract: A “cooperative thread array,” or “CTA,” is a group of multiple threads that concurrently execute the same program on an input data set to produce an output data set. Each thread in a CTA has a unique thread identifier assigned at thread launch time that controls various aspects of the thread's processing behavior such as the portion of the input data set to be processed by each thread, the portion of an output data set to be produced by each thread, and/or sharing of intermediate results among threads. Different threads of the CTA are advantageously synchronized at appropriate points during CTA execution using a barrier synchronization technique in which barrier instructions in the CTA program are detected and used to suspend execution of some threads until a specified number of other threads also reaches the barrier point.

...read moreread less

70 citations

Patent•

Indirect function call instructions in a synchronous parallel thread processor

[...]

Brett W. Coon¹, John R. Nickolls¹, Lars Nyland¹, Peter C. Mills¹, John Erik Lindholm¹ - Show less +1 more•Institutions (1)

Nvidia¹

24 Mar 2008

TL;DR: In this article, an indirect branch instruction takes an address register as an argument in order to provide indirect function call capability for single-instruction multiple-thread (SIMT) architectures.

...read moreread less

Abstract: An indirect branch instruction takes an address register as an argument in order to provide indirect function call capability for single-instruction multiple-thread (SIMT) processor architectures. The indirect branch instruction is used to implement indirect function calls, virtual function calls, and switch statements to improve processing performance compared with using sequential chains of tests and branches.

...read moreread less

44 citations

Patent•

Lock mechanism to enable atomic updates to shared memory

[...]

Brett W. Coon¹, John R. Nickolls¹, Lars Nyland¹, Peter C. Mills¹•Institutions (1)

Nvidia¹

24 Mar 2008

TL;DR: In this article, a system and method for locking and unlocking access to a shared memory for atomic operations provides immediate feedback indicating whether or not the lock was successful; read data is returned to the requestor with the lock status.

...read moreread less

Abstract: A system and method for locking and unlocking access to a shared memory for atomic operations provides immediate feedback indicating whether or not the lock was successful. Read data is returned to the requestor with the lock status. The lock status may be changed concurrently when locking during a read or unlocking during a write. Therefore, it is not necessary to check the lock status as a separate transaction prior to or during a read-modify-write operation. Additionally, a lock or unlock may be explicitly specified for each atomic memory operation. Therefore, lock operations are not performed for operations that do not modify the contents of a memory location.

...read moreread less

39 citations

Patent•

Processing an indirect branch instruction in a SIMD architecture

[...]

Brett W. Coon¹, John Erik Lindholm¹, Peter C. Mills¹, John R. Nickolls¹•Institutions (1)

Nvidia¹

06 Nov 2006

TL;DR: In this paper, a multithreaded processing unit is configured to perform the steps of fetching a program instruction, determining that the program instruction is an indirect branch instruction, and processing the indirect branch Instruction as a sequence of two-way branches to execute an indirect Branch Instruction with multiple branch addresses.

...read moreread less

Abstract: One embodiment of a computing system configured to manage divergent threads in a thread group includes a stack configured to store at least one token and a multithreaded processing unit. The multithreaded processing unit is configured to perform the steps of fetching a program instruction, determining that the program instruction is an indirect branch instruction, and processing the indirect branch instruction as a sequence of two-way branches to execute an indirect branch instruction with multiple branch addresses. Indirect branch instructions may be used to allow greater flexibility since the branch address or multiple branch addresses do not need to be determined at compile time.

...read moreread less

35 citations

Patent•

Shared single-access memory with management of multiple parallel requests

[...]

Brett W. Coon¹, Ming Y. Siu¹, Weizhong Xu¹, Stuart F. Oberman¹, John R. Nickolls¹, Peter C. Mills¹ - Show less +2 more•Institutions (1)

Nvidia¹

21 Jun 2011

TL;DR: In this paper, a group of parallel memory access requests are generated by parallel processing engines, each specifying a target address that might be the same or different for different requests, and the serialization logic selects one of the target addresses and determines which of the requests specify the selected target address.

...read moreread less

Abstract: A memory is used by concurrent threads in a multithreaded processor. Any addressable storage location is accessible by any of the concurrent threads, but only one location at a time is accessible. The memory is coupled to parallel processing engines that generate a group of parallel memory access requests, each specifying a target address that might be the same or different for different requests. Serialization logic selects one of the target addresses and determines which of the requests specify the selected target address. All such requests are allowed to proceed in parallel, while other requests are deferred. Deferred requests may be regenerated and processed through the serialization logic so that a group of requests can be satisfied by accessing each different target address in the group exactly once.

...read moreread less

32 citations

1
2
3
4
…
5

Cited by

PDF

Open Access

More filters

Patent•

Parallel runtime execution on multiple processors

[...]

Aaftab A. Munshi¹, Jeremy Sandmel¹•Institutions (1)

Apple Inc.¹

09 Apr 2008

TL;DR: In this article, a method and an apparatus that schedule a plurality of executables in a schedule queue for execution in one or more physical compute devices such as CPUs or GPUs concurrently are described.

...read moreread less

Abstract: A method and an apparatus that schedule a plurality of executables in a schedule queue for execution in one or more physical compute devices such as CPUs or GPUs concurrently are described. One or more executables are compiled online from a source having an existing executable for a type of physical compute devices different from the one or more physical compute devices. Dependency relations among elements corresponding to scheduled executables are determined to select an executable to be executed by a plurality of threads concurrently in more than one of the physical compute devices. A thread initialized for executing an executable in a GPU of the physical compute devices are initialized for execution in another CPU of the physical compute devices if the GPU is busy with graphics processing threads. Sources and existing executables for an API function are stored in an API library to execute a plurality of executables in a plurality of physical compute devices, including the existing executables and online compiled executables from the sources.

...read moreread less

171 citations

Patent•

Methods and apparatuses for load balancing between multiple processing units

[...]

Howard A. Miller¹, Ralph Brunner¹•Institutions (1)

Apple Inc.¹

24 Oct 2007

TL;DR: In this article, the power consumption, the performance, and the power/performance value are determined for various computational processes between a plurality of subsystems where each of the subsystems is capable of performing the computational processes.

...read moreread less

Abstract: Exemplary embodiments of methods and apparatuses to dynamically redistribute computational processes in a system that includes a plurality of processing units are described. The power consumption, the performance, and the power/performance value are determined for various computational processes between a plurality of subsystems where each of the subsystems is capable of performing the computational processes. The computational processes are exemplarily graphics rendering process, image processing process, signal processing process, Bayer decoding process, or video decoding process, which can be performed by a central processing unit, a graphics processing units or a digital signal processing unit. In one embodiment, the distribution of computational processes between capable subsystems is based on a power setting, a performance setting, a dynamic setting or a value setting.

...read moreread less

169 citations

Patent•

Method and device for maintaining data in a data storage system comprising a plurality of data storage nodes

[...]

Stefan Bernbo, Christian Melander, Roger Persson, Gustav Petersson

02 Sep 2011

TL;DR: In this paper, the authors present a method and device for maintaining data in a data storage system, comprising a plurality of data storage nodes, the method being employed in a storage node in the datacenter system and comprising: monitoring and detecting, conditions in the data storage systems that imply the need for replication of data between the nodes in the Data storage system; initiating replication processes in case such a condition is detected, wherein the replication processes include sending multicast and unicast requests to other storage nodes.

...read moreread less

Abstract: A method and device for maintaining data in a data storage system, comprising a plurality of data storage nodes, the method being employed in a storage node in the data storage system and comprising: monitoring and detecting, conditions in the data storage system that imply the need for replication of data between the nodes in the data storage system; initiating replication processes in case such a condition is detected, wherein the replication processes include sending multicast and unicast requests to other storage nodes, said requests including priority flags, receiving multicast and unicast requests from other storage nodes, wherein the received requests include priority flags, ordering the received requests in different queues depending on their priority flags, and dealing with requests in higher priority queues with higher frequency than requests in lower priority queues.

...read moreread less

126 citations

Patent•

Data parallel computing on multiple processors

[...]

Aaftab A. Munshi¹, Jeremy Sandmel¹•Institutions (1)

Apple Inc.¹

03 May 2007

TL;DR: In this article, a method and an apparatus that allocate one or more physical compute devices such as CPUs or GPUs attached to a host processing unit running an application for executing one or multiple threads of the application are described.

...read moreread less

Abstract: A method and an apparatus that allocate one or more physical compute devices such as CPUs or GPUs attached to a host processing unit running an application for executing one or more threads of the application are described. The allocation may be based on data representing a processing capability requirement from the application for executing an executable in the one or more threads. A compute device identifier may be associated with the allocated physical compute devices to schedule and execute the executable in the one or more threads concurrently in one or more of the allocated physical compute devices concurrently.

...read moreread less

112 citations

Journal Article•DOI•

Simultaneous branch and warp interweaving for sustained GPU performance

[...]

Nicolas Brunie¹, Sylvain Collange², Gregory Diamos³•Institutions (3)

École normale supérieure de Lyon¹, Universidade Federal de Minas Gerais², Nvidia³

09 Jun 2012

TL;DR: A novel thread reconvergence technique that ensures threads are run back in lockstep at control-flow reconvergent points without hindering their ability to run branches in parallel is introduced and a lane shuffling technique is proposed to allow solution (2) to benefit from inter-warp correlations in divergence patterns.

...read moreread less

Abstract: Single-Instruction Multiple-Thread (SIMT) micro-architectures implemented in Graphics Processing Units (GPUs) run fine-grained threads in lockstep by grouping them into units, referred to as warps, to amortize the cost of instruction fetch, decode and control logic over multiple execution units. As individual threads take divergent execution paths, their processing takes place sequentially, defeating part of the efficiency advantage of SIMD execution. We present two complementary techniques that mitigate the impact of thread divergence on SIMT micro-architectures. Both techniques relax the SIMD execution model by allowing two distinct instructions to be scheduled to disjoint subsets of the the same row of execution units, instead of one single instruction. They increase flexibility by providing more thread grouping opportunities than SIMD, while preserving the affinity between threads to avoid introducing extra memory divergence. We consider (1) co-issuing instructions from different divergent paths of the same warp and (2) co-issuing instructions from different warps. To support (1), we introduce a novel thread reconvergence technique that ensures threads are run back in lockstep at control-flow reconvergence points without hindering their ability to run branches in parallel. We propose a lane shuffling technique to allow solution (2) to benefit from inter-warp correlations in divergence patterns. The combination of all these techniques improves performance by 23% on a set of regular GPGPU applications and by 40% on irregular applications, while maintaining the same instruction-fetch and processing-unit resource requirements as the contemporary Fermi GPU architecture.

...read moreread less

101 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64

Collapse