Showing papers by "Jean-Luc Gaudiot published in 2003"

PDF

Open Access

Proceedings Article•DOI•

Dynamic scheduling issues in SMT architectures

[...]

Chul-Ho Shin¹, Seong-Won Lee², Jean-Luc Gaudiot³•Institutions (3)

Samsung¹, University of Southern California², University of California, Irvine³

22 Apr 2003

TL;DR: The functional model of the ADTS and its software architecture is implemented to evaluate various heuristics for determining a better fetch policy for a next scheduling quantum and it is reported that performance could be improved by as much as 25%.

...read moreread less

Abstract: Simultaneous multithreading (SMT) attempts to attain higher processor utilization by allowing instructions from multiple independent threads to coexist in a processor and compete for shared resources. Previous studies have shown, however, that its throughput may be limited by the number of threads. A reason is that a fixed thread scheduling policy cannot be optimal for the varying mixes of threads it may face in an SMT processor. Our adaptive dynamic thread scheduling (ADTS) was previously proposed to achieve higher utilization by allowing a detector thread to make use of wasted pipeline slots with nominal hardware and software costs. The detector thread adaptively switches between various fetch policies. Our previous study showed that a single fixed thread scheduling policy presents much room (some 30%) for improvement compared to an oracle-scheduled case. In this paper, we take a closer look at ADTS. We implemented the functional model of the ADTS and its software architecture to evaluate various heuristics for determining a better fetch policy for a next scheduling quantum. We report that performance could be improved by as much as 25%.

...read moreread less

31 citations

Journal Article•DOI•

Introducing the New Editor-in-Chief of the IEEE Transactions on Computers

[...]

Jean-Luc Gaudiot

01 Jan 2003-IEEE Transactions on Computers

26 citations

Proceedings Article•DOI•

An efficient PIM (processor-in-memory) architecture for motion estimation

[...]

Jung-Yup Kang¹, Sandeep K. S. Gupta¹, Saurabh Shah, Jean-Luc Gaudiot•Institutions (1)

University of Southern California¹

24 Jun 2003

TL;DR: A hardware/software co-design paradigm that uses a PIM module to efficiently execute motion estimation operations and an efficient data distribution mechanism to effectively support parallel executions among these smaller PIM modules is presented.

...read moreread less

Abstract: Motion estimation is the most time consuming stage of MPEG family encodings and it reportedly absorbs up to 90% of the total execution time of MPEG processing. Therefore, we propose a hardware/software co-design paradigm that uses a PIM module to efficiently execute motion estimation operations. We use a PIM module to reduce the memory access penalty caused by a large number of memory accesses. We segment the PIM module into small pieces so that each smaller PIM module can execute the operations in parallel fashion. However, in order to execute the operations in parallel, there are critical overheads that involve replicating a huge amount of data to many of these smaller PIM modules. Not only do these replications require a huge amount of additional memory accesses but also calculations when generating addresses. Therefore, we also present an efficient data distribution mechanism to effectively support parallel executions among these smaller PIM modules. With our paradigm, the host processor can be relieved from computationally-intensive and data-intensive workloads of motion estimation. We observed up to 2034/spl times/ improvement in reduction of the number of memory accesses and up to 439/spl times/ performance improvement for the execution of motion estimation operations when using our computing paradigm.

...read moreread less

11 citations

Journal Article•DOI•

Non-strict execution in parallel and distributed computing

[...]

Alfredo Cristóbal-Salas¹, Andrei Tchernykh¹, Jean-Luc Gaudiot², Wen-Yen Lin•Institutions (2)

Ensenada Center for Scientific Research and Higher Education¹, University of California, Irvine²

01 Apr 2003-International Journal of Parallel Programming

TL;DR: This paper surveys and demonstrates the power of non-strict evaluation in applications executed on distributed architectures, and shows that partial evaluation of memory accesses decreases the traffic in the interconnection network and improves the performance of MPI IS and MPIISSC applications.

...read moreread less

Abstract: This paper surveys and demonstrates the power of non-strict evaluation in applications executed on distributed architectures. We present the design, implementation, and experimental evaluation of single assignment, incomplete data structures in a distributed memory architecture and Abstract Network Machine (ANM). Incremental Structures (IS), Incremental Structure Software Cache (ISSC), and Dynamic Incremental Structures (DIS) provide nonstrict data access and fully asynchronous operations that make them highly suited for the exploitation of fine-grain parallelism in distributed memory systems. We focus on split-phase memory operations and non-strict information processing under a distributed address space to improve the overall system performance. A novel technique of optimization at the communication level is proposed and described. We use partial evaluation of local and remote memory accesses not only to remove much of the excess overhead of message passing, but also to reduce the number of messages when some information about the input or part of the input is known. We show that split-phase transactions of IS, together with the ability of deferring reads, allow partial evaluation of distributed programs without losing determinacy. Our experimental evaluation indicates that commodity PC clusters with both IS and a caching mechanism, ISSC, are more robust. The system can deliver speedup for both regular and irregular applications. We also show that partial evaluation of memory accesses decreases the traffic in the interconnection network and improves the performance of MPI IS and MPI ISSC applications.

...read moreread less

8 citations

Proceedings Article•DOI•

HiDISC: a decoupled architecture for data-intensive applications

[...]

Won Woo Ro¹, Jean-Luc Gaudiot², Stephen P. Crago¹, Alvin M. Despain¹•Institutions (2)

University of Southern California¹, University of California, Irvine²

22 Apr 2003

TL;DR: This paper presents the design and performance evaluation of the HiDISC (Hierarchical Decoupled Instruction Stream Computer), which provides low memory access latency by introducing enhanced data prefetching techniques at both the hardware and the software levels.

...read moreread less

Abstract: This paper presents the design and performance evaluation of our high-performance decoupled architecture, the HiDISC (Hierarchical Decoupled Instruction Stream Computer). HiDISC provides low memory access latency by introducing enhanced data prefetching techniques at both the hardware and the software levels. Three processors, one for each level of the memory hierarchy, act in concert to mask the memory latency. Our performance evaluation benchmarks include the Data-Intensive Systems Benchmark suite and the DIS Stressmark suite. Our simulation results point to a distinct advantage of the HiDISC system over current prevailing superscalar architectures for both sets of the benchmarks. On the average, a 12% improvement in performance is achieved while 17% of cache misses are eliminated.

...read moreread less

7 citations

Book Chapter•DOI•

Clustered Microarchitecture Simultaneous Multithreading

[...]

Seong-Won Lee¹, Jean-Luc Gaudiot²•Institutions (2)

University of Southern California¹, University of California, Irvine²

26 Aug 2003

TL;DR: This paper proposes a new clustered SMT architecture which is appropriate for both multiple threads and single thread environments and shows that the approach significantly reduces power consumption without significantly degrading performance.

...read moreread less

Abstract: The excessive amount of heat that billions of transistors will produce will be the most important challenge to the design of the future chips. In order to reduce the power consumption of microprocessors, many low power architectures have been developed. However, this has come at the expense of performance, and only few low power architectures for superscalar, such as clustered architectures, target high performance computing applications.

...read moreread less

6 citations

Proceedings Article•DOI•

Compiler support for dynamic speculative pre-execution

[...]

Won Woo Ro¹, Jean-Luc Gaudiot¹•Institutions (1)

University of Southern California¹

08 Feb 2003

TL;DR: This paper introduces a hybrid model enhanced with novel compiler support for the dynamic pre-execution of a p-thread, a promising prefetching technique which uses an auxiliary assisting thread in addition to the main program flow.

...read moreread less

Abstract: Speculative pre-execution is a promising prefetching technique which uses an auxiliary assisting thread in addition to the main program flow. A prefetching thread (p-thread), which contains the future probable cache miss instructions and backward slice, can run on the spare hardware context for data prefetching. Recently, various forms of speculative pre-execution have been developed, including hardware-based and software-based approaches. The hardware-based approach has the advantage of using runtime information dynamically. However, it requires a complex implementation and also lacks global information such as data and control flow. On the other hand, the software-oriented approach cannot cope with dynamic events and imposes additional software overhead As a compromise, this paper introduces a hybrid model enhanced with novel compiler support for the dynamic pre-execution of a p-thread.

...read moreread less

4 citations

Proceedings Article•

Incomplete Information Processing for Optimization of Distributed Applications.

[...]

Alfredo Cristóbal-Salas¹, Andrei Tchernykh, Jean-Luc Gaudiot²•Institutions (2)

Autonomous University of Baja California¹, University of California, Irvine²

01 Jan 2003

TL;DR: D-IS and D-ISSC facilitate programming by relaxing synchronization issues, and the traffic in the interconnection network can be also significantly reduced by partial evaluation of local and remote memory accesses, and speedup of regular and irregular applications can be increased.

...read moreread less

Abstract: This paper focuses on non-strict processing, optimization, and partial evaluation of MPI programs which use incremental data structures (ISs). We describe the design and implementation of Distributed IS Software Caches (D-ISSC), which take advantage of special, and temporal data localities while maintaining the capability of latency tolerance of the distributed IS memory system (D-IS). We show that D-IS and D-ISSC facilitate programming by relaxing synchronization issues. Our experimental evaluation indicates that the traffic in the interconnection network can be also significantly reduced by partial evaluation of local and remote memory accesses, and speedup of regular and irregular applications can be increased.

...read moreread less

2 citations

Proceedings Article•

Modeling and Predicting Point-to-Point Communications of MPI Parallel Programs in NOW Environments.

[...]

Hélio Marci de Oliveira, Kuan-Ching Li, Jean-Luc Gaudiot

01 Jan 2003

2 citations

Book Chapter•DOI•

Non-strict Evaluation of the FFT Algorithm in Distributed Memory Systems

[...]

Alfredo Cristóbal-Salas¹, Andrei Tchernykh², Jean-Luc Gaudiot³•Institutions (3)

Autonomous University of Baja California¹, Ensenada Center for Scientific Research and Higher Education², University of California³

29 Sep 2003-Lecture Notes in Computer Science

TL;DR: It is shown that by incorporating non-strict information processing to FFT MPI, a significant reduction of the number of messages can be archived, and the overall system performance can be improved.

...read moreread less

Abstract: This paper focuses on the partial evaluation of local and remote memory accesses of distributed applications, not only to remove much of the excess overhead of message passing implementations, but also to reduce the number of messages, when some information about the input data set is known. The use of split- phase memory operations, the exploitation of spatial data locality, and non-strict information processing are described. Through a detailed performance analysis, we establish conditions under which the technique is beneficial. We show that by incorporating non-strict information processing to FFT MPI, a significant reduction of the number of messages can be archived, and the overall system performance can be improved.

...read moreread less

1 citations