scispace - formally typeset
Search or ask a question

Showing papers by "Jean-Luc Gaudiot published in 2003"


Proceedings ArticleDOI
22 Apr 2003
TL;DR: The functional model of the ADTS and its software architecture is implemented to evaluate various heuristics for determining a better fetch policy for a next scheduling quantum and it is reported that performance could be improved by as much as 25%.
Abstract: Simultaneous multithreading (SMT) attempts to attain higher processor utilization by allowing instructions from multiple independent threads to coexist in a processor and compete for shared resources. Previous studies have shown, however, that its throughput may be limited by the number of threads. A reason is that a fixed thread scheduling policy cannot be optimal for the varying mixes of threads it may face in an SMT processor. Our adaptive dynamic thread scheduling (ADTS) was previously proposed to achieve higher utilization by allowing a detector thread to make use of wasted pipeline slots with nominal hardware and software costs. The detector thread adaptively switches between various fetch policies. Our previous study showed that a single fixed thread scheduling policy presents much room (some 30%) for improvement compared to an oracle-scheduled case. In this paper, we take a closer look at ADTS. We implemented the functional model of the ADTS and its software architecture to evaluate various heuristics for determining a better fetch policy for a next scheduling quantum. We report that performance could be improved by as much as 25%.

31 citations



Proceedings ArticleDOI
24 Jun 2003
TL;DR: A hardware/software co-design paradigm that uses a PIM module to efficiently execute motion estimation operations and an efficient data distribution mechanism to effectively support parallel executions among these smaller PIM modules is presented.
Abstract: Motion estimation is the most time consuming stage of MPEG family encodings and it reportedly absorbs up to 90% of the total execution time of MPEG processing. Therefore, we propose a hardware/software co-design paradigm that uses a PIM module to efficiently execute motion estimation operations. We use a PIM module to reduce the memory access penalty caused by a large number of memory accesses. We segment the PIM module into small pieces so that each smaller PIM module can execute the operations in parallel fashion. However, in order to execute the operations in parallel, there are critical overheads that involve replicating a huge amount of data to many of these smaller PIM modules. Not only do these replications require a huge amount of additional memory accesses but also calculations when generating addresses. Therefore, we also present an efficient data distribution mechanism to effectively support parallel executions among these smaller PIM modules. With our paradigm, the host processor can be relieved from computationally-intensive and data-intensive workloads of motion estimation. We observed up to 2034/spl times/ improvement in reduction of the number of memory accesses and up to 439/spl times/ performance improvement for the execution of motion estimation operations when using our computing paradigm.

11 citations


Journal ArticleDOI
TL;DR: This paper surveys and demonstrates the power of non-strict evaluation in applications executed on distributed architectures, and shows that partial evaluation of memory accesses decreases the traffic in the interconnection network and improves the performance of MPI IS and MPIISSC applications.
Abstract: This paper surveys and demonstrates the power of non-strict evaluation in applications executed on distributed architectures. We present the design, implementation, and experimental evaluation of single assignment, incomplete data structures in a distributed memory architecture and Abstract Network Machine (ANM). Incremental Structures (IS), Incremental Structure Software Cache (ISSC), and Dynamic Incremental Structures (DIS) provide nonstrict data access and fully asynchronous operations that make them highly suited for the exploitation of fine-grain parallelism in distributed memory systems. We focus on split-phase memory operations and non-strict information processing under a distributed address space to improve the overall system performance. A novel technique of optimization at the communication level is proposed and described. We use partial evaluation of local and remote memory accesses not only to remove much of the excess overhead of message passing, but also to reduce the number of messages when some information about the input or part of the input is known. We show that split-phase transactions of IS, together with the ability of deferring reads, allow partial evaluation of distributed programs without losing determinacy. Our experimental evaluation indicates that commodity PC clusters with both IS and a caching mechanism, ISSC, are more robust. The system can deliver speedup for both regular and irregular applications. We also show that partial evaluation of memory accesses decreases the traffic in the interconnection network and improves the performance of MPI IS and MPI ISSC applications.

8 citations


Proceedings ArticleDOI
22 Apr 2003
TL;DR: This paper presents the design and performance evaluation of the HiDISC (Hierarchical Decoupled Instruction Stream Computer), which provides low memory access latency by introducing enhanced data prefetching techniques at both the hardware and the software levels.
Abstract: This paper presents the design and performance evaluation of our high-performance decoupled architecture, the HiDISC (Hierarchical Decoupled Instruction Stream Computer). HiDISC provides low memory access latency by introducing enhanced data prefetching techniques at both the hardware and the software levels. Three processors, one for each level of the memory hierarchy, act in concert to mask the memory latency. Our performance evaluation benchmarks include the Data-Intensive Systems Benchmark suite and the DIS Stressmark suite. Our simulation results point to a distinct advantage of the HiDISC system over current prevailing superscalar architectures for both sets of the benchmarks. On the average, a 12% improvement in performance is achieved while 17% of cache misses are eliminated.

7 citations


Book ChapterDOI
26 Aug 2003
TL;DR: This paper proposes a new clustered SMT architecture which is appropriate for both multiple threads and single thread environments and shows that the approach significantly reduces power consumption without significantly degrading performance.
Abstract: The excessive amount of heat that billions of transistors will produce will be the most important challenge to the design of the future chips. In order to reduce the power consumption of microprocessors, many low power architectures have been developed. However, this has come at the expense of performance, and only few low power architectures for superscalar, such as clustered architectures, target high performance computing applications.

6 citations


Proceedings ArticleDOI
08 Feb 2003
TL;DR: This paper introduces a hybrid model enhanced with novel compiler support for the dynamic pre-execution of a p-thread, a promising prefetching technique which uses an auxiliary assisting thread in addition to the main program flow.
Abstract: Speculative pre-execution is a promising prefetching technique which uses an auxiliary assisting thread in addition to the main program flow. A prefetching thread (p-thread), which contains the future probable cache miss instructions and backward slice, can run on the spare hardware context for data prefetching. Recently, various forms of speculative pre-execution have been developed, including hardware-based and software-based approaches. The hardware-based approach has the advantage of using runtime information dynamically. However, it requires a complex implementation and also lacks global information such as data and control flow. On the other hand, the software-oriented approach cannot cope with dynamic events and imposes additional software overhead As a compromise, this paper introduces a hybrid model enhanced with novel compiler support for the dynamic pre-execution of a p-thread.

4 citations


Proceedings Article
01 Jan 2003
TL;DR: D-IS and D-ISSC facilitate programming by relaxing synchronization issues, and the traffic in the interconnection network can be also significantly reduced by partial evaluation of local and remote memory accesses, and speedup of regular and irregular applications can be increased.
Abstract: This paper focuses on non-strict processing, optimization, and partial evaluation of MPI programs which use incremental data structures (ISs). We describe the design and implementation of Distributed IS Software Caches (D-ISSC), which take advantage of special, and temporal data localities while maintaining the capability of latency tolerance of the distributed IS memory system (D-IS). We show that D-IS and D-ISSC facilitate programming by relaxing synchronization issues. Our experimental evaluation indicates that the traffic in the interconnection network can be also significantly reduced by partial evaluation of local and remote memory accesses, and speedup of regular and irregular applications can be increased.

2 citations



Book ChapterDOI
TL;DR: It is shown that by incorporating non-strict information processing to FFT MPI, a significant reduction of the number of messages can be archived, and the overall system performance can be improved.
Abstract: This paper focuses on the partial evaluation of local and remote memory accesses of distributed applications, not only to remove much of the excess overhead of message passing implementations, but also to reduce the number of messages, when some information about the input data set is known. The use of split- phase memory operations, the exploitation of spatial data locality, and non-strict information processing are described. Through a detailed performance analysis, we establish conditions under which the technique is beneficial. We show that by incorporating non-strict information processing to FFT MPI, a significant reduction of the number of messages can be archived, and the overall system performance can be improved.

1 citations