scispace - formally typeset
Search or ask a question

Showing papers by "Jean-Luc Gaudiot published in 2009"


Journal ArticleDOI
TL;DR: The analysis shows that the performance of PARSEC benchmarks may improve by a factor of 180 and 230 percent for the SPLASH-2 suite, compared to when only the intrinsic parallelism is considered, demonstrating the immense potential of fine-grained value prediction in reducing the communication latency in many-core architectures.
Abstract: The newly emerging many-core-on-a-chip designs have renewed an intense interest in parallel processing. By applying Amdahl's formulation to the programs in the PARSEC and SPLASH-2 benchmark suites, we find that most applications may not have sufficient parallelism to efficiently utilize modern parallel machines. The long sequential portions in these application programs are caused by computation as well as communication latency. However, value prediction techniques may allow the ldquoparallelizationrdquo of the sequential portion by predicting values before they are produced. In conventional superscalar architectures, the computation latency dominates the sequential sections. Thus, value prediction techniques may be used to predict the computation result before it is produced. In many-core architectures, since the communication latency increases with the number of cores, value prediction techniques may be used to reduce both the communication and computation latency. In this paper, we extend Amdahl's formulation to model the data redundancy inherent to each benchmark, thereby identifying the potential of value prediction techniques. Our analysis shows that the performance of PARSEC benchmarks may improve by a factor of 180 and 230 percent for the SPLASH-2 suite, compared to when only the intrinsic parallelism is considered. This demonstrates the immense potential of fine-grained value prediction in reducing the communication latency in many-core architectures.

27 citations


Book ChapterDOI
30 Jul 2009
TL;DR: This research shows that, without control over the number of entries each thread can occupy in system resources like instruction fetch queue and/or reorder buffer, a scenario called "mutual-hindrance" execution takes place, and demonstrates that active resource sharing control is essential for future multicore multithreading microprocessor design.
Abstract: One major obstacle faced by designers when entering the multicore era is how to harness the massive computing power which these cores provide. Since Instructional-Level Parallelism (ILP) is inherently limited, one single thread is not capable of efficiently utilizing the resource of a single core. Hence, Simultaneous MultiThreading (SMT) microarchitecture can be introduced in an effort to achieve improved system resource utilization and a correspondingly higher instruction throughput through the exploitation of Thread-Level Parallelism (TLP) as well as ILP. However, when multiple threads execute concurrently in a single core, they automatically compete for system resources. Our research shows that, without control over the number of entries each thread can occupy in system resources like instruction fetch queue and/or reorder buffer, a scenario called "mutual-hindrance" execution takes place. Conversely, introducing active resource sharing control mechanisms causes the opposite situation ("mutual-benefit" execution), with a possible significant performance improvement and lower cache miss frequency. This demonstrates that active resource sharing control is essential for future multicore multithreading microprocessor design.

4 citations


Proceedings ArticleDOI
20 Jul 2009
TL;DR: RHE (Reduced version of Harmony for Education), a lightweight JVM instructional tool, is introduced, showing that with RHE, engineers with little or no knowledge of JVM design can become familiar with various JVM components within a week.
Abstract: Teaching Java Virtual Machine (JVM) has become essential in training the next generation of Web application engineers, embedded software engineers, as well as virtual machine researchers and practitioners. However, due to the lack of a suitable instructional tool, it is difficult for students to get a sufficiently deep understanding of JVM design. In this paper, we introduce RHE (Reduced version of Harmony for Education), a lightweight JVM instructional tool. Due to its modular design and simple implementation, RHE can also be used as a research prototype with quick turnaround time. Our experience shows that with RHE, engineers with little or no knowledge of JVM design can become familiar with various JVM components within a week.

3 citations


Proceedings ArticleDOI
23 May 2009
TL;DR: Packer is proposed, a space-and-time-efficient parallel garbage collection algorithm based on the novel concept of virtual spaces that reduces the garbage collection pause time of Packer, and reduces the compacting GC parallelization problem into a tree traversalallelization problem, and applies it to both normal and large object compaction.
Abstract: The fundamental challenge of garbage collector (GC) design is to maximize the recycled space with minimal time overhead. For efficient memory management, in many GC designs the heap is divided into large object space (LOS) and non-large object space (non-LOS). When one of the spaces is full, garbage collection is triggered even though the other space may still have a lot of free room, thus leading to inefficient space utilization. Also, space partitioning in existing GC designs implies different GC algorithms for different spaces. This not only prolongs the pause time of garbage collection, but also makes collection not efficient on multiple spaces. To address these problems, we propose Packer, a space-and-time-efficient parallel garbage collection algorithm based on the novel concept of virtual spaces. Instead of physically dividing the heap into multiple spaces, Packer manages multiple virtual spaces in one physically shared space. With multiple virtual spaces, Packer offers the advantage of efficient memory management. At the same time, with one physically shared space, Packer avoids the problem of inefficient space utilization. To reduce the garbage collection pause time of Packer, we also propose a novel parallelization method that is applicable to multiple virtual spaces. We reduce the compacting GC parallelization problem into a tree traversal parallelization problem, and apply it to both normal and large object compaction.

2 citations


Proceedings ArticleDOI
18 May 2009
TL;DR: The Space Tuner that utilizes the novel concept of allocation speed to reduce wasted space and a novel parallelization method that reduces the compacting GC Parallelization problem into a tree traversal parallelization problem.
Abstract: As multithreaded server applications and runtime systems prevail, garbage collection is becoming an essential feature to support high performance systems. The fundamental issue of garbage collector (GC) design is to maximize the recycled space with minimal time overhead. This paper proposes two innovative solutions: one to improve space efficiency, and the other to improve time efficiency. To achieve space efficiency, we propose the Space Tuner that utilizes the novel concept of allocation speed to reduce wasted space. Conventional static space partitioning techniques often lead to inefficient space utilization. The Space Tuner adjusts the heap partitioning dynamically such that when a collection is triggered, all space partitions are fully filled. To achieve time efficiency, we propose a novel parallelization method that reduces the compacting GC parallelization problem into a tree traversal parallelization problem. This method can be applied for both normal and large object compaction. Object compaction is hard to parallelize due to strong data dependencies such that the source object can not be moved to its target location until the object originally in the target location has been moved out. Our proposed algorithm overcomes the difficulties by dividing the heap into equal-sized blocks and parallelizing the movement of the independent blocks. It is noteworthy that these proposed algorithms are generic such that they can be utilized in different GC designs.

1 citations


Journal ArticleDOI
01 May 2009
TL;DR: While the DDQ (Decoupled Dispatch Queues) design achieves levels of performance which are comparable to what would be obtained in a superscalar machine with a large dispatch queue, the approach can be designed with small, distributed dispatch queues which consequently can be implemented with low hardware complexity and high clock rates.
Abstract: Continuing demands for high degrees of Instruction Level Parallelism (ILP) require large dispatch queues (or centralized reservation stations) in modern superscalar microprocessors. However, such large dispatch queues are inevitably accompanied by high circuit complexity which would correspondingly limit the pipeline clock rates. In other words, increasing the size of the dispatch queue ultimately hinders attempts at increasing the clock speed. This is due to the fact that most of today's designs are based upon a centralized dispatch queue which itself depends on globally broadcasting operations to wakeup and select the ready instructions. As an alternative to this conventional design, we propose the design of hierarchically distributed dispatch queues, based on the access/execute decoupled architectures. Simulation results based on 14 data intensive benchmarks show that while our DDQ (Decoupled Dispatch Queues) design achieves levels of performance which are comparable to what would be obtained in a superscalar machine with a large dispatch queue, our approach can be designed with small, distributed dispatch queues which consequently can be implemented with low hardware complexity and high clock rates.

1 citations


Journal ArticleDOI
TL;DR: This work considered first that large-scale systems such as Grids are devoted to exploit parallelism in applications, may entail the coordination of large numbers of processing elements, the synchronization and communication in the larger scale, the programming models and the programming languages.
Abstract: After a decade of visible success stories in building clusters, then cluster of clusters, what we call “Grid systems” are expected to connect large numbers of heterogeneous resources (PC, databases, HPC clusters, instruments, sensors, visualization tools, etc.), to be accessed by many users, to execute a large variety of applications (number crunching, data access, multimedia, etc.) and to deal with many scientific fields (health, economy, computing, etc.). Such context forces applications to work in a dynamically changing computing environment, which gives rise to approaches to master complexity that are radically different from those found in traditional computing environments. An architecture (in the sense of ‘hardware’) alone is not sufficient to ensure an efficient use of resources and as with other fields in computing research, we need also to focus on middleware, languages, to cite but a few of the challenges. We considered first that large-scale systems such as Grids are devoted to exploit parallelism in applications, may entail the coordination of large numbers of processing elements, the synchronization and communication in the larger scale, the programming models and

1 citations