scispace - formally typeset
Search or ask a question

Showing papers by "Jean-Luc Gaudiot published in 2011"


Journal ArticleDOI
TL;DR: The results indicate that the hardware implementation of speculative execution operations on GPU architectures to reduce the software performance overheads result in almost tenfold reduction of the control divergent sequential operations with only moderate hardware and power consumption overheads.
Abstract: GPUs and CPUs have fundamentally different architectures. It is conventional wisdom that GPUs can accelerate only those applications that exhibit very high parallelism, especially vector parallelism such as image processing. In this paper, we explore the possibility of using GPUs for value prediction and speculative execution: we implement software value prediction techniques to accelerate programs with limited parallelism, and software speculation techniques to accelerate programs that contain runtime parallelism, which are hard to parallelize statically. Our experiment results show that due to the relatively high overhead, mapping software value prediction techniques on existing GPUs may not bring any immediate performance gain. On the other hand, although software speculation techniques introduce some overhead as well, mapping these techniques to existing GPUs can already bring some performance gain over CPU. Based on these observations, we explore the hardware implementation of speculative execution operations on GPU architectures to reduce the software performance overheads. The results indicate that the hardware extensions result in almost tenfold reduction of the control divergent sequential operations with only moderate hardware (5–8%) and power consumption (1–5%) overheads.

23 citations


Journal ArticleDOI
TL;DR: A centralized approach is proposed that attempts to simultaneously reduce power in processor units with highest dissipation: reorder buffer, instruction queue, load/store queue, and register files, based on an observation that utilization for these units varies significantly, during cache miss period.
Abstract: Power minimization has become a primary concern in microprocessor design. In recent years, many circuit and micro-architectural innovations have been proposed to reduce power in many individual processor units. However, many of these prior efforts have concentrated on the approaches which require considerable redesign and verification efforts. Also it has not been investigated whether these techniques can be combined. Therefore a challenge is to find a centralized and simple algorithm which can address power issues for more than one unit, and ultimately the entire chip and comes with the least amount of redesign and verification efforts, the lowest possible design risk and the least hardware overhead. This paper proposes such a centralized approach that attempts to simultaneously reduce power in processor units with highest dissipation: reorder buffer, instruction queue, load/store queue, and register files. It is based on an observation that utilization for the aforementioned units varies significantly, during cache miss period. Therefore we propose to dynamically adjust the size and thus power dissipation of these resources during such periods. Circuit level modifications required for such resource adaptation are presented. Simulation results show a substantial power reduction at the cost of a negligible performance impact and a small hardware overhead.

14 citations


Book ChapterDOI
21 Oct 2011
TL;DR: This paper studies the impact of memory-side acceleration on performance, and evaluates its implementation feasibility including bus bandwidth utilization, hardware cost, and energy consumption, showing that this technique is able to improve performance by up to 20% as well as produce up to 12.77% of energy saving when implemented in 32 nm technology.
Abstract: As Extensible Markup Language (XML) becomes prevalent in cloud computing environments, it also introduces significant performance overheads. In this paper, we analyze the performance of XML parsing, identify that a significant fraction of the performance overhead is indeed incurred by memory data loading. To address this problem, we propose implementing memory-side acceleration on top of computation-side acceleration of XML parsing. To this end, we study the impact of memory-side acceleration on performance, and evaluate its implementation feasibility including bus bandwidth utilization, hardware cost, and energy consumption. Our results show that this technique is able to improve performance by up to 20% as well as produce up to 12.77% of energy saving when implemented in 32 nm technology.

13 citations


Journal ArticleDOI
TL;DR: The RHE, or Reduced Harmony for Education, is a lightweight instructional JVM that takes less than 40 h for engineers with little or no prior knowledge of JVM design to become familiar with the essential JVM components.
Abstract: Java Virtual Machine (JVM) education has become essential in training embedded software engineers as well as virtual machine researchers and practitioners. However, due to the lack of suitable instructional tools, it is difficult for students to obtain any kind of hands-on experience and to attain any deep understanding of JVM design. To address this issue, the authors designed the RHE, or Reduced Harmony for Education, a lightweight instructional JVM. The RHE is extremely simple and yet contains all essential modules of a JVM. Furthermore, it comes with several test programs designed to familiarize users with the modules of JVM. In the past few years, the authors have successfully used the RHE to train new engineers on the JVM technology. The training experience shows that with the RHE, it takes less than 40 h for engineers with little or no prior knowledge of JVM design to become familiar with the essential JVM components.

10 citations


Proceedings ArticleDOI
30 Sep 2011
TL;DR: This paper selects nine widely adopted cryptography algorithms and considers the overhead not only from the perspective of computation but also focusing on the memory access pattern, breaking down the function execution time to identify the software bottleneck suitable for hardware acceleration.
Abstract: Data encryption/decryption has become an essential component for modern information exchange. However, executing these cryptographic algorithms is often associated with huge overhead and the need to reduce this overhead arises correspondingly. In this paper, we select nine widely adopted cryptography algorithms and study their workload characteristics. Different from many previous works, we consider the overhead not only from the perspective of computation but also focusing on the memory access pattern. We break down the function execution time to identify the software bottleneck suitable for hardware acceleration. Then we categorize the operations needed by these algorithms. In particular, we introduce a concept called 'Load-Store Block' (LSB) and perform LSB identification of various algorithms. Our results illustrate that for cryptographic algorithms, the execution rate of most hotspot functions is more than 60%; memory access instruction ratio is mostly more than 60%; and LSB instructions account for more than 30% for selected benchmarks. Based on our findings, we suggest future directions in designing either the hardware accelerator associated with microprocessor or specific microprocessor for cryptography applications.

9 citations


Journal ArticleDOI
TL;DR: The Space Tuner that utilizes the novel concept of allocation speed to reduce wasted space and a novel parallelization method that reduces the compacting GC parallelization problem into a tree traversal parallelized problem to improve time efficiency are proposed.
Abstract: As multithreaded server applications and runtime systems prevail, garbage collection is becoming an essential feature to support high performance systems, especially those running data-intensive applications. The fundamental issue of garbage collector (GC) design is to maximize the recycled space with minimal time overhead. This paper proposes two innovative solutions: one to improve space efficiency, and the other to improve time efficiency. To achieve space efficiency, we propose the Space Tuner that utilizes the novel concept of allocation speed to reduce wasted space. Conventional static space partitioning techniques often lead to inefficient space utilization. The Space Tuner adjusts the heap partitioning dynamically such that when a collection is triggered, all space partitions are fully filled. To achieve time efficiency, we propose a novel parallelization method that reduces the compacting GC parallelization problem into a tree traversal parallelization problem. This method can be applied for both normal and large object compaction. Object compaction is hard to parallelize due to strong data dependencies such that the source object can not be moved to its target location until the object originally in the target location has been moved out. Our proposed algorithm overcomes the difficulties by dividing the heap into equal-sized blocks and parallelizing the movement of the independent blocks. It is noteworthy that these proposed algorithms are generic such that they can be utilized in different GC designs. The proposed techniques have been implemented in Apache Harmony JVM and we evaluated the proposed algorithms with SPECjbb and Dacapo benchmark suites. The experiment results demonstrate that our proposed algorithms greatly improve space utilization and the corresponding parallelization schemes are scalable, which brings time efficiency.

5 citations