scispace - formally typeset
Search or ask a question
Author

Jiao-Wei Huang

Bio: Jiao-Wei Huang is an academic researcher from National Taiwan University. The author has contributed to research in topics: Low-power electronics & Non-uniform memory access. The author has an hindex of 2, co-authored 3 publications receiving 31 citations.

Papers
More filters
Proceedings ArticleDOI
18 Jan 2010
TL;DR: In this article, a new weight assignment scheme for logic switching activity was proposed, which enhances the IR-drop assessment capability of the existing weighted switching activity (WSA) model by including the power grid network structure information.
Abstract: For two-pattern at-speed scan testing, the excessive power supply noise at the launch cycle may cause the circuit under test to malfunction, leading to yield loss. This paper proposes a new weight assignment scheme for logic switching activity; it enhances the IR-drop assessment capability of the existing weighted switching activity (WSA) model. By including the power grid network structure information, the proposed weight assignment better reflects the regional IR-drop impact of each switching event. For ATPG, such comprehensive information is crucial in determining whether a switching event burdens the IR-drop effect. Simulation results show that, compared with previous weight assignment schemes, the estimated regional IR-drop profiles better correlate with those generated by commercial tools.

21 citations

Proceedings ArticleDOI
07 Nov 2010
TL;DR: The experimental results show that the proposed scheduling policy improves system throughput by 21% compared to FR-FCFS (first-ready first-come-first-serve) on an MPSoC for mobile phones with QoS guarantee.
Abstract: Optimizing memory system performance is critical for delivering high system performance for multimedia applications since they are usually memory intensive. As the number of IP cores in a multimedia MPSoC (Multi-Processor System-on-Chip) continues to increase, system performance will be eventually limited by the memory system. In this paper, we tackle the memory performance issue of multimedida MPSoCs through intelligent memory access scheduling. We observe that since memory resources are shared by all processing elements in an MPSoC, interferences among requests from different IP cores cause not only delay in memory accesses but also unfair DRAM accesses among IPs. Traditional memory scheduling policies that only emphasize on maximizing memory system throughput do not take into account these interferences. Therefore, in this paper, we propose a hierarchical memory scheduling policy to minimize interferences among requests. The experimental results show that the proposed scheduling policy improves system throughput by 21% compared to FR-FCFS (first-ready first-come-first-serve) on an MPSoC for mobile phones with QoS guarantee.

8 citations

Proceedings ArticleDOI
09 Mar 2012
TL;DR: A run-time mechanism is proposed that predict the memory stall cycles of an individual IP, and make the power gating decision based on the predicted memory latency and its break-even time, so that a power-gated IP can be woken up in advance to avoid performance degradation.
Abstract: As technology continues to scale, reducing leakage is critical to achieve energy efficiency. Power gating can potentially save a significant part of leakage but it incurs both energy and performance penalties. Therefore, power gating decisions need to be made carefully. In the current low-power SoC design, an IP core is power gated when it is not operating. In this paper, we explore the IP idle time due to memory accesses for further leakage reduction. In MPSoCs, due to contention among concurrent memory accesses from different IP cores, memory stall cycles vary significantly, ranging from 10 to 600 cycles according to our experiments. We propose a run-time mechanism that predict the memory stall cycles of an individual IP, and make the power gating decision based on the predicted memory latency and its break-even time. With the predicted memory latency, a power-gated IP can be woken up in advance to avoid performance degradation. The experimental results show that our power management mechanism can achieve 25.3% leakage energy saving within 4% performance penalty.

2 citations


Cited by
More filters
Proceedings ArticleDOI
03 Dec 2011
TL;DR: This paper proposes a memory scheduling algorithm designed specifically for parallel applications, targeting two common synchronization primitives that cause inter-dependence of threads: locks and barriers, and shows that it speeds up a set of memory-intensive parallel applications by 12.6% compared to the best previous memory scheduling technique.
Abstract: A primary use of chip-multiprocessor (CMP) systems is to speed up a single application by exploiting thread-level parallelism. In such systems, threads may slow each other down by issuing memory requests that interfere in the shared memory subsystem. This inter-thread memory system interference can significantly degrade parallel application performance. Better memory request scheduling may mitigate such performance degradation. However, previously proposed memory scheduling algorithms for CMPs are designed for multi-programmed workloads where each core runs an independent application, and thus do not take into account the inter-dependent nature of threads in a parallel application. In this paper, we propose a memory scheduling algorithm designed specifically for parallel applications. Our approach has two main components, targeting two common synchronization primitives that cause inter-dependence of threads: locks and barriers. First, the runtime system estimates threads holding the locks that cause the most serialization as the set of limiter threads, which are prioritized by the memory scheduler. Second, the memory scheduler shuffles thread priorities to reduce the time threads take to reach the barrier.We show that our memory scheduler speeds up a set of memory-intensive parallel applications by 12.6% compared to the best previous memory scheduling technique.

147 citations

Proceedings ArticleDOI
16 Jun 2014
TL;DR: GemDroid is designed by integrating the Android open-source emulator for facilitating execution of mobile applications, the GEM5 core simulator for analyzing the CPU and memory centric designs, and models for several IPs to collectively study their impact on system-level performance and power.
Abstract: As the demand for feature-rich mobile systems such as smartphones and tablets has outpaced other computing systems and is expected to continue at a faster rate, it is projected that SoCs with tens of cores and hundreds of IPs (or accelerator) will be designed to provide unprecedented level of features and functionality in future. Design of such mobile systems with required QoS and power budgets along with other design constraints will be a daunting task for computer architects since any ad hoc, piece-meal solution is unlikely to result in an optimal design. This requires early exploration of the complete design space to understand the system-level design trade-offs. To the best of our knowledge, there is no such publicly available tool to conduct a holistic evaluation of mobile platforms consisting of cores, IPs and system software.This paper presents GemDroid, a comprehensive simulation infrastructure to address these concerns. GemDroid has been designed by integrating the Android open-source emulator for facilitating execution of mobile applications, the GEM5 core simulator for analyzing the CPU and memory centric designs, and models for several IPs to collectively study their impact on system-level performance and power. Analyzing a spectrum of applications with GemDroid, we observed that the memory subsystem is a vital cog in the mobile platform because, it needs to handle both core and IP traffic, which have very different characteristics. Consequently, we present a heterogeneous memory controller (HMC) design, where we divide the memory physically into two address regions, where the first region with one memory controller (MC) handles core-specific application data and the second region with another MC handles all IP related data. The proposed modifications to the memory controller design results in an average 25% reduction in execution time for CPU bound applications, up to 11% reduction in frame drops, and on average 17% reduction in CPU busy time for on-screen (IP bound) applications.

35 citations

Proceedings ArticleDOI
13 Jun 2015
TL;DR: This paper proposes a novel IP virtualization framework (VIP), involving three key ideas that allow several IPs to be chained together and made to appear to the software as a single device, thereby allowing better energy saving and utilization opportunities.
Abstract: Energy-efficient user-interactive and display-oriented applications on handhelds rely heavily on multiple accelerators (termed IP cores) to meet their periodic frame processing needs. Further, these platforms are starting to host multiple applications concurrently on the multiple CPU cores. Unfortunately, today's hardware exposes an interface that forces the host software (Android drivers) to treat each IP core as an isolated device. Consequently, the host CPU has to get involved in the (i) processing of each frame, (ii) scheduling them to ensure timely progress through the IP cores to meet their QoS needs, and (iii) explicitly having to move data from one IP core to the next, with main memory serving as the common staging area. We show in this paper through measurements on a Nexus 7 platform that the frequent invocation of the CPU for processing these frames and the involvement of main memory as a data flow conduit, are serious limitations. Instead, we propose a novel IP virtualization framework (VIP), involving three key ideas that allow several IPs to be chained together and made to appear to the software as a single device. First, chaining of IPs avoids data transfer through the memory system, enhancing the throughput of flows through the IPs. Second, by using a burst-mode, the CPU can initiate the processing of several frames through the virtual IP chain, without getting involved (and interrupted) for each frame, thereby allowing better energy saving and utilization opportunities. Removing the CPU from this loop, requires alternate orchestration of frame flows to ensure QoS guarantees for each frame of each application. Our third enhancement in VIP creates several virtual paths, one for each flow, through these IP chains with the hardware scheduling the frames to enforce QoS guarantees despite any contention for resources along the way. Our experimental evaluations demonstrate the effectiveness of VIP on energy consumption and QoS for multiple applications.

30 citations

Proceedings ArticleDOI
29 Apr 2013
TL;DR: A simulation-based X'Filling method, Bit-Flip, is proposed to maximize the power supply noise during PKLPG test and demonstrates that the method can significantly increase effective WSA while limiting the fill rate.
Abstract: Pseudo functional K Longest Path Per Gate (KLPG) test (PKLPG) is proposed to generate delay tests that test the longest paths while having power supply noise similar to that seen during normal functional operation. Our experimental results show that PKLPG is more vulnerable to under-testing than traditional two-cycle transition fault test. In this work, a simulation-based X'Filling method, Bit-Flip, is proposed to maximize the power supply noise during PKLPG test. Given a set of partially-specified scan patterns, random filling is done and then an iterative procedure is invoked to flip some of the filled bits, to increase the effective weighted switching activity (WSA). Experimental results on both compacted and uncompacted test patterns are presented. The results demonstrate that our method can significantly increase effective WSA while limiting the fill rate.

23 citations

Journal ArticleDOI
TL;DR: The proposed DfT support enables a design partitioning approach, where any given set of patterns, generated in a power-unaware manner, can be utilized to test the design regions one at a time, reducing both launch and capture power in a design-flow-compatible manner.
Abstract: At-speed or even faster-than-at-speed testing of VLSI circuits aims for high-quality screening of the circuits by targeting performance-related faults. On one hand, a compact test set with highly effective patterns, each detecting multiple delay faults, is desirable for lower test costs. On the other hand, such patterns increase switching activity during launch and capture operations. Patterns optimized for quality and cost may thus end up violating peak-power constraints, resulting in yield loss, while pattern generation under low switching activity constraints may lead to loss in test quality and/or pattern count inflation. In this paper, we propose design for testability (DfT) support for enabling the use of a set of patterns optimized for cost and quality as is, yet in a low power manner; we develop three different DfT mechanisms, one for launch-off shift, one for launch-off capture, and one for mixed at-speed testing. The proposed DfT support enables a design partitioning approach, where any given set of patterns, generated in a power-unaware manner, can be utilized to test the design regions one at a time, reducing both launch and capture power in a design-flow-compatible manner. This way, the test pattern count and quality of the optimized test set can be preserved, while lowering the launch/capture power.

20 citations