scispace - formally typeset
Search or ask a question
Author

Wen-Li Shih

Bio: Wen-Li Shih is an academic researcher from National Tsing Hua University. The author has contributed to research in topics: Compiler & Power optimization. The author has an hindex of 3, co-authored 6 publications receiving 27 citations.

Papers
More filters
Proceedings ArticleDOI
16 Aug 2006
TL;DR: The methods and experiences to develop software and toolkit flows for PAC (parallel architecture core) VLIW DSP processors and the experimental result in the compiler framework by incorporating software pipeline (SWP) policies for distributed register files in PAC architecture are presented.
Abstract: To support high-performance and low-power for multimedia applications and for hand-held devices, embedded VLIW DSP processors are of research focus. With the tight resource constraints, distributed register files, variable-length encodings for instructions, and special data paths are frequently adopted. This creates challenges to deploy software toolkits for new embedded DSP processors. This article presents our methods and experiences to develop software and toolkit flows for PAC (parallel architecture core) VLIW DSP processors. Our toolkits include compilers, assemblers, debugger and DSP micro-kernels. We first retarget open research compiler (ORC) and toolkit chains for PAC VLIW DSP processor and address the issues to support distributed register files and ping-pong data paths for embedded VLIW DSP processors. Second, the linker and assembler are able to support variable length encoding schemes for DSP instructions. In addition, the debugger and DSP micro-kernel were designed to handle dual-core environments. The footprint of micro-kernel is also around 10K to address the code-size issues for embedded devices. We also present the experimental result in the compiler framework by incorporating software pipeline (SWP) policies for distributed register files in PAC architecture. Results indicated that our compiler framework gains performance improvement around 2.5 times against the code generated without our proposed optimizations

11 citations

Journal ArticleDOI
TL;DR: A multithread power-gating framework composed of multith read power- gating analysis (MTPGA) and predicated power-Gating (PPG) energy management mechanisms for reducing the leakage power when executingMultithread programs on simultaneous multithreading (SMT) machines is presented.
Abstract: Multithread programming is widely adopted in novel embedded system applications due to its high performance and flexibility. This article addresses compiler optimization for reducing the power consumption of multithread programs. A traditional compiler employs energy management techniques that analyze component usage in control-flow graphs with a focus on single-thread programs. In this environment the leakage power can be controlled by inserting on and off instructions based on component usage information generated by flow equations. However, these methods cannot be directly extended to a multithread environment due to concurrent execution issues.This article presents a multithread power-gating framework composed of multithread power-gating analysis (MTPGA) and predicated power-gating (PPG) energy management mechanisms for reducing the leakage power when executing multithread programs on simultaneous multithreading (SMT) machines. Our multithread programming model is based on hierarchical bulk-synchronous parallel (BSP) models. Based on a multithread component analysis with dataflow equations, our MTPGA framework estimates the energy usage of multithread programs and inserts PPG operations as power controls for energy management. We performed experiments by incorporating our power optimization framework into SUIF compiler tools and by simulating the energy consumption with a post-estimated SMT simulator based on Wattch toolkits. The experimental results show that the total energy consumption of a system with PPG support and our power optimization method is reduced by an average of 10.09p for BSP programs relative to a system without a power-gating mechanism on leakage contribution set to 30p; and the total energy consumption is reduced by an average of 4.27p on leakage contribution set to 10p. The results demonstrate our mechanisms are effective in reducing the leakage energy of BSP multithread programs.

7 citations

Proceedings ArticleDOI
02 Sep 2011
TL;DR: The flow to enable an OpenCL compiler based on Open64 infrastructures for ATI GPUs is described, which includes the extension of the front-end parser for OpenCL, the generation of high-level intermediate representations with OpenCL linguistics, performing high- level optimization, and finally applying OpenCL specific optimization for code generations.
Abstract: As microprocessors evolve into heterogeneous architectures with multi-cores of MPUs and GPUs, programming model supports become important for programming such architectures. To address this issue, OpenCL is proposed. Currently, most of OpenCL implementations take LLVM as their infrastructures. This presents an opportunity to demonstrate whether OpenCL can be effectively implemented on other compiler infrastructures. For example, Open64, which is another open source compiler and known to generate efficient codes for microprocessors, can contribute further to performance improvements and enhancing the adoption of heterogeneous computing based on OpenCL. In this paper, we describe the flow to enable an OpenCL compiler based on Open64 infrastructures for ATI GPUs. Our work includes the extension of the front-end parser for OpenCL, the generation of high-level intermediate representations with OpenCL linguistics, performing high-level optimization, and finally applying OpenCL specific optimization for code generations. Preliminary experimental results show that our compiler based on Open64 is able to generate efficient codes for OpenCL programs.

5 citations

Journal ArticleDOI
01 Sep 2015
TL;DR: This work attempts to devise power optimization schemes in compilers by exploiting the opportunities of the recurring patterns of embedded multicore programs, including Pipe and Filter pattern, MapReduce with Iterator pattern, and Bulk Synchronous Parallel Model.
Abstract: Minimization of power dissipation can be considered at algorithmic, compiler, architectural, logic, and circuit level. Recent research trends for multicore programming models have come to the direction that parallel design patterns can be a solution to develop multicore applications. As parallel design patterns are with regularity, we view this as a great opportunity to exploit power optimizations in the software layer. In this paper, we investigate compilers for low power with parallel design patterns on embedded multicore systems. We evaluate four major parallel design patterns, Pipe and Filter, MapReduce with Iterator, Puppeteer, and Bulk Synchronous Parallel (BSP) Model. Our work attempts to devise power optimization schemes in compilers by exploiting the opportunities of the recurring patterns of embedded multicore programs. The proposed optimization schemes are rate-based optimization for Pipe and Filter pattern , early-exit power optimization for MapReduce with Iterator pattern, power aware mapping algorithm for Puppeteer pattern, and multi-phases power gating scheme for BSP pattern. In our experiments, real world multicore applications are evaluated on a multicore power simulator. Significant power reductions are observed from the experimental results. Therefore, we present a direction for power optimizations that one can further identify additional key design patterns for embedded multicore systems to explore power optimization opportunities via compilers.

3 citations

Proceedings ArticleDOI
01 Dec 2011
TL;DR: This paper presents a case study on parallelizing a Bokeh application on an embedded multicore platform, which features one MPU and one DSP sub-system consisting of two VLIW DSP processors.
Abstract: Bokeh application presents the blur or the aesthetic quality of blurring in out-of-focus areas of an image. The out-of-focus effect of Bokeh results depends on accuracy of depth information and blurring effects produced by image postprocessing. To obtain accurate depth information, current stereo vision techniques however consume a huge amount of processing time. In this paper, we present a case study on parallelizing a Bokeh application on an embedded multicore platform, which features one MPU and one DSP sub-system consisting of two VLIW DSP processors. The Bokeh application employs a Belief Propagation method to obtain depth information of input images and uses the information to generate output images with out-of-focus effect. This study also illustrates how to deliver performance for applications on embedded multicore systems. To sustain heavy computation requirement of the stereo vision techniques, DSPs with their SIMD instructions are leveraged to exploit data parallelism in critical kernels. In addition, DMAs on the multicore system are also incorporated to facilitate data transmission between processors. The access to SIMD and DMAs is provided by two essential programming models we developed for embedded multicore systems. Our work also gives the firsthand experiences of how C++ classes and abstractions can be used to help parallelization of applications on embedded multicore DSP systems. Finally, in our experiments, we utilize DSPs, SIMD and DMAs to obtain performance for two key components of the Bokeh application with their speedups of 1.67 and 2.75, respectively.

2 citations


Cited by
More filters
Proceedings ArticleDOI
23 Apr 2008
TL;DR: The research directions of the second-phase PAC project (PAC II), including multicore architectures, ESL (electronics system-level) technology, and low-power multimedia framework, are also addressed in this paper.
Abstract: The Industrial Technology Research Institute (ITRI) PAC (parallel architecture core) project was initiated in 2003. The target is to develop a low-power and high-performance programmable SoC platform for multimedia applications. In the first PAC project phase (2004-2006), a 5-way VLIW DSP (PACDSP) processor has been developed with our patented distributed & ping-pong register file and variable-length VLIW encoding techniques. A dual-core PAC SoC, which is composed of a PACDSP core and an ARM9 core, has also been designed and fabricated in the TSMC 0.13 mum technology to demonstrate its outstanding performance and energy efficiency for multimedia processing such as real-time H.264 codec. This paper summarizes the technical contents of PACDSP, DVFS (dynamic voltage and frequency scaling) -enabled PAC SoC, and the energy-aware multimedia codec. The research directions of our second-phase PAC project (PAC II), including multicore architectures, ESL (electronics system-level) technology, and low-power multimedia framework, are also addressed in this paper.

47 citations

Journal ArticleDOI
TL;DR: In this paper, the authors focus on deeply embedded devices, typically used for Internet of Things (IoT) applications, and demonstrate how to enable energy transparency through existing static resource analysis (SRA) techniques and a new target-agnostic profiling technique, without hardware energy measurements.
Abstract: Energy transparency is a concept that makes a program’s energy consumption visible, from hardware up to software, through the different system layers. Such transparency can enable energy optimizations at each layer and between layers, as well as help both programmers and operating systems make energy-aware decisions. In this article, we focus on deeply embedded devices, typically used for Internet of Things (IoT) applications, and demonstrate how to enable energy transparency through existing static resource analysis (SRA) techniques and a new target-agnostic profiling technique, without hardware energy measurements. Our novel mapping technique enables software energy consumption estimations at a higher level than the Instruction Set Architecture (ISA), namely the LLVM intermediate representation (IR) level, and therefore introduces energy transparency directly to the LLVM optimizer. We apply our energy estimation techniques to a comprehensive set of benchmarks, including single- and multithreaded embedded programs from two commonly used concurrency patterns: task farms and pipelines. Using SRA, our LLVM IR results demonstrate a high accuracy with a deviation in the range of 1% from the ISA SRA. Our profiling technique captures the actual energy consumption at the LLVM IR level with an average error of 3%.

29 citations

Proceedings ArticleDOI
01 Jan 2019
TL;DR: This work used discriminant correlation analysis (DCA) to fuse features from face and voice and used the K-nearest neighbors (KNN) algorithm to classify the features and showed that fusion increased recognition accuracy by 52.45% compared to using face alone and 81.62% when using voice alone.
Abstract: Biometric authentication is a promising approach to securing the Internet of Things (IoT). Although existing research shows that using multiple biometrics for authentication helps increase recognition accuracy, the majority of biometric approaches for IoT today continue to rely on a single modality. We propose a multimodal biometric approach for IoT based on face and voice modalities that is designed to scale to the limited resources of an IoT device. Our work builds on the foundation of Gofman et al. [7] in implementing face and voice feature-level fusion on mobile devices. We used discriminant correlation analysis (DCA) to fuse features from face and voice and used the K-nearest neighbors (KNN) algorithm to classify the features. The approach was implemented on the Raspberry Pi IoT device and was evaluated on a dataset of face images and voice files acquired using a Samsung Galaxy S5 device in real-world conditions such as dark rooms and noisy settings. The results show that fusion increased recognition accuracy by 52.45% compared to using face alone and 81.62% compared to using voice alone. It took an average of 1.34 seconds to enroll a user and 0.91 seconds to perform the authentication. To further optimize execution speed and reduce power consumption, we implemented classification on a field-programmable gate array (FPGA) chip that can be easily integrated into an IoT device. Experimental results showed that the proposed FPGA-accelerated KNN could achieve 150x faster execution time and 12x lower energy consumption compared to a CPU.

23 citations

Journal ArticleDOI
01 Mar 2011
TL;DR: The first part of the two introductory papers of PAC describes the hardware architecture of the PACDSP core, its software development tools, and the PAC SoC with dynamic voltage and frequency scaling (DVFS).
Abstract: In order to develop a low-power and high-performance SoC platform for multimedia applications, the Parallel Architecture Core (PAC) project was initiated in Taiwan in 2003. A VLIW digital signal processor (PACDSP) has been developed from a proprietary instruction set with multimedia-rich instructions, a complexity-effective microarchitecture with an innovative distributed & ping-pong register organization and variable-length VLIW encoding, to a highly-configurable soft IP with several successful silicon implementations. A complete toolchain with an optimizing C compiler has also been developed for PACDSP. A dual-core PAC SoC has been designed and fabricated, which consists of a PACDSP core, an ARM9 core, scratchpad memories, and various on-chip peripherals, to demonstrate the outstanding performance and energy efficiency for multimedia processing such as the real-time H.264 codec. The first part of the two introductory papers of PAC describes the hardware architecture of the PACDSP core, its software development tools, and the PAC SoC with dynamic voltage and frequency scaling (DVFS).

18 citations

Proceedings ArticleDOI
01 Nov 2009
TL;DR: The experiment results show that the proposed software cache can efficiently reduce the external memory access times and includes pointwise element access and block version of access to software cache.
Abstract: In embedded SoC design, memory hierarchies are playing increasingly important roles for system performances. There is a significant latency gap between internal and external memory accesses. The external memory access might downgrade the performance of embedded systems. Application developers must explicitly handle data transfer between external and internal memories. That is a burden for programmers. In this paper, we propose a software cache API to help programmers to ease this problem. The proposed API includes pointwise element access and block version of access to software cache. We also give a detailed description for design and implementation of software cache API. As a case study, the software cache API is implemented on PAC DSP, a high performance DSP aiming for multi-media applications. We evaluate the implementation with UTDSP benchmark suite. The experiment results show that the proposed software cache can efficiently reduce the external memory access times.

10 citations