scispace - formally typeset
Search or ask a question

Showing papers by "Heinrich Meyr published in 2006"


Proceedings ArticleDOI
06 Mar 2006
TL;DR: This paper proposes a framework that enables software development, verification and evaluation from the very beginning of MP-SoC design cycle, and allows a co-development of the hardware and the software components in a tightly coupled loop where the hardware can be refined by considering the requirements of the software in a stepwise manner.
Abstract: The increasing demands of high-performance in embedded applications under shortening time-to-market has prompted system architects in recent time to opt for multi-processor systems-on-chip (MP-SoCs) employing several programmable devices. The programmable cores provide a high amount of flexibility and reusability, and can be optimized to the requirements of the application to deliver high-performance as well. Since application software forms the basis of such designs, the need to tune the underlying SoC architecture for extracting maximum performance from the software code has become imperative. In this paper, we propose a framework that enables software development, verification and evaluation from the very beginning of MP-SoC design cycle. Unlike traditional SoC design flows where software design starts only after the initial SoC architecture is ready, our framework allows a co-development of the hardware and the software components in a tightly coupled loop where the hardware can be refined by considering the requirements of the software in a stepwise manner. The key element of this framework is the integration of a fine-grained software instrumentation tool into a system-level-design (SLD) environment to obtain accurate software performance and memory access statistics. The accuracy of such statistics is comparable to that obtained through instruction set simulation (ISS), while the execution speed of the instrumented software is almost an order of magnitude faster than ISS. Such a combined design approach assists system architects to optimize both the hardware and the software through fast exploration cycles, and can result in far shorter design cycles and high productivity. We demonstrate the generality and the efficiency of our methodology with two case studies selected from two most prominent and computationally intensive embedded application domains.

96 citations


Book
01 Jan 2006
TL;DR: This book presents a set of tools for the creation and exploration of timing approximate SoC platform models and the rigorous definition of a framework for modeling at the timing approximate level of abstraction.
Abstract: Integrated System-Level Modeling of Network-on-Chip Enabled Multi-Processor Platforms first gives a comprehensive update on recent developments in the area of SoC platforms and ESL design methodologies. The main contribution is the rigorous definition of a framework for modeling at the timing approximate level of abstraction. Subsequently this book presents a set of tools for the creation and exploration of timing approximate SoC platform models.

32 citations


Proceedings ArticleDOI
06 Mar 2006
TL;DR: This paper presents the design and implementation of a floating-point unit (FPU) for an application specific instruction set processor (ASIP) suitable for embedded systems domain using a state-of-the-art architecture description language (ADL) based ASIP design framework.
Abstract: Multimedia and communication algorithms from embedded system domain often make extensive use of floating-point arithmetic. Due to the complexity and expense of the floating-point hardware, final implementations of these algorithms are usually carried out using floating-point emulation in software, or conversion (manually or automatically) of the floating-point operations to fixed point operations. Such strategies often lead to semi-optimal and imprecise software implementation. This paper presents the design and implementation of a floating-point unit (FPU) for an application specific instruction set processor (ASIP) suitable for embedded systems domain. Using a state-of-the-art architecture description language (ADL) based ASIP design framework, the FPU is implemented in such a modular way that it can be easily adapted to any other RISC like processor. The implemented operations are fully compliant to the IEEE 754 standard which facilitates portable software development. The benchmarking, in terms of energy, area and speed, of the designed FPU highlights the trade-offs of having a hardware FPU w.r.t. software emulation of floating-point operations

15 citations


Proceedings ArticleDOI
14 May 2006
TL;DR: Two ASIP design concepts for the cached FFT algorithm (CFFT) are presented and a modified CFFT algorithm which enables a better cache utilization is presented, which reduces the energy dissipation by up to 10% compared to the original C FFT implementation.
Abstract: Orthogonal Frequency Division Multiplexing (OFDM) is a data transmission technique which is used in wired and wireless digital communication systems. In this technique, Fast Fourier Transformation (FFT) and inverse FFT (IFFT) are kernel processing blocks in an OFDM system, and are used for data (de) modulation. OFDM systems are increasingly required to be flexible to accommodate different standards and operation modes, in addition to being energy-efficient. A trade-off between these two conflicting requirements can be achieved by employing Application-Specific Instruction-Set Processors (ASIPs). In this paper, two ASIP design concepts for the Cached FFT algorithm (CFFT) are presented. A reduction in energy dissipation of up to 25% is achieved compared to an ASIP for the widely used Cooley-Tukey FFT algorithm, which was designed by using the same design methodology and technology. Further, a modified CFFT algorithm which enables a better cache utilization is presented. This modification reduces the energy dissipation by up to 10% compared to the original CFFT implementation.

14 citations


Proceedings ArticleDOI
22 Oct 2006
TL;DR: This paper presents an efficient and quickly Retargetable SIMD code optimization technique that is integrated into an industrial retargetable C compiler and demonstrates that the proposed technique applies to real-life target machines and that it produces code quality improvements close to the theoretical limit.
Abstract: Retargetable C compilers are nowadays widely used to quickly obtain compiler support for new embedded processors and to perform early processor architecture exploration. One frequent concern about retargetable compilers, though, is their lack of machine-specific code optimization techniques in order to achieve highest code quality. While this problem is partially inherent to the retargetable compilation approach, it can be circumvented by designing flexible, configurable code optimization techniques that apply to a certain range of target architectures. This paper focuses on target machines with SIMD instruction support which is widespread in embedded processors for multimedia applications. We present an efficient and quickly retargetable SIMD code optimization technique that is integrated into an industrial retargetable C compiler. Experimental results for the Philips Trimedia processor demonstrate that the proposed technique applies to real-life target machines and that it produces code quality improvements close to the theoretical limit.

13 citations


Proceedings ArticleDOI
06 Mar 2006
TL;DR: This paper proposes an approach which extracts high-level structural information from the ADL representation and systematically uses the available control signals in pipelined datapaths, and shows a significant power reduction.
Abstract: Cutting-edge applications of future embedded systems demand highest processor performance with low power consumption to get acceptable battery-life times. Therefore, low power optimization techniques are strongly applied during the development of modern Application Specific Instruction Set Processors (ASIPs). Electronic System Level design tools based on Architecture Description Languages (ADL) offer a significant reduction in design time and effort by automatically generating the software tool-suite as well as the Register Transfer Level (RTL) description of the processor. In this paper, the automation of power optimization in ADL-based RTL generation is addressed. Operand isolation is a well-known power optimization technique applicable at all stages of processor development. With increasing design complexitiy several efforts have been undertaken to automate operand isolation. In pipelined datapaths, where isolating signals are often implicitly available, the traditional RTL-based approach introduces unnecessary overhead. We propose an approach which extracts high-level structural information from the ADL representation and systematically uses the available control signals. Our experiments with state-of-the-art embedded processors show a significant power reduction (improvement in power efficiency).

13 citations


Proceedings ArticleDOI
26 Apr 2006
TL;DR: An efficient and universal technique of automatic insertion of gated clocks during the ADL-based ASIP design flow is described, reducing power consumption up to 41% percent compared to naive RTL synthesis from ADL description, without any incurred overhead for area and speed.
Abstract: Increasing complexity of cutting-edge applications for future embedded systems demand even higher processor performance with a strong consideration for battery-life. Low power optimization techniques are, therefore, widely applied towards the development of modern Application Specific Instruction-Set Processors (ASIPs). Architecture Description Languages (ADLs) offer the ASIP designers a quick and optimal design convergence by automatically generating the software tool-suite as well as the Register Transfer Level (RTL) description of the processor. The automatically generated processor description is then subjected to the traditional RTL-based synthesis flow. Power-specific optimizations, often found in RTL-based commercial tools, cannot take the full advantage of the architectural knowledge embedded in the ADL description, resulting in sub-optimal power efficiency. In this paper, we address this issue by describing an efficient and universal technique of automatic insertion of gated clocks during the ADL-based ASIP design flow. Experiments with ASIP benchmarks show the dramatic impact of our approach by reducing power consumption up to 4100 percent compared to naive RTL synthesis from ADL description, without any incurred overhead for area and speed.

12 citations


Proceedings ArticleDOI
14 Jun 2006
TL;DR: The verification flow includes the idea of automatic assertion generation during high-level synthesis and support for automatic test-generation utilizing the ADL-framework for ASIP design, and shows the benefit of the approach by trapping errors in a pipelined SPARC-compliant processor architecture.
Abstract: Nowadays, architecture description languages (ADLs) are getting popular to achieve quick and optimal design convergence during the development of application specific instruction-set processors (ASIPs). Verification, in various stages of such ASIP development, is a major bottleneck hindering widespread acceptance of ADL-based processor design approach. Traditional verification of processors are only applied at register transfer level (RTL) or below. In the context of ADL-based ASIP design, this verification approach is often inconvenient and error-prone, since design and verification are done at different levels of abstraction. In this paper, this problem is addressed by presenting an integrated verification approach during ADL-driven processor design. Our verification flow includes the idea of automatic assertion generation during high-level synthesis and support for automatic test-generation utilizing the ADL-framework for ASIP design. We show the benefit of our approach by trapping errors in a pipelined SPARC-compliant processor architecture

10 citations


Proceedings ArticleDOI
17 Jan 2006
TL;DR: A novel, fine-grained memory profiling technique that provides the designer with valuable information such as the total amount of dynamic memory requirement of an application, the most heavily accessed source level data objects, themost memory intensive portions of an applications etc.
Abstract: The memory subsystem is the major performance bottleneck in terms of speed and power consumption in today's digital systems. This is especially true for application specific embedded systems where power consumption due to memory traffic, memory latency and size of the on-chip caches have a significant role in overall system performance, energy efficiency and cost. There is an urgent need of tools that help designers take informed decisions about memory subsystems for embedded applications. This paper presents a novel, fine-grained memory profiling technique that provides the designer with valuable information such as the total amount of dynamic memory requirement of an application, the most heavily accessed source level data objects, the most memory intensive portions of an application etc. Such information can aid designers to take decisions about the overall memory subsystem comprising of a number of different cache levels, scratch-pad memories and main memory. It can also be used by a compiler to perform advanced compiler controlled memory assignment techniques, and by the application programmer to optimize an application. Case studies at the end of this paper demonstrate the accuracy of our profiling technique and provide some example usage scenarios of it.

8 citations


Proceedings ArticleDOI
11 Dec 2006
TL;DR: A robust and adaptive structure for the prediction of the channel fading process in the context of a power controlled code division multiple access (CDMA) system based on least mean square (LMS) adaptation is deduced.
Abstract: In this paper we derive an enhanced power control algorithm, fitting into the up/down control scheme, as it is considered in the frequency division duplex (FDD) mode of the current 3GPP standard. Analysis of the classical up/down power control scheme unveils, that with increasing velocities the power control performance degrades, as the fixed step size power control is not able to track the channel fading properly. For the uplink we derive a nonlinear control algorithm generating the up/down power control commands accounting for the future of the channel fading process. Simulations show that this algorithm in combination with perfect future channel state information can partially mitigate the drawbacks of a fixed step-size up/down power control. A prerequisite for predictive power control is the acquisition of the future channel state information. In this paper we deduce a robust and adaptive structure for the prediction of the channel fading process in the context of a power controlled code division multiple access (CDMA) system based on least mean square (LMS) adaptation. Link level simulations show a signal to noise and interference ratio (SINR) gain in terms of the block error rate, enabling a decrease of the target SINR and thus leading to an enhanced spectral efficiency.

8 citations


Proceedings ArticleDOI
21 May 2006
TL;DR: It is shown that existing techniques are inefficient for high throughput applications such as ultra wideband (UWB), because of memory bottlenecks, and an interleaved execution technique which exploits temporal parallelism is proposed.
Abstract: Fast Fourier transformation (FFT) and its inverse (IFFT) are used in orthogonal frequency division multiplexing (OFDM) systems for data (de)modulation. The transformations are the kernel tasks in an OFDM implementation, and are the most processing-intensive ones. Recent trends in the electronic consumer market require OFDM implementations to be flexible, making a trade-off between area, energy-efficiency, flexibility and timing a necessity. This has spurred the development of application-specific instruction-set processors (ASIPs) for FFT processing. Parallelization is an architectural parameter that significantly influence design goals. This paper presents an analysis of the efficiency of parallelization techniques for an FFT-ASIP. It is shown that existing techniques are inefficient for high throughput applications such as ultra wideband (UWB), because of memory bottlenecks. Therefore, an interleaved execution technique which exploits temporal parallelism is proposed. With this technique, it is possible to meet the throughput requirement of UWB (409.6 Msamples/s) with only 4 non-trivial butterfly units for an ASIP that runs at 400MHz.

Proceedings ArticleDOI
11 Dec 2006
TL;DR: The results on the data rate show good accordance to previous results based on non-data aided channel prediction especially in the interesting bandwidth range.
Abstract: The achievable data rate of an OFDM system using data-aided channel estimation in the high bandwidth regime is evaluated under the assumption of a frequency selective, continuously fading channel. As previous results propose, the achievable data rate depends on the LMMSE channel estimate, for which a convenient representation is introduced here. The mean square estimation error is derived from this representation, allowing for an analysis with respect to the optimum amount and distribution of the pilot symbols in the wideband regime. The results on the data rate show good accordance to previous results based on non-data aided channel prediction especially in the interesting bandwidth range.

Journal ArticleDOI
01 Jun 2006
TL;DR: A modeling style is presented, which is able to capture high- and low-level architectural information at the same time and make it possible to drive both the C compiler and the simulator generation without sacrificing the modeling flexibility.
Abstract: Today's Application Specific Instruction-set Processor (ASIP) design methodology often employs centralized Architecture Description Language (ADL) processor models, from which software tools, such as C compiler, assembler, linker, and instruction-set simulator, can be automatically generated. Among these tools, the C compiler is becoming more and more important. However, the generation of C compilers requires high-level architecture information rather than low-level details needed by simulator generation. This makes it particularly difficult to include different aspects of the target architectureinto one single model, and meanwhile keeping consistency. This paper presents a modeling style, which is able to capture high- and low-level architectural information at the same time and make it possible to drive both the C compiler and the simulator generation without sacrificing the modeling flexibility. The proposed approach has been successfully applied to model a number of contemporary, real-world processor architectures.

03 Apr 2006
TL;DR: In this paper, the authors link the turbo decoding algorithm to maximum likelihood (ML) sequence detection by demonstrating how the turbo decoder can be systematically derived starting from the ML sequence detection criterion.
Abstract: Despite the considerable research effort towards the analysis and understanding of the nature of turbo decoding, a clear identification of the underlying optimization problem the turbo decoder attempts to solve is still missing. In this paper, we link the turbo decoding algorithm to maximum likelihood (ML) sequence detection by demonstrating how the turbo decoder can be systematically derived starting from the ML sequence detection criterion. In particular, we show that a method to solve the ML sequence detection problem is to iteratively solve the corresponding critical point equations of an equivalent unconstrained estimation problem by means of fixed-point iterations. The turbo decoding algorithm is obtained by approximating the overall a posteriori probabilities, such that the fixed-point iteration becomes feasible and the optimum ML solution is still a solution of the corresponding approximate critical point equations.

03 Apr 2006
TL;DR: Using results from the theory of products of random matrices, a general prove is presented of this property and it is shown that the rate of decay is exponentially with distance along the trellis.
Abstract: From an implementation as well as theoretical point of view it is a fundamental property of the symbol-by-symbol MAP decoding algorithm, that the dependence of the decoder output on the decoder inputs decays with distance in the code trellis. By using results from the theory of products of random matrices we present a general prove of this property and show that the rate of decay is exponentially with distance along the trellis. Furthermore, we examine how the rate of decay depends on the channel parameter and the a priori information, and how it evolves during the iterative decoding process. Finally, we comment on possible practical implications.

Proceedings ArticleDOI
06 Mar 2006
TL;DR: A new NPU code optimization technique to use such HW contexts is presented that minimizes the overhead for saving and reloading register contents for function calls via the runtime stack.
Abstract: Sophisticated C compiler support for network processors (NPUs) is required to improve their usability and consequently, their acceptance in system design. Nonetheless, high-level code compilation always introduces overhead, regarding code size and performance compared to handwritten assembly code. This overhead results partially from high-level function calls that usually introduce memory accesses in order to save and reload register contents. A key feature of many NPU architectures is hardware multithreading support, in the form of separate register files, for fast context switching between different application tasks. In this paper, a new NPU code optimization technique to use such HW contexts is presented that minimizes the overhead for saving and reloading register contents for function calls via the runtime stack. The feasibility and the performance gain of this technique are demonstrated for the Infineon Technologies PP32 NPU architecture and typical network application kernels.