scispace - formally typeset
Search or ask a question

Showing papers by "Heinrich Meyr published in 1996"


Proceedings ArticleDOI
30 Oct 1996
TL;DR: The development of a new language was necessary in order to cover the gap between coarse ISA models used in compilers, and instruction set simulators on the one hand, and detailed models used for hardware design on the other.
Abstract: A machine description language is presented The language, LISA, and its generic machine model are able to produce bit- and cycle/phase-accurate processor models covering the specific needs of HW/SW codesign, and cosimulation environments The development of a new language was necessary in order to cover the gap between coarse ISA models used in compilers, and instruction set simulators on the one hand, and detailed models used for hardware design on the other The main part of the paper is devoted to behavioral pipeline modeling The pipeline controller of the generic machine model is represented as an ASAP (as soon as possible) sequencer parameterized by precedence and resource constraints of operations of each instruction The standard pipeline description based on reservation tables and Gantt charts was extended by additional operation descriptors which enable the detection of data and control hazards, and permit modeling of pipeline flushes Using the newly introduced L-charts we reduced the parameterization of the pipeline controller to a minimum and at the same time covered typical pipeline controls found in state of the art signal processors As an example, the application of the LISA model on the TI-TMS320C54x signal processor is presented

151 citations


Proceedings ArticleDOI
01 Jun 1996
TL;DR: In this paper, the sources of the speedup and the limitations of the technique are analyzed and the realization of the simulation compiler is presented.
Abstract: This paper presents a technique for simulating processors and attached hardware using the principle of compiled simulation. Unlike existing, inhouse and off-the-shelf hardware/software co-simulators, which use interpretive processor simulation, the proposed technique performs instruction decoding and simulation scheduling at compile time. The technique offers up to three orders of magnitude faster simulation. The high speed allows the user to explore algorithms and hardware/software trade-offs before any hardware implementation. In this paper, the sources of the speedup and the limitations of the technique are analyzed and the realization of the simulation compiler is presented.

84 citations


Journal ArticleDOI
TL;DR: It is proved that, due to the lack of additional operations, DCORDIC compares favorably with the previously known redundant methods in terms of latency and computational complexity.
Abstract: The CORDIC algorithm is a well-known iterative method for the efficient computation of vector rotations, and trigonometric and hyperbolic functions. Basically, CORDIC performs a vector rotation which is not a perfect rotation, since the vector is also scaled by a constant factor. This scaling has to be compensated for following the CORDIC iteration. Since CORDIC implementations using conventional number systems are relatively slow, current research has focused on solutions employing redundant number systems which make a much faster implementation possible. The problem with these methods is that either the scale factor becomes variable, making additional operations necessary to compensate for the scaling, or additional iterations are necessary compared to the original algorithm. In contrast we developed transformations of the usual CORDIC algorithm which result in a constant scale factor redundant implementation without additional operations. The resulting "Differential CORDIC Algorithm" (DCORDIC) makes use of on-line (most significant digit first redundant) computation. We derive parallel architectures for the radix-2 redundant number systems and present some implementation results based on logic synthesis of VHDL descriptions produced by a DCORDIC VHDL generator. We finally prove that, due to the lack of additional operations, DCORDIC compares favorably with the previously known redundant methods in terms of latency and computational complexity.

84 citations


Book ChapterDOI
01 Jan 1996
TL;DR: The advent of 0.5μ processing that allows for the integration of 5 million transistors on a single integrated circuit has brought forth new challenges and opportunities in embedded-system design.
Abstract: The advent of 0.5μ processing that allows for the integration of 5 million transistors on a single integrated circuit has brought forth new challenges and opportunities in embedded-system design. This high level of integration makes it possible and desirable to integrate a processor core, a program ROM, and an ASIC together on a single IC. To justify the design costs of such an IC, these embedded-system designs must be sold in large volumes and, as a result, they are very cost-sensitive. The cost of an IC is most closely linked to its size, which is derived from the final circuit area. It is not unusual for the ROM that stores the program code to be the largest contributor to the area of such ICs. Thus the incremental value of using logic optimization to reduce the size of the ASIC is smaller because the ASIC takes up a relatively smaller percentage of the final circuit area. On the other hand, the potential for cost reduction through diminishing the size of the program ROM is great. There are also often strong real-time performance requirements on the final code; hence, there is a necessity for producing high-performance code as well.

58 citations


Journal ArticleDOI
TL;DR: The aim of this paper is to describe system design and VLSI implementation of a complex system of fabricated ASIC's for high speed Viterbi decoding using the "minimized method" (MM) parallelized VA.
Abstract: At present, the Viterbi algorithm (VA) is widely used in communication systems for decoding and equalization. The achievable speed of conventional Viterbi decoders (VD's) is limited by the inherent nonlinear add-compare-select (ACS) recursion. The aim of this paper is to describe system design and VLSI implementation of a complex system of fabricated ASIC's for high speed Viterbi decoding using the "minimized method" (MM) parallelized VA. We particularly emphasize the interaction between system design, architecture and VLSI implementation as well as system partitioning issues and the resulting requirements for the system design flow. Our design objectives were 1) to achieve the same decoding performance as a conventional VD using the parallelized algorithm, 2) to achieve a speed of more than 1 Gb/s, and 3) to realize a system for this task using a single cascadable ASIC. With a minimum system configuration of four identical ASIC's produced by using 1.0 /spl mu/ CMOS technology, the design objective of a decoding speed of 1.2 Gb/s is achieved. This means, compared to previous implementations of Viterbi decoders, the speed is increased by an order of magnitude.

43 citations


Proceedings ArticleDOI
18 Nov 1996
TL;DR: Two digital receiver algorithms for the processing of an extended range of variable sample rates are proposed, based on filtering the received samples prior to timing synchronization, whereas the second algorithm increases the sample rate in the timing recovery loop and matched filter.
Abstract: The evolving digital television broadcasting standard does not standardize the data rate of the transmitted data but instead leaves it completely unspecified. We propose two digital receiver algorithms for the processing of an extended range of variable sample rates. The algorithms are compared in terms of their complexity and performance. One of the algorithms is based on filtering the received samples prior to timing synchronization, whereas the second algorithm increases the sample rate in the timing recovery loop and matched filter. Both algorithms can be implemented causing negligible loss in a DVB receiver.

28 citations


Proceedings ArticleDOI
06 Nov 1996
TL;DR: Three main components of the exploration environment-benchmarking methodology (DSPstone), fast processor simulation (SuperSim), and machine description (LISA) are focused on, which allow an exploration of a much larger design space than it was possible with standard processor simulators.
Abstract: In the paper the problem of processor/compiler codesign for digital signal processing and embedded systems is discussed. The main principle we follow is the top-down approach characterized by extensive simulation and quantitative performance evaluation of processor and compiler. Although well established in the design of state-of-the-art general purpose processors and compilers, this approach is rarely followed by leading producers of signal and embedded processors. As a consequence, the matching between the processor and the compiler is low. In the paper we focus on three main components of our exploration environment-benchmarking methodology (DSPstone), fast processor simulation (SuperSim), and machine description (LISA). Most of the paper is devoted to the technique of compiled processor simulation. The speedup obtained allows an exploration of a much larger design space than it was possible with standard processor simulators.

22 citations



Proceedings ArticleDOI
30 Oct 1996
TL;DR: The automated generation of components for high throughput data-flow dominated VLSI-systems in digital communications is described by means of a hierarchically organized library and the design environment ComBox enhances reusability and enables rapid implementation of complex systems starting from a system level description.
Abstract: We describe the automated generation of components for high throughput data-flow dominated VLSI-systems in digital communications. By means of a hierarchically organized library both behavioural models with high simulation efficiency and corresponding hardware generators that produce sophisticated VHDL descriptions are made easily accessible to the system designer. The structured approach allows the evaluation of the trade-offs between alternatives at each design step and guarantees a fast and reliable design flow towards hardware. The design environment ComBox enhances reusability and enables rapid implementation of complex systems starting from a system level description.

7 citations


Book ChapterDOI
01 Jan 1996
TL;DR: It is shown that the first demodu­lation algorithm is superior both in performance and computational complexity and exhibits the best robustness properties in the case of signal impairments.
Abstract: For MSK, three demodulators are compared. The first demodu­lation algorithm, partially coherent demodulation, is based on a classical matched filter approach combined with feedforward phase synchronization, whereas the second algorithm, block demodulation, is based on minimizing a distance measure based on the symbol vector trial and the observed differential phase vector. The third algorithm is based on the same distance measure, however the minimization is carried out using the viterbi algorithm. We provide a derivation of the second algorithm. It is shown that the first approach is superior both in performance and computational complexity. The first algorithm also exhibits the best robustness properties in the case of signal impairments.

6 citations


Book ChapterDOI
01 Jan 1996
TL;DR: This chapter addresses the process of implementing complex functions by an appropriate combination of application specific hardware and software modules in telecommunication product design.
Abstract: In most general terms telecommunication product design can be defined as the process of implementing complex functions by an appropriate combination of application specific hardware and software modules. In the future, one of the most important assets of a successful company will be the mastering of this product development process. In this chapter we address this process.

Proceedings ArticleDOI
30 Oct 1996
TL;DR: This work analyzes and compares silicon real-estate and throughput of word-parallel arithmetic circuits (add and shift type arithmetic) based on various redundant number representations and compares these results with the automatically optimized two's complement implementations.
Abstract: All the commercially available logic-synthesis tools currently use only (non-redundant) binary and two's complement number representations for representing the results of arithmetic operators We analyze and compare silicon real-estate and throughput of word-parallel arithmetic circuits (add and shift type arithmetic) based on various redundant number representations and compare these results with the automatically optimized two's complement implementations The literature on redundant number representations typically recommends radix-4 arithmetic for full-custom or a traditional semi-custom design style We show that the radix-4 implementation is often not optimal for a logic-synthesis based semi-custom design style Instead, a high-radix or a mixed-radix implementation (which we derive) should be considered

Book ChapterDOI
01 Jan 1996
TL;DR: The process of designing an ASIC implementation of a digital receiver is carried out on different levels of abstraction and often involves the error-prone transition between different description styles which imposes obstacles on the joint optimization of algorithm and architecture.
Abstract: The process of designing an ASIC implementation of a digital receiver is carried out on different levels of abstraction. This often involves the error-prone transition between different description styles which imposes obstacles on the joint optimization of algorithm and architecture.

Journal ArticleDOI
TL;DR: The joint design process leading to an ASIC chipset accelerating the execution of rulebased systems is described and the interaction between the algorithm used for software implementation and the parallel algorithm suited for hardware implementation is examined.
Abstract: The move towards higher levels of abstraction in hardware design begins to blur the difference between hardware and software design. Nevertheless, the attractiveness of a software implementation is still defined by the much smaller abstraction gap between specification and implementation. Whereas, hardware design creates the possibility to exploit parallelism at a very fine level of granularity and thereby achieve tremendous performance gains with a moderate expenditure of hardware. This paper describes the joint design process leading to an ASIC chipset accelerating the execution of rulebased systems. The interaction between the algorithm used for software implementation and the parallel algorithm suited for hardware implementation is examined. An area efficient implementation of the programmable hardware was enabled by an application specific compiler backend. The heuristics applied by the optimising “code” generator are discussed quantitatively.