scispace - formally typeset
Search or ask a question

Showing papers by "Keshab K. Parhi published in 1991"


Journal ArticleDOI
Abstract: Rate-optimal compile-time multiprocessor scheduling of iterative dataflow programs suitable for real-time signal processing applications is discussed. It is shown that recursions or loops in the programs lead to an inherent lower bound on the achievable iteration period, referred to as the iteration bound. A multiprocessor schedule is rate-optimal if the iteration period equals the iteration bound. Systematic unfolding of iterative dataflow programs is proposed, and properties of unfolded dataflow programs are studied. Unfolding increases the number of tasks in a program, unravels the hidden concurrently in iterative dataflow programs, and can reduce the iteration period. A special class of iterative dataflow programs, referred to as perfect-rate programs, is introduced. Each loop in these programs has a single register. Perfect-rate programs can always be scheduled rate optimally (requiring no retiming or unfolding transformation). It is also shown that unfolding any program by an optimum unfolding factor transforms any arbitrary program to an equivalent perfect-rate program, which can then be scheduled rate optimally. This optimum unfolding factor for any arbitrary program is the least common multiple of the number of registers (or delays) in all loops and is independent of the node execution times. An upper bound on the number of processors for rate-optimal scheduling is given. >

390 citations


Journal ArticleDOI
TL;DR: In this article, a systematic unfolding transformation technique for transforming bit-serial architectures into equivalent digit-serial ones is presented, where the novel feature of the unfolding technique lies in the generation of functionally correct control circuits in the digit serial architectures.
Abstract: A systematic unfolding transformation technique for transforming bit-serial architecture into equivalent digit-serial ones is presented. The novel feature of the unfolding technique lies in the generation of functionally correct control circuits in the digit-serial architectures. For some applications bit-serial architectures may be too slow, and bit-parallel architectures may be faster than necessary and may require too much hardware. The desired sample rate in these applications can be achieved using the digit-serial approach, where multiple bits of a sample are processed in a single clock cycle. The number of bits processed in one clock cycle in the digit-serial systems is referred to as the digit size; the digit size can be any arbitrary integer (the digit size was restricted to be a divisor of wordlength in past ad hoc designs). Digit-serial implementation of two's complement adders and multipliers is described. Least-significant-bit-first bit-serial implementation of two's complement division, square-root, and compare-select operations are presented, and the corresponding digit-serial architectures for these operations are obtained using the unfolding algorithm. Unfolding of multiple-rate operations (such as interpolators and decimators) is also addressed. >

189 citations


Journal ArticleDOI
TL;DR: In this article, look-ahead computation techniques were successfully applied to create necessary concurrency in linear recursive and some nonlinear recursive operations (such as the add-compare-select operation).
Abstract: High-speed implementation of signal processing algorithms for digital transmission is addressed. The internal feedback or recursion in these algorithms makes it difficult to implement recursive systems concurrently using either pipelining or parallelism. In the past, look-ahead computation techniques were successfully applied to create necessary concurrency in linear recursive and some nonlinear recursive operations (such as the add-compare-select operation). Novel computation approaches are proposed, and the look-ahead technique is extended to pipeline the feedback loops containing finite-level quantizers. Approaches to pipeline piecewise-linear recursive systems are presented. The proposed architectures are suitable for real-time high-speed implementation of quantizer loop operations where the levels of quantizer and the order of the loops are small. >

96 citations


Proceedings ArticleDOI
14 Oct 1991
TL;DR: This paper addresses design of high speed architectures for fixed-point, two's-complement, bit-parallel division, square-root, and multiplication operations, and presents a fast, new conversion scheme for converting radix-2 redundant numbers to two's complement binary numbers, and uses this to design a bit-Parallel multiplier.
Abstract: The design of high-speed architectures is addressed for fixed-point, two's-complement, bit-parallel, pipelined, multiplication, division and square-root operations. The architectures presented make use of hybrid number representations (i.e. the input and output numbers are presented using two's complement representation, and the internal numbers are represented using radix-2 redundant representation). A fast, new conversion scheme for converting radix-2 redundant numbers to two's-complement binary numbers is presented, and this is used to design a reduced latency bit-parallel multiplier. The novel sign-multiplexing scheme helps detect the sign of a redundant number very quickly and is used in combination with the remainder conditioning scheme to achieve very high speed in fixed-point division and square-root operators. These architectures require fewer pipelining latches than their conventional two's-complement counterparts. Reduction in latency without sacrificing clock speed has resulted in reduced computation time for these operations. >

42 citations


Journal ArticleDOI
TL;DR: The sequential DP algorithm is proposed using fewer finer grain pipelined processors, and increased hardware efficiency is achieved by using a novel computation sequence.
Abstract: Novel computation techniques to achieve fine-grain pipelining in forward dynamic programming (DP) architectures are proposed. The sequential DP algorithm is proposed using fewer finer grain pipelined processors, and increased hardware efficiency is achieved by using a novel computation sequence. Look-ahead computation is used to obtain a concurrent DP algorithm, and it is used in combination with an appropriate computation sequence to achieve further pipelining in DP architectures. The finer grain pipelined architectures are mapped to ring and mesh processor arrays, and approximately the same iteration rate is achieved as the coarse-grain pipelined architectures, but with use of much less hardware. The design of interleaved architectures using multiple clocks is also outlined. >

24 citations


Proceedings ArticleDOI
04 Nov 1991
TL;DR: In this article, a pipelining method in lattice digital filters is introduced, based on a constrained IIR (infinite impulse response) digital filter design method by which pipelined direct-form filters are designed.
Abstract: A pipelining method in lattice digital filters is introduced. This pipelining method is based on a constrained IIR (infinite impulse response) digital filter design method by which pipelined direct-form filters are designed. These direct-form filters are transformed to pipelined lattice digital filters. It is shown that the roundoff error and the number of multiply/add operations of the resulting pipelined lattice filters are smaller than those of the pipelined lattice filters obtained by applying look-ahead on direct-form nonpipelined digital filters. >

23 citations


Journal ArticleDOI
TL;DR: Contrary to common beliefs, it is concluded that pole-zero canceling scattered look-ahead pipelined recursive filters have good finite word error properties.
Abstract: High sample rate recursive filtering can be achieved by transforming the original filters to higher-order filters using the scattered look-ahead computation technique (which relies upon pole-zero cancellation). Finite word-length implementation of these filters leads to inexact pole-zero cancellation. This necessitates a thorough study of finite word effects in these filters. Theoretical results on roundoff and coefficient quantization errors in these filters are presented. It is shown that to maintain the same error at the filter output, the word length needs to be at most increased by log/sub 2/ log/sub 2/ 2M bit for a scattered look-ahead decomposed filter (where as M is the level of loop pipelining). This worst case corresponds to the case when all poles are close to zero. For M between two and eight, the word length needs to be increased only by 1 or 2 bit. Contrary to common beliefs, it is concluded that pole-zero canceling scattered look-ahead pipelined recursive filters have good finite word error properties. >

21 citations


Proceedings ArticleDOI
14 Apr 1991
TL;DR: Methodologies are addressed for high-level synthesis of dedicated digital signal processing (DSP) architectures using the Minnesota Architecture Synthesis (MARS) design system and algorithms are given for concurrent scheduling and resource allocation for systematic synthesis of DSP architectures.
Abstract: Methodologies are addressed for high-level synthesis of dedicated digital signal processing (DSP) architectures using the Minnesota Architecture Synthesis (MARS) design system. The MARS system is capable of exploring a wide design space because the authors' synthesis algorithms can accommodate multiple implementation styles. Algorithms are given for concurrent scheduling and resource allocation for systematic synthesis of DSP architectures. The algorithms exploit inter-iteration and intra-iteration precedence constraints, and produce as good or better results than those published. Synthesis is accommodated with multiple implementation styles to reduce overall hardware costs; systematic synthesis of such architectures has not been explored so far. To improve the quality of the final schedule, the algorithm utilizes implicit retiming and pipelining of the data flow graph. >

21 citations


Proceedings ArticleDOI
11 Jun 1991
TL;DR: The authors discuss the iterative algorithm which calculates the minimum unfolding factor necessary to achieve a given sample rate with and without retiming, utilized within the MARS (the Minnesota architecture synthesis) design system.
Abstract: A method of determining the minimum unfolding factor needed to synthesize a data path for a given sample rate is presented. Minimizing the unfolding factor is important because the time complexity for scheduling and allocation increases linearly with the unfolding factor. The authors discuss the iterative algorithm which calculates the minimum unfolding factor necessary to achieve a given sample rate with and without retiming. This algorithm is utilized within the MARS (the Minnesota architecture synthesis) design system to preprocess a dataflow graph prior to resource scheduling and allocation. >

21 citations


Proceedings ArticleDOI
04 Nov 1991
TL;DR: A comparison with previous work indicates that the novel architecture has the least increase in hardware requirements and at the same time has the highest convergence speed in seconds.
Abstract: A fine-grain pipelined architecture for least mean-square (LMS) filtering is developed by employing a stochastic form of look-ahead. With the stochastic form of look-ahead one can look for acceptable convergence behavior rather than invariance with respect to the input-output mapping. This architecture offers a trade-off between a variable output latency and adaptation accuracy. Analytical expressions describing the convergence properties are provided. A comparison with previous work indicates that the novel architecture has the least increase in hardware requirements and at the same time has the highest convergence speed in seconds. Simulation results confirm the desired analytical expressions. >

20 citations


Proceedings ArticleDOI
14 Apr 1991
TL;DR: It is shown that register allocation techniques can lead to up to 50% savings in hardware area, as compared with converter architectures designed in a straightforward manner.
Abstract: The authors use life time analysis and propose systematic register allocation techniques to reuse the registers; register reuse leads to data format converter architectures with fewer registers. A simple forward-circulate allocation scheme is proposed to motivate the use of register allocation, and a more efficient forward-backward register allocation scheme is proposed. Examples of data converters presented include matrix transposer, serial-to-parallel, and parallel-to-serial converters. General m-to-n bit-parallel and bit-serial converters are also studied. It is shown that register allocation techniques can lead to up to 50% savings in hardware area, as compared with converter architectures designed in a straightforward manner. >

Proceedings ArticleDOI
04 Nov 1991
TL;DR: The author presents pipelined and parallel architectures for high-speed implementation of Huffman decoders using look-ahead computation techniques, and suggests a solution to the problem of unequal code word length of the Huffman code words.
Abstract: The author presents pipelined and parallel architectures for high-speed implementation of Huffman decoders using look-ahead computation techniques. Huffman decoders are used in high-definition television, video, and other data compression systems. The achievable speed in these decoders is inherently limited due to their sequential nature of computation. The unequal code word length of the Huffman code words makes it difficult to apply look-ahead. This problem is overcome by representing Huffman decoders as finite state machines which can exploit look-ahead. The proposed approach is useful for high-speed Huffman decoder implementations where the number of symbols of the decoder is low. >

Proceedings ArticleDOI
11 Jun 1991
TL;DR: The author addresses register minimization in design of digital signal processing (DSP) data format converter architectures using a novel forward-backward register allocation scheme and examples of converters presented include matrix transposers, and general data format converters.
Abstract: The author addresses register minimization in design of digital signal processing (DSP) data format converter architectures. Systematic lifetime analysis is used to calculate the minimum number of registers needed for any arbitrarily specified data format converter. The minimum number of registers can be used to design a data format converter architecture using a novel forward-backward register allocation scheme. The number of registers needed in the scheme is about half of that needed in the forward register allocation scheme. Examples of converters presented include matrix transposers, and general (m, d/sub 1/) to (n, d/sub 2/) data format converters. The (m, d/sub 1/) to (n, d/sub 2/) converter inputs m words and d/sub 1/ bits per word in one input cycle and outputs n words and d/sub 2/ bits per word in one output cycle (d/sub 1/ and d/sub 2/ lie between 1 and the word-length w). >

01 Apr 1991
TL;DR: A systematic unfolding transformation technique for transforming bit-serial architecture into equivalent digit-serial ones is presented, the novel feature of the unfolding technique lies in the generation of functionally correct control circuits in the digit- serial architectures.
Abstract: A systematic unfolding transformation technique for transforming bit-serial architecture into equivalent digit-serial ones is presented. The novel feature of the unfolding technique lies in the generation of functionally correct control circuits in the digit-serial architectures. For some applications bit-serial architectures may be too slow, and bit-parallel architectures may be faster than necessary and may require too much hardware. The desired sample rate in these applications can be achieved using the digit-serial approach, where multiple bits of a sample are processed in a single clock cycle. The number of bits processed in one clock cycle in the digit-serial systems is referred to as the digit size; the digit size can be any arbitrary integer (the digit size was restricted to be a divisor of wordlength in past ad hoc designs). Digit-serial implementation of two's complement adders and multipliers is described. Least-significant-bit-first bit-serial implementation of two's complement division, square-root, and compare-select operations are presented, and the corresponding digit-serial architectures for these operations are obtained using the unfolding algorithm. Unfolding of multiple-rate operations (such as interpolators and decimators) is also addressed. >

Proceedings ArticleDOI
14 Apr 1991
TL;DR: Using Gaussian-Markov source and speech signal benchmarks, it is shown that these new approaches lead to distortion as good as or better than that obtained using the LBG and Kohonen approaches.
Abstract: Many techniques for quantizing large sets of input vectors into much smaller sets of output vectors have been developed. Various neural network based techniques for generating the input vectors via system training are studied. The variations are centered around a neural net vector quantization (NNVQ) method which combines the well-known conventional Linde, Buzo and Gray (1980) (LBG) technique and the neural net based Kohonen (1984) technique. Sequential and parallel learning techniques for designing efficient NNVQs are given. The schemes presented require less computation time due to a new modified gain formula, partial/zero neighbor updating, and parallel learning of the code vectors. Using Gaussian-Markov source and speech signal benchmarks, it is shown that these new approaches lead to distortion as good as or better than that obtained using the LBG and Kohonen approaches. >