scispace - formally typeset
Search or ask a question

Showing papers on "Logarithmic number system published in 2009"


Journal ArticleDOI
TL;DR: The proposed MPC architecture is implemented by means of a hardware description language and then prototyped and emulated on a field-programmable gate array and yields a small-in-size and energy-efficient implementation that is capable of solving the aforementioned problems on the order of milliseconds.
Abstract: This paper presents a hardware architecture for embedded real-time model predictive control (MPC). The computational cost of an MPC problem, which relies on the solution of an optimization problem at every time step, is dominated by operations on real matrices. In order to design an efficient and low-cost application-specific processor, we analyze the computational cost of MPC, and we propose a limited-resource host processor to be connected with an application-specific matrix coprocessor. The coprocessor uses a 16-b logarithmic number system arithmetic unit, which is designed using cotransformation, to carry out the required arithmetic operations. The proposed architecture is implemented by means of a hardware description language and then prototyped and emulated on a field-programmable gate array. Results on computation time and architecture area are presented and analyzed, and the functionality of the proposed architecture is verified using two case studies: a linear problem of a rotating antenna and a nonlinear glucose-regulation problem. The proposed MPC architecture yields a small-in-size and energy-efficient implementation that is capable of solving the aforementioned problems on the order of milliseconds, and we compare its performance and area requirements with other MPC designs that have appeared in the literature.

95 citations


Journal ArticleDOI
TL;DR: The novelty of the approach lies in the fact that it performs interpolation efficiently, without the need to perform multiplication or division, and the method performs both the log() and antilog() operation using the same hardware architecture.
Abstract: The realization of functions such as log() and antilog() in hardware is of considerable relevance, due to their importance in several computing applications. In this paper, we present an approach to compute log() and antilog() in hardware. Our approach is based on a table lookup, followed by an interpolation step. The interpolation step is implemented in combinational logic, in a field-programmable gate array (FPGA), resulting in an area-efficient, fast design. The novelty of our approach lies in the fact that we perform interpolation efficiently, without the need to perform multiplication or division, and our method performs both the log() and antilog() operation using the same hardware architecture. We compare our work with existing methods, and show that our approach results in significantly lower memory resource utilization, for the same approximation errors. Also our method scales very well with an increase in the required accuracy, compared to existing techniques.

89 citations


Journal ArticleDOI
TL;DR: A lower error and ROM-free logarithmic converter that reduces the overhead of computation-intensive operations for real-time digital-signal-processing applications and outperforms previously proposed one-region and two-region conversion methods.
Abstract: In this brief, we propose a lower error and ROM-free logarithmic converter. The proposed converter can lead to area-efficient hardware implementation as it avoids the need for a ROM by employing simple computation units for logarithmic approximation. Our proposed logarithmic conversion algorithm partitions the exact logarithmic curve into two symmetric regions such that the slopes in the two regions that are used for logarithmic approximation are inversed. Simulation results show that the proposed algorithm achieves an error range and percentage error range of only 0.045 and 3.339%, respectively, which outperforms previously proposed one-region and two-region conversion methods. We have implemented the proposed logarithmic converter using 0.13-?m CMOS technology, and the latency is 2.8 ns. The proposed converter can be used to reduce the overhead of computation-intensive operations for real-time digital-signal-processing applications.

64 citations


Journal ArticleDOI
Byeong-Gyu Nam1, Hoi-Jun Yoo1
TL;DR: A 4-way 32-bit stream processor core developed for handheld low-power 3-D graphics systems achieves a single-cycle throughput for all these operations except for the matrix-vector multiplication that takes 2 cycles per result, which were 4 cycles in conventional way.
Abstract: A low-power and high-performance 4-way 32-bit stream processor core is developed for handheld low-power 3-D graphics systems. It contains a floating-point unified matrix, vector, and elementary function unit. By exploiting the logarithmic arithmetic and the proposed adaptive number conversion scheme, a 4-way arithmetic unit achieves a single-cycle throughput for all these operations except for the matrix-vector multiplication that takes 2 cycles per result, which were 4 cycles in conventional way. The processor featured by this functional unit and several proposed architectural schemes including embedded register index calculations, functional unit reconfiguration, and operand forwarding in logarithmic domain achieves 19.1% cycle count reduction for OpenGL transformation and lighting (TnL) operation from the latest work. The proposed stream processor core is integrated into a 3-D graphics SoC as a vertex shader to show its effectiveness. The entire SoC is fabricated into a test chip using 1-poly 6-metal 0.18 mum CMOS technology. The 17.2 mm2 chip contains 1.57 M transistors and 29 kB SRAM. The stream processor core takes 9.7 mm2 and dissipates 86.8 mW at 200 MHz operating frequency. It shows a peak performance of 141 Mvertices/s for geometry transformation (TFM) and achieves 17.5% performance improvement and 44.7% and 39.4% power and area reductions for the TFM from the latest work. For power management of the SoC, the chip is divided into the triple power domains separately controlled by dynamic voltage and frequency scaling (DVFS). With this scheme, it shows 52.4 mW power consumption at 60 fps, 50.5% power reduction from the latest work.

50 citations


Journal ArticleDOI
TL;DR: A two-dimensional systolic array QR decomposition is implemented on a Xilinx Virtex5 FPGA using the Givens rotation algorithm, which uses straightforward floating-point divide and square root implementations, which makes it easier to be used within a larger system.
Abstract: We have implemented a two-dimensional systolic array QR decomposition on a Xilinx Virtex5 FPGA using the Givens rotation algorithm. QR decomposition is a key step in many DSP applications including sonar beamforming, channel equalization, and 3G wireless communication. Compared to previous work that implements Givens rotations using a one-dimensional systolic array, our implementation uses a truly two-dimensional systolic array architecture. As a result, latency scales well for larger matrices. In addition, prior work avoids divide and square root operations in the Givens rotation algorithm by using special operations such as CORDIC or special number systems such as the logarithmic number system (LNS). In contrast, our design uses straightforward floating-point divide and square root implementations, which makes it easier to be used within a larger system. In our design, the input matrix size can be configured at compile time to many different sizes, making it easily scalable to future large FPGAs or over multiple FPGAs. The QR module is fully pipelined with a throughput of over 130MHz for the IEEE single-precision floating-point format. The peak performance for a 12 × 12 input matrix is approximately 35 GFLOPs.

34 citations


Proceedings ArticleDOI
23 Jun 2009
TL;DR: Utilization of a bag of covariance matrices as object descriptor improves the object recognition accuracy while speed up the learning process and an efficient architecture for generic object recognition system based on an ensemble classifier in aFPGA environment is described.
Abstract: We describe an efficient architecture for generic object recognition system based on an ensemble classifier in a Field Programmable Gate Array (FPGA) environment. Utilization of a bag of covariance matrices as object descriptor improves the object recognition accuracy while speed up the learning process. We extend this technique, and present its hardware architecture, as well as object classifier based on on-line variant of random forest (RF) implemented using Logarithmic Number System (LNS). First, we describe the algorithmic and architecture of our model, comprises several computation modules. Then test and verified the model functionality using numerical simulation in the GRAZ02 dataset domain. It has been shown that the proposed system gained strong performance over floating-point and fixed-point precisions, even when only 10% of the training examples are used and is reasonably power efficient.

7 citations


Journal ArticleDOI
TL;DR: The formulas for the maximum allowable errors of the exponent and logarithm computations within the LNS unit that has better or equal precision performance than that of a comparable IEEE FLP unit has been derived and can be used in the design of the L NS addition/subtraction unit with direct-computation implementation method.
Abstract: Logarithmic number system (LNS) arithmetic is more efficient than floating-point (FLP) arithmetic in some complex function computation. However, computation of the log2 (1 ± 2−v) function in large word-length LNS addition/subtraction will cost a large hardware. Direct computation of the log2 (1 ± 2−v) function is a promising method for the practical implementation of large word-length LNS arithmetic. Two most important operations in this method are the exponent and logarithm computations. The authors analysed the precision requirement in computing the exponential and logarithmic functions for the direct-computation of LNS addition/subtraction. The formulas for the maximum allowable errors of the exponent and logarithm computations within the LNS unit that has better or equal precision performance than that of a comparable IEEE FLP unit has been derived. The simulation results show that these estimation formulas for the two maximum errors are correct and thus can be used in the design of the LNS addition/subtraction unit with direct-computation implementation method.

3 citations


Proceedings ArticleDOI
06 May 2009
TL;DR: The Constraint algorithm is suggested to solve fan-in problem of the Greedy algorithm in designing encoder circuit of the flash ADC and shows better performance in terms of layout area, power consumption, and operation speed.
Abstract: The DBNR (Double Base Number Representation) has been known to represent the Multidimensional Logarithmic Number System for implementing the multiplier accumulator architecture of DSP (Digital Signal Processing). This paper also uses the DBNR to improve the bottleneck of DSP arithmetic circuits with the flash ADC (Analog-to-Digital Converter). The Constraint algorithm is suggested to solve fan-in problem of the Greedy algorithm in designing encoder circuit of the flash ADC. The Constraint algorithm shows better performance in terms of layout area, power consumption, and operation speed, compared with the FAT tree encoder, which is known as the fastest encoder circuit yielding binary output.

2 citations


Proceedings ArticleDOI
06 May 2009
TL;DR: The high throughput of the computing system can be attained with the use of digit pipelining in the design of the hardware architecture because the latency of the pipeline is short and the convergence rate of the algorithm is exponential.
Abstract: In this research, a hardware algorithm for digit on-line logarithmic computation is proposed. This algorithm is based on a fast digit-parallel logarithmic algorithm that was proposed previously. The drawback of the previous algorithm is that the computation cannot be digit pipelined with other computations. Our new algorithm will generate the partial logarithmic result after only some input digits of the operand are available. Thus, the high throughput of the computing system can be attained with the use of digit pipelining in the design of the hardware architecture. Furthermore, the latency of the pipeline is short because the convergence rate of the algorithm is exponential. For example, when the word length of the operand is 24, the number of pipeline stages is only four. Base on our proposed digit on-line method, we have designed the architecture of a 24-bit logarithmic unit. The exhausted test of the 24-bit unit shows that our algorithm and error analysis are correct.

1 citations


Proceedings ArticleDOI
14 Dec 2009
TL;DR: With this algorithm, the convergence rate of LNS addition/subtraction unit can be exponential and all the possible cases are thoroughly tested by simulations and thus the correctness of the algorithm is proved.
Abstract: Very large word-length logarithmic number system (LNS) addition/subtraction requires a lot of hardware and long pipeline latency. In this paper, we proposed an algorithm that utilized two novel methods to solve these problems. With this algorithm, the convergence rate of LNS addition/subtraction unit can be exponential. All the possible cases are thoroughly tested by simulations and thus we have proved the correctness of the algorithm.

1 citations


Proceedings ArticleDOI
25 May 2009
TL;DR: The proposed software-implemented 32-bit LNS arithmetic implementation approach is very efficient for computing complex arithmetic functions in an ARM embedded system.
Abstract: Logarithmic number system (LNS) arithmetic is a good alternative for floating-point arithmetic. We have implemented 32-bit LNS arithmetic by using assembly and C languages on an ARM processor. Compared to FLP arithmetic, the proposed software-implemented LNS arithmetic can achieve a speedup factor of 9.12/13.45 in multiplication/division, with only about 34% speed degrade in addition/subtraction. For the AB function, the proposed LNS arithmetic is 91.06 times faster than the FLP arithmetic. We conclude that our proposed software LNS arithmetic implementation approach is very efficient for computing complex arithmetic functions in an ARM embedded system.

Proceedings ArticleDOI
11 Oct 2009
TL;DR: A hardware implementation of the parametric image-processing framework that will accurately process images and speed up computation for addition, subtraction, and multiplication and the design of arithmetic circuits including parallel counters, adders and multipliers based in two high performance threshold logic gate implementations that are developed.
Abstract: This Parameterized Digital Electronic Arithmetic (PDEA) model replaces linear operations with non-linear ones. In this paper we introduce a hardware implementation of the parametric image-processing framework that will accurately process images and speed up computation for addition, subtraction, and multiplication. Particularly, the paper presents the design of arithmetic circuits including parallel counters, adders and multipliers based in two high performance threshold logic gate implementations that we have developed. We will also explore new microprocessor architectures to take advantage of arithmetic. The experiments executed have shown that the algorithm provides faster and better enhancements from those described in the literature. Its potential applications include computer graphics, digital signal processing and other multimedia applications.