scispace - formally typeset
Search or ask a question
Book

Computational Aspects of Vlsi

01 Jan 1984-
About: The article was published on 1984-01-01 and is currently open access. It has received 862 citations till now. The article focuses on the topics: Very-large-scale integration.
Citations
More filters
Journal ArticleDOI
01 Jun 1991
TL;DR: This work proposes a systematic method for synthesizing optimal VLSI architectures using distributed arithmetic and compares distributed arithmetic with more conventional methods for inner product computation and shows how area, latency and period may be traded off while maintaining constant error.
Abstract: Real-time signal processing requires fast computation of inner products. Distributed arithmetic is a method of inner product computation that uses table-lookup and addition in place of multiplication. Distributed arithmetic has previously been shown to produce novel and seemingly efficient architectures for a variety of signal processing computations; however the methods of design, analysis and comparison have been ad hoc. We propose a systematic method for synthesizing optimal VLSI architectures using distributed arithmetic. A partition of the inner product computation at the word and bit level produces a computation consisting of lookups and additions. We study two classes of algorithms to implement this computation, regular iterative algorithms and tree algorithms, each of which can be expressed in the form of a dependency graph. We use linear and nonlinear maps to assign computations to processors in space and time. Expressions are developed for the area, latency, period and arithmetic error for a particular partition and space/time map of the dependecy graph. We use these expressions to formulate a constrained optimization problem over a large class of architectures. We compare distributed arithmetic with more conventional methods for inner product computation and show how area, latency and period may be traded off while maintaining constant error.

28 citations


Cites background or methods from "Computational Aspects of Vlsi"

  • ...We measure the complexity of a VLSI architecture in terms of area A, latency L, and period P. VLSI communication models [ 32 ], [33] have been used to analyze numerous computations....

    [...]

  • ...VLSI Complexity Theory [ 32 ], [33] uses communication models for the asymptotic analysis of interconnect complexity....

    [...]

  • ...Ullman [ 32 ] proves a number of fundamental theorems for the layout of tree structures in VLSI....

    [...]

Journal Article
TL;DR: A general network of agents that can be built with Zeus is studied, to determine how much discounting the agents can allow, and how to control the coordination time.
Abstract: We study a general network of agents that can be built with Zeus [?]. Relationships between agents can be peer, slave, master, discounter, or no relation at all. There are four possible strategies: the cheapest agent is selected, preference to slaves first, cut-price discounting based on the utility, and cheapest agent chosen, but preference given to cheapest slave. The cost of a task for the agent originating it, is the cost of the resources used. The size of the initial endowment is determined so that there are never any lost tasks in the system. We also establish the influence of agent strategies on stability and fairness. We were able to determine how much discounting the agents can allow, and how to control the coordination time. The growth of the maximum communication load with respect to the number of agents is calculated for various topologies of networks of agents. A performance measure related to the speed of the network is also calculated.

28 citations

Proceedings Article
01 Jan 1989
TL;DR: The authors present a parallel merging algorithm that, on an exclusive-read exclusive-write (EREW) parallel random-access machine (PRAM) with k processors merges two sorted lists of total length n in O(n/k+log n) time and constant extra space per processor, and hence is time-space optimal for any value of k.
Abstract: The authors present a parallel merging algorithm that, on an exclusive-read exclusive-write (EREW) parallel random-access machine (PRAM) with k processors merges two sorted lists of total length n in O(n/k+log n) time and constant extra space per processor, and hence is time-space optimal for any value of k >

27 citations

Journal ArticleDOI
TL;DR: Completely pipelined inner product architectures are presented for FIR filtering and linear transformation, using only full adders, organized to form multipliers.
Abstract: Completely pipelined inner product architectures are presented for FIR filtering and linear transformation. The designs use only full adders, organized to form multipliers. By cascading these multiplier structures, no additional area or time is needed to sum their products. The merits of the FFT are briefly reconsidered in the context of high throughput VLSI structures for digital signal processing.

27 citations

Journal ArticleDOI
TL;DR: This paper begins by applying two standard cache-friendly optimizations to the Floyd--Warshall algorithm and shows limited performance improvements, then discusses the unidirectional space time representation (USTR), which can be used to reduce the amount of processor-memory traffic by a factor of O(&sqrt;C), where C is the cache size.
Abstract: The topic of cache performance has been well studied in recent years. Compiler optimizations exist and optimizations have been done for many problems. Much of this work has focused on dense linear algebra problems. At first glance, the Floyd--Warshall algorithm appears to fall into this category. In this paper, we begin by applying two standard cache-friendly optimizations to the Floyd--Warshall algorithm and show limited performance improvements. We then discuss the unidirectional space time representation (USTR). We show analytically that the USTR can be used to reduce the amount of processor-memory traffic by a factor of O(√C), where C is the cache size, for a large class of algorithms. Since the USTR leads to a tiled implementation, we develop a tile size selection heuristic to intelligently narrow the search space for the tile size that minimizes total execution time. Using the USTR, we develop a cache-friendly implementation of the Floyd--Warshall algorithm. We show experimentally that this implementation minimizes the level-1 and level-2 cache misses and TLB misses and, therefore, exhibits the best overall performance. Using this implementation, we show a 2x improvement in performance over the best compiler optimized implementation on three different architectures. Finally, we show analytically that our implementation of the Floyd--Warshall algorithm is asymptotically optimal with respect to processor-memory traffic. We show experimental results for the Pentium III, Alpha, and MIPS R12000 machines using problem sizes between 1024 and 2048 vertices. We demonstrate improved cache performance using the Simplescalar simulator.

27 citations