scispace - formally typeset
Search or ask a question
Author

Bevan M. Baas

Bio: Bevan M. Baas is an academic researcher from University of California, Davis. The author has contributed to research in topics: Throughput (business) & Clock rate. The author has an hindex of 27, co-authored 85 publications receiving 2541 citations. Previous affiliations of Bevan M. Baas include Stanford University & Qualcomm Atheros.


Papers
More filters
Journal ArticleDOI
TL;DR: This paper presents an energy-efficient, single-chip, 1024-point fast Fourier transform (FFT) processor, which has been fabricated in a standard 0.7 /spl mu/m CMOS process and is fully functional on first-pass silicon.
Abstract: This paper presents an energy-efficient, single-chip, 1024-point fast Fourier transform (FFT) processor. The 460000-transistor design has been fabricated in a standard 0.7 /spl mu/m (L/sub poly/=0.6 /spl mu/m) CMOS process and is fully functional on first-pass silicon. At a supply voltage of 1.1 V, it calculates a 1024-point complex FFT in 330 /spl mu/s while consuming 9.5 mW, resulting in an adjusted energy efficiency more than 16 times greater than the previously most efficient known FFT processor. At 3.3 V, it operates at 173 MHz-which is a clock rate 2.6 times greater than the previously fastest rate.

319 citations

Journal ArticleDOI
TL;DR: This work curve fit second and third-order polynomials to circuit delay, energy, and power dissipation results based on HSpice simulations utilizing the Predictive Technology Model (PTM) and International Technology Roadmap for Semiconductors (ITRS) models.

211 citations

Journal ArticleDOI
TL;DR: A 167-processor computational platform consists of an array of simple programmable processors capable of per-processor dynamic supply voltage and clock frequency scaling, three algorithm-specific processors, and three 16 KB shared memories; and is implemented in 65 nm CMOS as discussed by the authors.
Abstract: A 167-processor computational platform consists of an array of simple programmable processors capable of per-processor dynamic supply voltage and clock frequency scaling, three algorithm-specific processors, and three 16 KB shared memories; and is implemented in 65 nm CMOS. All processors and shared memories are clocked by local fully independent, dynamically haltable, digitally-programmable oscillators and are interconnected by a configurable circuit-switched network which supports long-distance communication. Programmable processors occupy 0.17nmm2 and operate at a maximum clock frequency of 1.2 GHz at 1.3 V. At 1.2 V, they operate at 1.07 GHz and consume 47.5nmW when 100% active, resulting in an energy dissipation of 44 pJ per operation. At 0.675 V, they operate at 66 MHz and consume 608nmuW when 100% active, resulting in a total energy dissipation of 9.2 pJ per ALU or MAC operation.

203 citations

Proceedings ArticleDOI
01 Oct 2006
TL;DR: The proposed split-row method makes column processing parallelism easier to exploit, doubles available row processor parallelism, and significantly simplifies row processors - which results in smaller area, higher speeds, and lower energy dissipation.
Abstract: A reduced complexity LDPC decoding method is presented that dramatically reduces wire interconnect complexity, which is a major issue in LDPC decoders. The proposed split-row method makes column processing parallelism easier to exploit, doubles available row processor parallelism, and significantly simplifies row processors - which results in smaller area, higher speeds, and lower energy dissipation. Simulation results over an additive white Gaussian channel show that the error performance of high row-weight codes with split-row decoding is within 0.3-0.6 dB of the min-sum and sum-product decoding algorithms. A full parallel decoder for a (3,6) LDPC code with a code length of 1536 bits is implemented in a 0.18 mum CMOS technology twice: once using the split-row method, and once using the min-sum algorithm for comparison. The split-row decoder operates at 53 MHz and delivers a throughput of 5.4 Gbps with 15 decoding iterations per block. The split-row decoder is about 1.3 times smaller, has an average wire length 1.5 times shorter, and has a throughput 1.6 times higher than the min-sum decoder.

111 citations

Proceedings ArticleDOI
07 Aug 2002
TL;DR: The 0.25/spl mu/m CMOS mixed-signal baseband and MAC processor for the IEEE 802.11a WLAN standard in 0.8 mm/sup 2/ and contains 4.0M transistors in a 196-pin BGA package.
Abstract: An 0.25 /spl mu/m CMOS mixed-signal baseband and MAC processor for the IEEE 802.11a WLAN standard in 0.25 /spl mu/m CMOS occupies 6.8/spl times/6.8 mm/sup 2/ and contains 4.0M transistors in a 196-pin BGA package. Power consumption for transmit and receive is 326 mW and 452 mW. Additional data rates up to 108 Mb/s are supported. The MAC is implemented using dedicated control and datapath logic, and includes registers that allow host software to configure and control its operation. This yields an overall design that is compact, power-efficient, and requires no off-chip RAM or program storage, yet is very flexible.

107 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: This paper provides a general description of NoC architectures and applications and enumerates several related research problems organized under five main categories: Application characterization, communication paradigm, communication infrastructure, analysis, and solution evaluation.
Abstract: To alleviate the complex communication problems that arise as the number of on-chip components increases, network-on-chip (NoC) architectures have been recently proposed to replace global interconnects. In this paper, we first provide a general description of NoC architectures and applications. Then, we enumerate several related research problems organized under five main categories: Application characterization, communication paradigm, communication infrastructure, analysis, and solution evaluation. Motivation, problem description, proposed approaches, and open issues are discussed for each problem from system, microarchitecture, and circuit perspectives. Finally, we address the interactions among these research problems and put the NoC design process into perspective.

733 citations

Journal ArticleDOI
Y. Hoskote1, Sriram R. Vangal1, A. Singh1, Nitin Borkar1, S. Borkar1 
TL;DR: A multicore processor in 65-Nm technology with 80 single-precision, floatingpoint cores delivers performance in excess of a Teraflops while consuming less than 100 W.
Abstract: A multicore processor in 65-Nm technology with 80 single-precision, floatingpoint cores delivers performance in excess of a Teraflops while consuming less than 100 W. A 2D on-die mesh interconnection network operating at 5 GHz provides the high-performance communication fabric to connect the cores. The network delivers a bisection bandwidth of 2.56 Terabits per second and a per hop fall-through latency of 1 nanosecond.

658 citations

Journal ArticleDOI
TL;DR: In this paper, an integrated network-on-chip architecture containing 80 tiles arranged as an 8x10 2D array of floating-point cores and packet-switched routers, both designed to operate at 4 GHz.
Abstract: This paper describes an integrated network-on-chip architecture containing 80 tiles arranged as an 8x10 2-D array of floating-point cores and packet-switched routers, both designed to operate at 4 GHz. Each tile has two pipelined single-precision floating-point multiply accumulators (FPMAC) which feature a single-cycle accumulation loop for high throughput. The on-chip 2-D mesh network provides a bisection bandwidth of 2 Terabits/s. The 15-FO4 design employs mesochronous clocking, fine-grained clock gating, dynamic sleep transistors, and body-bias techniques. In a 65-nm eight-metal CMOS process, the 275 mm2 custom design contains 100 M transistors. The fully functional first silicon achieves over 1.0 TFLOPS of performance on a range of benchmarks while dissipating 97 W at 4.27 GHz and 1.07 V supply.

645 citations