scispace - formally typeset
Search or ask a question
Author

Valavan Manohararajah

Other affiliations: University of Toronto
Bio: Valavan Manohararajah is an academic researcher from Altera. The author has contributed to research in topics: Field-programmable gate array & Programmable logic array. The author has an hindex of 13, co-authored 42 publications receiving 528 citations. Previous affiliations of Valavan Manohararajah include University of Toronto.

Papers
More filters
Journal ArticleDOI
TL;DR: In this paper, an iterative technology-mapping tool called IMap is presented, it supports depth-oriented, area- oriented, and duplication-free mapping modes, and the edge-delay model is used throughout.
Abstract: In this paper, an iterative technology-mapping tool called IMap is presented. It supports depth-oriented (area is a secondary objective), area-oriented (depth is a secondary objective), and duplication-free mapping modes. The edge-delay model (as opposed to the more commonly used unit-delay model) is used throughout. Two new heuristics are used to obtain area reductions over previously published methods. The first heuristic predicts the effects of various mapping decisions on the area of the final solution, and the second heuristic bounds the depth of the mapping solution at each node. In depth-oriented mode, when targeting five lookup tables (LUTs), IMap obtains depth optimal solutions that are 44.4%, 19.4%, and 5% smaller than those produced by FlowMap, CutMap, and DAOMap, respectively. Targeting the same LUT size in area-oriented mode, IMap obtains solutions that are 17.5% and 9.4% smaller than those produced by duplication-free mapping and ZMap, respectively. IMap is also shown to be highly efficient. Runtime improvements of between 2.3times and 82times are obtained over existing algorithms when targeting five LUTs. Area and runtime results comparing IMap to the other mappers when targeting four and six LUTs are also presented

118 citations

Proceedings ArticleDOI
21 Feb 2016
TL;DR: This paper describes architectural enhancements in the Altera Stratix?
Abstract: This paper describes architectural enhancements in the Altera Stratix? 10 HyperFlex? FPGA architecture, fabricated in the Intel 14nm FinFET process. Stratix 10 includes ubiquitous flip-flops in the routing to enable a high degree of pipelining. In contrast to the earlier architectural exploration of pipelining in pass-transistor based architectures, the direct drive routing fabric in Stratix-style FPGAs enables an extremely low-cost pipeline register. The presence of ubiquitous flip-flops simplifies circuit retiming and improves performance. The availability of predictable retiming affects all stages of the cluster, place and route flow. Ubiquitous flip-flops require a low-cost clock network with sufficient flexibility to enable pipelining of dozens of clock domains. Different cost/performance tradeoffs in a pipelined fabric and use of a 14nm process, lead to other modifications to the routing fabric and the logic element. User modification of the design enables even higher performance, averaging 2.3X faster in a small set of designs.

76 citations

Proceedings ArticleDOI
13 Jun 2005
TL;DR: A new linear-time retiming algorithm that produces near-optimal results that is specifically targeted at Altera's Stratix FPGA-based designs, although the techniques described are general enough for any implementation medium.
Abstract: In this paper, the authors presented a new linear-time retiming algorithm that produces near-optimal results. The implementation is specifically targeted at Altera's Stratix FPGA-based designs, although the techniques described are general enough for any implementation medium. The algorithm is able to handle the architectural constraints of the target device, multiple timing constraints assigned by the user and implicit legality constraints. It ensures that register moves do not create asynchronous problems such as creating a glitch on a clock/reset signal.

32 citations

Patent
07 Feb 2007
TL;DR: In this paper, a method for designing a system on a target device includes synthesizing the system and placing it on the target device, where a first descendant thread is spawned to run in parallel with an existing thread.
Abstract: A method for designing a system on a target device includes synthesizing the system. The system is mapped. The system is placed on the target device. The system is routed. Physical synthesis is performed on the system where a first descendant thread is spawned to run in parallel with an existing thread where the first descendant thread is executing a different optimization strategy than the existing thread but on a same netlist as the existing thread.

29 citations

Patent
31 May 2011
TL;DR: In this article, the memory interface circuitry is calibrated using a procedure that includes read calibration, write leveling, read latency tuning, and write calibration, which is used to calibrate memory systems supporting a variety of memory communications protocols.
Abstract: Integrated circuits may communicate with off-chip memory. Such types of integrated circuits may include memory interface circuitry that is used to interface with the off-chip memory. The memory interface circuitry may be calibrated using a procedure that includes read calibration, write leveling, read latency tuning, and write calibration. Read calibration may serve to ensure proper gating of data strobe signals and to center the data strobe signals with respect to read data signals. Write leveling ensures that the data strobe signals are aligned to system clock signals. Read latency tuning serves to adjust read latency to ensure optimum read performance. Write calibration may serve to center the data strobe signals with respect to write data signals. These calibration operations may be used to calibrate memory systems supporting a variety of memory communications protocols.

26 citations


Cited by
More filters
Book
02 Nov 2007
TL;DR: This book is intended as an introduction to the entire range of issues important to reconfigurable computing, using FPGAs as the context, or "computing vehicles" to implement this powerful technology.
Abstract: The main characteristic of Reconfigurable Computing is the presence of hardware that can be reconfigured to implement specific functionality more suitable for specially tailored hardware than on a simple uniprocessor. Reconfigurable computing systems join microprocessors and programmable hardware in order to take advantage of the combined strengths of hardware and software and have been used in applications ranging from embedded systems to high performance computing. Many of the fundamental theories have been identified and used by the Hardware/Software Co-Design research field. Although the same background ideas are shared in both areas, they have different goals and use different approaches.This book is intended as an introduction to the entire range of issues important to reconfigurable computing, using FPGAs as the context, or "computing vehicles" to implement this powerful technology. It will take a reader with a background in the basics of digital design and software programming and provide them with the knowledge needed to be an effective designer or researcher in this rapidly evolving field. · Treatment of FPGAs as computing vehicles rather than glue-logic or ASIC substitutes · Views of FPGA programming beyond Verilog/VHDL · Broad set of case studies demonstrating how to use FPGAs in novel and efficient ways

531 citations

Proceedings ArticleDOI
02 Jun 2018
TL;DR: This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI, and achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1.5 teraflops.
Abstract: Interactive AI-powered services require low-latency evaluation of deep neural network (DNN) models—aka ""real-time AI"". The growing demand for computationally expensive, state-of-the-art DNNs, coupled with diminishing performance gains of general-purpose architectures, has fueled an explosion of specialized Neural Processing Units (NPUs). NPUs for interactive services should satisfy two requirements: (1) execution of DNN models with low latency, high throughput, and high efficiency, and (2) flexibility to accommodate evolving state-of-the-art models (e.g., RNNs, CNNs, MLPs) without costly silicon updates. This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI. The Brainwave NPU achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1. The NPU attains this performance using a single-threaded SIMD ISA paired with a distributed microarchitecture capable of dispatching over 7M operations from a single instruction. The spatially distributed microarchitecture, scaled up to 96,000 multiply-accumulate units, is supported by hierarchical instruction decoders and schedulers coupled with thousands of independently addressable high-bandwidth on-chip memories, and can transparently exploit many levels of fine-grain SIMD parallelism. When targeting an FPGA, microarchitectural parameters such as native datapaths and numerical precision can be "synthesis specialized" to models at compile time, enabling atypically high FPGA performance competitive with hardened NPUs. When running on an Intel Stratix 10 280 FPGA, the Brainwave NPU achieves performance ranging from ten to over thirty-five teraflops, with no batching, on large, memory-intensive RNNs.

498 citations

01 Jun 1961
TL;DR: In this article, the Ashenhurst chart method is generalized to non-junctive decompositions by means of the don't care conditions, which leads to designs of more economical switching circuits to realize the given switching function.
Abstract: : A given switching function of n variables can frequently be decomposed into a composite function of several essentially simpler switching functions. Such decompositions lead to designs of more economical switching circuits to realize the given switching function. Ashenhurst's chart method is generalized to nondisjunctive decompositions by means of the don't care conditions. This extension provides an effective method of constructing all decompositions of switching functions. (Author)

227 citations

Book
25 Oct 2006
TL;DR: All major steps in FPGA design flow which includes: routing and placement, circuit clustering, technology mapping and architecture-specific optimization, physical synthesis, RT-level and behavior-level synthesis, and power optimization are covered.
Abstract: Design automation or computer-aided design (CAD) for field programmable gate arrays (FPGAs) has played a critical role in the rapid advancement and adoption of FPGA technology over the past two decades. The purpose of this paper is to meet the demand for an up-to-date comprehensive survey/tutorial for FPGA design automation, with an emphasis on the recent developments within the past 5-10 years. The paper focuses on the theory and techniques that have been, or most likely will be, reduced to practice. It covers all major steps in FPGA design flow which includes: routing and placement, circuit clustering, technology mapping and architecture-specific optimization, physical synthesis, RT-level and behavior-level synthesis, and power optimization. We hope that this paper can be used both as a guide for beginners who are embarking on research in this relatively young yet exciting area, and a useful reference for established researchers in this field.

147 citations