scispace - formally typeset
Search or ask a question
Author

Kaushik Ravindran

Bio: Kaushik Ravindran is an academic researcher from National Instruments. The author has contributed to research in topics: Dataflow & Model of computation. The author has an hindex of 14, co-authored 30 publications receiving 1769 citations. Previous affiliations of Kaushik Ravindran include University of California, Berkeley & Sadia S.A..

Papers
More filters
Proceedings ArticleDOI
07 Jun 2004
TL;DR: In this article, a canonical first order delay model is proposed to propagate timing quantities like arrival times and required arrival times through the timing graph in this canonical form and the sensitivities of all timing quantities to each of the sources of variation are available.
Abstract: Variability in digital integrated circuits makes timing verification an extremely challenging task. In this paper, a canonical first order delay model is proposed that takes into account both correlated and independent randomness. A novel linear-time block-based statistical timing algorithm is employed to propagate timing quantities like arrival times and required arrival times through the timing graph in this canonical form. At the end of the statistical timing, the sensitivities of all timing quantities to each of the sources of variation are available. Excessive sensitivities can then be targeted by manual or automatic optimization methods to improve the robustness of the design. This paper also reports the first incremental statistical timer in the literature which is suitable for use in the inner loop of physical synthesis or other optimization programs. The third novel contribution of this paper is the computation of local and global criticality probabilities. For a very small cost in CPU time, the probability of each edge or node of the timing graph being critical is computed. Numerical results are presented on industrial ASIC chips with over two million logic gates.

703 citations

Journal ArticleDOI
TL;DR: A canonical first-order delay model that takes into account both correlated and independent randomness is proposed, and the first incremental statistical timer in the literature is reported, suitable for use in the inner loop of physical synthesis or other optimization programs.
Abstract: Variability in digital integrated circuits makes timing verification an extremely challenging task. In this paper, a canonical first-order delay model that takes into account both correlated and independent randomness is proposed. A novel linear-time block-based statistical timing algorithm is employed to propagate timing quantities like arrival times and required arrival times through the timing graph in this canonical form. At the end of the statistical timing, the sensitivity of all timing quantities to each of the sources of variation is available. Excessive sensitivities can then be targeted by manual or automatic optimization methods to improve the robustness of the design. This paper also reports the first incremental statistical timer in the literature, which is suitable for use in the inner loop of physical synthesis or other optimization programs. The third novel contribution of this paper is the computation of local and global criticality probabilities. For a very small cost in computer time, the probability of each edge or node of the timing graph being critical is computed. Numerical results are presented on industrial application-specified integrated circuit (ASIC) chips with over two million logic gates, and statistical timing results are compared to exhaustive corner analysis on a chip design whose hardware showed early mode timing violations

416 citations

Proceedings ArticleDOI
02 Jul 2007
TL;DR: This work proposes a two step approach: a custom preparsing technique to resolve control dependencies in the input stream and expose MB level data parallelism, and an MB level scheduling technique to allocate and load balance MB rendering tasks.
Abstract: The H264 decoder has a sequential, control intensive front end that makes it difficult to leverage the potential performance of emerging manycore processors Preparsing is a functional parallelization technique to resolve this front end bottleneck However, the resulting parallel macro block (MB) rendering tasks have highly input-dependent execution times and precedence constraints, which make them difficult to schedule efficiently on manycore processors To address these issues, we propose a two step approach: (i) a custom preparsing technique to resolve control dependencies in the input stream and expose MB level data parallelism, (ii) an MB level scheduling technique to allocate and load balance MB rendering tasks The run time MB level scheduling increases the efficiency of parallel execution in the rest of the H264 decoder, providing 60% speedup over greedy dynamic scheduling and 9-15% speedup over static compile time scheduling for more than four processors The preparsing technique coupled with run time MB level scheduling enables a potential 7times speedup for H264 decoding

96 citations

Proceedings ArticleDOI
09 Nov 2003
TL;DR: This paper presents an algorithm for constrained clockskew scheduling which computes for a given number of clockingdomains the optimal phase shifts for the domains and the assignment of the individual registers to the domains.
Abstract: The application of general clock skew scheduling is practically limited due to the difficulties in implementing a wide spectrum of dedicated clock delays in a reliable manner This results in a significant limitation of the optimization potential As an alternative, the application of multiple clocking domains with dedicated phase shifts that are implemented by reliable, possibly expensive design structures can overcome these limitations and substantially increase the implementable optimization potential of clock adjustments In this paper we present an algorithm for constrained clock skew scheduling which computes for a given number of clocking domains the optimal phase shifts for the domains and the assignment of the individual registers to the domains For the within-domain latency values, the algorithm can assume a zero-skew clock delivery or apply a user-provided upper bound Our experiments demonstrate that a constrained clock skew schedule using a few clocking domains combined with small within-domain latency can reliably implement the full sequential optimization potential to date only possible with an unconstrained clock schedule

79 citations

Proceedings ArticleDOI
19 Sep 2005
TL;DR: In this paper, an exploration framework based on Integer Linear Programming (ILP) is proposed to explore micro-architectures and allocate application tasks to maximize throughput for IPv4 packet forwarding.
Abstract: FPGA-based soft multiprocessors are viable system solutions for high performance applications. They provide a software abstraction to enable quick implementations on the FPGA. The multiprocessor can be customized for a target application to achieve high performance. Modern FPGAs provide the capacity to build a variety of micro-architectures composed of 20-50 processors, complex memory hierarchies, heterogeneous interconnection schemes and custom co-processors for performance critical operations. However, the diversity in the architectural design space makes it difficult to realize the performance potential of these systems. In this paper we develop an exploration framework to build efficient FPGA multiprocessors for a target application. Our main contribution is a tool based on Integer Linear Programming to explore micro-architectures and allocate application tasks to maximize throughput. Using this tool, we implement a soft multiprocessor for IPv4 packet forwarding that achieves a throughput of 2 Gbps, surpassing the performance of a carefully tuned hand design.

76 citations


Cited by
More filters
18 Dec 2006
TL;DR: The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.
Abstract: Author(s): Asanovic, K; Bodik, R; Catanzaro, B; Gebis, J; Husbands, P; Keutzer, K; Patterson, D; Plishker, W; Shalf, J; Williams, SW | Abstract: The recent switch to parallel microprocessors is a milestone in the history of computing. Industry has laid out a roadmap for multicore designs that preserves the programming paradigm of the past via binary compatibility and cache coherence. Conventional wisdom is now to double the number of cores on a chip with each silicon generation. A multidisciplinary group of Berkeley researchers met nearly two years to discuss this change. Our view is that this evolutionary approach to parallel hardware and software may work from 2 or 8 processor systems, but is likely to face diminishing returns as 16 and 32 processor systems are realized, just as returns fell with greater instruction-level parallelism. We believe that much can be learned by examining the success of parallelism at the extremes of the computing spectrum, namely embedded computing and high performance computing. This led us to frame the parallel landscape with seven questions, and to recommend the following: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS per development dollar. • Instead of traditional benchmarks, use 13 “Dwarfs” to design and evaluate parallel programming models and architectures. (A dwarf is an algorithmic method that captures a pattern of computation and communication.) • “Autotuners” should play a larger role than conventional compilers in translating parallel programs. • To maximize programmer productivity, future programming models must be more human-centric than the conventional focus on hardware or applications. • To be successful, programming models should be independent of the number of processors. • To maximize application efficiency, programming models should support a wide range of data types and successful models of parallelism: task-level parallelism, word-level parallelism, and bit-level parallelism. 1 The Landscape of Parallel Computing Research: A View From Berkeley • Architects should not include features that significantly affect performance or energy if programmers cannot accurately measure their impact via performance counters and energy counters. • Traditional operating systems will be deconstructed and operating system functionality will be orchestrated using libraries and virtual machines. • To explore the design space rapidly, use system emulators based on Field Programmable Gate Arrays (FPGAs) that are highly scalable and low cost. Since real world applications are naturally parallel and hardware is naturally parallel, what we need is a programming model, system software, and a supporting architecture that are naturally parallel. Researchers have the rare opportunity to re-invent these cornerstones of computing, provided they simplify the efficient programming of highly parallel systems.

2,262 citations

01 Apr 1998
TL;DR: The longman elect new senior secondary theme book is a brand new task-based coursebook specially designed to meet the aims of the new high school curriculum for secondary 4 to 6 building on the solid foundation of knowledge skills values and attitudes laid down in the widely successful Longman elect junior secondary series as discussed by the authors.
Abstract: primary longman elect, longman elect js2b answer librarydoc31 pdf slideblast com, products amp services summary pearson hk, nss longman economics question bank smalanchealriff, longman elect nss companion website, longman activate nss complete exam practice answer for the, longman elect nss mock paper 1 answer pdf predsopmentfast, longman elect complete exam practice for hkdse answer key, longman elect js1 3 companion website, mmis basic search cd rom database hong, nss exam series 2010 pearson hk mafiadoc com, all elt products amp services in hong kong longman english, longman activate nss complete exam practice answer for the, english language pearson, answer keys longman elect senior secondary companion website, longman elect nss mock paper 1 answer pdf plyfelin, longman elect complete exam practice for hkdse answer key, longman elect nss companion website, free download here, longman elect exam practice answer set 1, longman elect nss complete exam practice answerrar, longman elect nss answer set 3 pdf slideblast com, nss longman economics question bank lufebcater wixsite com, longman elect nss writing skills book asset 1 soup io, longman elect nss mock paper 1 answer pdf kontmandiapa, mydselab english language longman elect senior secondary, longman activate nss complete exam practice answer for the, longman elect senior secondary companion website, exam practice hkdse answer set 2 faroush org, longman elect nss mock paper 1 answer pdf, longman activate nss complete exam practice answer for theenriching english language teaching amp learning for the new generation primary english seminar amp school cases sharing, get longman elect nss answer set 3 pdf file for free on our ebook library public sector governance and accountability logic programming first russian longman academic series 2 answer keys librarydoc31 pdf librarydoc31 lipstick traces a secret history of the twentieth century greil marcus librarydoc31 logic and design revised in art, brief description the longman elect new senior secondary theme book is a brand new task based coursebook specially designed to meet the aims of the new senior secondary nss english language curriculum for secondary 4 to 6 building on the solid foundation of knowledge skills values and attitudes laid down in the widely successful longman elect junior secondary series the longman elect, question bank update for book 1b chapter 8 10 are available 2 3 2012 supplementary exercises for book 2a ch 1 3 are available 20 02 2012 longman elect exam practice answer set 1 pdf free download here pearson longman nss exam series 2010 question answer book p nss bafs business environment introduction to ma question bank 1 1, welcome to the longman elect nss companion website the online assessment centre oac is now available with selected content click the oac icon below to access it this site is best viewed at 1024x768 screen resolution with chrome 24 firefox 20 or internet explorer 9 legal, 7b042e0984 download and read longman activate nss complete exam practice for the hkdse answer longman activate nss complete exam practice for the hkdse answer ebooks prentice hall geometry 4 3 practice answers longman activate nss complete exam practice for the 4 3 practice answers free download, longman elect nss mock paper 1 answer pdf gt download mirror 1 e31cf57bcd nike inc pdf online longman elect exam practice answer set 6 english nss exam longman elect exam practice answer answer keys sets 1 8 extra set 6 papers get longman reading anthology 5 answer pdf file

586 citations

Proceedings ArticleDOI
30 Aug 2010
TL;DR: The evaluation results show that GPU brings significantly higher throughput over the CPU-only implementation, confirming the effectiveness of GPU for computation and memory-intensive operations in packet processing.
Abstract: We present PacketShader, a high-performance software router framework for general packet processing with Graphics Processing Unit (GPU) acceleration. PacketShader exploits the massively-parallel processing power of GPU to address the CPU bottleneck in current software routers. Combined with our high-performance packet I/O engine, PacketShader outperforms existing software routers by more than a factor of four, forwarding 64B IPv4 packets at 39 Gbps on a single commodity PC. We have implemented IPv4 and IPv6 forwarding, OpenFlow switching, and IPsec tunneling to demonstrate the flexibility and performance advantage of PacketShader. The evaluation results show that GPU brings significantly higher throughput over the CPU-only implementation, confirming the effectiveness of GPU for computation and memory-intensive operations in packet processing.

585 citations

Proceedings ArticleDOI
29 May 2013
TL;DR: An extensive survey and categorization of state-of-the-art mapping methodologies and highlights the emerging trends for multi/many-core systems.
Abstract: The reliance on multi/many-core systems to satisfy the high performance requirement of complex embedded software applications is increasing. This necessitates the need to realize efficient mapping methodologies for such complex computing platforms. This paper provides an extensive survey and categorization of state-of-the-art mapping methodologies and highlights the emerging trends for multi/many-core systems. The methodologies aim at optimizing system's resource usage, performance, power consumption, temperature distribution and reliability for varying application models. The methodologies perform design-time and run-time optimization for static and dynamic workload scenarios, respectively. These optimizations are necessary to fulfill the end-user demands. Comparison of the methodologies based on their optimization aim has been provided. The trend followed by the methodologies and open research challenges have also been discussed.

435 citations

Journal ArticleDOI
25 Sep 2006
TL;DR: A brief discussion of key sources of power dissipation and their temperature relation in CMOS VLSI circuits, and techniques for full-chip temperature calculation with special attention to its implications on the design of high-performance, low-power V LSI circuits is presented.
Abstract: The growing packing density and power consumption of very large scale integration (VLSI) circuits have made thermal effects one of the most important concerns of VLSI designers The increasing variability of key process parameters in nanometer CMOS technologies has resulted in larger impact of the substrate and metal line temperatures on the reliability and performance of the devices and interconnections Recent data shows that more than 50% of all integrated circuit failures are related to thermal issues This paper presents a brief discussion of key sources of power dissipation and their temperature relation in CMOS VLSI circuits, and techniques for full-chip temperature calculation with special attention to its implications on the design of high-performance, low-power VLSI circuits The paper is concluded with an overview of techniques to improve the full-chip thermal integrity by means of off-chip versus on-chip and static versus adaptive methods

420 citations