scispace - formally typeset
Search or ask a question
Author

Nitin Chandrachoodan

Bio: Nitin Chandrachoodan is an academic researcher from Indian Institute of Technology Madras. The author has contributed to research in topics: Computer science & Adder. The author has an hindex of 9, co-authored 60 publications receiving 294 citations. Previous affiliations of Nitin Chandrachoodan include Indian Institutes of Technology & University of Maryland, College Park.


Papers
More filters
Journal ArticleDOI
TL;DR: The design outperforms previous hardware implementations, as well as tuned software implementations including the ATLAS and MKL libraries on workstations and has been synthesized for FPGA targets and can be easily retargeted.
Abstract: Decomposition of a matrix into lower and upper triangular matrices (LU decomposition) is a vital part of many scientific and engineering applications, and the block LU decomposition algorithm is an approach well suited to parallel hardware implementation This paper presents an approach to speed up implementation of the block LU decomposition algorithm using FPGA hardware Unlike most previous approaches reported in the literature, the approach does not assume the matrix can be stored entirely on chip The memory accesses are studied for various FPGA configurations, and a schedule of operations for scaling well is shown The design has been synthesized for FPGA targets and can be easily retargeted The design outperforms previous hardware implementations, as well as tuned software implementations including the ATLAS and MKL libraries on workstations

55 citations

Proceedings ArticleDOI
07 Jan 2012
TL;DR: This paper presents the implementation of a 3GPP standards compliant configurable turbo decoder on a GPU by suitably parallelizing the Log-MAP decoding algorithm and doing an architecture aware mapping of it on to the GPU.
Abstract: This paper presents the implementation of a 3GPP standards compliant configurable turbo decoder on a GPU. The challenge in implementing a turbo decoder on a GPU is in suitably parallelizing the Log-MAP decoding algorithm and doing an architecture aware mapping of it on to the GPU. The approximations in parallelizing the Log-MAP algorithm come at the cost of reduced BER performance. To mitigate this reduction, different guarding mechanisms of varying computational complexity have been presented. The limited shared memory and registers available on GPUs are carefully allocated to obtain a high real-time decoding rate without requiring several independent data streams in parallel.

25 citations

Proceedings ArticleDOI
07 May 2001
TL;DR: This work considers the problem of representing timing information associated with functions in a dataflow graph used to represent a signal processing system in the context of high-level hardware (architectural) synthesis, and shows that with some reasonable assumptions on the way hardware implementations of multirate systems operate, it can derive general hierarchical descriptions of multIRate systems similarly to single rate systems.
Abstract: We consider the problem of representing timing information associated with functions in a dataflow graph used to represent a signal processing system in the context of high-level hardware (architectural) synthesis. This information is used for synthesis of appropriate architectures for implementing the graph. Conventional models for timing suffer from shortcomings that make it difficult to represent timing information in a hierarchical manner, especially for multirate signal processing systems. We identify some of these shortcomings, and provide an alternate model that does not have these problems. We show that with some reasonable assumptions on the way hardware implementations of multirate systems operate, we can derive general hierarchical descriptions of multirate systems similarly to single rate systems. Several analytical results such as the computation of the iteration period bound, that previously applied only to single rate systems can also easily be extended to multirate systems under the new assumptions. We have applied our model to several multirate signal processing applications, and obtained favorable results. We present results of the timing information computed for several multirate DSP applications that show how the new treatment can streamline the problem of performance analysis and synthesis of such systems.

25 citations

Proceedings ArticleDOI
01 Nov 2012
TL;DR: The BP decoding algorithm is implemented to utilize the parallel computing capability of the GPUs and can make use of parallelism both at the thread level and block level, and by utilizing the limited shared memory available on GPUs, a real time decoding performance is achieved.
Abstract: We present a Graphics Processing Unit (GPU) implementation of a Belief Propagation (BP) based decoder for polar codes The BP decoding algorithm is implemented to utilize the parallel computing capability of the GPUs We show how the algorithm can make use of parallelism both at the thread level and block level, and by utilizing the limited shared memory available on GPUs, a real time decoding performance is achieved The resulting algorithm is able to achieve a decoding throughput of almost 5Mbps while maintaining a frame error rate below 10−3 on code blocks of 1024 bits

20 citations

Proceedings ArticleDOI
17 Apr 2015
TL;DR: An indigenously developed acquisition system based on arduino interfaced ADS1299 with a wearable dry electrode mask is used to record and process EOG signals.
Abstract: In this paper, an EOG based assistive system for typing text using a virtual keyboard is presented. An indigenously developed acquisition system based on arduino interfaced ADS1299 with a wearable dry electrode mask is used to record and process EOG signals. An accuracy of 100% and an average speed of 1 char/12 sec was achieved by an untrained person in online implementation of this system. This can be further improvised by word prediction algorithms

18 citations


Cited by
More filters
01 Jan 2010
TL;DR: This journal special section will cover recent progress on parallel CAD research, including algorithm foundations, programming models, parallel architectural-specific optimization, and verification, as well as other topics relevant to the design of parallel CAD algorithms and software tools.
Abstract: High-performance parallel computer architecture and systems have been improved at a phenomenal rate. In the meantime, VLSI computer-aided design (CAD) software for multibillion-transistor IC design has become increasingly complex and requires prohibitively high computational resources. Recent studies have shown that, numerous CAD problems, with their high computational complexity, can greatly benefit from the fast-increasing parallel computation capabilities. However, parallel programming imposes big challenges for CAD applications. Fully exploiting the computational power of emerging general-purpose and domain-specific multicore/many-core processor systems, calls for fundamental research and engineering practice across every stage of parallel CAD design, from algorithm exploration, programming models, design-time and run-time environment, to CAD applications, such as verification, optimization, and simulation. This journal special section will cover recent progress on parallel CAD research, including algorithm foundations, programming models, parallel architectural-specific optimization, and verification. More specifically, papers with in-depth and extensive coverage of the following topics will be considered, as well as other topics relevant to the design of parallel CAD algorithms and software tools. 1. Parallel algorithm design and specification for CAD applications 2. Parallel programming models and languages of particular use in CAD 3. Runtime support and performance optimization for CAD applications 4. Parallel architecture-specific design and optimization for CAD applications 5. Parallel program debugging and verification techniques particularly relevant for CAD The papers should be submitted via the Manuscript Central website and should adhere to standard ACM TODAES formatting requirements (http://todaes.acm.org/). The page count limit is 25.

459 citations

Proceedings ArticleDOI
11 Sep 2010
TL;DR: This work characterize a large set of stream programs that was implemented directly in a stream programming language, allowing new insights into the high-level structure and behavior of the applications.
Abstract: Stream programs represent an important class of high-performance computations. Defined by their regular processing of sequences of data, stream programs appear most commonly in the context of audio, video, and digital signal processing, though also in networking, encryption, and other areas. In order to develop effective compilation techniques for the streaming domain, it is important to understand the common characteristics of these programs. Prior characterizations of stream programs have examined legacy implementations in C, C++, or FORTRAN, making it difficult to extract the high-level properties of the algorithms. In this work, we characterize a large set of stream programs that was implemented directly in a stream programming language, allowing new insights into the high-level structure and behavior of the applications. We utilize the StreamIt benchmark suite, consisting of 65 programs and 33,600 lines of code. We characterize the bottlenecks to parallelism, the data reference patterns, the input/output rates, and other properties. The lessons learned have implications for the design of future architectures, languages and compilers for the streaming domain.

179 citations

Journal ArticleDOI
Ali Dasdan1
TL;DR: This article focuses on the fastest OCR algorithms only, provides a unified theoretical framework and a few new results, and runs these algorithms on the largest circuit benchmarks available.
Abstract: Optimum cycle ratio (OCR) algorithms are fundamental to the performance analysis of (digital or manufacturing) systems with cycles. Some applications in the computer-aided design field include cycle time and slack optimization for circuits, retiming, timing separation analysis, and rate analysis. There are many OCR algorithms, and since a superior time complexity in theory does not mean a superior time complexity in practice, or vice-versa, it is important to know how these algorithms perform in practice on real circuit benchmarks. A recent published study experimentally evaluated almost all the known OCR algorithms, and determined the fastest one among them. This article improves on that study in the following ways: (1) it focuses on the fastest OCR algorithms only; (2) it provides a unified theoretical framework and a few new results; (3) it runs these algorithms on the largest circuit benchmarks available; (4) it compares the algorithms in terms of many properties in addition to running times such as operation counts, convergence behavior, space requirements, generality, simplicity, and robustness; (5) it analyzes the experimental results using statistical techniques and provides asymptotic time complexity of each algorithm in practice; and (6) it provides clear guidance to the use and implementation of these algorithms together with our algorithmic improvements.

175 citations

01 Jan 2006
TL;DR: This work presents exact techniques to chart the Pareto space of throughput and storage tradeoffs, which can be used to determine the minimal storage space needed to execute a graph under a given throughput constraint.
Abstract: sultinanimplementation thatcannotbeexecuted within Multimedia applications usually havethroughput constraints. these timing constraints. Itisnecessary totakethetiming An implementation mustmeetthese constraints, while it constraints intoaccount while minimizing thebuffers. Sevminimizes resource usageandenergyconsumption. The eralapproaches havebeenproposed forminimizing buffer computeintensive kernels oftheseapplications areoften requirements underathroughput constraint. In[9], atechspecified asSynchronous Dataflow Graphs.Communica- nique basedonlinear programming isproposed tocalculate tionbetween nodesinthesegraphs requires storage space aschedule thatrealizes themaximalthroughput whileit whichinfluences throughput. We present exacttechniquestries tominimize buffer sizes. Hwangetal.propose aheuristochart thePareto spaceofthroughput andstorage trade- ticthatcantakeresource constraints into account [10]. This offs, whichcanbeusedtodetermine theminimal storage methodistargeted towards a-cyclic graphs anditalways spaceneeded toexecute agraphunderagiven throughputmaximizes throughput rather thanusing athroughput conconstraint. Thefeasibility oftheapproach isdemonstratedstraint. Thus,itcouldleadtoadditional resource requirewithanumberofexamples. ments. In[13], buffer minimization formaximal throughput ofa subclass ofSDFGs(homogeneous SDFGs) isachieved Categories andSubject Descriptors: C.3[Special-pur- viaaninteger linear programming approach. Ingeneral, poseandApplication-based Systems] Signal processing sys- theminimal buffer sizes obtained withthisapproach cantems notbetranslated toexact minimal buffer sizes forarbitrary GeneralTerms:Algorithms, Experimentation, Theory. SDFGs.We propose, incontrast toexisting work, an ex

166 citations

Book ChapterDOI
01 Jan 1999
TL;DR: In a single-phase edge-triggered circuit, in the case where there is no clock skew, the designer must ensure that for correct operation, each input-output path of a combinational subcircuit has a delay that is less than the clock period.
Abstract: Conventional synchronous circuit design is predicated on the assumption that each clock signal of the same phase arrives at each memory element at exactly the same time. In a sequential VLSI circuit, due to differences in interconnect delays on the clock distribution network, this simultaneity is difficult to achieve and clock signals do not arrive at all of the registers at the same time. This is referred to as a skew in the clock. In a single-phase edge-triggered circuit, in the case where there is no clock skew, the designer must ensure that for correct operation, each input-output path of a combinational subcircuit has a delay that is less than the clock period. In the presence of skew, however, the relation grows more complex and the task of designing the combinational subcircuits becomes more involved.

148 citations