Home
/
Authors
/
Nitin Chandrachoodan

Author

Nitin Chandrachoodan

Other affiliations: Indian Institutes of Technology, University of Maryland, College Park

Bio: Nitin Chandrachoodan is an academic researcher from Indian Institute of Technology Madras. The author has contributed to research in topics: Computer science & Adder. The author has an hindex of 9, co-authored 60 publications receiving 294 citations. Previous affiliations of Nitin Chandrachoodan include Indian Institutes of Technology & University of Maryland, College Park.

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2009
2008
2007
2004
2003
2002
2001
1999

Papers

PDF

Open Access

More filters

Journal Article•DOI•

FPGA-Based High-Performance and Scalable Block LU Decomposition Architecture

[...]

Manish Kumar Jaiswal¹, Nitin Chandrachoodan²•Institutions (2)

ICFAI University, Dehradun¹, Indian Institute of Technology Madras²

01 Jan 2012-IEEE Transactions on Computers

TL;DR: The design outperforms previous hardware implementations, as well as tuned software implementations including the ATLAS and MKL libraries on workstations and has been synthesized for FPGA targets and can be easily retargeted.

...read moreread less

Abstract: Decomposition of a matrix into lower and upper triangular matrices (LU decomposition) is a vital part of many scientific and engineering applications, and the block LU decomposition algorithm is an approach well suited to parallel hardware implementation This paper presents an approach to speed up implementation of the block LU decomposition algorithm using FPGA hardware Unlike most previous approaches reported in the literature, the approach does not assume the matrix can be stored entirely on chip The memory accesses are studied for various FPGA configurations, and a schedule of operations for scaling well is shown The design has been synthesized for FPGA targets and can be easily retargeted The design outperforms previous hardware implementations, as well as tuned software implementations including the ATLAS and MKL libraries on workstations

...read moreread less

55 citations

Proceedings Article•DOI•

GPU Implementation of a Programmable Turbo Decoder for Software Defined Radio Applications

[...]

Dhiraj Reddy Nallapa Yoge¹, Nitin Chandrachoodan¹•Institutions (1)

Indian Institute of Technology Madras¹

07 Jan 2012

TL;DR: This paper presents the implementation of a 3GPP standards compliant configurable turbo decoder on a GPU by suitably parallelizing the Log-MAP decoding algorithm and doing an architecture aware mapping of it on to the GPU.

...read moreread less

Abstract: This paper presents the implementation of a 3GPP standards compliant configurable turbo decoder on a GPU. The challenge in implementing a turbo decoder on a GPU is in suitably parallelizing the Log-MAP decoding algorithm and doing an architecture aware mapping of it on to the GPU. The approximations in parallelizing the Log-MAP algorithm come at the cost of reduced BER performance. To mitigate this reduction, different guarding mechanisms of varying computational complexity have been presented. The limited shared memory and registers available on GPUs are carefully allocated to obtain a high real-time decoding rate without requiring several independent data streams in parallel.

...read moreread less

25 citations

Proceedings Article•DOI•

An efficient timing model for hardware implementation of multirate dataflow graphs

[...]

Nitin Chandrachoodan¹, S.S. Bhattacharyaa, K.J.R. Liu•Institutions (1)

University of Maryland, College Park¹

07 May 2001

TL;DR: This work considers the problem of representing timing information associated with functions in a dataflow graph used to represent a signal processing system in the context of high-level hardware (architectural) synthesis, and shows that with some reasonable assumptions on the way hardware implementations of multirate systems operate, it can derive general hierarchical descriptions of multIRate systems similarly to single rate systems.

...read moreread less

Abstract: We consider the problem of representing timing information associated with functions in a dataflow graph used to represent a signal processing system in the context of high-level hardware (architectural) synthesis. This information is used for synthesis of appropriate architectures for implementing the graph. Conventional models for timing suffer from shortcomings that make it difficult to represent timing information in a hierarchical manner, especially for multirate signal processing systems. We identify some of these shortcomings, and provide an alternate model that does not have these problems. We show that with some reasonable assumptions on the way hardware implementations of multirate systems operate, we can derive general hierarchical descriptions of multirate systems similarly to single rate systems. Several analytical results such as the computation of the iteration period bound, that previously applied only to single rate systems can also easily be extended to multirate systems under the new assumptions. We have applied our model to several multirate signal processing applications, and obtained favorable results. We present results of the timing information computed for several multirate DSP applications that show how the new treatment can streamline the problem of performance analysis and synthesis of such systems.

...read moreread less

25 citations

Proceedings Article•DOI•

A GPU implementation of belief propagation decoder for polar codes

[...]

L Bharath Kumar Reddy¹, Nitin Chandrachoodan¹•Institutions (1)

Indian Institute of Technology Madras¹

01 Nov 2012

TL;DR: The BP decoding algorithm is implemented to utilize the parallel computing capability of the GPUs and can make use of parallelism both at the thread level and block level, and by utilizing the limited shared memory available on GPUs, a real time decoding performance is achieved.

...read moreread less

Abstract: We present a Graphics Processing Unit (GPU) implementation of a Belief Propagation (BP) based decoder for polar codes The BP decoding algorithm is implemented to utilize the parallel computing capability of the GPUs We show how the algorithm can make use of parallelism both at the thread level and block level, and by utilizing the limited shared memory available on GPUs, a real time decoding performance is achieved The resulting algorithm is able to achieve a decoding throughput of almost 5Mbps while maintaining a frame error rate below 10−3 on code blocks of 1024 bits

...read moreread less

20 citations

Proceedings Article•DOI•

EOG based virtual keyboard

[...]

S. Sai Surya Teja¹, Sharat S. Embrandiri¹, Nitin Chandrachoodan¹, M. Ramasubba Reddy¹•Institutions (1)

Indian Institute of Technology Madras¹

17 Apr 2015

TL;DR: An indigenously developed acquisition system based on arduino interfaced ADS1299 with a wearable dry electrode mask is used to record and process EOG signals.

...read moreread less

Abstract: In this paper, an EOG based assistive system for typing text using a virtual keyboard is presented. An indigenously developed acquisition system based on arduino interfaced ADS1299 with a wearable dry electrode mask is used to record and process EOG signals. An accuracy of 100% and an average speed of 1 char/12 sec was achieved by an untrained person in online implementation of this system. This can be further improvised by word prediction algorithms

...read moreread less

18 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14

Collapse

Cited by

PDF

Open Access

More filters

Parallel CAD: Algorithm Design and Programming Special Section Call for Papers TODAES: ACM Transactions on Design Automation of Electronic Systems

[...]

Kurt Keutzer, Peng Li, Li Shang, Hai Zhou

01 Jan 2010

TL;DR: This journal special section will cover recent progress on parallel CAD research, including algorithm foundations, programming models, parallel architectural-specific optimization, and verification, as well as other topics relevant to the design of parallel CAD algorithms and software tools.

...read moreread less

Abstract: High-performance parallel computer architecture and systems have been improved at a phenomenal rate. In the meantime, VLSI computer-aided design (CAD) software for multibillion-transistor IC design has become increasingly complex and requires prohibitively high computational resources. Recent studies have shown that, numerous CAD problems, with their high computational complexity, can greatly benefit from the fast-increasing parallel computation capabilities. However, parallel programming imposes big challenges for CAD applications. Fully exploiting the computational power of emerging general-purpose and domain-specific multicore/many-core processor systems, calls for fundamental research and engineering practice across every stage of parallel CAD design, from algorithm exploration, programming models, design-time and run-time environment, to CAD applications, such as verification, optimization, and simulation. This journal special section will cover recent progress on parallel CAD research, including algorithm foundations, programming models, parallel architectural-specific optimization, and verification. More specifically, papers with in-depth and extensive coverage of the following topics will be considered, as well as other topics relevant to the design of parallel CAD algorithms and software tools. 1. Parallel algorithm design and specification for CAD applications 2. Parallel programming models and languages of particular use in CAD 3. Runtime support and performance optimization for CAD applications 4. Parallel architecture-specific design and optimization for CAD applications 5. Parallel program debugging and verification techniques particularly relevant for CAD The papers should be submitted via the Manuscript Central website and should adhere to standard ACM TODAES formatting requirements (http://todaes.acm.org/). The page count limit is 25.

...read moreread less

459 citations

Proceedings Article•DOI•

An empirical characterization of stream programs and its implications for language and compiler design

[...]

William Thies¹, Saman Amarasinghe²•Institutions (2)

Microsoft¹, Massachusetts Institute of Technology²

11 Sep 2010

TL;DR: This work characterize a large set of stream programs that was implemented directly in a stream programming language, allowing new insights into the high-level structure and behavior of the applications.

...read moreread less

Abstract: Stream programs represent an important class of high-performance computations. Defined by their regular processing of sequences of data, stream programs appear most commonly in the context of audio, video, and digital signal processing, though also in networking, encryption, and other areas. In order to develop effective compilation techniques for the streaming domain, it is important to understand the common characteristics of these programs. Prior characterizations of stream programs have examined legacy implementations in C, C++, or FORTRAN, making it difficult to extract the high-level properties of the algorithms. In this work, we characterize a large set of stream programs that was implemented directly in a stream programming language, allowing new insights into the high-level structure and behavior of the applications. We utilize the StreamIt benchmark suite, consisting of 65 programs and 33,600 lines of code. We characterize the bottlenecks to parallelism, the data reference patterns, the input/output rates, and other properties. The lessons learned have implications for the design of future architectures, languages and compilers for the streaming domain.

...read moreread less

179 citations

Journal Article•DOI•

Experimental analysis of the fastest optimum cycle ratio and mean algorithms

[...]

Ali Dasdan¹•Institutions (1)

Synopsys¹

01 Oct 2004-ACM Transactions on Design Automation of Electronic Systems

TL;DR: This article focuses on the fastest OCR algorithms only, provides a unified theoretical framework and a few new results, and runs these algorithms on the largest circuit benchmarks available.

...read moreread less

Abstract: Optimum cycle ratio (OCR) algorithms are fundamental to the performance analysis of (digital or manufacturing) systems with cycles. Some applications in the computer-aided design field include cycle time and slack optimization for circuits, retiming, timing separation analysis, and rate analysis. There are many OCR algorithms, and since a superior time complexity in theory does not mean a superior time complexity in practice, or vice-versa, it is important to know how these algorithms perform in practice on real circuit benchmarks. A recent published study experimentally evaluated almost all the known OCR algorithms, and determined the fastest one among them. This article improves on that study in the following ways: (1) it focuses on the fastest OCR algorithms only; (2) it provides a unified theoretical framework and a few new results; (3) it runs these algorithms on the largest circuit benchmarks available; (4) it compares the algorithms in terms of many properties in addition to running times such as operation counts, convergence behavior, space requirements, generality, simplicity, and robustness; (5) it analyzes the experimental results using statistical techniques and provides asymptotic time complexity of each algorithm in practice; and (6) it provides clear guidance to the use and implementation of these algorithms together with our algorithmic improvements.

...read moreread less

175 citations

Exploring Trade-Offs inBuffer Requirements and Throughput Constraints forSynchronous Dataflow Graphs*

[...]

Sander Stuijk

01 Jan 2006

TL;DR: This work presents exact techniques to chart the Pareto space of throughput and storage tradeoffs, which can be used to determine the minimal storage space needed to execute a graph under a given throughput constraint.

...read moreread less

Abstract: sultinanimplementation thatcannotbeexecuted within Multimedia applications usually havethroughput constraints. these timing constraints. Itisnecessary totakethetiming An implementation mustmeetthese constraints, while it constraints intoaccount while minimizing thebuffers. Sevminimizes resource usageandenergyconsumption. The eralapproaches havebeenproposed forminimizing buffer computeintensive kernels oftheseapplications areoften requirements underathroughput constraint. In[9], atechspecified asSynchronous Dataflow Graphs.Communica- nique basedonlinear programming isproposed tocalculate tionbetween nodesinthesegraphs requires storage space aschedule thatrealizes themaximalthroughput whileit whichinfluences throughput. We present exacttechniquestries tominimize buffer sizes. Hwangetal.propose aheuristochart thePareto spaceofthroughput andstorage trade- ticthatcantakeresource constraints into account [10]. This offs, whichcanbeusedtodetermine theminimal storage methodistargeted towards a-cyclic graphs anditalways spaceneeded toexecute agraphunderagiven throughputmaximizes throughput rather thanusing athroughput conconstraint. Thefeasibility oftheapproach isdemonstratedstraint. Thus,itcouldleadtoadditional resource requirewithanumberofexamples. ments. In[13], buffer minimization formaximal throughput ofa subclass ofSDFGs(homogeneous SDFGs) isachieved Categories andSubject Descriptors: C.3[Special-pur- viaaninteger linear programming approach. Ingeneral, poseandApplication-based Systems] Signal processing sys- theminimal buffer sizes obtained withthisapproach cantems notbetranslated toexact minimal buffer sizes forarbitrary GeneralTerms:Algorithms, Experimentation, Theory. SDFGs.We propose, incontrast toexisting work, an ex

...read moreread less

166 citations

Book Chapter•DOI•

Clock Skew Optimization

[...]

Naresh Maheshwari¹, Sachin S. Sapatnekar²•Institutions (2)

Iowa State University¹, University of Minnesota²

01 Jan 1999

TL;DR: In a single-phase edge-triggered circuit, in the case where there is no clock skew, the designer must ensure that for correct operation, each input-output path of a combinational subcircuit has a delay that is less than the clock period.

...read moreread less

Abstract: Conventional synchronous circuit design is predicated on the assumption that each clock signal of the same phase arrives at each memory element at exactly the same time. In a sequential VLSI circuit, due to differences in interconnect delays on the clock distribution network, this simultaneity is difficult to achieve and clock signals do not arrive at all of the registers at the same time. This is referred to as a skew in the clock. In a single-phase edge-triggered circuit, in the case where there is no clock skew, the designer must ensure that for correct operation, each input-output path of a combinational subcircuit has a delay that is less than the clock period. In the presence of skew, however, the relation grows more complex and the task of designing the combinational subcircuits becomes more involved.

...read moreread less

148 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65

Collapse