scispace - formally typeset
Search or ask a question
Author

Y. N. Srikant

Bio: Y. N. Srikant is an academic researcher from Indian Institute of Science. The author has contributed to research in topics: Cache & Very long instruction word. The author has an hindex of 15, co-authored 98 publications receiving 808 citations.


Papers
More filters
BookDOI
01 Aug 2002
TL;DR: Worst-Case Execution Time and Energy Analysis T. Mitra, A. Roychoudhury Static Program Analysis for Security K. Srikant, K. Vardhan Statistical and Machine Learning Techniques in Compiler Design K. Vaswani type systems: Advances and Applications.
Abstract: Worst-Case Execution Time and Energy Analysis T. Mitra, A. Roychoudhury Static Program Analysis for Security K. Gopinath Compiler Aided Design of Embedded Computers A. Shrivastava, N. Dutt Whole Execution Traces and Their Use in Debugging X. Zhang, N. Gupta, R. Gupta Optimizations for Memory Hierarchies E. Raman, D. I. August Garbage Collection Techniques A. Sanyal, U. P. Khedker Energy-Aware Compiler Optimizations Y. N. Srikant, K. A. Vardhan Statistical and Machine Learning Techniques in Compiler Design K. Vaswani Type Systems: Advances and Applications J. Palsberg, T. Millstein Dynamic Compilation E. Duesterwald The Static Single Assignment Form: Construction and Application to Program Optimization J. P. Prabhu, P. Shankar, Y. N. Srikant Shape Analysis and Applications T. Reps, M. Sagiv, R. Wilhelm Optimizations for Object-Oriented Languages A. Krall, N. Horspool Program Slicing G. B. Mund, R. Mall Computations on Iteration Spaces S. Rajopadhye, L. Renganarayana, G. Gupta, M. Strout Architecture Description Languages for Retargetable Compilation W. Qin, S. Malik Instruction Selection Using Tree Parsing P. Shankar A Retargetable Very Long Instruction Word Compiler Framework for Digital Signal Processors S. Rajagopalan, S. Malik Instruction Scheduling R. Govindarajan Advances in Software Pipelining H. Rong, R. Govindarajan Advances in Register Allocation Techniques V. K. Nandivada

111 citations

Proceedings ArticleDOI
30 Sep 2007
TL;DR: In this paper, the authors describe techniques to estimate the worst case execution time of executable code on architectures with data caches using Abstract Interpretation, which is used for the dual purposes of tracking address computations and cache behavior.
Abstract: This paper describes techniques to estimate the worst case execution time of executable code on architectures with data caches. The underlying mechanism is Abstract Interpretation, which is used for the dual purposes of tracking address computations and cache behavior. A simultaneous numeric and pointer analysis using an abstraction for discrete sets of values computes safe approximations of access addresses which are then used to predict cache behavior using Must Analysis. A heuristic is also proposed which generates likely worst case estimates. It can be used in soft real time systems and also for reasoning about the tightness of the safe estimate. The analysis methods can handle programs with non-affine access patterns, for which conventional Presburger Arithmetic formulations or Cache Miss Equations do not apply. The precision of the estimates is user-controlled and can be traded off against analysis time. Executables are analyzed directly, which, apart from enhancing precision, renders the method language independent.

55 citations

Journal ArticleDOI
TL;DR: IR2VEC as mentioned in this paper is a distributed embedding infrastructure that combines representation learning methods with flow information to capture the syntax as well as the semantics of the input programs and achieves state-of-the-art performance on heterogeneous device mapping and thread coarsening.
Abstract: We propose IR2VEC, a Concise and Scalable encoding infrastructure to represent programs as a distributed embedding in continuous space. This distributed embedding is obtained by combining representation learning methods with flow information to capture the syntax as well as the semantics of the input programs. As our infrastructure is based on the Intermediate Representation (IR) of the source code, obtained embeddings are both language and machine independent. The entities of the IR are modeled as relationships, and their representations are learned to form a seed embedding vocabulary. Using this infrastructure, we propose two incremental encodings: Symbolic and Flow-Aware. Symbolic encodings are obtained from the seed embedding vocabulary, and Flow-Aware encodings are obtained by augmenting the Symbolic encodings with the flow information. We show the effectiveness of our methodology on two optimization tasks (Heterogeneous device mapping and Thread coarsening). Our way of representing the programs enables us to use non-sequential models resulting in orders of magnitude of faster training time. Both the encodings generated by IR2VEC outperform the existing methods in both the tasks, even while using simple machine learning models. In particular, our results improve or match the state-of-the-art speedup in 11/14 benchmark-suites in the device mapping task across two platforms and 53/68 benchmarks in the thread coarsening task across four different platforms. When compared to the other methods, our embeddings are more scalable, is non-data-hungry, and has better Out-Of-Vocabulary (OOV) characteristics.

43 citations

Journal ArticleDOI
TL;DR: A domain-specific language (DSL) is proposed, Falcon, for implementing graph algorithms that abstracts the hardware, provides constructs to write explicitly parallel programs at a higher level, and can work with general algorithms that may change the graph structure.
Abstract: Graph algorithms have been shown to possess enough parallelism to keep several computing resources busy—even hundreds of cores on a GPU. Unfortunately, tuning their implementation for efficient execution on a particular hardware configuration of heterogeneous systems consisting of multicore CPUs and GPUs is challenging, time consuming, and error prone. To address these issues, we propose a domain-specific language (DSL), Falcon, for implementing graph algorithms that (i) abstracts the hardware, (ii) provides constructs to write explicitly parallel programs at a higher level, and (iii) can work with general algorithms that may change the graph structure (morph algorithms). We illustrate the usage of our DSL to implement local computation algorithms (that do not change the graph structure) and morph algorithms such as Delaunay mesh refinement, survey propagation, and dynamic SSSP on GPU and multicore CPUs. Using a set of benchmark graphs, we illustrate that the generated code performs close to the state-of-the-art hand-tuned implementations.

43 citations

Proceedings ArticleDOI
20 Mar 2005
TL;DR: A low-overhead, non-intrusive hardware path profiling scheme that can be programmed to detect several types of paths including acyclic, intra-procedural paths, paths for a whole program path and extended paths, enabling context-sensitive performance monitoring and bottleneck analysis.
Abstract: For aggressive path-based program optimizations to be profitable in cost-sensitive environments, accurate path profiles must be available at low overheads. In this paper, we propose a low-overhead, non-intrusive hardware path profiling scheme that can be programmed to detect several types of paths including acyclic, intra-procedural paths, paths for a whole program path and extended paths. The profiler consists of a path stack, which detects paths and generates a sequence of path descriptors using branch information from the processor pipeline, and a hot path table that collects a profile of hot paths for later use by a program optimizer. With assistance from the processor's event detection logic, our profiler can track a host of architectural metrics along paths, enabling context-sensitive performance monitoring and bottleneck analysis. We illustrate the utility of our scheme by associating paths with a power metric that estimates power consumption in the cache hierarchy caused by instructions along the path. Experiments using programs from the SPEC CPU2000 benchmark suite show that our path profiler, occupying 7KB of hardware real-estate, collects accurate path profiles (average overlap of 88% with a perfect profile) at negligible execution time overheads (0.6% on average).

41 citations


Cited by
More filters
Journal Article
TL;DR: This book by a teacher of statistics (as well as a consultant for "experimenters") is a comprehensive study of the philosophical background for the statistical design of experiment.
Abstract: THE DESIGN AND ANALYSIS OF EXPERIMENTS. By Oscar Kempthorne. New York, John Wiley and Sons, Inc., 1952. 631 pp. $8.50. This book by a teacher of statistics (as well as a consultant for \"experimenters\") is a comprehensive study of the philosophical background for the statistical design of experiment. It is necessary to have some facility with algebraic notation and manipulation to be able to use the volume intelligently. The problems are presented from the theoretical point of view, without such practical examples as would be helpful for those not acquainted with mathematics. The mathematical justification for the techniques is given. As a somewhat advanced treatment of the design and analysis of experiments, this volume will be interesting and helpful for many who approach statistics theoretically as well as practically. With emphasis on the \"why,\" and with description given broadly, the author relates the subject matter to the general theory of statistics and to the general problem of experimental inference. MARGARET J. ROBERTSON

13,333 citations

Proceedings ArticleDOI
01 Dec 2007
TL;DR: It is shown that a realistic hardware implementation that dynamically regroups threads into new warps on the fly following the occurrence of diverging branch outcomes improves performance by an average of 20.7% for an estimated area increase of 4.7%.
Abstract: Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in commodity desktop computer systems. GPUs typically use single-instruction, multiple-data (SIMD) pipelines to achieve high perfor- mance with minimal overhead incurred by control hard- ware. Scalar threads are grouped together into SIMD batches, sometimes referred to as warps. While SIMD is ideally suited for simple programs, recent GPUs include control flow instructions in the GPU instruction set archi- tecture and programs using these instructions may experi- ence reduced performance due to the way branch execution is supported by hardware. One approach is to add a stack to allow different SIMD processing elements to execute dis- tinct program paths after a branch instruction. The occur- rence of diverging branch outcomes for different processing elements significantly degrades performance. In this paper, we explore mechanisms for more efficient SIMD branch ex- ecution on GPUs. We show that a realistic hardware im- plementation that dynamically regroups threads into new warps on the fly following the occurrence of diverging branch outcomes improves performance by an average of 20.7% for an estimated area increase of 4.7%.

512 citations

Proceedings ArticleDOI
06 Apr 2008
TL;DR: The complexity involved in optimizing applications for one highly-parallel system and one relatively simple methodology for reducing the workload involved in the optimization process are shown.
Abstract: Program optimization for highly-parallel systems has historically been considered an art, with experts doing much of the performance tuning by hand. With the introduction of inexpensive, single-chip, massively parallel platforms, more developers will be creating highly-parallel applications for these platforms, who lack the substantial experience and knowledge needed to maximize their performance. This creates a need for more structured optimization methods with means to estimate their performance effects. Furthermore these methods need to be understandable by most programmers. This paper shows the complexity involved in optimizing applications for one such system and one relatively simple methodology for reducing the workload involved in the optimization process.This work is based on one such highly-parallel system, the GeForce 8800 GTX using CUDA. Its flexible allocation of resources to threads allows it to extract performance from a range of applications with varying resource requirements, but places new demands on developers who seek to maximize an application's performance. We show how optimizations interact with the architecture in complex ways, initially prompting an inspection of the entire configuration space to find the optimal configuration. Even for a seemingly simple application such as matrix multiplication, the optimal configuration can be unexpected. We then present metrics derived from static code that capture the first-order factors of performance. We demonstrate how these metrics can be used to prune many optimization configurations, down to those that lie on a Pareto-optimal curve. This reduces the optimization space by as much as 98% and still finds the optimal configuration for each of the studied applications.

312 citations

Journal ArticleDOI
TL;DR: The architectural influence on static timing analysis is described and recommendations as to profitable and unacceptable architectural features are given and results show that measurement-based methods still used in industry are not useful for quite commonly used complex processors.
Abstract: Embedded hard real-time systems need reliable guarantees for the satisfaction of their timing constraints. Experience with the use of static timing-analysis methods and the tools based on them in the automotive and the aeronautics industries is positive. However, both the precision of the results and the efficiency of the analysis methods are highly dependent on the predictability of the execution platform. In fact, the architecture determines whether a static timing analysis is practically feasible at all and whether the most precise obtainable results are precise enough. Results contained in this paper also show that measurement-based methods still used in industry are not useful for quite commonly used complex processors. This dependence on the architectural development is of growing concern to the developers of timing-analysis tools and their customers, the developers in industry. The problem reaches a new level of severity with the advent of multicore architectures in the embedded domain. This paper describes the architectural influence on static timing analysis and gives recommendations as to profitable and unacceptable architectural features.

249 citations

Proceedings ArticleDOI
25 Jul 2010
TL;DR: A time-constrained probabilistic factor graph model (TPFG), which takes a research publication network as input and models the advisor-advisee relationship mining problem using a jointly likelihood objective function is proposed and an efficient learning algorithm is designed to optimize the objective function.
Abstract: Information network contains abundant knowledge about relationships among people or entities. Unfortunately, such kind of knowledge is often hidden in a network where different kinds of relationships are not explicitly categorized. For example, in a research publication network, the advisor-advisee relationships among researchers are hidden in the coauthor network. Discovery of those relationships can benefit many interesting applications such as expert finding and research community analysis. In this paper, we take a computer science bibliographic network as an example, to analyze the roles of authors and to discover the likely advisor-advisee relationships. In particular, we propose a time-constrained probabilistic factor graph model (TPFG), which takes a research publication network as input and models the advisor-advisee relationship mining problem using a jointly likelihood objective function. We further design an efficient learning algorithm to optimize the objective function. Based on that our model suggests and ranks probable advisors for every author. Experimental results show that the proposed approach infer advisor-advisee relationships efficiently and achieves a state-of-the-art accuracy (80-90%). We also apply the discovered advisor-advisee relationships to bole search, a specific expert finding task and empirical study shows that the search performance can be effectively improved (+4.09% by NDCG@5).

212 citations