scispace - formally typeset
Search or ask a question
Author

Harini Sriraman

Bio: Harini Sriraman is an academic researcher from VIT University. The author has contributed to research in topics: Multi-core processor & Overhead (computing). The author has an hindex of 1, co-authored 5 publications receiving 3 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: An architecture and algorithm for self-repairing of design bugs in the data path using FPGA is proposed, which is re-configured during the run-time to take over the functions of the faulty component.
Abstract: In recent times, with increased transistor density, it is impossible to verify all the components exhaustively for different scenarios. This results in design bugs also known as extrinsic hardware faults to escape into the processor chip in spite of various levels of testing. Hence, handling design bugs efficiently on the field is a necessity in modern multi-core processors. In this paper, an architecture and algorithm for self-repairing of design bugs in the data path using FPGA is proposed. The FPGA is re-configured during the run-time to take over the functions of the faulty component. To verify the effectiveness of the proposed design a representative sample of five faults are injected and handled. The proposed design's area overhead and time overhead calculations are done using Cadence ncverilog and gem5 simulator respectively. The area overhead of the proposed design is < 1% and performance improvement is around 2.5% compared to the existing techniques.

2 citations

Book ChapterDOI
01 Jan 2019
TL;DR: From the result analysis it is found that the performance is maximum, when the pattern size matches the tile size and it is less than 64, due to the size of the warp considered.
Abstract: Parallelizing pattern matching in multidimensional images is very vital in many applications to improve the performance. With SIMT architectures, the performance can be greatly enhanced if the hardware threads are utilized to the maximum. In the case of pattern matching algorithms, the main bottleneck arises due to the reduction operation that needs to be performed on the multiple parallel search operations. This can be solved by using Shift-Or operations. The recent trend has shown the improvement in pattern matching using Shift-Or operations for bit pattern matching. This has to be extended for multiple dimensional images like hyper-cubes. In this paper, we have extended the Shift-Or pattern matching for multidimensional images. The algorithm is implemented for GPU architectures. The complexity of the proposed algorithm is \( m*\frac{log(n)}{kw} \) where m is the number of dimensions, n is the size of the array if the multidimensional matrix values are placed in a single dimensional array, k is the size of the pattern and w is the size of the tile. From the result analysis it is found that the performance is maximum, when the pattern size matches the tile size and it is less than 64. This restriction is due to the size of the warp considered.

1 citations

Proceedings ArticleDOI
25 Mar 2021
TL;DR: In this paper, the staleness among the worker nodes is identified as the main cause of stragglers and different methods used to address this issue are described in detail and open research problems in this field are also highlighted.
Abstract: Deep learning for image analytics is widely used in many real-world applications. Due to the rapid growth in data and model size there is a need to distribute the models in multiple nodes. Distributed computing of the model helps to increase the scalability, training time and its cost effectiveness. But the distribution can lead to longer computation times in case of stale nodes. The computational time of the distributed nodes are affected by many factors like latency caused dur to communication, network connectivity, resource sharing, computational power etc. The main problem faced in case of distribution is the staleness among the worker nodes. Effect of stragglers cannot be completely avoided in distributed clusters. The failures in storage, disks, imbalanced workloads, resources sharing etc. are the main cause of stragglers. Stragglers can cause longer computation time and reduce the performance of the model. The different methods used to address this issue is described in the paper in detail. The open research problems in this field are also highlighted.

1 citations

Journal ArticleDOI
TL;DR: The lifetime reliability of processors has become a major design constraint in the dark silicon era and design defects and aging are a major concern.
Abstract: The lifetime reliability of processors has become a major design constraint in the dark silicon era. Processor reliability issues are mainly due to design defects and aging. Unlike design defects, ...
Book ChapterDOI
01 Jan 2019
TL;DR: The proposed solution maximizes the reliability of the processor core without much area and time overhead and utilizes the existing reconfigurable hardware in the new range of embedded systems like Intel Atom E6 × 5C series.
Abstract: With increasing complexity of processor architectures and their vulnerability to hard faults, it is vital to have self-repairing processor architectures. This paper proposes the idea of autonomic repairing of permanent hard faults in the functional units of an out-of-order processor core using reconfigurable FPGA. The technique proposed utilizes the existing reconfigurable hardware in the new range of embedded systems like Intel Atom E6 × 5C series. The proposed technique includes an on-chip buffer, a fault status table (fully associative) and few control signals to the existing core. To perform self-repairing, decoder will identify reference to the faulty unit and initiate the reconfigurable hardware to be configured as the faulty unit referenced. Dispatch unit will help resolve the reservation station conflicts for the reconfigurable hardware. Execution of instruction that referenced the faulty unit gets executed in the reconfigurable unit. Dispatch unit and the buffers helps complete the out-of-order execution and in-order commit of the instructions that referenced a faulty unit. A hypothetical architecture that loosely resembles ALPHA 21264 is designed as a test bed for analyzing the proposed self-repairing mechanism. Area and time overhead analysis are done using Cadence NCVerilog simulator, Xilinx-Vivado ISE and FPGA Prototype board. Spatial and temporal costs of the proposed design are around 2% and 2.64% respectively. With recent increase in hybrid architectures that has FPGA tightly coupled with ASIC processor core, the proposed solution maximizes the reliability of the processor core without much area and time overhead.

Cited by
More filters
01 Jan 1999
TL;DR: In this article, the Shift-And algorithm was used to solve the problem of pattern matching in LZW compressed text, where a pattern length is at most 32 or the word length.
Abstract: This paper considers the Shift-And approach to the problem of pattern matching in LZW compressed text, and gives a new algorithm that solves it. The algorithm is indeed fast when a pattern length is at most 32, or the word length. After an O(m + |Σ|) time and O(|Σ|) space preprocessing of a pattern, it scans an LZW compressed text in O(n + r) time and reports all occurrences of the pattern, where n is the compressed text length, m is the pattern length, and r is the number of the pattern occurrences. Experimental results show that it runs approximately 1.5 times faster than a decompression followed by a simple search using the Shift-And algorithm. Moreover, the algorithm can be extended to the generalized pattern matching, to the pattern matching with k mismatches, and to the multiple pattern matching, like the Shift-And algorithm.

56 citations

01 Oct 2017
TL;DR: This paper implements a fast GCD coprocessor based on Euclid's method with variable precisions (32-bit to 1024-bit) and shows that the design area is scalable and can be easily increased or embedded with many other design applications.
Abstract: Introduction: Euclid's algorithm is well-known for its efficiency and simple iterative to compute the greatest common divisor (GCD) of two non-negative integers. It contributes to almost all public key cryptographic algorithms over a finite field of arithmetic. This, in turn, has led to increased research in this domain, particularly with the aim of improving the performance throughput for many GCD-based applications. Methodology: In this paper, we implement a fast GCD coprocessor based on Euclid's method with variable precisions (32-bit to 1024-bit). The proposed implementation was benchmarked using seven field programmable gate arrays (FPGA) chip families (i.e., one Altera chip and six Xilinx chips) and reported on four cost complexity factors: the maximum frequency, the total delay values, the hardware utilization and the total FPGA thermal power dissipation. Results: The results demonstrated that the XC7VH290T-2-HCG1155 and XC7K70T-2-FBG676 devices recorded the best maximum frequencies of 243.934 MHz down to 39.94 MHz for 32-bits with 1024-bit precisions, respectively. Additionally, it was found that the implementation with different precisions has utilized minimal resources of the target device, i.e., a maximum of 2% and 4% of device registers and look-up tables (LUT’s). Conclusions: These results imply that the design area is scalable and can be easily increased or embedded with many other design applications. Finally, comparisons with previous designs/implementations illustrate that the proposed coprocessor implementation is faster than many reported state-of-the-art solutions. This paper is an extended version of our conference paper [1].

1 citations

Journal ArticleDOI
TL;DR: In this paper, performance monitoring counters (PMC) and machine learning models are used to detect and locate pipeline bugs in a processor, and a synthetic bug injection framework is developed to synthetically inject bugs in x86 pipeline stages.
Book ChapterDOI
21 Oct 2022
TL;DR: In this paper , the deployment of HPC accelerators for CNN and how acceleration is achieved is discussed, and the leading cloud platforms used in computer vision for acceleration are also listed.
Abstract: Image processing combined with computer vision is creating a vast breakthrough in many research, industry-related, and social applications. The growth of big data has led to the large quantity of high-resolution images that can be used in complex applications and processing. There is a need for rapid image processing methods to find accurate and faster results for the time-crucial applications. In such cases, there is a need to accelerate the algorithms and models using the HPC systems. The acceleration of these algorithms can be obtained using hardware accelerators like GPU, TPU, FPGA, etc. The GPU and TPU are mainly used for the parallel implementation of the algorithms and processing them parallelly. The acceleration method and hardware selection are challenging since numerous accelerators are available, requiring deep knowledge and understanding of the algorithms. This chapter explains the deployment of HPC accelerators for CNN and how acceleration is achieved. The leading cloud platforms used in computer vision for acceleration are also listed.