scispace - formally typeset
C

Chokchai Leangsuksun

Researcher at Louisiana Tech University

Publications -  60
Citations -  1098

Chokchai Leangsuksun is an academic researcher from Louisiana Tech University. The author has contributed to research in topics: Fault tolerance & High availability. The author has an hindex of 19, co-authored 60 publications receiving 1043 citations. Previous affiliations of Chokchai Leangsuksun include Kent State University.

Papers
More filters
Proceedings ArticleDOI

An optimal checkpoint/restart model for a large scale high performance computing system

TL;DR: This work presents a reliability-aware method for an optimal checkpoint/restart strategy that can deal with a varying checkpoint interval and with different failure distributions, and aims at addressing fault tolerance challenge, especially in a large-scale HPC system.
Journal ArticleDOI

ASC: an associative-computing paradigm

TL;DR: A parallel programming paradigm called ASC (ASsociative Computing), designed for a wide range of computing engines, that incorporates data parallelism at the base level, so that programmers do not have to specify low-level sequential tasks such as sorting, looping and parallelization.
Proceedings ArticleDOI

Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments

TL;DR: A model that aims to reduce full checkpoint overhead by performing a set of incremental checkpoints between two consecutive full checkpoints is built and a method to find the number of those incremental checkpoints is given.
Proceedings ArticleDOI

A Framework for Proactive Fault Tolerance

TL;DR: This document presents a proactive fault tolerance framework that can use different reactive fault tolerance mechanisms, i.e., migration and pause/un-pause and allows the implementation of new proactive faultolerance policies thanks to a modular architecture.
Proceedings ArticleDOI

Availability modeling and analysis on high performance cluster computing systems

TL;DR: This paper proposes a single framework that coordinates event monitoring, filtering, data analysis and dynamic availability modeling, and a sample analysis of real time event logs from a 512 node cluster from Lawrence Livermore National Laboratory.