scispace - formally typeset
N

Nichamon Naksinehaboon

Researcher at Louisiana Tech University

Publications -  16
Citations -  530

Nichamon Naksinehaboon is an academic researcher from Louisiana Tech University. The author has contributed to research in topics: Fault tolerance & Rollback. The author has an hindex of 10, co-authored 16 publications receiving 493 citations.

Papers
More filters
Proceedings ArticleDOI

The lightweight distributed metric service: a scalable infrastructure for continuous monitoring of large scale computing systems and applications

TL;DR: The Lightweight Distributed Metric Service is introduced for scalable, lightweight monitoring of large scale computing systems and applications and its motivations, metrics of choice, and requirements relating to the scale and specialized nature of Blue Waters.
Proceedings ArticleDOI

An optimal checkpoint/restart model for a large scale high performance computing system

TL;DR: This work presents a reliability-aware method for an optimal checkpoint/restart strategy that can deal with a varying checkpoint interval and with different failure distributions, and aims at addressing fault tolerance challenge, especially in a large-scale HPC system.
Proceedings ArticleDOI

Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments

TL;DR: A model that aims to reduce full checkpoint overhead by performing a set of incremental checkpoints between two consecutive full checkpoints is built and a method to find the number of those incremental checkpoints is given.
Proceedings ArticleDOI

A reliability-aware approach for an optimal checkpoint/restart model in HPC environments

TL;DR: This work presents a reliability-aware method for an optimal checkpoint/restart strategy towards minimizing rollback and checkpoint overheads, and aims to address fault tolerance challenge especially in a large-scale HPC system by providing optimal checkpoint placement techniques that are derived from the actual system reliability.
Proceedings ArticleDOI

Blue Gene/L Log Analysis and Time to Interrupt Estimation

TL;DR: System- and application-level failures could be characterized by analyzing relevant log files and various time to repair factors were applied to obtain application time to interrupt, which will be exploited in further resilience modeling research.