Nichamon Naksinehaboon

Researcher at Louisiana Tech University

Publications - 16

Citations - 530

Nichamon Naksinehaboon is an academic researcher from Louisiana Tech University. The author has contributed to research in topics: Fault tolerance & Rollback. The author has an hindex of 10, co-authored 16 publications receiving 493 citations.

Papers

PDF

Open Access

More filters

Proceedings ArticleDOI

The lightweight distributed metric service: a scalable infrastructure for continuous monitoring of large scale computing systems and applications

Anthony Agelastos, +14 more

TL;DR: The Lightweight Distributed Metric Service is introduced for scalable, lightweight monitoring of large scale computing systems and applications and its motivations, metrics of choice, and requirements relating to the scale and specialized nature of Blue Waters.

...read moreread less

Proceedings ArticleDOI

An optimal checkpoint/restart model for a large scale high performance computing system

Yudan Liu, +5 more

TL;DR: This work presents a reliability-aware method for an optimal checkpoint/restart strategy that can deal with a varying checkpoint interval and with different failure distributions, and aims at addressing fault tolerance challenge, especially in a large-scale HPC system.

...read moreread less

Proceedings ArticleDOI

Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments

Nichamon Naksinehaboon, +5 more

TL;DR: A model that aims to reduce full checkpoint overhead by performing a set of incremental checkpoints between two consecutive full checkpoints is built and a method to find the number of those incremental checkpoints is given.

...read moreread less

Proceedings ArticleDOI

A reliability-aware approach for an optimal checkpoint/restart model in HPC environments

Yudan Liu, +5 more

TL;DR: This work presents a reliability-aware method for an optimal checkpoint/restart strategy towards minimizing rollback and checkpoint overheads, and aims to address fault tolerance challenge especially in a large-scale HPC system by providing optimal checkpoint placement techniques that are derived from the actual system reliability.

...read moreread less

Proceedings ArticleDOI

Blue Gene/L Log Analysis and Time to Interrupt Estimation

Narate Taerat, +7 more

TL;DR: System- and application-level failures could be characterized by analyzing relevant log files and various time to repair factors were applied to obtain application time to interrupt, which will be exploited in further resilience modeling research.

...read moreread less