N
Nichamon Naksinehaboon
Researcher at Louisiana Tech University
Publications - 16
Citations - 530
Nichamon Naksinehaboon is an academic researcher from Louisiana Tech University. The author has contributed to research in topics: Fault tolerance & Rollback. The author has an hindex of 10, co-authored 16 publications receiving 493 citations.
Papers
More filters
Proceedings ArticleDOI
The lightweight distributed metric service: a scalable infrastructure for continuous monitoring of large scale computing systems and applications
Anthony Agelastos,Benjamin A. Allan,Jim Brandt,paul cassella,Jeremy Enos,Joshi Fullop,Ann C. Gentile,Steve Monk,Nichamon Naksinehaboon,Jeff Ogden,Mahesh Rajan,Michael Showerman,Joel O. Stevenson,Narate Taerat,Thomas Tucker +14 more
TL;DR: The Lightweight Distributed Metric Service is introduced for scalable, lightweight monitoring of large scale computing systems and applications and its motivations, metrics of choice, and requirements relating to the scale and specialized nature of Blue Waters.
Proceedings ArticleDOI
An optimal checkpoint/restart model for a large scale high performance computing system
TL;DR: This work presents a reliability-aware method for an optimal checkpoint/restart strategy that can deal with a varying checkpoint interval and with different failure distributions, and aims at addressing fault tolerance challenge, especially in a large-scale HPC system.
Proceedings ArticleDOI
Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments
TL;DR: A model that aims to reduce full checkpoint overhead by performing a set of incremental checkpoints between two consecutive full checkpoints is built and a method to find the number of those incremental checkpoints is given.
Proceedings ArticleDOI
A reliability-aware approach for an optimal checkpoint/restart model in HPC environments
Yudan Liu,Raja Nassar,Chokchai Leangsuksun,Nichamon Naksinehaboon,Mihaela Paun,Stephen L. Scott +5 more
TL;DR: This work presents a reliability-aware method for an optimal checkpoint/restart strategy towards minimizing rollback and checkpoint overheads, and aims to address fault tolerance challenge especially in a large-scale HPC system by providing optimal checkpoint placement techniques that are derived from the actual system reliability.
Proceedings ArticleDOI
Blue Gene/L Log Analysis and Time to Interrupt Estimation
Narate Taerat,Nichamon Naksinehaboon,Clayton Chandler,James Elliott,Chokchai Leangsuksun,George Ostrouchov,Stephen L. Scott,Christian Engelmann +7 more
TL;DR: System- and application-level failures could be characterized by analyzing relevant log files and various time to repair factors were applied to obtain application time to interrupt, which will be exploited in further resilience modeling research.