Conference

Dependable Systems and Networks

About: Dependable Systems and Networks is an academic conference. The conference publishes majorly in the area(s): Dependability & Fault tolerance. Over the lifetime, 2088 publications have been published by the conference receiving 64641 citations.

...read moreread less

Topics: Dependability, Fault tolerance, Computer science, Fault injection, Software fault tolerance ...read more

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Modeling the effect of technology trends on the soft error rate of combinational logic

[...]

Premkishore Shivakumar, Michael Kistler, Stephen W. Keckler, Doug Burger, Lorenzo Alvisi - Show less +1 more

23 Jun 2002

TL;DR: An end-to-end model is described and validated that enables us to compute the soft error rates (SER) for existing and future microprocessor-style designs and predicts that the SER per chip of logic circuits will increase nine orders of magnitude from 1992 to 2011 and at that point will be comparable to the SERper chip of unprotected memory elements.

...read moreread less

Abstract: This paper examines the effect of technology scaling and microarchitectural trends on the rate of soft errors in CMOS memory and logic circuits. We describe and validate an end-to-end model that enables us to compute the soft error rates (SER) for existing and future microprocessor-style designs. The model captures the effects of two important masking phenomena, electrical masking and latching-window masking, which inhibit soft errors in combinational logic. We quantify the SER due to high-energy neutrons in SRAM cells, latches, and logic circuits for feature sizes from 600 nm to 50 nm and clock periods from 16 to 6 fan-out-of-4 inverter delays. Our model predicts that the SER per chip of logic circuits will increase nine orders of magnitude from 1992 to 2011 and at that point will be comparable to the SER per chip of unprotected memory elements. Our result emphasizes that computer system designers must address the risks of soft errors in logic circuits for future designs.

...read moreread less

1,506 citations

Proceedings Article•DOI•

Pinpoint: problem determination in large, dynamic Internet services

[...]

Mike Y. Chen¹, Emre Kiciman¹, Eugene Fratkin¹, Armando Fox¹, Eric Brewer¹ - Show less +1 more•Institutions (1)

University of California, Berkeley¹

23 Jun 2002

TL;DR: This work presents a dynamic analysis methodology that automates problem determination in these environments by coarse-grained tagging of numerous real client requests as they travel through the system and using data mining techniques to correlate the believed failures and successes of these requests to determine which components are most likely to be at fault.

...read moreread less

Abstract: Traditional problem determination techniques rely on static dependency models that are difficult to generate accurately in today's large, distributed, and dynamic application environments such as e-commerce systems. We present a dynamic analysis methodology that automates problem determination in these environments by 1) coarse-grained tagging of numerous real client requests as they travel through the system and 2) using data mining techniques to correlate the believed failures and successes of these requests to determine which components are most likely to be at fault. To validate our methodology, we have implemented Pinpoint, a framework for root cause analysis on the J2EE platform that requires no knowledge of the application components. Pinpoint consists of three parts: a communications layer that traces client requests, a failure detector that uses traffic-sniffing and middleware instrumentation, and a data analysis engine. We evaluate Pinpoint by injecting faults into various application components and show that Pinpoint identifies the faulty components with high accuracy and produces few false-positives.

...read moreread less

910 citations

Proceedings Article•DOI•

A large-scale study of failures in high-performance computing systems

[...]

Bianca Schroeder¹, Garth A. Gibson¹•Institutions (1)

Carnegie Mellon University¹

25 Jun 2006

TL;DR: Analysis of failure data collected at two large high-performance computing sites finds that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate.

...read moreread less

Abstract: Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations is publicly available. This paper analyzes failure data recently made publicly available by one of the largest high-performance computing sites. The data has been collected over the past 9 years at Los Alamos National Laboratory and includes 23000 failures recorded on more than 20 different systems, mostly large clusters of SMP and NUMA nodes. We study the statistics of the data, including the root cause of failures, the mean time between failures, and the mean time to repair. We find for example that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate. From one system to another, mean repair time varies from less than an hour to more than a day, and repair times are well modeled by a lognormal distribution

...read moreread less

676 citations

Proceedings Article•DOI•

The impact of technology scaling on lifetime reliability

[...]

Jayanth Srinivasan¹, Sarita V. Adve¹, Pradip Bose², Jude A. Rivers²•Institutions (2)

University of Illinois at Urbana–Champaign¹, IBM²

28 Jun 2004

TL;DR: The results imply that leveraging a single microarchitecture design for multiple remaps across a few technology generations will become increasingly difficult, and motivate a need for workload specific, microarch Architectural lifetime reliability awareness at an early design stage.

...read moreread less

Abstract: The relentless scaling of CMOS technology has provided a steady increase in processor performance for the past three decades. However, increased power densities (hence temperatures) and other scaling effects have an adverse impact on long-term processor lifetime reliability. This paper represents a first attempt at quantifying the impact of scaling on lifetime reliability due to intrinsic hard errors, taking workload characteristics into consideration. For our quantitative evaluation, we use RAMP (Srinivasan et al., 2004), a previously proposed industrial-strength model that provides reliability estimates for a workload, but for a given technology. We extend RAMP by adding scaling specific parameters to enable workload-dependent lifetime reliability evaluation at different technologies. We show that (1) scaling has a significant impact on processor hard failure rates - on average, with SPEC benchmarks, we find the failure rate of a scaled 65nm processor to be 316% higher than a similarly pipelined 180nm processor; (2) time-dependent dielectric breakdown and electromigration have the largest increases; and (3) with scaling, the difference in reliability from running at worst-case vs. typical workload operating conditions increases significantly, as does the difference from running different workloads. Our results imply that leveraging a single microarchitecture design for multiple remaps across a few technology generations will become increasingly difficult, and motivate a need for workload specific, microarchitectural lifetime reliability awareness at an early design stage.

...read moreread less

577 citations

Proceedings Article•DOI•

State Machine Replication for the Masses with BFT-SMART

[...]

Alysson Bessani, João Sousa, Eduardo Pelison Alchieri

23 Jun 2014

TL;DR: BFT-SMART is an open-source Java-based library implementing robust BFT state machine replication with improved reliability, modularity as a first-class property, multicore-awareness, reconfiguration support and a flexible programming interface.

...read moreread less

Abstract: The last fifteen years have seen an impressive amount of work on protocols for Byzantine fault-tolerant (BFT) state machine replication (SMR). However, there is still a need for practical and reliable software libraries implementing this technique. BFT-SMART is an open-source Java-based library implementing robust BFT state machine replication. Some of the key features of this library that distinguishes it from similar works (e.g., PBFT and UpRight) are improved reliability, modularity as a first-class property, multicore-awareness, reconfiguration support and a flexible programming interface. When compared to other SMR libraries, BFT-SMART achieves better performance and is able to withstand a number of real-world faults that previous implementations cannot.

...read moreread less

517 citations

Collapse

Performance

Metrics

2,088

Papers

64,641

Citations

No. of papers from the Conference in previous years
Year	Papers
2023	1
2022	53
2021	101
2020	103
2019	93
2018	120