scispace - formally typeset
B

Bianca Schroeder

Researcher at University of Toronto

Publications -  68
Citations -  7218

Bianca Schroeder is an academic researcher from University of Toronto. The author has contributed to research in topics: Scheduling (computing) & Server. The author has an hindex of 29, co-authored 64 publications receiving 6684 citations. Previous affiliations of Bianca Schroeder include Sprint Corporation & Microsoft.

Papers
More filters
Proceedings Article

Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

TL;DR: In this article, the authors present and analyze field-gathered disk replacement data from a number of large production systems, including high-performance computing sites and internet services sites, and find that in the field, annual disk replacement rates typically exceed 1%, with 2-4% common and up to 13% observed on some systems.
Proceedings ArticleDOI

DRAM errors in the wild: a large-scale field study

TL;DR: Measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode.
Proceedings ArticleDOI

A large-scale study of failures in high-performance computing systems

TL;DR: Analysis of failure data collected at two large high-performance computing sites finds that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate.
Journal ArticleDOI

A Large-Scale Study of Failures in High-Performance Computing Systems

TL;DR: Analysis of failure data collected at two large high-performance computing sites finds that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate.
Journal ArticleDOI

Understanding failures in petascale computers

TL;DR: This paper reviews sources of failure information for compute clusters and storage systems, projects failure rates and the corresponding decrease in application effectiveness, and discusses coping strategies such as application-level checkpoint compression and system level process-pairs fault-tolerance for supercomputing.