scispace - formally typeset
Open AccessJournal ArticleDOI

A case for redundant arrays of inexpensive disks (RAID)

Reads0
Chats0
TLDR
Five levels of RAIDs are introduced, giving their relative cost/performance, and a comparison to an IBM 3380 and a Fujitsu Super Eagle is compared.
Abstract
Increasing performance of CPUs and memories will be squandered if not matched by a similar performance increase in I/O. While the capacity of Single Large Expensive Disks (SLED) has grown rapidly, the performance improvement of SLED has been modest. Redundant Arrays of Inexpensive Disks (RAID), based on the magnetic disk technology developed for personal computers, offers an attractive alternative to SLED, promising improvements of an order of magnitude in performance, reliability, power consumption, and scalability. This paper introduces five levels of RAIDs, giving their relative cost/performance, and compares RAID to an IBM 3380 and a Fujitsu Super Eagle.

read more

Content maybe subject to copyright    Report

A Case for Redundant Arrays of Inexpensive Disks (RAID)
Davtd A Patterson, Garth Gibson, and Randy H Katz
Computer Saence D~v~smn
Department of Elecmcal Engmeermg and Computer Sclencea
571 Evans Hall
Umversity of Cabforma
Berkeley. CA 94720
(partrsl@WF -kY du)
Abstract Increasmg performance of CPUs and memorres wrll be
squandered lf not matched by a sunrlm peformance ourease m II0
Whde
the capactty of Smgle Large Expenstve D&T (SLED) has grown rapuily,
the performance rmprovement
of
SLED has been modest Redundant
Arrays of
Inexpensive Disks (RAID), based on the magnetic duk
technology developed
for personal computers, offers
an
attractive
alternattve IO SLED, promtang onprovements
of an or&r of mogm&e m
pctformance, rehabdlty, power consumption, and scalalnlrty Thu paper
rntroducesfivc levels
of RAIDS, grvmg rheu relative costlpetfotmance, and
compares RAID to an IBM 3380 and a Fupisu Super Eagle
1 Background: Rlsrng CPU and Memory Performance
The users of computers are currently enJoymg unprecedented growth
m the speed of computers Gordon Bell said that between 1974 and 1984.
smgle chip computers improved m performance by 40% per year, about
twice the rate of mmlcomputers [Bell 841 In the followmg year B111 Joy
predicted an even faster growth [Joy 851
Mamframe and supercomputer manufacturers, havmg &fficulty keeping
pace with the rapId growth predicted by “Joy’s Law,” cope by offermg
m&processors as theu top-of-the-lme product.
But a fast CPU does not a fast system
make
Gene
Amdahl related
CPU speed to mam memory s12e usmg this rule [Siewmrek 821
Each CPU mnstrucaon per second requues one byte
of moan memory,
If computer system costs are not to be dommated by the cost of memory,
then Amdahl’s constant suggests that memory chip capacity should grow
at the same rate Gordon Moore pr&cted that growth rate over 20 years
fransuforslclup = 2y*-1%4
AK predzted by Moore’s Law, RAMs have quadrupled m capacity every
twotMoom75110threeyeaFIyers861
Recently the rauo of megabytes of mam memory to MIPS ha9 been
defti as ahha [Garcm 841. vvlth Amdahl’s constant meanmg alpha = 1 In
parl because of the rapti drop of memory prices, mam
memory we.9 have
grownfastexthanCPUspeedsandmanymachmesare~ppedtoday~th
alphas of 3 or tigha
To mamtam the balance of costs m computer systems, secondary
storage must match the advances m other parts of the system
A key meas-
Pemuswn to copy mthout fee all or w of &IS matcnal IS granted pronded that the
COP!S
zzrc not made or lstnbuted for dwct commernal advantage, the ACM copyright notIce
and the tltk of the pubbcatuon and IW da’, appear, and notxe IS @“en that
COPYI"K
IS by
pemtrs~on of the Association for Computing Machtnery To
COPY
otherwIse,
or
to
repubbsh, requres B fee and/or spenfic perm~ss~o”
0 1988
ACM 0-89791~268-3/88/~/OlOP $1 50
ure of magneuc tik technology 1s the growth m the maxnnum number of
bits that can be stored per square mch, or the bits per mch m
a
track
umes the number of tracks per mch
Called MA D , for maxunal area1
density, the “Fmt Law m Disk Density” predicts ~rank87]
MAD = lo(Year-1971)/10
Magnettc dd technology has doubled capacity and halved pnce every three
years, m hne with the growth rate of semiconductor memory, and m
practice between 1967 and 1979 the dtsk capacity of the average IBM data
processmg system more than kept up with its mam memory [Stevens81 ]
Capacity IS not the o~rty memory charactensuc that must
grow
rapidly to mamtam system balance, since the speed with which
msuuctions and data are delivered to a CPU also determmes its ulamdte
perfarmanceThespeedof~mem~has~tpacefoPtworeasons
(1) the mvenuon of caches, showmg that a small buff= can be managed
automamzally to contain a substanttal fractmn of
memory
refaences.
(2) and the SRAM technology, used to build caches, whose speed has
lmpmvedattherateof4O%tolOO%peryear
In umtmst to pnmary memory technologres, the performance of
single large expensive ma8netuz d&s (SLED) has improved at a modest
rate These mechamcal devu~ are dommated
by the seek and the rotahon
delays from 1971 to 1981, the raw seek tune for a high-end IBM disk
improved by only a factor of two whllt the rocstlon hme did not
cbange[Harkex811 Greater
denslty
means a lugher transfer rate when the
mformatmn 1s found. and extra heads can educe the aveaage seek tnne, but
the raw seek hme only unproved at a rate of 7%
per
year There 1s no
reasontoexpectafasterratemthenearfuture
To mamtam balance, computer systems have been usmg even larger
mam memones or solid state d&s to buffer some of the I/O acttvlty
This may be a fine solutron for apphcattons whose I/O actrvlty has
locality of reference and for which volatlltty 1s not an issue. but
appbcauons dommated by a high rate of random muests for small peces
of data (such BS tmmact~on-pmcessmg) or by a low number of requests for
massive amounts of data (such as large simulahons nmnmg on
supercomputers) are facmg a sermus p&mnance hmuatmn
2. The Pendrng I/O Crisw
What t3 the Impact of lmprovmg the performance of sOme pieces
of
a
problem while leavmg others the same?
Amdahl’s answer IS now known
asAmdahl'sLaw[Amdahl67]
1
S
z
(1-n +flk
Whae
S = the
effecttve
speedup,
f=fractmnofworkmfastermode,and
k = speedup whde m faster mode
I/G
Suppose that some current appbcatmns spend 10% of thev ume
In
Then when computers are 10X faster--accordmg to Bdl Joy m
JUSt
Over thtte years--then Amdahl’s Law predicts efQcove speedup wdl be only
5X When we have computers lOOX faster--vm evolutmn of umprcuzessors
or by multiprocessors-&s applrcatlon will be less than 10X faster,
wastmg 90% of the potenhal speedup

Whde we can lmagme improvements m software file systems via
buffcrmg for near term 40 demands, we need mnovaUon to avoid an J./O
crms [Boral83]
3 A Solution: Arrays of Inexpensrve Disks
RapId unprovements m capacity of large disks have not been the only
target ofd& designers, smce personal computers have created a market for
inexpensive magnetic disks These lower cost &sks have lower perfor-
mance as well as less capacity Table I below compares the top-of-the-lme
IBM 3380 model AK4 mamframe dtsk, FUJ~$U M2361A “Super Eagle”
muucomputer disk, and the Conner Penpherals CP 3100 personal
computer d&
ChoroctensacS
IBM
FUJUSU
Canners 3380 v 2361 v
3380 M2361A CP3100 3100 31Go
(>I mmrr
3100 Is tt?tter)
D&c dmmeter (mches)
14
105 35 4 3
Formatted DaraCapaclty (MB)
7500 600 100
01 2
Pr~ce/MB(controller mcl )
$18-$10 $20517 $lO-$7
l-25 17-3
MlTFRated (hours)
30,oLw 20@030,ooo 1 15
MlTF m pracUce (hours)
100,000
3
? ?V
No Actuators 4
1 1
2 1
MaxmuunUO’$econd/ActuaU~ 50 40 30
6 8
Typical I/O’s/second/Actuator JO 24 20
7 8
-~wdsecond/box 200 40 30 2 8
Typical VO’s/secondmox
120
24 20
2 8
Transfer Rate (MB/set)
3 25 1
3 4
Power/box (w)
6,600 640 10 660 64
Volume (cu ft )
24
34
03 800 110
Table I
Companson of IBM 3380 dtsk model AK4 for marnframe
computers, the Fuptsu M2361A “Super Eagle” dtsk for rmnrcomputers,
and the Conners Penpherals CP 3100
dtsk
for personal computers By
“‘MOxtmum Ilo’slsecond” we mean the rMxmtum number of average seeks
and average rotates for a stngle sector access Cost and rehabthty
rnfonnatzon on the 3380 comes from w&spread expertence [IBM 871
[hvh2k87]
O?kd the
lnformatlon
on
the FuJltsu
from the manual [Fu&
871, whtle some numbers on the new CP3100 are based on speculatton
The pnce per megabyte w gven as a range to allow for dflerent prices for
volume &scount and d@rent mark-up practtces of the vendors (The 8
watt maximum power of the CP3100 was rncreased to 10 watts to allow
for the tne&xency of an external power supply. stnce rhe other drives
contan their awn power supphes)
One suqmsmg fact is that the number of I/Ck per second per Bctuator in an
inexpensive &Sk is within a factor of two of the large d&s In several of
the remammg metrics, mcludmg pnce per megabyte, the mexpenslve disk
ts supenor or equal to the large Qsks
The small size and low power are even more Impressive since dsks
such
as the
CP31CO contam full track buffers
and
most funcUons of the
traditional mainframe controller Small disk manufacturers can provide
such funcUons m high volume dusks because of the efforts of standards
comm~ttces
m defmmg hrgher level penpheral mterfaces. such as the ANSI
x3 131-1986 Small Computer System Interface (SCSI) Such standards
have encouraged companies bke Adeptec to offer SCSI mterfaces as single
chips, m turn allowing &Sk compames to
embed
mamfiame controller
functrons at low cost Figure 1 compares the uadltlonal mamframe dsk
approach and the small computer disk approach 7%~. sine SCSI mterface
chip
emLxd&d
as a controller m every disk can also be
uSed aS the dmXt
memory access @MA) deuce at the other end
of the SCSI bus
Such charactensUcs lead to our proposal for buddmg I/O systems as
-YS of mexpenslve d&s, either mterleaved for the large tninsfers of
supercomputers [I(lm
86]@vny
871[Satem861 or mdependent for the many
small mnsfen of transacUon processmg Usmg the mformamn m ‘fable
I, 75 ~~xpensrve disks potentmlly have 12 hmcs the I/O bandwIdth of the
IBM 3380 and the same capacity, with lower power
COnSUmpUOn
and Cost
4 Caveats
We cannot explore all issues associated with
such
-ys m the space
avaIlable for this paper, so we ConCefltNte on fundamental estimates of
price-performance and rehabduy Our reasoning IS that If there are no
advantages m pnceperformance or temble d&vantages m rehabdlty, then
there IS IIO
need to explore further We chamctenze a transacUon-processing
workload to evaluate performance of a col&Uon of iexpensive d&s. but
remember that such a CollecUon is Just one hardware component of a
complete tranacUon-processmg system While deslgnmg a complete TPS
based on these ideas 1s enUcmg, we will resst that temptaUon m this
paper Cabling and packagmg, certamly an issue m the cost and rehablhty
of an array of many mexpenslve d&s, IS also beyond this paper’s scope
Mainframe
Small Computer
CPU
LJ
0%
Memoly
Channel
. . .
. . .
CPU
a
dm
Figure 1 Comparison
of organizations for typlca/ mat&me and small
compter ahk tnterfaces Stngle chrp SCSI tnte@ces such as the Adaptec
MC-6250 allow the small computer to ure a single crUp to be the DMA
tnterface as well as pronde an embedded controllerfor each dtsk [Adeptec
871 (The pnce per megabyte an Table I mcludes evetythtng zn the shaded
box.?sabovc)
5. And Now The Bad News: Reliabihty
The unrehabd~ty of d&s forces computer systems managers to make
backup versions of mformaUon quite frequently m case of fmlure What
would be the impact on relmbdlty of havmg a hundredfold Increase m
disks? Assummg a constant fmlure rate--that is. an exponenhally
dlsmbuted Ume to fadure--and that failures are Independent--both
assumptmns made by dtsk manufacturers when cakulaUng the Mean Time
To Fadure O--the zebablhty of an array of d&s IS
MITF ofa slngtc &sk
MTI’F of a Drsk Array
=
Number MDuks m the Array
Using the mformatron m Table I. the MTTF of 100 CP 3100 d&s 1s
30,000/100 = 300 hours, or less than 2 weeks Compared to the 30,ooO
hour (> 3 years) MTTF of the IBM 3380, this IS &smal If we consider
scaling the army to 1000 disks, lhen the MTTF IS 30 hours or about one
day, reqmrmg an
ad~ecIne. worse rhan
dismal
Without fault tolerance, large arrays of mexpenstve Qsks are too
unrehable to be useful
6. A Better Solution’ RAID
To overcome the rebabtbty challenge, we must make use of extra
d&s contammg redundant mformaUon to recover the ongmai mformatmn
when a &Sk fads
Our acronym for these Redundant Arrays of Inexpensn’e
Disks
IS RAID
To sunplify the explanaUon of our final proposal and to
avold confusmn wnh previous work, we give a taxonomy of five different
orgamzaUons of dtsk arrays, begmnmg with murored disks and progressmg
through a variety of ahemaUves with &ffenng performance and rehablhty
We refer to each orgamzauon as a RAID
level
The reader should be forewarned that we describe all levels as If
implemented m hardware solely to slmphfy the presentation, for RAID
Ideas are apphcable to software implementauons as well as hardware
Reltabthty
Our baste approach will be to break the arrays into
rellabrhty groups, with each group having extra “check” disks contammg
redundant mformauon When a disk fads we assume that withm a short
time the failed disk ~111 be replaced and the mformauon wdl be
110

recon~ acted on to the new dlbk usmg the redundant mformauon Th1.s
time IS Ldled the mean time to repair (MlTR) The MTTR can be reduced
If the system includes extra d&s to act as “hot” standby spares, when a
disk fmls, a replacement disk IS swltched m elecrromcally Penodlcally a
human operator replaces all faded d&s Here are other terms that we use
D
= total number of d&s with data (not mcludmg extra check d&s).
G = number of data d&s m a group (not mcludmg extra check d&s),
C = number of check d&s m a group,
nG =D/G=nUmberOfgoUp&
As menhoned above we make the same assumptions that disk
manufacturers make--that fadura are exponenual and mdependent (An
earthquake or power surge IS a sltuatlon where an array of d&s might not
foul Independently ) Since these reliability prticuons wdl be very high,
we want to emphasize that the rehabdlty IS only of the the &sk-head
assemblies with this fmlure model, and not the whole software and
electromc system In ad&non, m our view the pace of technology means
extremely lugh WF are “overlull”--for, independent of expected bfeume,
users will replace obsolete &sks After all, how many people are stdl
using 20 year old d&s?
The general MT’TF calculation for single-error repamng RAID 1s
given III two steps Fmt, the group MTIF IS
mFDtsk
I
MrrF,,,, =
*
G+C
Probabdrty ofanotherfadure m a group
b&re repamng the dead oisk
As more formally denved m the appendix, the probabdlty of a second
fa&nebeforethefirsthasbeenrepauedIs
MlTR
hill-R
Probabdrty of =
E
Another Failure
bfnF,,,,k /(No
DIS~T-
1)
MmF/j,k /(w-l)
The mtmuon behmd the formal calculation m-the appendix comes
from trymg to calculate the average number of second d& fdures durmg
the repau time for X single &Sk fadures
Since we assume that Qsk fadures
occur at a umform rate, tha average number of second fa&ues durmg the
rcpau tune for X first fadures 1s
X *MlTR
MlTF of remamtng d&s u) the group
The average number of second fathues for a smgle d&z 1s then
MlTR
bfnFD,& /
No
Of W?UlUllIl~ drSkS l?l the group
The MTTF of the retnaming disks IS Just the MTI’F of a smgle disk
dnwkd by the number of go4 disks m the gmup. gwmg the result above
The second step IS the reltablhty of the whole system, which IS
approxl~~~teiy
(smcc MITFGrow 1s
not qmte titnbuted exponentrally)
MTrFGrarp
MTTFRAID =
Pi
Pluggmg It all together, we get.
mFD,sk
mFD,sk
1
MITFRAID = -
*
*-
G+C
(G+C-l)*MITR “c
(MmFDtsk)2
= (G+C)*tlG * (G+C-l)*MITR
Smce the formula 1s tbe same for each level, we make the abstract
numbers
concrete usmg these parameters as appropriate D=loO total data
d&s, G=lO data disks per group, M7VDcsk = 30,000 hours, MmR = 1
hour,
with the
check d&s per group C detennmed by the RAID level
Relubrlrty Overhead
Cost This IS stmply the extra check
disks. expressed as a percentage of the number of data &sks
D
As we shall
see below, the cost vanes WIUI RAID level fmm 100% down to 4%
Useable Storage Capacity Percentage
Another way to
express this rellabdlty overhead 1s m terms of the percentage of the total
capacity of data &sks
and
check disks that can be used to store data
Depending on the orgamauon, this vanes from a low of 50% to a high of
96%
Performance
Smce supercomputer applications and
transaction-processing systems have &fferent access patterns and rates, we
need different metncs to evaluate both For supercomputers we count the
number of reads and wnte.s per second for large blocks of data, with large
defined as gettmg at least one sector from each data d& III a group Durmg
large transfers all the disks m a group act as a stngle umt, each readmg or
wntmg a pomon of the large data block m parallel
A better measure for transacuon-processmg systems s the number of
indlvrdual reads or writes per second Smce transacuon-processing
systems (e g , deblts/cre&ts) use a read-modify-wnte sequence of disk
accesses, we mclude that metnc as well Ideally durmg small transfers each
dsk m a group can act mdepe&ndy. e~thez readmg or wntmg mdependent
mfonnatmn In summary supercomputer applicauons need a hrgh dura rure
whale transacuon-pmcessm g need a
hrgh II0 rate
For both the large and small transfer calculauons we assume the
mlmmum user request IS a sector, that a sector 1s small relauve to a track,
and that there 1s enough work to keep every devtce busy Thus sector size
affects both dusk storage efficiency and transfer sue Figure 2 shows the
uiealoperauonoflargeandsmall~accessesmaRAID
(a) Stngle Large or “Graupcd” Read
(lreadqwadoverGd&s)
1tt 1
q nl .*.
(b) Several Smll or Indmdual Reads and Writes
(GndsandlorwntcsqmndawrG&sks)
Figure 2. Large tramfer vs
small tran$ers WI a group of G d&s
The SIX pelformauce
memcs are then the number of reads, wntes, and
read-mod@-writes per second for both large (grouped) or small (mdlvldual)
transfers Rather than @ve absolute numbers for each memc, we calculate
efficiency the number of events per second for a RAID relative to the
corrcqondmg events per second for a smgle dusk (This ts Boral’s I/O
bandwidth per ggabyte moral 831 scaled to glgabytes per disk ) In Uns
pap we are after fundamental Mferences so we use ample. demmmlstlc
throughput measures for our pezformance memc rather than latency
Effective Performance Per Dnk The
cost of d&s can be a
large portmn of the cost of a database system, so the I/O performance per
disk--factonng m the overhead of the check disks--suggests the
cost/performance of a system
‘flus IS the bottom line for a RAID
111

I
7. First Level RAID: Mwrored Disks
Mmored dusks are 11 tradmonal approach for lmprovmg rellabdlty of
magneuc disks This IS the most expensive opuon we consider since all
tiks are duplicated (G=l and C=l). and eve.ry wnte to a data dusk 1s also a
wnte to a check &Sk Tandem doubles the number of controllers for fault
tolerance, allowing an opwnized version of mirrored d&s that lets reads
occur m parallel Table II shows the memcs for a Level 1 RAID assummg
this optnnuatton
MTTF
Total Number of D&s
Ovcrhcad Cost
Usecrble Storage Capacity
Exceeds Useful Roduct Ltiwne
(4500.000 hrs or > 500 years)
2D
100%
50%
Eventslscc vs Smgle Disk Full RAID
E@caency Per Disk
hrge (or Grouped) Readr
ws
1 00/s
Large (or Grouped) Wrues
D/S
50/S
Large (or Grouped) R-M-W
4Dl3S
67/S
Small (or Indsvuiual) Rends W
100
Small (or hd~vuiual) Writes D
50
Small (or In&dual) R-M-W 4D/3
61
Table II.
Charactenstrcs of Level 1 RAID Here we assume
that
writes
are not slowed by waztrng jar the second wrote to complete because the
slowdown for writing 2 dtsks 1s mtnor compared to the slowdown S for
wntrng a whole group of 10 lo
25
d&s Unltke a “pure” mtrrored scheme
wtth extra &As that are mvlsrble to the s&ware, we assume an optmuted
scheme with twice as many controllers allowtng parallel reads to all d&s,
grvmg full disk bandwidth for large reads and allowtng the reads of
rea&noaijj-nntes to occw in paralbzl
When mdwldual accesses am dlsmbuted acmss muluple d&s, average
queuemg. seek, and rotate delays may &ffer from the smgle Qsk case
Although bandwidth may be unchanged, it is Qsmbuted more evenly,
reducing vanance m queuemg delay and, If the disk load IS not too high,
also reducmg the expected queuemg delay through parallebsm [Llvny 871
When many arms seek to the same track then rotate to the described sector,
the average seek and rotate time wdl be larger than the average for a smgle
disk, tendmg toward the worst case tunes Tlus affect should not generally
more than double the average access tlmc to a smgle sector whde stdl
gettmg many sectors m parallel In the special case of rmrrored &sks with
sufficient controllers, the choice between arms that can read any data sector
will reduce the tune for the average read seek by up to 45% mltton 881
To allow for these factors but to retam our fundamental emphasis we
apply a slowdown factor, S, when there are more than two d&s m a
group In general, 1 5 S < 2 whenever groups of disk work m parallel
With synchronous disks the spindles of all disks m the group are
synchronous so that the correspondmg sectors of a group of d&s pass
under the heads stmultaneously,[Kmzwd 881 so for synchronous disks
there IS no slowdown and S = 1 Smce a Level 1 RAID has only one data
disk m its group, we assume that the large transfer reqmres the same
number of Qsks actmg in concert 9s found m groups of the higher level
RAIDS 10 to 25 d&s
Dupllcatmg all msks can mean doubhng the cost of the database
system or using only 50% of the disk storage capacity Such largess
inspires the next levels of RAID
8 Second Level RAID: Hammmg Code for ECC
The history of main memory orgaruzauons suggests a way to reduce
the cost of rehablhty With the introduction of 4K and 16K DRAMS,
computer designers discovered that these new devices were
SubJe.Ct to
losing information due to alpha part&s Smce there were many single
bit DRAMS m a system and smce they were usually accessed m groups of
16 to 64 chips at a ume, system designers added redundant chips to correct
single errors and to detect double errors m each group This increased the
number of memory chips by 12% to 38%--depending on the size of the
group--but 11 slgmdcantly improved rehabdlty
As long as all the dam bits m a group are read or wntten together,
there 1s no Impact on performance However, reads of less than the group
size requue readmg the whole group to be sure the mformatir? IS correct,
and writes to a poruon of the group mean three steps
1) a read step to get all the rest ofthe data,
2)
a mad&v step to merge the new and old mformatwn.
3) a write step to write the full group, tncludmg check lnformatwn
Smce we have scores of d&s m a RAID and smce some accesses are
to groups of d&s, we can mimic the DRAM solution by bit-mterleavmg
the data across the disks of a group and then add enough check d&s to
detect and correct a smgle error A smgle panty dusk can detect a smgle
error, but to correct an erroI we need enough check dusks to ulentiy the
disk with the error For a group sue of 10 data do& (G) we need 4 check
d&s (C) m total, and d G = 25 then C = 5 [HammmgSO] To keep down
the cost of redundancy, we assume the group size will vary from 10 to 25
Since our mdwidual data transfer urn1 is Just a sector, bit- interleaved
dsks mean that a large transfer for this RAID must be at least G sectors
L&e DRAMS, reads to a smaller amount unpiles readmg a full “cctor from
each of the bit-mterleaved disks m a group, and writes of a single unit
involve the read-modify-wnte cycle to hll the Qsks Table III shows the
metncs of this Level 2 RAID
MlTF
ExceedsUseful~ehme
G=lO
G=Z
(494500 hrs
(103500 llrs
or >50 years)
OT 12 years)
Total Number of D&s
14OD
12OD
overhud Cost 40%
20%
Useable Storage Capacity
71%
83%
EventslSec
Full RAID Eficlency Per Ask Eflc~ncy Per Dtsk
(vs Single Disk) L2 L2lLI
L2 L2ILI
hgeRe&
D/S
111s 71%
86/S 86%
Lurgc wrllcs
D/S
71/s 143%
86/s 112%
Large R-M-W
D/S
71/s 107%
86/S 129%
Small Reodr DISC
01/s
6%
03lS 3%
Small Wrttes
D12sG
04/S 6% o2.B 3%
Small R-M-W DISC
071s 9%
03/S 4%
Table III Charactenstlcs
of a Level 2 RAID The L2lLI column gives
the % performance of level 2 m terms of lewl 1 (>lOO% means L.2 IS
faster) As long as the transfer taut ts large enough to spread over all the
data d& of a group. the large IIOs get the
full
bandwuith of each &Sk,
&w&d by S to allow all dtsks m a group to complete Level 1 large reads
are fmler because &a IS duphcated and so the redwdoncy d&s can also do
independent accesses Small 110s still reqture accessmg all the Isks tn a
group. so only DIG small IIOc can happen at a tone, agam dwrded by S to
allow a group of disks to jintsh Small Level 2 writes are hke small
R-M-W becalcse full sectors must be read before new &ta can be written
onto part of each sector
For large wntes, the level 2 system has the same performance as level
1 even though it uses fewer check disks, and so on a per disk basis It
outperforms level 1 For small data transfers the performance 1s &smal
either for the whole system or per disk, all the disks of a group must be
accessed for a small transfer, llmltmg the Ipaxrmum number of
simultaneous accesses to
DIG
We also include the slowdown factor S
smce the access must wat for all the disks to complete
Thus level 2 RAID IS desuable for supercomputers but mapproprmte
for transaction processmg systems, with increasing group size. increasing
the disparity m performance per disk for the two applications In
recognition of this fact, Thrnkmg Machmes Incorporated announced a
Level 2 RAID this year for its Connecuon Machme supercomputer called
the “Data Vault,” with G = 32 and C = 8, mcludmg one hot standby spare
[H&s 871
Before improving small data transfers, we concentrate once more on
lowenng the cost
9 Thwd Level RAID: Single Check Disk Per Group
Most check disks m the level 2 RAID are used to determme which
disk faded, for only one redundant panty disk is needed to detect an error
These extra disks are truly “redundant” smce most drsk controllers can
already detect If a dusk faded either through special signals provided m the
disk interface or the extra checking mformauon at the end of a sector used
to detect and correct soft errors So mformatlon on the failed disk can be
reconstructed by calculatmg the parity of the remaining good disks and
then companng bit-by-bit to the panty calculated for the ongmal full
112

group When these two parmcs agree, the faded bu was a 0, othcrwtse it
RAID levels 2,3, and 4 By stormg a whole transfer umt m a sector, reads
was a 1 If the check drsk IS the fadure,Just read all the data drsks and store
the group panty in the replacement drsk
can be mdependent and operate at the maxrmum rate of a disk yet sull
detect errors Thus the primary change between level 3 and 4
IS that WC
Reducmg the check d&s toone per group (C=l) reduces the overhead
cost to between 4% and 10% for the group stzes considered here
The
performance for the thud level RAID system is the same as the Level 2
RAID, but the effectrve performance per dtsk mcreases smce it needs fewer
check d&s This reductron m total d&s also increases relrabdtty, but
since It is shll larger than the useful hfehme of disks, this IS a minor
pomt One advantage of a level 2 system over level 3 is that the extra
check mformatton assocrated with each sector to correct soft errors IS not
needed, mcreasmg the capactty per dtsk by perhaps 10% Level 2 also
allows all soft errors to be corrected “on the fly” wnhout havmg to reread a
sector Table IV summarizes the thud level RAID charactensncs and
Figure 3 compares the sector layout and check d&s for levels 2 and 3
mterlcave data
4 Tran$er
UIlllS
a, b, c & d
Level 4
Sector 0
&la
Disk 1
MlTF
Exceeds Useful Lrfenme
Secwr 0
Data
Disk 2
A
a
T
2
A
Total Number of D&s
owrhcad cost
Useable Storage Capacity
EventslSec
Full RAID
(vs Single Disk)
LargeRecu&
D/S
Large Writes
D/S
Large R-M-W
D/S
Small Readr
DISC
Small Vyrites
D/2sG
Small R-M-W
DISC
G=lO
(820,000
hrs
or >90 years)
1 1OD
10%
91%
EIficclency Per Disk
L3 WIL2 WILl
91/S 127% 91%
91/S 121% 182%
91/S 127% 136%
09/S 127% 8%
05/S
127% 8%
09/S
127% 11%
G=25
(346,000
hrs
or 40 years)
104D
4%
96%
Eflctency Per Disk
w Lx2 WILI
96/S
112% 96%
96/S
112% 192%
96/S 112% 142%
041s 112% 3%
02/S 112% 3%
041s 112% 5%
Table IV Characterrstrcs
of a Level 3 RAID The L3lL2 column gives
the % performance of L3 tn terms
of
L2 and the L3ILl column give; it in
terms of
LI (>loO% means L3
IS faster)
The performance for the full
systems IS the same m RAID levels 2 and 3, but since there are fewer
check dtsks the performance per dnk tmproves
Park and Balasubramaman proposed a thud level RAID system
without suggestmg a partrcular applicauon park861 Our calculattons
suggest tt 1s a much better match to supercomputer apphcatrons than to
transacuon processing
systems
This year two disk manufacturers have
announced level 3 RAIDS for such apphcanons usmg synchronized 5 25
mch disks with G=4 and C=l one from IvIaxtor and one from Mtcropohs
[Magmms 871
This thud level has brought the rehabrhty overhead cost to its lowest
level, so in the last two levels we Improve performance of small accesses
w&out changmg cost or rehabrlny
10. Fourth Level RAID Independent ReadsbVrltes
Spreadmg a transfer across all &sks wuhm the group has the
followmg advantage
.
Large or grouped transfer ttme IS reduced because transfer
bandwulth of the entue array can be exploned
But it has the followmg drsadvantagek as well
.
ReadmgAvnhng to a disk m a group requues readmg/wnhng to
all the d&s m a group, levels 2 and 3 RAIDS can perform only
one I/O at a
Pme
per group
.
If the disks are not synchromzed, you do not see average seek
and rotattonal delays, the observed delays should move towards
the worst case, hence the S factor m the equatrons above
This fourth level RAID improves performance of small transfers through
parallehsm--the abrhty to do more than one I/O per group at a ume We
no longer spread the mdtvtdual transfer informanon across several &sks,
but keep each mdrvrdual unit ma smgle disk
The vutue of bit-mterleavmg 1s the easy calculatron of the Hammmg
code needed to detect or correct errors in level 2
But recall that m the thud
level RAID we rely on the drsk controller to detect errors wnhm a single
drsk sector Hence, rf we store an mdrvrdual transfer umt in a single sector,
we can detect errors on an mdtvtdual read without accessing any other drsk
Frgure 3 shows the different ways the mformatron is stored in a sector for
Sectar 0
L&a
Dtsk 3
Sector 0
D&
Disk 4
Sector 0
Check
Disk 5
Sector 0
Check
Ask 6
Sector 0
Check
Disk 7
aEcc0
bECC0
CECCO
dECC0
aEcc1
bECC1
cECC1
dEcc1
aEcc2
bECC2
cECC2
dECC2
ECCa
ECCb
ECCc
ECCd
(Only one
check &Sk
tn level 3
Check
rnfo
ts calculated
aver each
lran.$er 10~1
(Each @tier
umt 1s placed tnto
a single sector
Note that the check
~$0 IS now calculated
over a pece of each
tran$er urut )
D
I
s
L
Frgure
3 Comparrson
of
locatton
of data
and check mformatlon In
sectors for RAID levels 2, 3, and 4
for
G=4 Not shown IS the small
amount
of
check mformatton per sector added by the disk controller to
detect and correct
soft
errors wlthm a sector Remember that we use
physical sector numbers and hardware control to explain these ideas but
RAID can be unplemented by sofmare ucmg logical sectors and disks
At fust thought you mrght expect that an mdrvldual wnte to a smglz
sector stdl mvolves all the disks m a group smce (1) the check disk mutt
be rewritten wnh the new panty data, and (2) the rest of the data dash>
must be read to be able to calculate the new panty data Recall that each
panty bit IS Just a smgle exclusive OR of s+l the correspondmg data NIL 11
a group In level 4 RAID, unhke level 3, the panty calculatron is ITXFI
simpler since, if we know the old data value and the old parity balue al
well as the new data value, we can calculate the new panty mforrmror: sr
follows
new panty = (old data xor new data ) xor old pantv
In level 4 a small wnte
then
uses 2 dtsks to perform 4 accesses-2 rea&
and 2 wrnes--whtle a small mad mvolves only one read on one
disk
Table
V summarmes the fourth level RAID charactensucs Note that all small
accesses
improve--dramatrcally for the reads--but the small
read-modrfy-wnte is strll so slow relatrve to a level 1 RAID that ns
applrcabduy to transactron processmg is doubtful Recently Salem and
Gama-Molma proposed a Level 4 system [Salem 86)
Before proceedmg to the next level we need to explam the
performance of small writes
in Table V (and hence small
read-modify-writes smce they entarl the same operatrons m dus RAID)
The formula for the small wntes drvrdes
D
by 2 Instead of 4 becau*e 2
113

Citations
More filters
Journal ArticleDOI

The Google file system

TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.
Journal ArticleDOI

The design and implementation of a log-structured file system

TL;DR: In this paper, a log-structured file system called Sprite LFS is proposed, which uses a segment cleaner to compress the live information from heavily fragmented segments in order to speed up file writing and crash recovery.
Journal ArticleDOI

Multiple description coding: compression meets the network

TL;DR: By allowing image reconstruction to continue even after a packet is lost, this type of representation can prevent a Web browser from becoming dormant, and the source can be approximated from any subset of the chunks.
Journal ArticleDOI

Query evaluation techniques for large databases

TL;DR: This survey describes a wide array of practical query evaluation techniques for both relational and postrelational database systems, including iterative execution of complex query evaluation plans, the duality of sort- and hash-based set-matching algorithms, types of parallel query execution and their implementation, and special operators for emerging database application domains.
References
More filters
Journal ArticleDOI

Error detecting and error correcting codes

TL;DR: The author was led to the study given in this paper from a consideration of large scale computing machines in which a large number of operations must be performed without a single error in the end result.
Proceedings ArticleDOI

Validity of the single processor approach to achieving large scale computing capabilities

TL;DR: In this paper, the authors argue that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution.
Proceedings Article

Progress in digital integrated electronics

TL;DR: The expected t rends can be recombined to see how long exponent i a l g r o w t h i n c o m p l e x i t y c a n b e e x p e c t e d t o con t inue.
Proceedings Article

Disk Shadowing

Dina Bitton, +1 more
TL;DR: An analytic model is developed which shows that the seek time for a random read in a shadow set is a monotonic decreasing function of the number of disks in the set, and that intelligent device scheduling of shadowed disks increases the I/O rate.
Proceedings Article

Disk striping

TL;DR: This paper investigates the effect of striping on disk service times and its advantages and limitations in one of a set representative applications, file processing, and explores several possible enhancements to striping.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What have the authors contributed in "A case for redundant arrays of inexpensive disks (raid)" ?

Smgle chip computers improved m performance by 40 % per year, about twice the rate of mmlcomputers [ Bell 841 In the followmg year B111 Joy predicted an even faster growth [ Joy 851 Mamframe and supercomputer manufacturers, havmg & fficulty keeping pace with the rapId growth predicted by “ Joy ’ s Law, ” cope by offermg m & processors as theu top-of-the-lme product. But a fast CPU does not a fast system make Gene Amdahl related CPU speed to mam memory s12e usmg this rule [ Siewmrek 821 Each CPU mnstrucaon per second requues one byte of moan memory, If computer system costs are not to be dommated by the cost of memory, then Amdahl ’ s constant suggests that memory chip capacity should grow at the same rate Gordon Moore pr & cted that growth rate over 20 years fransuforslclup =