What have the authors contributed in "A case for redundant arrays of inexpensive disks (raid)" ?

Q: What have the authors contributed in "A case for redundant arrays of inexpensive disks (raid)" ?

Smgle chip computers improved m performance by 40 % per year, about twice the rate of mmlcomputers [ Bell 841 In the followmg year B111 Joy predicted an even faster growth [ Joy 851 Mamframe and supercomputer manufacturers, havmg & fficulty keeping pace with the rapId growth predicted by “ Joy ’ s Law, ” cope by offermg m & processors as theu top-of-the-lme product. But a fast CPU does not a fast system make Gene Amdahl related CPU speed to mam memory s12e usmg this rule [ Siewmrek 821 Each CPU mnstrucaon per second requues one byte of moan memory, If computer system costs are not to be dommated by the cost of memory, then Amdahl ’ s constant suggests that memory chip capacity should grow at the same rate Gordon Moore pr & cted that growth rate over 20 years fransuforslclup =

(Open Access) A case for redundant arrays of inexpensive disks (RAID) (1988) | David A. Patterson

A Case for Redundant Arrays of Inexpensive Disks (RAID)

Davtd A Patterson, Garth Gibson, and Randy H Katz

Computer Saence D~v~smn

Department of Elecmcal Engmeermg and Computer Sclencea

571 Evans Hall

Umversity of Cabforma

Berkeley. CA 94720

(partrsl@WF -kY du)

Abstract Increasmg performance of CPUs and memorres wrll be

squandered lf not matched by a sunrlm peformance ourease m II0

Whde

the capactty of Smgle Large Expenstve D&T (SLED) has grown rapuily,

the performance rmprovement

SLED has been modest Redundant

Arrays of

Inexpensive Disks (RAID), based on the magnetic duk

technology developed

for personal computers, offers

attractive

alternattve IO SLED, promtang onprovements

of an or&r of mogm&e m

pctformance, rehabdlty, power consumption, and scalalnlrty Thu paper

rntroducesfivc levels

of RAIDS, grvmg rheu relative costlpetfotmance, and

compares RAID to an IBM 3380 and a Fupisu Super Eagle

1 Background: Rlsrng CPU and Memory Performance

The users of computers are currently enJoymg unprecedented growth

m the speed of computers Gordon Bell said that between 1974 and 1984.

smgle chip computers improved m performance by 40% per year, about

twice the rate of mmlcomputers [Bell 841 In the followmg year B111 Joy

predicted an even faster growth [Joy 851

Mamframe and supercomputer manufacturers, havmg &fficulty keeping

pace with the rapId growth predicted by “Joy’s Law,” cope by offermg

m&processors as theu top-of-the-lme product.

But a fast CPU does not a fast system

make

Gene

Amdahl related

CPU speed to mam memory s12e usmg this rule [Siewmrek 821

Each CPU mnstrucaon per second requues one byte

of moan memory,

If computer system costs are not to be dommated by the cost of memory,

then Amdahl’s constant suggests that memory chip capacity should grow

at the same rate Gordon Moore pr&cted that growth rate over 20 years

fransuforslclup = 2y*-1%4

AK predzted by Moore’s Law, RAMs have quadrupled m capacity every

twotMoom75110threeyeaFIyers861

Recently the rauo of megabytes of mam memory to MIPS ha9 been

defti as ahha [Garcm 841. vvlth Amdahl’s constant meanmg alpha = 1 In

parl because of the rapti drop of memory prices, mam

memory we.9 have

grownfastexthanCPUspeedsandmanymachmesare~ppedtoday~th

alphas of 3 or tigha

To mamtam the balance of costs m computer systems, secondary

storage must match the advances m other parts of the system

A key meas-

Pemuswn to copy mthout fee all or w of &IS matcnal IS granted pronded that the

COP!S

zzrc not made or lstnbuted for dwct commernal advantage, the ACM copyright notIce

and the tltk of the pubbcatuon and IW da’, appear, and notxe IS @“en that

COPYI"K

IS by

pemtrs~on of the Association for Computing Machtnery To

COPY

otherwIse,

repubbsh, requres B fee and/or spenfic perm~ss~o”

0 1988

ACM 0-89791~268-3/88/~/OlOP $1 50

ure of magneuc tik technology 1s the growth m the maxnnum number of

bits that can be stored per square mch, or the bits per mch m

track

umes the number of tracks per mch

Called MA D , for maxunal area1

density, the “Fmt Law m Disk Density” predicts ~rank87]

MAD = lo(Year-1971)/10

Magnettc dd technology has doubled capacity and halved pnce every three

years, m hne with the growth rate of semiconductor memory, and m

practice between 1967 and 1979 the dtsk capacity of the average IBM data

processmg system more than kept up with its mam memory [Stevens81 ]

Capacity IS not the o~rty memory charactensuc that must

grow

rapidly to mamtam system balance, since the speed with which

msuuctions and data are delivered to a CPU also determmes its ulamdte

perfarmanceThespeedof~mem~has~tpacefoPtworeasons

(1) the mvenuon of caches, showmg that a small buff= can be managed

automamzally to contain a substanttal fractmn of

memory

refaences.

(2) and the SRAM technology, used to build caches, whose speed has

lmpmvedattherateof4O%tolOO%peryear

In umtmst to pnmary memory technologres, the performance of

single large expensive ma8netuz d&s (SLED) has improved at a modest

rate These mechamcal devu~ are dommated

by the seek and the rotahon

delays from 1971 to 1981, the raw seek tune for a high-end IBM disk

improved by only a factor of two whllt the rocstlon hme did not

cbange[Harkex811 Greater

denslty

means a lugher transfer rate when the

mformatmn 1s found. and extra heads can educe the aveaage seek tnne, but

the raw seek hme only unproved at a rate of 7%

per

year There 1s no

reasontoexpectafasterratemthenearfuture

To mamtam balance, computer systems have been usmg even larger

mam memones or solid state d&s to buffer some of the I/O acttvlty

This may be a fine solutron for apphcattons whose I/O actrvlty has

locality of reference and for which volatlltty 1s not an issue. but

appbcauons dommated by a high rate of random muests for small peces

of data (such BS tmmact~on-pmcessmg) or by a low number of requests for

massive amounts of data (such as large simulahons nmnmg on

supercomputers) are facmg a sermus p&mnance hmuatmn

2. The Pendrng I/O Crisw

What t3 the Impact of lmprovmg the performance of sOme pieces

problem while leavmg others the same?

Amdahl’s answer IS now known

asAmdahl'sLaw[Amdahl67]

(1-n +flk

Whae

S = the

effecttve

speedup,

f=fractmnofworkmfastermode,and

k = speedup whde m faster mode

I/G

Suppose that some current appbcatmns spend 10% of thev ume

Then when computers are 10X faster--accordmg to Bdl Joy m

JUSt

Over thtte years--then Amdahl’s Law predicts efQcove speedup wdl be only

5X When we have computers lOOX faster--vm evolutmn of umprcuzessors

or by multiprocessors-&s applrcatlon will be less than 10X faster,

wastmg 90% of the potenhal speedup

Whde we can lmagme improvements m software file systems via

buffcrmg for near term 40 demands, we need mnovaUon to avoid an J./O

crms [Boral83]

3 A Solution: Arrays of Inexpensrve Disks

RapId unprovements m capacity of large disks have not been the only

target ofd& designers, smce personal computers have created a market for

inexpensive magnetic disks These lower cost &sks have lower perfor-

mance as well as less capacity Table I below compares the top-of-the-lme

IBM 3380 model AK4 mamframe dtsk, FUJ~$U M2361A “Super Eagle”

muucomputer disk, and the Conner Penpherals CP 3100 personal

computer d&

ChoroctensacS

IBM

FUJUSU

Canners 3380 v 2361 v

3380 M2361A CP3100 3100 31Go

(>I mmrr

3100 Is tt?tter)

D&c dmmeter (mches)

105 35 4 3

Formatted DaraCapaclty (MB)

7500 600 100

01 2

Pr~ce/MB(controller mcl )

$18-$10 $20517 $lO-$7

l-25 17-3

MlTFRated (hours)

30,oLw 20@030,ooo 1 15

MlTF m pracUce (hours)

100,000

? ?V

No Actuators 4

1 1

2 1

MaxmuunUO’$econd/ActuaU~ 50 40 30

6 8

Typical I/O’s/second/Actuator JO 24 20

7 8

-~wdsecond/box 200 40 30 2 8

Typical VO’s/secondmox

120

24 20

2 8

Transfer Rate (MB/set)

3 25 1

3 4

Power/box (w)

6,600 640 10 660 64

Volume (cu ft )

03 800 110

Table I

Companson of IBM 3380 dtsk model AK4 for marnframe

computers, the Fuptsu M2361A “Super Eagle” dtsk for rmnrcomputers,

and the Conners Penpherals CP 3100

dtsk

for personal computers By

“‘MOxtmum Ilo’slsecond” we mean the rMxmtum number of average seeks

and average rotates for a stngle sector access Cost and rehabthty

rnfonnatzon on the 3380 comes from w&spread expertence [IBM 871

[hvh2k87]

O?kd the

lnformatlon

the FuJltsu

from the manual [Fu&

871, whtle some numbers on the new CP3100 are based on speculatton

The pnce per megabyte w gven as a range to allow for dflerent prices for

volume &scount and d@rent mark-up practtces of the vendors (The 8

watt maximum power of the CP3100 was rncreased to 10 watts to allow

for the tne&xency of an external power supply. stnce rhe other drives

contan their awn power supphes)

One suqmsmg fact is that the number of I/Ck per second per Bctuator in an

inexpensive &Sk is within a factor of two of the large d&s In several of

the remammg metrics, mcludmg pnce per megabyte, the mexpenslve disk

ts supenor or equal to the large Qsks

The small size and low power are even more Impressive since dsks

such

as the

CP31CO contam full track buffers

and

most funcUons of the

traditional mainframe controller Small disk manufacturers can provide

such funcUons m high volume dusks because of the efforts of standards

comm~ttces

m defmmg hrgher level penpheral mterfaces. such as the ANSI

x3 131-1986 Small Computer System Interface (SCSI) Such standards

have encouraged companies bke Adeptec to offer SCSI mterfaces as single

chips, m turn allowing &Sk compames to

embed

mamfiame controller

functrons at low cost Figure 1 compares the uadltlonal mamframe dsk

approach and the small computer disk approach 7%~. sine SCSI mterface

chip

emLxd&d

as a controller m every disk can also be

uSed aS the dmXt

memory access @MA) deuce at the other end

of the SCSI bus

Such charactensUcs lead to our proposal for buddmg I/O systems as

-YS of mexpenslve d&s, either mterleaved for the large tninsfers of

supercomputers [I(lm

86]@vny

871[Satem861 or mdependent for the many

small mnsfen of transacUon processmg Usmg the mformamn m ‘fable

I, 75 ~~xpensrve disks potentmlly have 12 hmcs the I/O bandwIdth of the

IBM 3380 and the same capacity, with lower power

COnSUmpUOn

and Cost

4 Caveats

We cannot explore all issues associated with

such

-ys m the space

avaIlable for this paper, so we ConCefltNte on fundamental estimates of

price-performance and rehabduy Our reasoning IS that If there are no

advantages m pnceperformance or temble d&vantages m rehabdlty, then

there IS IIO

need to explore further We chamctenze a transacUon-processing

workload to evaluate performance of a col&Uon of iexpensive d&s. but

remember that such a CollecUon is Just one hardware component of a

complete tranacUon-processmg system While deslgnmg a complete TPS

based on these ideas 1s enUcmg, we will resst that temptaUon m this

paper Cabling and packagmg, certamly an issue m the cost and rehablhty

of an array of many mexpenslve d&s, IS also beyond this paper’s scope

Mainframe

Small Computer

CPU

Memoly

Channel

. . .

CPU

Figure 1 Comparison

of organizations for typlca/ mat&me and small

compter ahk tnterfaces Stngle chrp SCSI tnte@ces such as the Adaptec

MC-6250 allow the small computer to ure a single crUp to be the DMA

tnterface as well as pronde an embedded controllerfor each dtsk [Adeptec

871 (The pnce per megabyte an Table I mcludes evetythtng zn the shaded

box.?sabovc)

5. And Now The Bad News: Reliabihty

The unrehabd~ty of d&s forces computer systems managers to make

backup versions of mformaUon quite frequently m case of fmlure What

would be the impact on relmbdlty of havmg a hundredfold Increase m

disks? Assummg a constant fmlure rate--that is. an exponenhally

dlsmbuted Ume to fadure--and that failures are Independent--both

assumptmns made by dtsk manufacturers when cakulaUng the Mean Time

To Fadure O--the zebablhty of an array of d&s IS

MITF ofa slngtc &sk

MTI’F of a Drsk Array

Number MDuks m the Array

Using the mformatron m Table I. the MTTF of 100 CP 3100 d&s 1s

30,000/100 = 300 hours, or less than 2 weeks Compared to the 30,ooO

hour (> 3 years) MTTF of the IBM 3380, this IS &smal If we consider

scaling the army to 1000 disks, lhen the MTTF IS 30 hours or about one

day, reqmrmg an

ad~ecIne. worse rhan

dismal

Without fault tolerance, large arrays of mexpenstve Qsks are too

unrehable to be useful

6. A Better Solution’ RAID

To overcome the rebabtbty challenge, we must make use of extra

d&s contammg redundant mformaUon to recover the ongmai mformatmn

when a &Sk fads

Our acronym for these Redundant Arrays of Inexpensn’e

Disks

IS RAID

To sunplify the explanaUon of our final proposal and to

avold confusmn wnh previous work, we give a taxonomy of five different

orgamzaUons of dtsk arrays, begmnmg with murored disks and progressmg

through a variety of ahemaUves with &ffenng performance and rehablhty

We refer to each orgamzauon as a RAID

level

The reader should be forewarned that we describe all levels as If

implemented m hardware solely to slmphfy the presentation, for RAID

Ideas are apphcable to software implementauons as well as hardware

Reltabthty

Our baste approach will be to break the arrays into

rellabrhty groups, with each group having extra “check” disks contammg

redundant mformauon When a disk fads we assume that withm a short

time the failed disk ~111 be replaced and the mformauon wdl be

110

recon~ acted on to the new dlbk usmg the redundant mformauon Th1.s

time IS Ldled the mean time to repair (MlTR) The MTTR can be reduced

If the system includes extra d&s to act as “hot” standby spares, when a

disk fmls, a replacement disk IS swltched m elecrromcally Penodlcally a

human operator replaces all faded d&s Here are other terms that we use

= total number of d&s with data (not mcludmg extra check d&s).

G = number of data d&s m a group (not mcludmg extra check d&s),

C = number of check d&s m a group,

nG =D/G=nUmberOfgoUp&

As menhoned above we make the same assumptions that disk

manufacturers make--that fadura are exponenual and mdependent (An

earthquake or power surge IS a sltuatlon where an array of d&s might not

foul Independently ) Since these reliability prticuons wdl be very high,

we want to emphasize that the rehabdlty IS only of the the &sk-head

assemblies with this fmlure model, and not the whole software and

electromc system In ad&non, m our view the pace of technology means

extremely lugh WF are “overlull”--for, independent of expected bfeume,

users will replace obsolete &sks After all, how many people are stdl

using 20 year old d&s?

The general MT’TF calculation for single-error repamng RAID 1s

given III two steps Fmt, the group MTIF IS

mFDtsk

MrrF,,,, =

G+C

Probabdrty ofanotherfadure m a group

b&re repamng the dead oisk

As more formally denved m the appendix, the probabdlty of a second

fa&nebeforethefirsthasbeenrepauedIs

MlTR

hill-R

Probabdrty of =

Another Failure

bfnF,,,,k /(No

DIS~T-

MmF/j,k /(w-l)

The mtmuon behmd the formal calculation m-the appendix comes

from trymg to calculate the average number of second d& fdures durmg

the repau time for X single &Sk fadures

Since we assume that Qsk fadures

occur at a umform rate, tha average number of second fa&ues durmg the

rcpau tune for X first fadures 1s

X *MlTR

MlTF of remamtng d&s u) the group

The average number of second fathues for a smgle d&z 1s then

MlTR

bfnFD,& /

Of W?UlUllIl~ drSkS l?l the group

The MTTF of the retnaming disks IS Just the MTI’F of a smgle disk

dnwkd by the number of go4 disks m the gmup. gwmg the result above

The second step IS the reltablhty of the whole system, which IS

approxl~~~teiy

(smcc MITFGrow 1s

not qmte titnbuted exponentrally)

MTrFGrarp

MTTFRAID =

Pluggmg It all together, we get.

mFD,sk

MITFRAID = -

G+C

(G+C-l)*MITR “c

(MmFDtsk)2

= (G+C)*tlG * (G+C-l)*MITR

Smce the formula 1s tbe same for each level, we make the abstract

numbers

concrete usmg these parameters as appropriate D=loO total data

d&s, G=lO data disks per group, M7VDcsk = 30,000 hours, MmR = 1

hour,

with the

check d&s per group C detennmed by the RAID level

Relubrlrty Overhead

Cost This IS stmply the extra check

disks. expressed as a percentage of the number of data &sks

As we shall

see below, the cost vanes WIUI RAID level fmm 100% down to 4%

Useable Storage Capacity Percentage

Another way to

express this rellabdlty overhead 1s m terms of the percentage of the total

capacity of data &sks

and

check disks that can be used to store data

Depending on the orgamauon, this vanes from a low of 50% to a high of

96%

Performance

Smce supercomputer applications and

transaction-processing systems have &fferent access patterns and rates, we

need different metncs to evaluate both For supercomputers we count the

number of reads and wnte.s per second for large blocks of data, with large

defined as gettmg at least one sector from each data d& III a group Durmg

large transfers all the disks m a group act as a stngle umt, each readmg or

wntmg a pomon of the large data block m parallel

A better measure for transacuon-processmg systems s the number of

indlvrdual reads or writes per second Smce transacuon-processing

systems (e g , deblts/cre&ts) use a read-modify-wnte sequence of disk

accesses, we mclude that metnc as well Ideally durmg small transfers each

dsk m a group can act mdepe&ndy. e~thez readmg or wntmg mdependent

mfonnatmn In summary supercomputer applicauons need a hrgh dura rure

whale transacuon-pmcessm g need a

hrgh II0 rate

For both the large and small transfer calculauons we assume the

mlmmum user request IS a sector, that a sector 1s small relauve to a track,

and that there 1s enough work to keep every devtce busy Thus sector size

affects both dusk storage efficiency and transfer sue Figure 2 shows the

uiealoperauonoflargeandsmall~accessesmaRAID

(a) Stngle Large or “Graupcd” Read

(lreadqwadoverGd&s)

1tt 1

q nl .*.

(b) Several Smll or Indmdual Reads and Writes

(GndsandlorwntcsqmndawrG&sks)

Figure 2. Large tramfer vs

small tran$ers WI a group of G d&s

The SIX pelformauce

memcs are then the number of reads, wntes, and

read-mod@-writes per second for both large (grouped) or small (mdlvldual)

transfers Rather than @ve absolute numbers for each memc, we calculate

efficiency the number of events per second for a RAID relative to the

corrcqondmg events per second for a smgle dusk (This ts Boral’s I/O

bandwidth per ggabyte moral 831 scaled to glgabytes per disk ) In Uns

pap we are after fundamental Mferences so we use ample. demmmlstlc

throughput measures for our pezformance memc rather than latency

Effective Performance Per Dnk The

cost of d&s can be a

large portmn of the cost of a database system, so the I/O performance per

disk--factonng m the overhead of the check disks--suggests the

cost/performance of a system

‘flus IS the bottom line for a RAID

111

7. First Level RAID: Mwrored Disks

Mmored dusks are 11 tradmonal approach for lmprovmg rellabdlty of

magneuc disks This IS the most expensive opuon we consider since all

tiks are duplicated (G=l and C=l). and eve.ry wnte to a data dusk 1s also a

wnte to a check &Sk Tandem doubles the number of controllers for fault

tolerance, allowing an opwnized version of mirrored d&s that lets reads

occur m parallel Table II shows the memcs for a Level 1 RAID assummg

this optnnuatton

MTTF

Total Number of D&s

Ovcrhcad Cost

Usecrble Storage Capacity

Exceeds Useful Roduct Ltiwne

(4500.000 hrs or > 500 years)

100%

50%

Eventslscc vs Smgle Disk Full RAID

E@caency Per Disk

hrge (or Grouped) Readr

1 00/s

Large (or Grouped) Wrues

D/S

50/S

Large (or Grouped) R-M-W

4Dl3S

67/S

Small (or Indsvuiual) Rends W

100

Small (or hd~vuiual) Writes D

Small (or In&dual) R-M-W 4D/3

Table II.

Charactenstrcs of Level 1 RAID Here we assume

that

writes

are not slowed by waztrng jar the second wrote to complete because the

slowdown for writing 2 dtsks 1s mtnor compared to the slowdown S for

wntrng a whole group of 10 lo

d&s Unltke a “pure” mtrrored scheme

wtth extra &As that are mvlsrble to the s&ware, we assume an optmuted

scheme with twice as many controllers allowtng parallel reads to all d&s,

grvmg full disk bandwidth for large reads and allowtng the reads of

rea&noaijj-nntes to occw in paralbzl

When mdwldual accesses am dlsmbuted acmss muluple d&s, average

queuemg. seek, and rotate delays may &ffer from the smgle Qsk case

Although bandwidth may be unchanged, it is Qsmbuted more evenly,

reducing vanance m queuemg delay and, If the disk load IS not too high,

also reducmg the expected queuemg delay through parallebsm [Llvny 871

When many arms seek to the same track then rotate to the described sector,

the average seek and rotate time wdl be larger than the average for a smgle

disk, tendmg toward the worst case tunes Tlus affect should not generally

more than double the average access tlmc to a smgle sector whde stdl

gettmg many sectors m parallel In the special case of rmrrored &sks with

sufficient controllers, the choice between arms that can read any data sector

will reduce the tune for the average read seek by up to 45% mltton 881

To allow for these factors but to retam our fundamental emphasis we

apply a slowdown factor, S, when there are more than two d&s m a

group In general, 1 5 S < 2 whenever groups of disk work m parallel

With synchronous disks the spindles of all disks m the group are

synchronous so that the correspondmg sectors of a group of d&s pass

under the heads stmultaneously,[Kmzwd 881 so for synchronous disks

there IS no slowdown and S = 1 Smce a Level 1 RAID has only one data

disk m its group, we assume that the large transfer reqmres the same

number of Qsks actmg in concert 9s found m groups of the higher level

RAIDS 10 to 25 d&s

Dupllcatmg all msks can mean doubhng the cost of the database

system or using only 50% of the disk storage capacity Such largess

inspires the next levels of RAID

8 Second Level RAID: Hammmg Code for ECC

The history of main memory orgaruzauons suggests a way to reduce

the cost of rehablhty With the introduction of 4K and 16K DRAMS,

computer designers discovered that these new devices were

SubJe.Ct to

losing information due to alpha part&s Smce there were many single

bit DRAMS m a system and smce they were usually accessed m groups of

16 to 64 chips at a ume, system designers added redundant chips to correct

single errors and to detect double errors m each group This increased the

number of memory chips by 12% to 38%--depending on the size of the

group--but 11 slgmdcantly improved rehabdlty

As long as all the dam bits m a group are read or wntten together,

there 1s no Impact on performance However, reads of less than the group

size requue readmg the whole group to be sure the mformatir? IS correct,

and writes to a poruon of the group mean three steps

1) a read step to get all the rest ofthe data,

a mad&v step to merge the new and old mformatwn.

3) a write step to write the full group, tncludmg check lnformatwn

Smce we have scores of d&s m a RAID and smce some accesses are

to groups of d&s, we can mimic the DRAM solution by bit-mterleavmg

the data across the disks of a group and then add enough check d&s to

detect and correct a smgle error A smgle panty dusk can detect a smgle

error, but to correct an erroI we need enough check dusks to ulentiy the

disk with the error For a group sue of 10 data do& (G) we need 4 check

d&s (C) m total, and d G = 25 then C = 5 [HammmgSO] To keep down

the cost of redundancy, we assume the group size will vary from 10 to 25

Since our mdwidual data transfer urn1 is Just a sector, bit- interleaved

dsks mean that a large transfer for this RAID must be at least G sectors

L&e DRAMS, reads to a smaller amount unpiles readmg a full “cctor from

each of the bit-mterleaved disks m a group, and writes of a single unit

involve the read-modify-wnte cycle to hll the Qsks Table III shows the

metncs of this Level 2 RAID

MlTF

ExceedsUseful~ehme

G=lO

G=Z

(494500 hrs

(103500 llrs

or >50 years)

OT 12 years)

Total Number of D&s

14OD

12OD

overhud Cost 40%

20%

Useable Storage Capacity

71%

83%

EventslSec

Full RAID Eficlency Per Ask Eflc~ncy Per Dtsk

(vs Single Disk) L2 L2lLI

L2 L2ILI

hgeRe&

D/S

111s 71%

86/S 86%

Lurgc wrllcs

D/S

71/s 143%

86/s 112%

Large R-M-W

D/S

71/s 107%

86/S 129%

Small Reodr DISC

01/s

03lS 3%

Small Wrttes

D12sG

04/S 6% o2.B 3%

Small R-M-W DISC

071s 9%

03/S 4%

Table III Charactenstlcs

of a Level 2 RAID The L2lLI column gives

the % performance of level 2 m terms of lewl 1 (>lOO% means L.2 IS

faster) As long as the transfer taut ts large enough to spread over all the

data d& of a group. the large IIOs get the

full

bandwuith of each &Sk,

&w&d by S to allow all dtsks m a group to complete Level 1 large reads

are fmler because &a IS duphcated and so the redwdoncy d&s can also do

independent accesses Small 110s still reqture accessmg all the Isks tn a

group. so only DIG small IIOc can happen at a tone, agam dwrded by S to

allow a group of disks to jintsh Small Level 2 writes are hke small

R-M-W becalcse full sectors must be read before new &ta can be written

onto part of each sector

For large wntes, the level 2 system has the same performance as level

1 even though it uses fewer check disks, and so on a per disk basis It

outperforms level 1 For small data transfers the performance 1s &smal

either for the whole system or per disk, all the disks of a group must be

accessed for a small transfer, llmltmg the Ipaxrmum number of

simultaneous accesses to

DIG

We also include the slowdown factor S

smce the access must wat for all the disks to complete

Thus level 2 RAID IS desuable for supercomputers but mapproprmte

for transaction processmg systems, with increasing group size. increasing

the disparity m performance per disk for the two applications In

recognition of this fact, Thrnkmg Machmes Incorporated announced a

Level 2 RAID this year for its Connecuon Machme supercomputer called

the “Data Vault,” with G = 32 and C = 8, mcludmg one hot standby spare

[H&s 871

Before improving small data transfers, we concentrate once more on

lowenng the cost

9 Thwd Level RAID: Single Check Disk Per Group

Most check disks m the level 2 RAID are used to determme which

disk faded, for only one redundant panty disk is needed to detect an error

These extra disks are truly “redundant” smce most drsk controllers can

already detect If a dusk faded either through special signals provided m the

disk interface or the extra checking mformauon at the end of a sector used

to detect and correct soft errors So mformatlon on the failed disk can be

reconstructed by calculatmg the parity of the remaining good disks and

then companng bit-by-bit to the panty calculated for the ongmal full

112

group When these two parmcs agree, the faded bu was a 0, othcrwtse it

RAID levels 2,3, and 4 By stormg a whole transfer umt m a sector, reads

was a 1 If the check drsk IS the fadure,Just read all the data drsks and store

the group panty in the replacement drsk

can be mdependent and operate at the maxrmum rate of a disk yet sull

detect errors Thus the primary change between level 3 and 4

IS that WC

Reducmg the check d&s toone per group (C=l) reduces the overhead

cost to between 4% and 10% for the group stzes considered here

The

performance for the thud level RAID system is the same as the Level 2

RAID, but the effectrve performance per dtsk mcreases smce it needs fewer

check d&s This reductron m total d&s also increases relrabdtty, but

since It is shll larger than the useful hfehme of disks, this IS a minor

pomt One advantage of a level 2 system over level 3 is that the extra

check mformatton assocrated with each sector to correct soft errors IS not

needed, mcreasmg the capactty per dtsk by perhaps 10% Level 2 also

allows all soft errors to be corrected “on the fly” wnhout havmg to reread a

sector Table IV summarizes the thud level RAID charactensncs and

Figure 3 compares the sector layout and check d&s for levels 2 and 3

mterlcave data

4 Tran$er

UIlllS

a, b, c & d

Level 4

Sector 0

&la

Disk 1

MlTF

Exceeds Useful Lrfenme

Secwr 0

Data

Disk 2

Total Number of D&s

owrhcad cost

Useable Storage Capacity

EventslSec

Full RAID

(vs Single Disk)

LargeRecu&

D/S

Large Writes

D/S

Large R-M-W

D/S

Small Readr

DISC

Small Vyrites

D/2sG

Small R-M-W

DISC

G=lO

(820,000

hrs

or >90 years)

1 1OD

10%

91%

EIficclency Per Disk

L3 WIL2 WILl

91/S 127% 91%

91/S 121% 182%

91/S 127% 136%

09/S 127% 8%

05/S

127% 8%

09/S

127% 11%

G=25

(346,000

hrs

or 40 years)

104D

96%

Eflctency Per Disk

w Lx2 WILI

96/S

112% 96%

96/S

112% 192%

96/S 112% 142%

041s 112% 3%

02/S 112% 3%

041s 112% 5%

Table IV Characterrstrcs

of a Level 3 RAID The L3lL2 column gives

the % performance of L3 tn terms

L2 and the L3ILl column give; it in

terms of

LI (>loO% means L3

IS faster)

The performance for the full

systems IS the same m RAID levels 2 and 3, but since there are fewer

check dtsks the performance per dnk tmproves

Park and Balasubramaman proposed a thud level RAID system

without suggestmg a partrcular applicauon park861 Our calculattons

suggest tt 1s a much better match to supercomputer apphcatrons than to

transacuon processing

systems

This year two disk manufacturers have

announced level 3 RAIDS for such apphcanons usmg synchronized 5 25

mch disks with G=4 and C=l one from IvIaxtor and one from Mtcropohs

[Magmms 871

This thud level has brought the rehabrhty overhead cost to its lowest

level, so in the last two levels we Improve performance of small accesses

w&out changmg cost or rehabrlny

10. Fourth Level RAID Independent ReadsbVrltes

Spreadmg a transfer across all &sks wuhm the group has the

followmg advantage

Large or grouped transfer ttme IS reduced because transfer

bandwulth of the entue array can be exploned

But it has the followmg drsadvantagek as well

ReadmgAvnhng to a disk m a group requues readmg/wnhng to

all the d&s m a group, levels 2 and 3 RAIDS can perform only

one I/O at a

Pme

per group

If the disks are not synchromzed, you do not see average seek

and rotattonal delays, the observed delays should move towards

the worst case, hence the S factor m the equatrons above

This fourth level RAID improves performance of small transfers through

parallehsm--the abrhty to do more than one I/O per group at a ume We

no longer spread the mdtvtdual transfer informanon across several &sks,

but keep each mdrvrdual unit ma smgle disk

The vutue of bit-mterleavmg 1s the easy calculatron of the Hammmg

code needed to detect or correct errors in level 2

But recall that m the thud

level RAID we rely on the drsk controller to detect errors wnhm a single

drsk sector Hence, rf we store an mdrvrdual transfer umt in a single sector,

we can detect errors on an mdtvtdual read without accessing any other drsk

Frgure 3 shows the different ways the mformatron is stored in a sector for

Sectar 0

L&a

Dtsk 3

Sector 0

Disk 4

Sector 0

Check

Disk 5

Sector 0

Check

Ask 6

Sector 0

Check

Disk 7

aEcc0

bECC0

CECCO

dECC0

aEcc1

bECC1

cECC1

dEcc1

aEcc2

bECC2

cECC2

dECC2

ECCa

ECCb

ECCc

ECCd

(Only one

check &Sk

tn level 3

Check

rnfo

ts calculated

aver each

lran.$er 10~1

(Each @tier

umt 1s placed tnto

a single sector

Note that the check

~$0 IS now calculated

over a pece of each

tran$er urut )

Frgure

3 Comparrson

locatton

of data

and check mformatlon In

sectors for RAID levels 2, 3, and 4

for

G=4 Not shown IS the small

amount

check mformatton per sector added by the disk controller to

detect and correct

soft

errors wlthm a sector Remember that we use

physical sector numbers and hardware control to explain these ideas but

RAID can be unplemented by sofmare ucmg logical sectors and disks

At fust thought you mrght expect that an mdrvldual wnte to a smglz

sector stdl mvolves all the disks m a group smce (1) the check disk mutt

be rewritten wnh the new panty data, and (2) the rest of the data dash>

must be read to be able to calculate the new panty data Recall that each

panty bit IS Just a smgle exclusive OR of s+l the correspondmg data NIL 11

a group In level 4 RAID, unhke level 3, the panty calculatron is ITXFI

simpler since, if we know the old data value and the old parity balue al

well as the new data value, we can calculate the new panty mforrmror: sr

follows

new panty = (old data xor new data ) xor old pantv

In level 4 a small wnte

then

uses 2 dtsks to perform 4 accesses-2 rea&

and 2 wrnes--whtle a small mad mvolves only one read on one

disk

Table

V summarmes the fourth level RAID charactensucs Note that all small

accesses

improve--dramatrcally for the reads--but the small

read-modrfy-wnte is strll so slow relatrve to a level 1 RAID that ns

applrcabduy to transactron processmg is doubtful Recently Salem and

Gama-Molma proposed a Level 4 system [Salem 86)

Before proceedmg to the next level we need to explam the

performance of small writes

in Table V (and hence small

read-modify-writes smce they entarl the same operatrons m dus RAID)

The formula for the small wntes drvrdes

by 2 Instead of 4 becau*e 2

113

A case for redundant arrays of inexpensive disks (RAID)

Figures

Citations

The Google file system

The design and implementation of a log-structured file system

Multiple description coding: compression meets the network

Parallel database systems: the future of high performance database systems

Query evaluation techniques for large databases

References

Error detecting and error correcting codes

Validity of the single processor approach to achieving large scale computing capabilities

Progress in digital integrated electronics

Disk Shadowing

Disk striping

Related Papers (5)

RAID: high-performance, reliable secondary storage

The Design and Implementation of a Log-structured file system

Scale and performance in a distributed file system

EVENODD: an efficient scheme for tolerating double disk failures in RAID architectures

A fast file system for UNIX

Frequently Asked Questions (1)

Q1. What have the authors contributed in "A case for redundant arrays of inexpensive disks (raid)" ?