`
`9
`
`8
`
`8
`
`PROCEEDINGS
`
`SIGMOD
`
`INTERNATIONAL
`
`CONFERENCE
`
`ON
`
`MANAGEMENT
`
`OF
`
`DATA
`
`CHICAGO PEt, tN-Oi tS:
`
`
`JUIN E +3
`
`al
`
`HPE, Exh. 1010, p. 1
`
`HPE, Exh. 1010, p. 1
`
`
`
`QOAIE A
`g% All
`{488
`
`The Association for Computing Machinery, Inc.
`11 West 42nd Street
`New York, New York 10036
`
`Copyright © 1988 by the Association for Computing Machinery, Inc. Copying
`without fee is permitted provided that the copies are not madeordistribute for
`direct commercial advantage, and credit to the sourceis given. Abstracting with
`credit is permitted. For other copyingof articles that carry a code at the bottom
`ofthe first page, copying is permitted provided that the per-copy fee indicated in
`the code is paid through the Copyright Clearance Center, 27 CongressStreet,
`Salem, MA 01970. For permission to republish write to: Director of Publications,
`Association for Computing Machinery. To copy otherwise, or republish, requires
`a fee and/orspecific permission.
`
`ISBN 0-89791-268-3
`
`Additional copies may be ordered prepaid from:
`
`ACM Order Department
`P.O. Box 64145
`Baltimore, MD 21264
`
`Price:
`Members: ......... $25.00
`All others: ......... $33.00
`
`ACM Order Number: 472880
`
`HPE, Exh. 1010, p. 2
`
`HPE, Exh. 1010, p. 2
`
`
`
`Table of Contents
`
`KEYNOTE ADDRESS
`
`Data With a Point of View
`AGE iC 6 0k rien oa ns ns we SEY Hele S vente gas yas wee >
`
`1
`
`PANELS
`Nested Relations -— A Step Forward or Backward
`Chine eee PISCNEY ee cc cau cE oe erste buedeehat eee a 2
`
`Multi-Database Systems
`Chair: Marek Rusinkiewicz ............ 0. cece eee eee eee eens 3
`
`Issues in Building Large Systems
`Cee re POLNBET 2.2. bpts voik +s =
`
`5 ++ ace gine ale Aolarnsinib seis Aa eep nas 2 6
`
`Logic Programming, Deductive Databases, Expert Database Systems
`re VRE ea a ak oe 25 PS Ee2 A 7
`
`SESSION: QUERY OPTIMIZATION
`Chair:
`J. Christoph Freytag
`
`Optimizing Large Join Queries
`Arona cand, Au Guten a caceesiiiecn «ee 00 vee wesin cele ere erro wl qth ON lowed od asin a 8
`
`Grammar-like Functional Rules for Representing Query Optimization Alternatives
`GU Me Mohmati aa... 3..
`6 sd steyebpareqsreeasl
`<
`v niie +) oe ogre Sra pememeeriaeepepees de) ele ala oles
`
`e wits 18
`
`Equi-depth Multidimensional Histograms
`M. Muralikrishna and D. J. DeWitt ..... 0.0... ccc cee ee ee eee 28
`
`SESSION: OBJECT-ORIENTED & SEMANTIC SYSTEMS
`Chair: Roger King
`
`Transaction Management in an Object-Oriented Database System
`(Invited Paper)
`Jorge Garza and Won Kim ....... 2... ccc cece eect eee eee eee eee eens 37
`
`SIM: A Database System Based on the Semantic Data Model
`D. Jagannathan, B. L. Fritchman, R. L. Guck, J. P. Thompson, and
`DRAPE le cc kb cea aes gis a wlotd ANOS eRe VER giln a
`« wm Rete Goer lee ot oll 46
`
`Contexts and MetaMessages in Object-Oriented Database Programming Language
`Design
`
`M. Caruso and E. Seiore oi... os is sale POG. tek ptt we valine: ea 56
`
`vi
`
`HPE, Exh. 1010, p. 3
`
`HPE, Exh. 1010, p. 3
`
`
`
`INCOMPLETE INFORMATION
`SESSION:
`Chair: Gultekin Ozsoyoglu
`Partition Semantics for Incomplete Information in Relational Databases
`D. Laurent and N. Spyratos ....... eee e eee ceree ees
`A Sound and Complete Query Evaluation Algorithm for Relational Databases
`with Null Values
`L. ¥. Yuan and D.-A. Chiang ......... 25. ceceeee sce r eset errererees
`
`The Derivation Problem of Summary Data
`BM. Malwestut. «oon o-. x eee es ae es ne a eee Rade cele eps eae ee
`
`.. 66
`
`74
`
`82
`
`90
`
`99
`
`109
`
`117
`
`126
`
`135
`
`SESSION: PARALLELISM ISSUES
`Chair: Dieter Gawlick
`Process and Dataflow Control in Distributed Data-Intensive Systems
`W. Alexander and G. Copeland .......-..eeeeee reer rset re rerenees
`
`Data Placement in Bubba
`G. Copeland, W. Alexander, E. Boughter and T. Keller
`
`.......-.+-+--+:
`
`A Case for Redundant Arrays of Inexpensive Disks (RAID)
`D. A. Patterson, G. Gibson and R. H. Katz ....---- eee reece reer reer
`
`SESSION: DISTRIBUTED CONCURRENCY CONTROL
`Chair: Hank F. Korth
`Semantics Based Transaction Management Techniques for Replicated Data
`A. Kumar and M.Stonebraker .......---- 22s eect eee eee eer stersces
`
`The Group Paradigm for Concurrency Control
`A. El Abbadi and S. Toueg ......----- ee ee cere cere este erst snreeees
`
`Multidatabase Update Issues
`Y. Breitbart and A. Silberschatz ........----e eee eeee ret r rete rees
`
`SESSION: LOGIC & DATABASES
`Chair: David Stemple
`
`Data Functions, Datalog and Negation
`S. Abiteboul and R. Hull .......... 0. eee eee eee eee ener rere nese
`A Framework for Testing Safety and Effective Computability of Extended Datalog
`R. Krishnamurthy, O. Shmueli and R. Ramakrishnan .......-----++e+eee:
`
`143
`
`154
`
`vii
`
`HPE, Exh. 1010, p. 4
`
`HPE, Exh. 1010, p. 4
`
`
`
`An Implementation Model for Reasoning with Complex Objects
`Q. Chen andiQ@aeeate es. cen alam aa ny eee es eee ees 164
`
`SESSION: STORAGE STRUCTURES
`Chair:
`Jack Orenstein
`
`Optimal File Distribution for Partial Match Retrieval
`M,. Hi. RiceAt ace «ax 6 oo « dca bales ow eeu os Wein lace 173
`
`Twin Grid Files: Space Optimizing Access Schemes
`A. Hutflesz, FL-W. Six ana). Widmayer 27.22. ee eee 183
`
`Hashing in Practice: Analysis of Hashing and Universal Hashing
`M. Vi Ramakrishna) 2c53 see: ER PL LAT. uk 191
`
`SHORT SESSION: NON-TRADITIONAL APPLICATIONS
`Chair: Randy Katz
`
`Data Modeling in DELAB
`Yannis E. Ioannidis and Miron Livny .................. 00. cece eee eee 200
`
`Data Management of Telecommunications Networks
`Kiyoshi Ono, Mikio Aoyama and Hiroshi Fumimoto..................... 201
`
`A Visual Interface to Design Data
`Kari Alho, Hannu Peltonen and Martti Mantyla............. 2.0.0.0. 000 202
`
`A Relational Database Design in Support of Standard Medical Terminology in
`Multi-domain Knowledge Bases
`Frank Naeymi-Rad, David A. Trace, Max Harry Weil, Lowell Carmony,
`D. Christine Georgakis and Martha Evens ............. 0... eee e eee 203
`
`SHORT SESSION:
`
`INDUSTRIAL SYSTEMS
`
`Chair: Tom Atwood
`
`Vishnu: An Object-Oriented Database Management System Supporting Rapid
`Prototyping
`P. A. Drew, M. S. Ganti, D. J. Moore, R. J. Nassif, S. Podar, and
`DFA, TACiZeros to cay eyaiie ov fosel ox: eine
`to ore 19, oye'-s w= epee vo Set R
`Ue
`eae) AMP EU A. oe 204
`
`Database Services in the German Beldschirmtext System (BTX)
`HJ BrOmmni as. hs se cies cree vs eae eee cue ee ied eee la oo 205
`
`GAIA: An Object-Oriented Framework for an Ada Envrionment
`Don Vinés and Timi King’... see lee cee a bee ees ee loe neces seen see en 206
`
`viii
`
`HPE, Exh. 1010, p. 5
`
`HPE, Exh. 1010, p. 5
`
`
`
`LCE: An Object Oriented Application Development Tool
`Pukniaj Kathhwaha . ... ops cess a ae tt SRS ee ee ll
`
`el
`
`Integrated Storage and Manipulation of Data and Text Based on a Relational
`Database
`Helmut Krieger, Dieter Schoenbauer and Bernd Strodthoff ............... 208
`
`SESSION: DATABASE DESIGN AND NESTED ALGEBRA
`Chair: Victor Vianu
`A Characterization of Constant-Time Maintainability for BCNF Database Schemes
`H. J. Hernandez and E. P. F. Chan ... 1... eee eee eee eee eens
`
`.. 209
`
`A Polynomial Time Algorithm for Testing Implications of a Join Dependency
`and Embodied Functional Dependencies
`J. Leuchner, L. Miller and G. Slutzki ........-- see eee ete eee
`
`The Powerset Algebra as a Result of Adding Programming Constructions to
`the Nested Relational Algebra
`M. Gyssens and D. Van Gucht .......-. sees seer seer erence errr tees
`
`SESSION: DATA MODELS
`Chair: Amihai Motro
`Resolving the Tension between Integrity and Security Using a Theorem Prover
`S. Mazumdar, D. Stemple and T. Sheard ...... 2... s sees reer reece ees
`
`A Transaction Logic for Database Specification
`X. Qian and R. Waldinger..22i5).V0 se ee eee en ne os
`
`A Generalized Model for a Relational Temporal Database
`S K. Gadia and C.-S Yeung .....scncene ste cee seesaw anne pene eres
`
`SESSION: HIGH-PERFORMANCE APPLICATIONS
`Chair: Frank Olken
`High Contention in a Stock Trading Database: A Case Study
`P. Peinl and A. Reuter ...... 0... eee cee eee teen ener e erent neces
`
`A DBMSfor Large Design Automation Databases
`IW. Fiaynie bs ei Bet We ye Cee eas Ee es ee ne
`A Specialized Data Management System for Parallel Execution of Particle
`Physics Codes
`PRN aes wer eee peewee) eas be Re ao ES Seerk Bre eae!
`
`218
`
`225
`
`233
`
`243
`
`251
`
`260
`
`269
`
`o4 OFT
`
`ix
`
`HPE, Exh. 1010, p. 6
`
`HPE, Exh. 1010, p. 6
`
`
`
`SESSION: PERFORMANCE AND INFORMATION MANAGEMENT
`Chair: David Lomet
`Performance Analysis and Fundamental Performance Tradeoffs for CLV
`Optical Disks
`S. Christodoulakis and D. A. Ford 2.2.0... .00 0.0. ee cece. 286
`Processing Queries Against Database Procedures: A Performance Analysis
`E. N. Hanson... “00ers... DRG eA Ea more 295
`Managing Knowledge About Information System Evolution
`M, Forks and T. Rose OT... see este svsinnntetces see 303
`SESSION: QUERY PROCESSING IN DEDUCTIVE DATABASES
`Chair: Dirk Van Gucht
`
`Compiling Separable Recursions
`F ARTON oo se naa nt MMR no» » adie ltoy tien il wis cane a ssbb Lewes seth,
`Classification of Recursive Formulas in Deductive Databases
`C. Youn, L. J. Henschen and J. Han ..........0..0.00000 0000000, 320
`Distributed Processing of Logic Programs
`O. Wolfson and A.Silberschatz ..........0..0.. 0. ececec eee eee ceo, 329
`
`| 312
`
`SESSION: HIGH-PERFORMANCE SYSTEMS
`Chair: David Schrader
`
`A Benchmark of Non-Stop SQL on the Debit Credit Transaction
`(Invited Paper)
`The Tandem Performance GRCUEre SOUR 1G) lahat asi on egne 337
`High Performance SQL Through Low-Level System Integration
`DBO a be yon ROTA BO Oe Lo saewros 342
`A Performance Analysis of the Gamma Database Machine
`D. J. DeWitt, S. Ghanderarizadeh and D. Schneider .................... 350
`
`SESSION: TRANSACTION MANAGEMENT
`Chair: Hector Garcia-Molina
`
`Semantic Lock Models in Object-Oriented Distributed Systems and Deadlock
`Resolution
`M. Roesler and W. A. Burkhard .........0 000... c cece ceceeceeee cee... 361
`
`HPE, Exh. 1010, p. 7
`
`HPE, Exh. 1010, p. 7
`
`
`
`Commitment in a Partitioned Distributed Database
`Kee Vi Si Ramarao)
`isn bee > access = o's oe ow LS © CREE GIS Cae wee Hee 371
`
`Formal Model of Correctness Without Serializability
`I, K. Korth and G. Speegle . 22... 2. eee we eee ee de eee tee 379
`
`SESSION: RULE PROCESSING
`Chair: Andreas Reuter
`
`A Data/Knowledge Base Management Testbed and Experimental Results on
`Data/Knowledge Base Query and Update Processing
`R. Ramnarayan and H. Lu ... 1... ee ce ce eee eet eens 387
`
`A Self-Controlling Interpreter for the Relational Production Language
`L: M,-L..Deleambre and, J. N. Etheredge .2090y o.- < seanee etter meines 396
`
`Implementing Large Production Systems in a DBMS Environment:
`Concepts and Algorithms
`T. Sells, C.-C. Ein and L.."Raschid ..... 7. ee ee eke cee tec 404
`
`SESSION: OBJECT-ORIENTED DATA MODELS
`Chair: H.-J. Schek
`
`A Data Model and Query Language for EXODUS
`M. J. Carey, D. J. DeWitt and S. L. Vandenberg 2.40 6.0s. 055 Hoe ev ee ees 413
`
`O2, an Object-Oriented Data Model
`C.Leclose. P. Richard and F.. Velez a <5 egies eae we tee eee died ins 424
`
`Modeling Class Hierarchies with Contradictions
`Ap BOGGS oo shes cis cs nny dene v ee Se Why Eu > ee wee see Yee Ne 434
`
`AUTHOR INDEX cae ccd cacsepiseselelsea ew oo eH co ei ale lecee vie purl attie Ngielt
`
`a cleus 444
`
`xi
`
`~ HPE, Exh. 1010, p. 8
`
`HPE, Exh. 1010, p. 8
`
`
`
`
`
`
`_| This material may be protected by Copyright law (Title 17 U.S. Code)
`
`A Case for Redundant Arrays of Inexpensive Disks (RAID)
`
`David A. Patterson, Garth Gibson, and Randy H. Katz
`
`Computer Science Division
`DepartmentofElectrical Engineering and Computer Sciences
`571 Evans Hall
`University of California
`Berkeley, CA 94720
`(pattrsn@ginger.berkeley.edu)
`
`ure of magnetic disk technologyis the growth in the maximum numberof
`bits that can be stored per square inch, or the bits per inch in a track
`times the numberof tracks per inch. Called M.A.D., for maximal areal
`density, the "First Law in Disk Density" predicts [Frank87] :
`
`MAD = 10Year-1971)/10
`
`Abstract. Increasing performance of CPUs and memories will be
`squanderedifnot matchedby a similar performance increasein I/O. While
`the capacity of Single Large Expensive Disks (SLED) has grown rapidly,
`the performance improvement of SLED has been modest. Redundant
`Arrays of Inexpensive Disks (RAID), based on the magnetic disk
`technology developed for personal computers, offers an attractive
`alternative to SLED, promising improvements ofan order ofmagnitude in
`performance,reliability, power consumption, and scalability. This paper
`introducesfive levels ofRAIDs, giving their relative cost/performance, and
`compares RAID to an IBM 3380 and a Fujitsu Super Eagle.
`
`1. Background: Rising CPU and Memory Performance
`The users of computers are currently enjoying unprecedented growth
`in the speed of computers. Gordon Bell said that between 1974 and 1984,
`single chip computers improved in performance by 40% per year, about
`twice the rate of minicomputers [Bell 84]. In the following year Bill Joy
`predicted an evenfaster growth [Joy 85]:
`
`MIPS = 2Year-1984
`
`Mainframe and supercomputer manufacturers, having difficulty keeping
`pace with the rapid growth predicted by "Joy's Law," cope by offering
`multiprocessors as their top-of-the-line product.
`But a fast CPU doesnot a fast system make. Gene Amdahlrelated
`CPUspeed to main memory size using this rule [Siewiorek 82):
`
`Each CPUinstruction per second requires one byte ofmain memory;
`
`Magnetic disk technology has doubled capacity and halved price every three
`years,
`in line with the growth rate of semiconductor memory, and in
`practice between 1967 and 1979 the disk capacity of the average IBM data
`processing system more than kept up with its main memory ([Stevens81].
`Capacity is not the ouly memory characteristic that must grow
`rapidly to maintain system balance, since the speed with which
`instructions and data are delivered to a CPU also determinesits ultimate
`performance. The speed of main memory haskept pace for two reasons:
`(1) the invention ofcaches, showing that a small buffer can be managed
`automatically to contain a substantial fraction of memory references;
`(2) and the SRAM technology, used to build caches, whose speed has
`improved at the rate of 40% to 100% per year.
`In contrast to primary memory technologies, the performance of
`single large expensive magnetic disks (SLED) has improved at a modest
`tate. These mechanical devices are dominated bythe seek andthe rotation
`delays: from 1971 to 1981, the raw seek time for a high-end IBM disk
`improved by only a factor of two while the rotation time did not
`change[Harker81]. Greater density means a higher transfer rate when the
`information is found,and extra heads can reducethe average seek time, but
`the raw seek time only improvedat a rate of 7% per year. There is no
`reason to expect a fasterrate in the near future.
`If computer system costs are not to be dominated by the cost of memory,
`To maintain balance, computer systems have been using even larger
`then Amdahl's constant suggests that memory chip capacity should grow
`main memories orsolid state disks to buffer some ofthe I/O activity.
`at the same rate. Gordon Moore predicted that growth rate over 20 years
`This may beafine solution for applications whose I/O activity has
`ago:
`locality of reference and for which volatility is not an issue,
`but
`applications dominatedby a high rate of random requests for small pieces
`of data (such as transaction-processing) or by a low numberof requests for
`massive amounts of data (such as large simulations runnin gon
`supercomputers) are facing a serious performancelimitation.
`2. The Pending I/O Crisis
`Whatis the impact of improving the performance of some Pieces ofa
`problem while leaving others the same? Amdahl's answer is now known
`as Amdahl's Law [Amdahl67]:
`]
`SoS (ee
`Sf) +fk
`
`transistorsichip = 2%ea7-1964
`
`Aspredicted by Moore's Law, RAMs have quadrupled in capacity every
`two [Moore 75]to three years [Myers 86].
`Recently the ratio of megabytes of main memory to MIPS has been
`defined as alpha [Garcia 84], with Amdahl's constant meaning alpha = 1. In
`part because ofthe rapid drop of memory prices, main memory sizes have
`grownfaster than CPU speeds and many machinesare shipped today with
`alphas of 3 or higher.
`To maintain the balance of costs in computer systems, secondary
`storage must match the advancesin other parts of the system. A key meas-
`
`Permission to copy withoutfee all or part of this material is granted provided thatthe copies
`are not madeordistributed for direct commercial advantage, the ACM copyrightnotice
`and thetitle of the publication and its da‘. appear, and noticeis given that copying is by
`permission of the Association for Computing Machinery. To copy otherwise, or to
`republish, requires a fee and /orspecific permission.
`© 1988 ACM 0-89791-268-3 /88/0006/0109 $1.50
`
`where:
`S = the effective speedup;
`f= fraction of work in faster mode; and
`k= speedupwhile in faster mode.
`Suppose that some currentapplications spend 10% oftheir time in
`Y/O. Then when computers are 10X faster--according to Bill Joy in just
`overthree ycars--then Amdahl's Law predicts ef"ective speedupwill be only
`5X. When we have computers 100X faster--via evolution of uniprocessors
`or by multiprocessors--this application will be less than 10X faster,
`wasting 90% ofthe potential speedup.
`
`a1
`
`09
`
`HPE, Exh. 1010, p. 9
`
`
`
`While we can imagine improvements in software file systems via
`buffering for near term I/O demands, we need innovation to avoid an I/O
`crisis [Boral 83].
`
`3. A Solution: Arrays of Inexpensive Disks
`Rapid improvementsin capacity of large disks have notbeen the only
`target of disk designers, since personal computers have created a marketfor
`inexpensive magnetic disks. These lower cost disks have lower perfor-
`manceas well as less capacity. Table I below compares the top-of-the-line
`IBM 3380 model AK4 mainframe disk, Fujitsu M2361A "Super Eagle"
`minicomputer disk, and the Conner Peripherals CP 3100 personal
`computerdisk.
`
`Characteristics
`
`Conners
`IBM Fujitse
`3380 M2361A CP3100
`
`3380 v. 2361 v
`3100
`3100
`(>J] means
`3100 is better)
`4°°3
`3.5
`10.5
`14
`Disk diameter(inches)
`01 2
`100
`600
`7500
`Formatted Data Capacity (MB)
`1-2.5 1.7-3
`Price/MB(contoller incl.)
` $18-$10 $20-$17 $10-$7
`Ayes
`MTTF Rated (hours)
`30,000
`20,000 30,000
`D2
`MTTF in practice (hours)
`100,000
`2
`?
`at
`No. Actuators
`4
`1
`1
`6 8
`Maximum I/O's/second/Acwator
`50
`40
`30
`7 8
`Typical I/O's/second/Actuator
`30
`24
`20
`2 8
`Maximum I/O's/second/box
`200
`40
`30
`2 8
`Typical I/O's/second/box
`120
`24
`20
`3
`44
`Transfer Rate (MB/sec)
`3
`25)
`Oh
`660 64
`Power/box (W)
`6,600
`640
`10
`
`24 3.4Volume(cu.ft.) .03 800 110
`
`
`
`Table I. Comparison of IBM 3380 disk model AK4 for mainframe
`computers, the Fujitsu M2361A "Super Eagle” disk for minicomputers,
`and the Conners Peripherals CP 3100 disk for personal computers. By
`"Maximum I/O's/second” we mean the maximum numberofaverage seeks
`and average rotates for a single sector access. Cost and reliability
`information on the 3380 comes from widespread experience [IBM 87]
`[Gawlick87] and the information on the Fujitsufrom the manual [Fujitsu
`87], while some numbers on the new CP3100 are based on speculation.
`The price per megabyteis given as a rangeto allowfor different pricesfor
`volume discount and different mark-up practices of the vendors. (The 8
`watt maximum powerof the CP3100 was increased to 10 watts to allow
`for the inefficiency of an external powersupply, since the other drives
`contain their own powersupplies).
`
`price-performanceandreliability. Our reasoning is thatif there are no
`advantages in price-performanceorterrible disadvantages in reliability, then
`there is no need to explore further. We characterize a transaction-processing
`workload to evaluate performance ofa collection of inexpensive disks, but
`rememberthat such a collection is just one hardware componentof a
`complete tranaction-processing system. While designing a complete TPS
`based on these ideas is enticing, we will resist that temptation in this
`paper. Cabling and packaging,certainly anissue in the cost andreliability
`of an array of manyinexpensive disks,is also beyondthis paper's scope.
`
`Mainframe
`
`Small Computer
`
`Figure 1. Comparison of organizations for typical mainframe and small
`computer disk interfaces. Single chip SCSIinterfaces such as the Adaptec
`AIC-6250 allow the small computer to use a single chip to be the DMA
`interface as well as provide an embedded controllerfor each disk [Adeptec
`87] . (The price per megabyte in Table | includes everything in the shaded
`boxes above.)
`5. And Now The Bad News: Reliability
`The unreliability of disks forces computer systems managersto make
`backup versions of information quite frequently in case offailure. What
`would be the impact on reliability of having a hundredfold increase in
`disks? Assuming a constant failure rate--that
`is, an exponentially
`distributed time to failure--and that failures are independent--both
`assumptions made by disk manufacturers whencalculating the Mean Time
`To Failure (MTTF)--the reliability of an array ofdisks is:
`
`MITFofa Single Disk
`MTTFofaDisk Array ==——_—______——_-
`Onesurprising factis that the numberof I/Os per second peractuator in an
`NumberofDisks in the Array
`inexpensive disk is within a factor of two of the large disks. In several of
`the remaining metrics, including price per megabyte,the inexpensive disk
`is superior or equal to the large disks.
`The small size and low power are even more impressive since disks
`such as the CP3100 contain full track buffers and mostfunctions of the
`traditional mainframecontroller. Small disk manufacturers can provide
`such functions in high volumedisks because of the efforts of standards
`committeesin defining higherlevel peripheral interfaces, such as the ANSI
`X3,131-1986 Small Computer System Interface (SCSI). Such standards
`have encouraged companieslike Adeptec to offer SCSIinterfaces as single
`chips, in turn allowing disk companies to embed mainframe controller
`functions at low cost. Figure 1 comparesthe traditional mainframedisk
`approach and the small computer disk approach. The same SCSIinterface
`chip embedded as a controller in every disk can also be used as the direct
`memory access (DMA)device at the other end of the SCSI bus.
`Suchcharacteristics lead to our proposal for building 1/O systems as
`arrays of inexpensive disks, either interleaved for the large transfers of
`supercomputers [Kim 86][Livny 87][Salem86] or independentfor the many
`small transfers oftransaction processing. Using the information in Table
`I, 75 inexpensive disks potentially have 12 timesthe I/O bandwidth ofthe
`IBM3380 andthe same capacity, with lower power consumptionandcost.
`
`Using the information in Table I, the MTTF of 100 CP 3100 disks is
`30,000/100 = 300 hours, or less than 2 weeks. Comparedto the 30,000
`hour (> 3 years) MTTF of the IBM 3380,this is dismal. If we consider
`scaling the array to 1000 disks, then the MTTF is 30 hours or about one
`day, requiring an adjective worse than dismal.
`Withoutfault tolerance, large arrays of inexpensive disks are too
`unreliable to be useful.
`6. A Better Solution: RAID
`To overcomethe reliability challenge, we must make use of extra
`disks containing redundantinformation to recoverthe original information
`whena disk fails. Our acronym for these Redundant Arraysof Inexpensive
`Disks is RAID. To simplify the explanation of ourfinal proposal and to
`avoid confusion with previous work, wegive a taxonomyoffive different
`organizationsofdisk arrays, beginning with mirrored disks and progressing
`througha variety ofalternatives with differing performanceandreliability.
`Wereferto each organization as a RAIDlevel.
`The reader should be forewarned that we describe all levels as if
`implemented in hardware solely to simplify the presentation, for RAID
`ideas are applicable to software implementationsas well as hardware.
`Reliability. Our basic approach will be to break the arrays into
`reliability groups, with each group having extra "check" disks containing
`redundant information. When a disk fails we assume that within a short
`time the failed disk will be replaced and the information will be
`
`_
`
`110
`
`HPE, Exh. 1010, p. 10
`
`4, Caveats
`Wecannotexploreall issues associated with such arrays in the space
`available for this paper, so we concentrate on fundamentalestimates of
`
`HPE, Exh. 1010, p. 10
`
`
`
`reconsi‘ucted on to the new disk using the redundantinformation. This
`timeis ciilled the mean timeto repair (MTTR). The MTTR can be reduced
`if the systemincludes extra disks to act as "hot" standby spares; when a
`disk fails, a replacementdisk is switched in electronically. Periodically a
`humanoperatorreplacesall failed disks. Here are other termsthat we use:
`D = total numberofdisks with data (not including extra check disks);
`G = numberofdata disks in a group (not including extra check disks);
`C = numberof checkdisks in a group;
`ng = D/G = numberofgroups;
`As mentioned above we make the same assumptions that disk
`manufacturers make--that failures are exponential and independent. (An
`earthquake or powersurgeis a situation wherean array of disks might not
`fail independently.) Since these reliability predictions will be very high,
`we want to emphasize that the reliability is only of the the disk-head
`assemblies with this failure model, and not the whole software and
`electronic system. In addition, in our view the pace of technology means
`extremely high MTTFare “overkill”--for, independentof expected lifetime,
`users will replace obsolete disks. After all, how many people arestill
`using 20 year old disks?
`The general MTTF calculation for single-error repairing RAID is
`given in two steps. First, the group MTTF is:
`
`1
`MTIFisk
`MITEGy © —<—<$<_<$<—+
`G+C
`Probability ofanotherfailure in a group
`before repairing the dead disk
`
` |||
`
`Since the formula is the same for each level, we make the abstract
`numbers concrete using these parameters as appropriate: D=100 total data
`disks, G=10 data disks per group, MTTFpj; = 30,000 hours, MTTR = 1
`hour, with the check disks per group C determinedby the RAIDlevel.
`Reliability Overhead Cost. This is simply the extra check
`disks, expressed as a percentage of the numberof data disks D. As weshall
`see below,the cost varies with RAID level from 100% down to 4%.
`Useable Storage Capacity Percentage. Another way to
`expressthis reliability overheadis in terms of the percentageofthetotal
`capacity of data disks and check disks that can be used to store data.
`Depending on the organization,this varies from a low of 50% to a high of
`96%.
`and
`applications
`supercomputer
`Performance. Since
`transaction-processing systems have different access patterns and rates, we
`need different metrics to evaluate both. For supercomputers we countthe
`numberof reads and writes per second for large blocks of data, with large
`defined as getting at least one sector from each data disk in a group. During
`large transfersall the disks in a group actas a single unit, each reading or
`writing a portion ofthe large data block in parallel.
`A better measure for transaction-processing systemsis the numberof
`individual reads or writes per second. Since transaction-processing
`systems (e.g., debits/credits) use a read-modify-write sequence of disk
`accesses, we include that metric as well. Ideally during small transfers each
`disk in a group can act independently,either reading or writing independent
`information. In summary supercomputer applications need a high data rate
`while transaction-processing need a high I/O rate.
`For both the large and small transfer calculations we assume the
`minimum user requestis a sector, that a sector is small relative to a track,
`As more formally derived in the appendix, the probability of a second
`and_that there is enough work to keep every device busy. Thus sectorsize
`failure before the first has been repaired is:
`affects both disk storage efficiency and transfer size. Figure 2 shows the
`MTTR
`MTTR
`ideal operation of large and small disk accesses in a RAID.
`
`
`Probability of== ————___________.. *
`MTTFpj, (G+C-1)
`Another Failure
`MTTFp;,;, (No. Disks-1)
`
`ae
`pe ysapeasetsy
`
`(a) Single Large or "Grouped" Read
`(1 read spread over G disks)
`
`Ea EI
`
`bess
`eee
`ROOF
`(b) Several Smal! or Individual Reads and Writes
`(G reads andlor writes spread over G disks)
`
`
`Theintuition behind the formal calculation inthe appendix comes
`from trying to calculate the average numberof seconddisk failures during
`the repair time for X single disk failures. Since we assumethatdisk failures
`occur at a uniform rate, this average numberof second failures during the
`repair timefor X first failures is
`
`X*MTIR
`
`MTTFof remaining disks in the group
`
`The average number of secondfailures for a single disk is then
`MTTR
`
`MTTFpicg ! No. of remaining disks in the group
`
`The MTTF ofthe remaining disks is just the MTTF of a single disk
`divided by the number of good disks in the group, giving the result above.
`The second step is the reliability of the whole system, which is
`approximately (since MTTFGroup is not quite distributed exponentially):
`MITFG,oup
`
`MTIFRAID =
`
`"G
`
`Pluggingit all together, we get:
`
`Figure 2. Large transfer vs. small transfers in a group of G disks.
`The six performance metrics are then the number of reads, writes, and
`read-modify-writes per second for both large (grouped) or small (individual)
`MTTFpisk—MTTFpisk 1
`
`transfers. Rather than give absolute numbers for each metric, we calculate
`MTTFpaip = ——— * —————— * —
`efficiency: the number of events per second for a RAIDrelative to the
`G+C
`(G+C-1)*MTTR
`ong
`corresponding events per second for a single disk. (This is Boral's 1/O
`bandwidth per gigabyte [Boral 83] scaled to gigabytes per disk.) In this
`(MTTFpsy)”
`paper weare after fundamental differences so we use simple, deterministic
`throughput measures for our performance metric rather than latency.
`Effective Performance Per Disk. The cost of disks can be a
`large portion of the cost of a database system, so the 1/O performance per
`disk--factoring in the overhead of the check disks--suggests the
`cost/performance of a system. This is the bottom line for a RAID.
`
`(G+C)*ng * (G+C-1)*MTIR
`
`(MTTFpis)”
`
`MTTFpap
`
`(D+C*ng )*(G+C-1)*MTTR
`
`111
`
`HPE, Exh. 1010, p. 11
`
`HPE, Exh. 1010, p. 11
`
`
`
`7, First Level RAID: Mirrored Disks
`Mirrored disks are 2 traditional approach for improving reliability of
`magnetic disks. This is the most expensive option we considersinceall
`disks are duplicated (G=1 and C=1), and every write to a data disk is also a
`write to a check disk. Tandem doubles the numberofcontrollers for fault
`tolerance, allowing an optimized version of mirrored disksthat lets reads
`occur in parallel. Table II shows the metrics for a Level 1 RAID assuming
`this optimization.
`MITF
`Exceeds Useful Product Lifetime
`(4,500,000 hrs or > 500 years)
`2D
`100%
`50%
`
`Total NumberofDisks
`Overhead Cost
`Useable Storage Capacity
`
`1) aread step to get all the rest ofthe data;
`2) a modify step to merge the new andold information;
`3) a write step to write thefull group, including check information.
`Since we have scoresofdisks in a RAID andsince someaccesses are
`to groups of disks, we can mimic the DRAMsolution by bit-interleaving
`the data across the disks of a group and then add enough check disks to
`detect and correcta single error. A single parity disk can detect a single
`error, but to correct an error we need enough checkdisks to identify the
`disk with the error. For a group size of 10 data disks (G) we need 4 check
`disks (C) in total, and if G = 25 then C = 5 [Hamming50]. To keep down
`the cost of redundancy, we assumethe groupsize will vary from 10 to 25.
`Since our individual data transferunit is just a sector,bit- interleaved
`disks mean that a large transfer for this RAID mustbe atleast G sectors.
`Like DRAMs,readsto a smaller amountimplies reading a full sector from
`each ofthebit-interleaved disks in a group, and writes of a single unit
`involve the read-modify-write cycle to all the disks. Table III shows the
`Efficiency Per Disk
`FullRAID
`Events/Sec vs. Single Disk
`metrics of this Level 2 RAID.
`1,00/S
`2D/S
`Large (or Grouped) Reads
`MTTF
`Exceeds Useful Lifetime
`.50/S
`D/S
`Large (or Grouped) Writes
`G=10
`G=25
`
`Large (or Grouped) R-M-Wss4D/3S 67/8
`(494,500 hrs
`(103,500 hrs
`Small (or Individual) Reads
`2D
`1.00
`50
`or >SO years)
`or 12 years)
`Small (or Individual) Writes
`D
`67
`Total NumberofDisks
`1.40D
`1.20D
`Smail (or Individual) R-M-W=4D/3
`Overhead Cost
`40%
`20%
`Useable Storage Capacity
`1%
`83%
`Events/Sec
`FullRAID Efficiency Per Disk Efficiency Per Disk
`(vs. Single Disk)
`12
`L2iL1
`2
`L2/L1
`Large Reads
`71S
`71%
`86/S
`86%
`Large Writes
`-T1/S
`143%
`86/S 172%
`Large R-M-W
`TUS
`107%
`.86/S
`129%
`Small Reads
`.07/S
`6%
`.03/S
`3%
`Small Writes
`04S
`6%
`.02/S
`3%
`Small R-M-W
`07/S
`9%
`.03/S
`4%
`
`Table II. Characteristics of Level 1 RAID, Here we assume that writes
`are not slowed by waiting for the second write to complete because the
`slowdown for writing 2 disks is minor compared to the slowdown S for
`writing a whole group of10 to 25 disks. Unlike a "pure" mirrored scheme
`with extra disks that are invisible to the software, we assume an optim