throbber
1
`
`9
`
`8
`
`8
`
`PROCEEDINGS
`
`SIGMOD
`
`INTERNATIONAL
`
`CONFERENCE
`
`ON
`
`MANAGEMENT
`
`OF
`
`DATA
`
`CHICAGO PEt, tN-Oi tS:
`
`
`JUIN E +3
`
`al
`
`HPE, Exh. 1010, p. 1
`
`HPE, Exh. 1010, p. 1
`
`

`

`QOAIE A
`g% All
`{488
`
`The Association for Computing Machinery, Inc.
`11 West 42nd Street
`New York, New York 10036
`
`Copyright © 1988 by the Association for Computing Machinery, Inc. Copying
`without fee is permitted provided that the copies are not madeordistribute for
`direct commercial advantage, and credit to the sourceis given. Abstracting with
`credit is permitted. For other copyingof articles that carry a code at the bottom
`ofthe first page, copying is permitted provided that the per-copy fee indicated in
`the code is paid through the Copyright Clearance Center, 27 CongressStreet,
`Salem, MA 01970. For permission to republish write to: Director of Publications,
`Association for Computing Machinery. To copy otherwise, or republish, requires
`a fee and/orspecific permission.
`
`ISBN 0-89791-268-3
`
`Additional copies may be ordered prepaid from:
`
`ACM Order Department
`P.O. Box 64145
`Baltimore, MD 21264
`
`Price:
`Members: ......... $25.00
`All others: ......... $33.00
`
`ACM Order Number: 472880
`
`HPE, Exh. 1010, p. 2
`
`HPE, Exh. 1010, p. 2
`
`

`

`Table of Contents
`
`KEYNOTE ADDRESS
`
`Data With a Point of View
`AGE iC 6 0k rien oa ns ns we SEY Hele S vente gas yas wee >
`
`1
`
`PANELS
`Nested Relations -— A Step Forward or Backward
`Chine eee PISCNEY ee cc cau cE oe erste buedeehat eee a 2
`
`Multi-Database Systems
`Chair: Marek Rusinkiewicz ............ 0. cece eee eee eee eens 3
`
`Issues in Building Large Systems
`Cee re POLNBET 2.2. bpts voik +s =
`
`5 ++ ace gine ale Aolarnsinib seis Aa eep nas 2 6
`
`Logic Programming, Deductive Databases, Expert Database Systems
`re VRE ea a ak oe 25 PS Ee2 A 7
`
`SESSION: QUERY OPTIMIZATION
`Chair:
`J. Christoph Freytag
`
`Optimizing Large Join Queries
`Arona cand, Au Guten a caceesiiiecn «ee 00 vee wesin cele ere erro wl qth ON lowed od asin a 8
`
`Grammar-like Functional Rules for Representing Query Optimization Alternatives
`GU Me Mohmati aa... 3..
`6 sd steyebpareqsreeasl
`<
`v niie +) oe ogre Sra pememeeriaeepepees de) ele ala oles
`
`e wits 18
`
`Equi-depth Multidimensional Histograms
`M. Muralikrishna and D. J. DeWitt ..... 0.0... ccc cee ee ee eee 28
`
`SESSION: OBJECT-ORIENTED & SEMANTIC SYSTEMS
`Chair: Roger King
`
`Transaction Management in an Object-Oriented Database System
`(Invited Paper)
`Jorge Garza and Won Kim ....... 2... ccc cece eect eee eee eee eee eens 37
`
`SIM: A Database System Based on the Semantic Data Model
`D. Jagannathan, B. L. Fritchman, R. L. Guck, J. P. Thompson, and
`DRAPE le cc kb cea aes gis a wlotd ANOS eRe VER giln a
`« wm Rete Goer lee ot oll 46
`
`Contexts and MetaMessages in Object-Oriented Database Programming Language
`Design
`
`M. Caruso and E. Seiore oi... os is sale POG. tek ptt we valine: ea 56
`
`vi
`
`HPE, Exh. 1010, p. 3
`
`HPE, Exh. 1010, p. 3
`
`

`

`INCOMPLETE INFORMATION
`SESSION:
`Chair: Gultekin Ozsoyoglu
`Partition Semantics for Incomplete Information in Relational Databases
`D. Laurent and N. Spyratos ....... eee e eee ceree ees
`A Sound and Complete Query Evaluation Algorithm for Relational Databases
`with Null Values
`L. ¥. Yuan and D.-A. Chiang ......... 25. ceceeee sce r eset errererees
`
`The Derivation Problem of Summary Data
`BM. Malwestut. «oon o-. x eee es ae es ne a eee Rade cele eps eae ee
`
`.. 66
`
`74
`
`82
`
`90
`
`99
`
`109
`
`117
`
`126
`
`135
`
`SESSION: PARALLELISM ISSUES
`Chair: Dieter Gawlick
`Process and Dataflow Control in Distributed Data-Intensive Systems
`W. Alexander and G. Copeland .......-..eeeeee reer rset re rerenees
`
`Data Placement in Bubba
`G. Copeland, W. Alexander, E. Boughter and T. Keller
`
`.......-.+-+--+:
`
`A Case for Redundant Arrays of Inexpensive Disks (RAID)
`D. A. Patterson, G. Gibson and R. H. Katz ....---- eee reece reer reer
`
`SESSION: DISTRIBUTED CONCURRENCY CONTROL
`Chair: Hank F. Korth
`Semantics Based Transaction Management Techniques for Replicated Data
`A. Kumar and M.Stonebraker .......---- 22s eect eee eee eer stersces
`
`The Group Paradigm for Concurrency Control
`A. El Abbadi and S. Toueg ......----- ee ee cere cere este erst snreeees
`
`Multidatabase Update Issues
`Y. Breitbart and A. Silberschatz ........----e eee eeee ret r rete rees
`
`SESSION: LOGIC & DATABASES
`Chair: David Stemple
`
`Data Functions, Datalog and Negation
`S. Abiteboul and R. Hull .......... 0. eee eee eee eee ener rere nese
`A Framework for Testing Safety and Effective Computability of Extended Datalog
`R. Krishnamurthy, O. Shmueli and R. Ramakrishnan .......-----++e+eee:
`
`143
`
`154
`
`vii
`
`HPE, Exh. 1010, p. 4
`
`HPE, Exh. 1010, p. 4
`
`

`

`An Implementation Model for Reasoning with Complex Objects
`Q. Chen andiQ@aeeate es. cen alam aa ny eee es eee ees 164
`
`SESSION: STORAGE STRUCTURES
`Chair:
`Jack Orenstein
`
`Optimal File Distribution for Partial Match Retrieval
`M,. Hi. RiceAt ace «ax 6 oo « dca bales ow eeu os Wein lace 173
`
`Twin Grid Files: Space Optimizing Access Schemes
`A. Hutflesz, FL-W. Six ana). Widmayer 27.22. ee eee 183
`
`Hashing in Practice: Analysis of Hashing and Universal Hashing
`M. Vi Ramakrishna) 2c53 see: ER PL LAT. uk 191
`
`SHORT SESSION: NON-TRADITIONAL APPLICATIONS
`Chair: Randy Katz
`
`Data Modeling in DELAB
`Yannis E. Ioannidis and Miron Livny .................. 00. cece eee eee 200
`
`Data Management of Telecommunications Networks
`Kiyoshi Ono, Mikio Aoyama and Hiroshi Fumimoto..................... 201
`
`A Visual Interface to Design Data
`Kari Alho, Hannu Peltonen and Martti Mantyla............. 2.0.0.0. 000 202
`
`A Relational Database Design in Support of Standard Medical Terminology in
`Multi-domain Knowledge Bases
`Frank Naeymi-Rad, David A. Trace, Max Harry Weil, Lowell Carmony,
`D. Christine Georgakis and Martha Evens ............. 0... eee e eee 203
`
`SHORT SESSION:
`
`INDUSTRIAL SYSTEMS
`
`Chair: Tom Atwood
`
`Vishnu: An Object-Oriented Database Management System Supporting Rapid
`Prototyping
`P. A. Drew, M. S. Ganti, D. J. Moore, R. J. Nassif, S. Podar, and
`DFA, TACiZeros to cay eyaiie ov fosel ox: eine
`to ore 19, oye'-s w= epee vo Set R
`Ue
`eae) AMP EU A. oe 204
`
`Database Services in the German Beldschirmtext System (BTX)
`HJ BrOmmni as. hs se cies cree vs eae eee cue ee ied eee la oo 205
`
`GAIA: An Object-Oriented Framework for an Ada Envrionment
`Don Vinés and Timi King’... see lee cee a bee ees ee loe neces seen see en 206
`
`viii
`
`HPE, Exh. 1010, p. 5
`
`HPE, Exh. 1010, p. 5
`
`

`

`LCE: An Object Oriented Application Development Tool
`Pukniaj Kathhwaha . ... ops cess a ae tt SRS ee ee ll
`
`el
`
`Integrated Storage and Manipulation of Data and Text Based on a Relational
`Database
`Helmut Krieger, Dieter Schoenbauer and Bernd Strodthoff ............... 208
`
`SESSION: DATABASE DESIGN AND NESTED ALGEBRA
`Chair: Victor Vianu
`A Characterization of Constant-Time Maintainability for BCNF Database Schemes
`H. J. Hernandez and E. P. F. Chan ... 1... eee eee eee eee eens
`
`.. 209
`
`A Polynomial Time Algorithm for Testing Implications of a Join Dependency
`and Embodied Functional Dependencies
`J. Leuchner, L. Miller and G. Slutzki ........-- see eee ete eee
`
`The Powerset Algebra as a Result of Adding Programming Constructions to
`the Nested Relational Algebra
`M. Gyssens and D. Van Gucht .......-. sees seer seer erence errr tees
`
`SESSION: DATA MODELS
`Chair: Amihai Motro
`Resolving the Tension between Integrity and Security Using a Theorem Prover
`S. Mazumdar, D. Stemple and T. Sheard ...... 2... s sees reer reece ees
`
`A Transaction Logic for Database Specification
`X. Qian and R. Waldinger..22i5).V0 se ee eee en ne os
`
`A Generalized Model for a Relational Temporal Database
`S K. Gadia and C.-S Yeung .....scncene ste cee seesaw anne pene eres
`
`SESSION: HIGH-PERFORMANCE APPLICATIONS
`Chair: Frank Olken
`High Contention in a Stock Trading Database: A Case Study
`P. Peinl and A. Reuter ...... 0... eee cee eee teen ener e erent neces
`
`A DBMSfor Large Design Automation Databases
`IW. Fiaynie bs ei Bet We ye Cee eas Ee es ee ne
`A Specialized Data Management System for Parallel Execution of Particle
`Physics Codes
`PRN aes wer eee peewee) eas be Re ao ES Seerk Bre eae!
`
`218
`
`225
`
`233
`
`243
`
`251
`
`260
`
`269
`
`o4 OFT
`
`ix
`
`HPE, Exh. 1010, p. 6
`
`HPE, Exh. 1010, p. 6
`
`

`

`SESSION: PERFORMANCE AND INFORMATION MANAGEMENT
`Chair: David Lomet
`Performance Analysis and Fundamental Performance Tradeoffs for CLV
`Optical Disks
`S. Christodoulakis and D. A. Ford 2.2.0... .00 0.0. ee cece. 286
`Processing Queries Against Database Procedures: A Performance Analysis
`E. N. Hanson... “00ers... DRG eA Ea more 295
`Managing Knowledge About Information System Evolution
`M, Forks and T. Rose OT... see este svsinnntetces see 303
`SESSION: QUERY PROCESSING IN DEDUCTIVE DATABASES
`Chair: Dirk Van Gucht
`
`Compiling Separable Recursions
`F ARTON oo se naa nt MMR no» » adie ltoy tien il wis cane a ssbb Lewes seth,
`Classification of Recursive Formulas in Deductive Databases
`C. Youn, L. J. Henschen and J. Han ..........0..0.00000 0000000, 320
`Distributed Processing of Logic Programs
`O. Wolfson and A.Silberschatz ..........0..0.. 0. ececec eee eee ceo, 329
`
`| 312
`
`SESSION: HIGH-PERFORMANCE SYSTEMS
`Chair: David Schrader
`
`A Benchmark of Non-Stop SQL on the Debit Credit Transaction
`(Invited Paper)
`The Tandem Performance GRCUEre SOUR 1G) lahat asi on egne 337
`High Performance SQL Through Low-Level System Integration
`DBO a be yon ROTA BO Oe Lo saewros 342
`A Performance Analysis of the Gamma Database Machine
`D. J. DeWitt, S. Ghanderarizadeh and D. Schneider .................... 350
`
`SESSION: TRANSACTION MANAGEMENT
`Chair: Hector Garcia-Molina
`
`Semantic Lock Models in Object-Oriented Distributed Systems and Deadlock
`Resolution
`M. Roesler and W. A. Burkhard .........0 000... c cece ceceeceeee cee... 361
`
`HPE, Exh. 1010, p. 7
`
`HPE, Exh. 1010, p. 7
`
`

`

`Commitment in a Partitioned Distributed Database
`Kee Vi Si Ramarao)
`isn bee > access = o's oe ow LS © CREE GIS Cae wee Hee 371
`
`Formal Model of Correctness Without Serializability
`I, K. Korth and G. Speegle . 22... 2. eee we eee ee de eee tee 379
`
`SESSION: RULE PROCESSING
`Chair: Andreas Reuter
`
`A Data/Knowledge Base Management Testbed and Experimental Results on
`Data/Knowledge Base Query and Update Processing
`R. Ramnarayan and H. Lu ... 1... ee ce ce eee eet eens 387
`
`A Self-Controlling Interpreter for the Relational Production Language
`L: M,-L..Deleambre and, J. N. Etheredge .2090y o.- < seanee etter meines 396
`
`Implementing Large Production Systems in a DBMS Environment:
`Concepts and Algorithms
`T. Sells, C.-C. Ein and L.."Raschid ..... 7. ee ee eke cee tec 404
`
`SESSION: OBJECT-ORIENTED DATA MODELS
`Chair: H.-J. Schek
`
`A Data Model and Query Language for EXODUS
`M. J. Carey, D. J. DeWitt and S. L. Vandenberg 2.40 6.0s. 055 Hoe ev ee ees 413
`
`O2, an Object-Oriented Data Model
`C.Leclose. P. Richard and F.. Velez a <5 egies eae we tee eee died ins 424
`
`Modeling Class Hierarchies with Contradictions
`Ap BOGGS oo shes cis cs nny dene v ee Se Why Eu > ee wee see Yee Ne 434
`
`AUTHOR INDEX cae ccd cacsepiseselelsea ew oo eH co ei ale lecee vie purl attie Ngielt
`
`a cleus 444
`
`xi
`
`~ HPE, Exh. 1010, p. 8
`
`HPE, Exh. 1010, p. 8
`
`

`

`
`
`
`_| This material may be protected by Copyright law (Title 17 U.S. Code)
`
`A Case for Redundant Arrays of Inexpensive Disks (RAID)
`
`David A. Patterson, Garth Gibson, and Randy H. Katz
`
`Computer Science Division
`DepartmentofElectrical Engineering and Computer Sciences
`571 Evans Hall
`University of California
`Berkeley, CA 94720
`(pattrsn@ginger.berkeley.edu)
`
`ure of magnetic disk technologyis the growth in the maximum numberof
`bits that can be stored per square inch, or the bits per inch in a track
`times the numberof tracks per inch. Called M.A.D., for maximal areal
`density, the "First Law in Disk Density" predicts [Frank87] :
`
`MAD = 10Year-1971)/10
`
`Abstract. Increasing performance of CPUs and memories will be
`squanderedifnot matchedby a similar performance increasein I/O. While
`the capacity of Single Large Expensive Disks (SLED) has grown rapidly,
`the performance improvement of SLED has been modest. Redundant
`Arrays of Inexpensive Disks (RAID), based on the magnetic disk
`technology developed for personal computers, offers an attractive
`alternative to SLED, promising improvements ofan order ofmagnitude in
`performance,reliability, power consumption, and scalability. This paper
`introducesfive levels ofRAIDs, giving their relative cost/performance, and
`compares RAID to an IBM 3380 and a Fujitsu Super Eagle.
`
`1. Background: Rising CPU and Memory Performance
`The users of computers are currently enjoying unprecedented growth
`in the speed of computers. Gordon Bell said that between 1974 and 1984,
`single chip computers improved in performance by 40% per year, about
`twice the rate of minicomputers [Bell 84]. In the following year Bill Joy
`predicted an evenfaster growth [Joy 85]:
`
`MIPS = 2Year-1984
`
`Mainframe and supercomputer manufacturers, having difficulty keeping
`pace with the rapid growth predicted by "Joy's Law," cope by offering
`multiprocessors as their top-of-the-line product.
`But a fast CPU doesnot a fast system make. Gene Amdahlrelated
`CPUspeed to main memory size using this rule [Siewiorek 82):
`
`Each CPUinstruction per second requires one byte ofmain memory;
`
`Magnetic disk technology has doubled capacity and halved price every three
`years,
`in line with the growth rate of semiconductor memory, and in
`practice between 1967 and 1979 the disk capacity of the average IBM data
`processing system more than kept up with its main memory ([Stevens81].
`Capacity is not the ouly memory characteristic that must grow
`rapidly to maintain system balance, since the speed with which
`instructions and data are delivered to a CPU also determinesits ultimate
`performance. The speed of main memory haskept pace for two reasons:
`(1) the invention ofcaches, showing that a small buffer can be managed
`automatically to contain a substantial fraction of memory references;
`(2) and the SRAM technology, used to build caches, whose speed has
`improved at the rate of 40% to 100% per year.
`In contrast to primary memory technologies, the performance of
`single large expensive magnetic disks (SLED) has improved at a modest
`tate. These mechanical devices are dominated bythe seek andthe rotation
`delays: from 1971 to 1981, the raw seek time for a high-end IBM disk
`improved by only a factor of two while the rotation time did not
`change[Harker81]. Greater density means a higher transfer rate when the
`information is found,and extra heads can reducethe average seek time, but
`the raw seek time only improvedat a rate of 7% per year. There is no
`reason to expect a fasterrate in the near future.
`If computer system costs are not to be dominated by the cost of memory,
`To maintain balance, computer systems have been using even larger
`then Amdahl's constant suggests that memory chip capacity should grow
`main memories orsolid state disks to buffer some ofthe I/O activity.
`at the same rate. Gordon Moore predicted that growth rate over 20 years
`This may beafine solution for applications whose I/O activity has
`ago:
`locality of reference and for which volatility is not an issue,
`but
`applications dominatedby a high rate of random requests for small pieces
`of data (such as transaction-processing) or by a low numberof requests for
`massive amounts of data (such as large simulations runnin gon
`supercomputers) are facing a serious performancelimitation.
`2. The Pending I/O Crisis
`Whatis the impact of improving the performance of some Pieces ofa
`problem while leaving others the same? Amdahl's answer is now known
`as Amdahl's Law [Amdahl67]:
`]
`SoS (ee
`Sf) +fk
`
`transistorsichip = 2%ea7-1964
`
`Aspredicted by Moore's Law, RAMs have quadrupled in capacity every
`two [Moore 75]to three years [Myers 86].
`Recently the ratio of megabytes of main memory to MIPS has been
`defined as alpha [Garcia 84], with Amdahl's constant meaning alpha = 1. In
`part because ofthe rapid drop of memory prices, main memory sizes have
`grownfaster than CPU speeds and many machinesare shipped today with
`alphas of 3 or higher.
`To maintain the balance of costs in computer systems, secondary
`storage must match the advancesin other parts of the system. A key meas-
`
`Permission to copy withoutfee all or part of this material is granted provided thatthe copies
`are not madeordistributed for direct commercial advantage, the ACM copyrightnotice
`and thetitle of the publication and its da‘. appear, and noticeis given that copying is by
`permission of the Association for Computing Machinery. To copy otherwise, or to
`republish, requires a fee and /orspecific permission.
`© 1988 ACM 0-89791-268-3 /88/0006/0109 $1.50
`
`where:
`S = the effective speedup;
`f= fraction of work in faster mode; and
`k= speedupwhile in faster mode.
`Suppose that some currentapplications spend 10% oftheir time in
`Y/O. Then when computers are 10X faster--according to Bill Joy in just
`overthree ycars--then Amdahl's Law predicts ef"ective speedupwill be only
`5X. When we have computers 100X faster--via evolution of uniprocessors
`or by multiprocessors--this application will be less than 10X faster,
`wasting 90% ofthe potential speedup.
`
`a1
`
`09
`
`HPE, Exh. 1010, p. 9
`
`

`

`While we can imagine improvements in software file systems via
`buffering for near term I/O demands, we need innovation to avoid an I/O
`crisis [Boral 83].
`
`3. A Solution: Arrays of Inexpensive Disks
`Rapid improvementsin capacity of large disks have notbeen the only
`target of disk designers, since personal computers have created a marketfor
`inexpensive magnetic disks. These lower cost disks have lower perfor-
`manceas well as less capacity. Table I below compares the top-of-the-line
`IBM 3380 model AK4 mainframe disk, Fujitsu M2361A "Super Eagle"
`minicomputer disk, and the Conner Peripherals CP 3100 personal
`computerdisk.
`
`Characteristics
`
`Conners
`IBM Fujitse
`3380 M2361A CP3100
`
`3380 v. 2361 v
`3100
`3100
`(>J] means
`3100 is better)
`4°°3
`3.5
`10.5
`14
`Disk diameter(inches)
`01 2
`100
`600
`7500
`Formatted Data Capacity (MB)
`1-2.5 1.7-3
`Price/MB(contoller incl.)
` $18-$10 $20-$17 $10-$7
`Ayes
`MTTF Rated (hours)
`30,000
`20,000 30,000
`D2
`MTTF in practice (hours)
`100,000
`2
`?
`at
`No. Actuators
`4
`1
`1
`6 8
`Maximum I/O's/second/Acwator
`50
`40
`30
`7 8
`Typical I/O's/second/Actuator
`30
`24
`20
`2 8
`Maximum I/O's/second/box
`200
`40
`30
`2 8
`Typical I/O's/second/box
`120
`24
`20
`3
`44
`Transfer Rate (MB/sec)
`3
`25)
`Oh
`660 64
`Power/box (W)
`6,600
`640
`10
`
`24 3.4Volume(cu.ft.) .03 800 110
`
`
`
`Table I. Comparison of IBM 3380 disk model AK4 for mainframe
`computers, the Fujitsu M2361A "Super Eagle” disk for minicomputers,
`and the Conners Peripherals CP 3100 disk for personal computers. By
`"Maximum I/O's/second” we mean the maximum numberofaverage seeks
`and average rotates for a single sector access. Cost and reliability
`information on the 3380 comes from widespread experience [IBM 87]
`[Gawlick87] and the information on the Fujitsufrom the manual [Fujitsu
`87], while some numbers on the new CP3100 are based on speculation.
`The price per megabyteis given as a rangeto allowfor different pricesfor
`volume discount and different mark-up practices of the vendors. (The 8
`watt maximum powerof the CP3100 was increased to 10 watts to allow
`for the inefficiency of an external powersupply, since the other drives
`contain their own powersupplies).
`
`price-performanceandreliability. Our reasoning is thatif there are no
`advantages in price-performanceorterrible disadvantages in reliability, then
`there is no need to explore further. We characterize a transaction-processing
`workload to evaluate performance ofa collection of inexpensive disks, but
`rememberthat such a collection is just one hardware componentof a
`complete tranaction-processing system. While designing a complete TPS
`based on these ideas is enticing, we will resist that temptation in this
`paper. Cabling and packaging,certainly anissue in the cost andreliability
`of an array of manyinexpensive disks,is also beyondthis paper's scope.
`
`Mainframe
`
`Small Computer
`
`Figure 1. Comparison of organizations for typical mainframe and small
`computer disk interfaces. Single chip SCSIinterfaces such as the Adaptec
`AIC-6250 allow the small computer to use a single chip to be the DMA
`interface as well as provide an embedded controllerfor each disk [Adeptec
`87] . (The price per megabyte in Table | includes everything in the shaded
`boxes above.)
`5. And Now The Bad News: Reliability
`The unreliability of disks forces computer systems managersto make
`backup versions of information quite frequently in case offailure. What
`would be the impact on reliability of having a hundredfold increase in
`disks? Assuming a constant failure rate--that
`is, an exponentially
`distributed time to failure--and that failures are independent--both
`assumptions made by disk manufacturers whencalculating the Mean Time
`To Failure (MTTF)--the reliability of an array ofdisks is:
`
`MITFofa Single Disk
`MTTFofaDisk Array ==——_—______——_-
`Onesurprising factis that the numberof I/Os per second peractuator in an
`NumberofDisks in the Array
`inexpensive disk is within a factor of two of the large disks. In several of
`the remaining metrics, including price per megabyte,the inexpensive disk
`is superior or equal to the large disks.
`The small size and low power are even more impressive since disks
`such as the CP3100 contain full track buffers and mostfunctions of the
`traditional mainframecontroller. Small disk manufacturers can provide
`such functions in high volumedisks because of the efforts of standards
`committeesin defining higherlevel peripheral interfaces, such as the ANSI
`X3,131-1986 Small Computer System Interface (SCSI). Such standards
`have encouraged companieslike Adeptec to offer SCSIinterfaces as single
`chips, in turn allowing disk companies to embed mainframe controller
`functions at low cost. Figure 1 comparesthe traditional mainframedisk
`approach and the small computer disk approach. The same SCSIinterface
`chip embedded as a controller in every disk can also be used as the direct
`memory access (DMA)device at the other end of the SCSI bus.
`Suchcharacteristics lead to our proposal for building 1/O systems as
`arrays of inexpensive disks, either interleaved for the large transfers of
`supercomputers [Kim 86][Livny 87][Salem86] or independentfor the many
`small transfers oftransaction processing. Using the information in Table
`I, 75 inexpensive disks potentially have 12 timesthe I/O bandwidth ofthe
`IBM3380 andthe same capacity, with lower power consumptionandcost.
`
`Using the information in Table I, the MTTF of 100 CP 3100 disks is
`30,000/100 = 300 hours, or less than 2 weeks. Comparedto the 30,000
`hour (> 3 years) MTTF of the IBM 3380,this is dismal. If we consider
`scaling the array to 1000 disks, then the MTTF is 30 hours or about one
`day, requiring an adjective worse than dismal.
`Withoutfault tolerance, large arrays of inexpensive disks are too
`unreliable to be useful.
`6. A Better Solution: RAID
`To overcomethe reliability challenge, we must make use of extra
`disks containing redundantinformation to recoverthe original information
`whena disk fails. Our acronym for these Redundant Arraysof Inexpensive
`Disks is RAID. To simplify the explanation of ourfinal proposal and to
`avoid confusion with previous work, wegive a taxonomyoffive different
`organizationsofdisk arrays, beginning with mirrored disks and progressing
`througha variety ofalternatives with differing performanceandreliability.
`Wereferto each organization as a RAIDlevel.
`The reader should be forewarned that we describe all levels as if
`implemented in hardware solely to simplify the presentation, for RAID
`ideas are applicable to software implementationsas well as hardware.
`Reliability. Our basic approach will be to break the arrays into
`reliability groups, with each group having extra "check" disks containing
`redundant information. When a disk fails we assume that within a short
`time the failed disk will be replaced and the information will be
`
`_
`
`110
`
`HPE, Exh. 1010, p. 10
`
`4, Caveats
`Wecannotexploreall issues associated with such arrays in the space
`available for this paper, so we concentrate on fundamentalestimates of
`
`HPE, Exh. 1010, p. 10
`
`

`

`reconsi‘ucted on to the new disk using the redundantinformation. This
`timeis ciilled the mean timeto repair (MTTR). The MTTR can be reduced
`if the systemincludes extra disks to act as "hot" standby spares; when a
`disk fails, a replacementdisk is switched in electronically. Periodically a
`humanoperatorreplacesall failed disks. Here are other termsthat we use:
`D = total numberofdisks with data (not including extra check disks);
`G = numberofdata disks in a group (not including extra check disks);
`C = numberof checkdisks in a group;
`ng = D/G = numberofgroups;
`As mentioned above we make the same assumptions that disk
`manufacturers make--that failures are exponential and independent. (An
`earthquake or powersurgeis a situation wherean array of disks might not
`fail independently.) Since these reliability predictions will be very high,
`we want to emphasize that the reliability is only of the the disk-head
`assemblies with this failure model, and not the whole software and
`electronic system. In addition, in our view the pace of technology means
`extremely high MTTFare “overkill”--for, independentof expected lifetime,
`users will replace obsolete disks. After all, how many people arestill
`using 20 year old disks?
`The general MTTF calculation for single-error repairing RAID is
`given in two steps. First, the group MTTF is:
`
`1
`MTIFisk
`MITEGy © —<—<$<_<$<—+
`G+C
`Probability ofanotherfailure in a group
`before repairing the dead disk
`
` |||
`
`Since the formula is the same for each level, we make the abstract
`numbers concrete using these parameters as appropriate: D=100 total data
`disks, G=10 data disks per group, MTTFpj; = 30,000 hours, MTTR = 1
`hour, with the check disks per group C determinedby the RAIDlevel.
`Reliability Overhead Cost. This is simply the extra check
`disks, expressed as a percentage of the numberof data disks D. As weshall
`see below,the cost varies with RAID level from 100% down to 4%.
`Useable Storage Capacity Percentage. Another way to
`expressthis reliability overheadis in terms of the percentageofthetotal
`capacity of data disks and check disks that can be used to store data.
`Depending on the organization,this varies from a low of 50% to a high of
`96%.
`and
`applications
`supercomputer
`Performance. Since
`transaction-processing systems have different access patterns and rates, we
`need different metrics to evaluate both. For supercomputers we countthe
`numberof reads and writes per second for large blocks of data, with large
`defined as getting at least one sector from each data disk in a group. During
`large transfersall the disks in a group actas a single unit, each reading or
`writing a portion ofthe large data block in parallel.
`A better measure for transaction-processing systemsis the numberof
`individual reads or writes per second. Since transaction-processing
`systems (e.g., debits/credits) use a read-modify-write sequence of disk
`accesses, we include that metric as well. Ideally during small transfers each
`disk in a group can act independently,either reading or writing independent
`information. In summary supercomputer applications need a high data rate
`while transaction-processing need a high I/O rate.
`For both the large and small transfer calculations we assume the
`minimum user requestis a sector, that a sector is small relative to a track,
`As more formally derived in the appendix, the probability of a second
`and_that there is enough work to keep every device busy. Thus sectorsize
`failure before the first has been repaired is:
`affects both disk storage efficiency and transfer size. Figure 2 shows the
`MTTR
`MTTR
`ideal operation of large and small disk accesses in a RAID.
`
`
`Probability of== ————___________.. *
`MTTFpj, (G+C-1)
`Another Failure
`MTTFp;,;, (No. Disks-1)
`
`ae
`pe ysapeasetsy
`
`(a) Single Large or "Grouped" Read
`(1 read spread over G disks)
`
`Ea EI
`
`bess
`eee
`ROOF
`(b) Several Smal! or Individual Reads and Writes
`(G reads andlor writes spread over G disks)
`
`
`Theintuition behind the formal calculation inthe appendix comes
`from trying to calculate the average numberof seconddisk failures during
`the repair time for X single disk failures. Since we assumethatdisk failures
`occur at a uniform rate, this average numberof second failures during the
`repair timefor X first failures is
`
`X*MTIR
`
`MTTFof remaining disks in the group
`
`The average number of secondfailures for a single disk is then
`MTTR
`
`MTTFpicg ! No. of remaining disks in the group
`
`The MTTF ofthe remaining disks is just the MTTF of a single disk
`divided by the number of good disks in the group, giving the result above.
`The second step is the reliability of the whole system, which is
`approximately (since MTTFGroup is not quite distributed exponentially):
`MITFG,oup
`
`MTIFRAID =
`
`"G
`
`Pluggingit all together, we get:
`
`Figure 2. Large transfer vs. small transfers in a group of G disks.
`The six performance metrics are then the number of reads, writes, and
`read-modify-writes per second for both large (grouped) or small (individual)
`MTTFpisk—MTTFpisk 1
`
`transfers. Rather than give absolute numbers for each metric, we calculate
`MTTFpaip = ——— * —————— * —
`efficiency: the number of events per second for a RAIDrelative to the
`G+C
`(G+C-1)*MTTR
`ong
`corresponding events per second for a single disk. (This is Boral's 1/O
`bandwidth per gigabyte [Boral 83] scaled to gigabytes per disk.) In this
`(MTTFpsy)”
`paper weare after fundamental differences so we use simple, deterministic
`throughput measures for our performance metric rather than latency.
`Effective Performance Per Disk. The cost of disks can be a
`large portion of the cost of a database system, so the 1/O performance per
`disk--factoring in the overhead of the check disks--suggests the
`cost/performance of a system. This is the bottom line for a RAID.
`
`(G+C)*ng * (G+C-1)*MTIR
`
`(MTTFpis)”
`
`MTTFpap
`
`(D+C*ng )*(G+C-1)*MTTR
`
`111
`
`HPE, Exh. 1010, p. 11
`
`HPE, Exh. 1010, p. 11
`
`

`

`7, First Level RAID: Mirrored Disks
`Mirrored disks are 2 traditional approach for improving reliability of
`magnetic disks. This is the most expensive option we considersinceall
`disks are duplicated (G=1 and C=1), and every write to a data disk is also a
`write to a check disk. Tandem doubles the numberofcontrollers for fault
`tolerance, allowing an optimized version of mirrored disksthat lets reads
`occur in parallel. Table II shows the metrics for a Level 1 RAID assuming
`this optimization.
`MITF
`Exceeds Useful Product Lifetime
`(4,500,000 hrs or > 500 years)
`2D
`100%
`50%
`
`Total NumberofDisks
`Overhead Cost
`Useable Storage Capacity
`
`1) aread step to get all the rest ofthe data;
`2) a modify step to merge the new andold information;
`3) a write step to write thefull group, including check information.
`Since we have scoresofdisks in a RAID andsince someaccesses are
`to groups of disks, we can mimic the DRAMsolution by bit-interleaving
`the data across the disks of a group and then add enough check disks to
`detect and correcta single error. A single parity disk can detect a single
`error, but to correct an error we need enough checkdisks to identify the
`disk with the error. For a group size of 10 data disks (G) we need 4 check
`disks (C) in total, and if G = 25 then C = 5 [Hamming50]. To keep down
`the cost of redundancy, we assumethe groupsize will vary from 10 to 25.
`Since our individual data transferunit is just a sector,bit- interleaved
`disks mean that a large transfer for this RAID mustbe atleast G sectors.
`Like DRAMs,readsto a smaller amountimplies reading a full sector from
`each ofthebit-interleaved disks in a group, and writes of a single unit
`involve the read-modify-write cycle to all the disks. Table III shows the
`Efficiency Per Disk
`FullRAID
`Events/Sec vs. Single Disk
`metrics of this Level 2 RAID.
`1,00/S
`2D/S
`Large (or Grouped) Reads
`MTTF
`Exceeds Useful Lifetime
`.50/S
`D/S
`Large (or Grouped) Writes
`G=10
`G=25
`
`Large (or Grouped) R-M-Wss4D/3S 67/8
`(494,500 hrs
`(103,500 hrs
`Small (or Individual) Reads
`2D
`1.00
`50
`or >SO years)
`or 12 years)
`Small (or Individual) Writes
`D
`67
`Total NumberofDisks
`1.40D
`1.20D
`Smail (or Individual) R-M-W=4D/3
`Overhead Cost
`40%
`20%
`Useable Storage Capacity
`1%
`83%
`Events/Sec
`FullRAID Efficiency Per Disk Efficiency Per Disk
`(vs. Single Disk)
`12
`L2iL1
`2
`L2/L1
`Large Reads
`71S
`71%
`86/S
`86%
`Large Writes
`-T1/S
`143%
`86/S 172%
`Large R-M-W
`TUS
`107%
`.86/S
`129%
`Small Reads
`.07/S
`6%
`.03/S
`3%
`Small Writes
`04S
`6%
`.02/S
`3%
`Small R-M-W
`07/S
`9%
`.03/S
`4%
`
`Table II. Characteristics of Level 1 RAID, Here we assume that writes
`are not slowed by waiting for the second write to complete because the
`slowdown for writing 2 disks is minor compared to the slowdown S for
`writing a whole group of10 to 25 disks. Unlike a "pure" mirrored scheme
`with extra disks that are invisible to the software, we assume an optim

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket