`
`The 20th Annual International
`Symposium on
`
`COMPUTER ARCHITECTURE
`
`May 16-19, 1993
`
`San Diego, California
`
`Sponsored by
`
`IEEE ComputerSociety
`Technical Committee on Computer Architecture
`
`Association for Computing Machinery
`SIGARCH
`
`@.
`
`IEEE Computer Society Press
`Los Alamitos, California
`
`Tokyo
`°
`Brussels
`°
`Washington
`
`
`HPE, Exh. 1012, p. 1
`
`HPE, Exh. 1012, p. 1
`
`
`
`Computer Society Press,or the Institute of Electrical and Electronics Engineers,Inc.
`
`The papersin this book comprise the proceedingsof the meeting mentioned on the
`coverand title page. They reflect the authors’ opinions and,in the interests of timely
`dissemination, are published as presented and without change.Their inclusionin this
`publication does not necessarily constitute endorsement by the editors, the IEEE
`
`Published by the
`IEEE Computer Society Press
`10662 Los VaquerosCircle
`PO Box 3014
`Los Alamitos, CA 90720-1264
`
`®
`
`© 1993bythe Institute of Electrical and Electronics Engineers, Inc. All rights reserved.
`Copyright and Reprint Permissions: Abstracting is permitted with credit to the source.
`Libraries are permitted to photocopy beyondthelimit of US copyrightlaw,for private use
`of patrons, thosearticles in this volume that carry a code at the bottom ofthefirst page,
`providedthat the per-copyfee indicatedin the codeis paid through the Copyright Clearance
`Center, 27 CongressStreet, Salem, MA 01970. For other copying,reprint, or republication
`permission, write to IEEE Copyrights Manager, IEEE Service Center, 445 Hoes Lane, PO
`Box 1331, Piscataway, NJ 08855-1331.
`
`IEEE Computer Society Press Order Number 3810-02
`IEEE Catalog Number 93CH3284-7
`Library of Congress Number 85-642899
`ISBN 0-8186-3810-9 (paper)
`ISBN 0-8186-381 1-7 (microfiche)
`ISBN 0-8186-3812-5 (case)
`ISSN 0884-7495
`ACM Order Number415930
`ACMLibrary Series ISBN 0-89791-579-8 (Hardcover)
`ACM SIGARCHISSN 0163-5964
`
`Additional copies can be ordered from:
`
`IEEE Service Center
`445 Hoes Lane
`PO Box 1331
`Piscataway, NJ 08855-1331
`
`IEEE Computer Society
`13, avenue deI'Aquilon
`B-1200 Brussels
`BELGIUM
`
`ACM Order Dept., PO Box 64145, Baltimore, MD 21264
`
`IEEE ComputerSociety
`Ooshima Building
`2-19-1 Minami-Aoyama
`Minato-ku, Tokyo 107
`JAPAN
`
`IEEE Computer Society Press
`Customer Service Center
`10662 Los VaquerosCircle
`PO Box 3014
`Los Alamitos, CA 90720-1264
`
`Production Editors: Mary E. Kavanaugh and Edna Straub
`Coverart: Joseph Daigle / Schenk-Daigle Studios
`Printed in the United States of America by Braun-Brumfield, Inc.
`
`& The Institute of Electrical and Electronics Engineers,Inc.
`
`iv
`
`HPE, Exh. 1012, p. 2
`
`HPE, Exh. 1012, p. 2
`
`
`
`Table of Contents
`
`General Chair’s Message..............:ssssssscccccsssssssscccccesseesnnssaneseccccceeeeenseseeeeeeeeeeesceeeeeseeseeseeseeeesssenseecesenaees v
`Program Chair’s Message.............sssssssscsccccsssssnscececeseessnnaseaasecceceeeeeesceeeeceeesecesesceesesseeeeeeessesesssssesnsenees vi
`Organizing COMMiIttee............cccccccccccccccccccseecceseeceeececeeececcccecccececeeeceecscceeeceeeeauaeseseeeceececcceseeueuensececeeseees vii
`ROLETCOS esse cae eh exseewnwesnacasavcnaveusedtesanesugumsaaces Sessa tease Sse inky 6a V 6H eH SAU WEN SA TORR UN TATOXONAUERY BD Eaxaaaren cause creUNaaweReaes viii
`
`Session 1: Opening Session
`
`Session 2: Architectural Characteristics of Scientific Applications
`
`Architectural Requirementsof Parallel Scientific Applications
`with Explicit Communication «cscsesscecescsaceceitevcsceeteedettin tes venwwenan cus sansoncesstreseecreestn dee das dea davis tbe tacaceenennsacees 2
`R. Cypher, A. Ho, S. Konstantinidou, and P. Messina
`WorkingSets, Cache Sizes, and Node Granularity Issues for Large-Scale Multiprocessors............. 14
`E. Rothberg, J.P. Singh, and A. Gupta
`
`Session 3: TLBs and Memory Management
`
`Design Tradeoffs for Software-Managed TLBs Se ostvayckoevea/otnusneneror he Te res essscnensueewers 27
`D. Nagle, R. Uhlig, T. Stanley, S. Sechrest, T. Mudge, and R. Brown
`Architectural Support for Translation Table Management in Large Address
`Space Machines..............ccsccccccssssscccssssccceessccceesnseccsessesceessseecssesnseeeseseesaseecssseaeeseseessaeeeeseeseeeseeseueeeenes 39
`J. Huck and J. Hays
`
`Session 4: Input/Output
`
`The TickerTAIP Parallel RAID Architecture.............c:::cccccssssscccsssscccesssnsceccseeseesssaseeecesssaneeeeeeeseeeesaees 52
`P. Cao, S.B. Lim, S. Venkataraman, and J. Wilkes
`Parity Logging Overcoming the Small Write Problem in Redundant Disk ArrayS.................::esescee 64
`D. Stodolsky, G. Gibson, and M. Holland
`The Architecture of a Fault-Tolerant Cached RAID Controller..................cccccccssssssecesesssnscececeesesseees 76
`J. Menon and J. Cortney
`
`Session 5: Multiprocessor Caches
`
`The Detection and Elimination of Useless Misses in Multiprocessor,.............:.:ccccccsssssssseceeeseessnenceees 88
`M. Dubois, J. Skeppstedt, L. Ricciulli, K. Ramamurthy, and P. Stenstrém
`Adaptive Cache Coherency for Detecting Migratory Shared Data.................cccsssssccccsssssssceceeesseessnnees 98
`A.L. Cox and R.J. Fowler
`
`An Adaptive Cache CoherenceProtocol Optimized for Migratory Sharing...................ccssseseecceeesnees 109
`P. Stenstrém, M. Brorsson, and L. Sandberg
`
`Session 6: Panel Session I — High-Performance Computing from
`the Applications Perspective
`D. Kuck (Chair)
`
`x
`
`HPE, Exh. 1012, p. 3
`
`HPE, Exh. 1012, p. 3
`
`
`
`Session 7: Multithreading Support
`
`Register Relocation: Flexible Contexts for Multithreading................cccccccsccccssssscsssccsssssscesesesceeeesnes 120
`C.A. Waldspurger and W.E. Weihl
`Multiple Threads in Cyclic Register Windows..............c::ccesssccesssseeesa AU RE Aa SS, arealwa ben 131
`Y. Hidaka, H. Koike, and H. Tanaka
`
`Session 8: Mechanismsfor Creating Shared Memory
`
`Evaluation of Release Consistent Software Distributed Shared Memory on Emerging
`NetworkTechnology#.« 3. .<) 2th. adeuaant lohsamiaeert an Diamine lant eh aes 144
`S. Dwarkadas,P. Keleher, A.L. Cox, and W. Zwaenepoel
`Mechanismsfor Cooperative Shared Memory..............c::cccssccssscssscsssessecesssesssecessecessecessscsesscessscessesees 156
`D.A. Wood, S. Chandra, B. Falsafi, M.D. Hill, J.R. Larus, A.R. Lebeck,
`J.C. Lewis, S.S. Mukherjee, S. Palacharla, and S.K. Reinhardt
`
`Session 9: Cache Design
`
`A Case for Two-Way Skewed-Associative Caches............c:cccccccssscsssssssecssscssscsssscessccssssssssssseesscesssevenes 169
`A. Seznec
`
`Column-Associative Caches: A Technique for Reducing the Miss Rate of
`Direct-Mapped Caches..............cccsssccsssccssssesssecessecessesessssecessescesssessssesesussesssssssessseseesssscssssssessesesesessees 179
`A. Agarwal and S.D. Pudar
`Cache Write Policies and Performance. ...............ccccccccccccssssssccccsccssssssssccssccecsecssssecssstssssssescesssesseeseees 191
`N.P. Jouppi
`
`Session 10: Evaluation of MachinesI
`
`Hierarchical Performance Modeling with MACS: A Case Studyof the Convex C-240............cc0000 203
`E.L. Boyd and E.S. Davidson
`The Cedar System and anInitial Performance Study ..............:cccsccsssesscessesecsecsssesseessecessseessessseesees 213
`D. Kuck, E. Davidson, D. Lawrie, A. Sameh, C.-Q. Zhu, A. Veidenbaum,
`J. Konicek, P. Yew, K. Gallivan, W. Jalby, H. Wijshoff, R. Bramley,
`U.M. Yang, P. Emrath, D. Padua, R. Eigenmann, J. Hoeflinger, G. Jaxon,
`Z. Li, T. Murphy, J. Andrews, and S. Turner
`The J-Machine Multicomputer: An Architectural Evaluation ............ccccccccsccscccsssescsssesssesescessceseees 224
`M.D. Noakes, D.A. Wallach, and W.J. Dally
`
`Session 11: Processor Architecture and Implementation
`
`16-Bit vs. 32-Bit Instructions for Pipelined Microprocessors..............ccscscsssssscssssscsssscssesessesceesessceess 237
`J. Bunda, D. Fussell, R. Jenevein, and W.C. Athas
`Register Connection: A New Approach to Adding Registers into Instruction
`BT PTArcs nowsscerses es iaccscenacenen cake eyccrs eeizk doa aR NNSRNRTEAT La Bciga dts dsm aniannachcinmmnnommnondnommmees 247
`T. Kiyohara, S. Mahlke, W. Chen, R. Bringmann, R. Hank,
`S. Anik, and W.-M. Hwu
`A Comparison of Dynamic BranchPredictors that Use Two Levels of Branch History................... 257
`T.-Y. Yeh and Y.N.Patt
`
`xi
`
`HPE, Exh. 1012, p. 4
`
`HPE, Exh. 1012, p. 4
`
`
`
`Session 12: Multiprocessor Memory Systems
`
`The Performance of Cache-Coherent Ring-Based Multiprocessor...................:ccsssssccscessseereseceoseees 268
`L.A. Barroso and M. Dubois
`
`Limitations of Cache Prefetching on a Bus-Based Multiprocessor...................csssscccsssssssssscsssssesesees 278
`D.M. Tullsen and S.J. Eggers
`Transactional Memory: Architectural Support for Lock-Free Data Structures................:.scccseesees 289
`M. Herlihy and J.E.B. Moss
`
`Session 13: Panel Session II — Experimental Research: How Do We Measure Success?
`Y. Patt and R. Iyer (Co-Chairs)
`
`Session 14: Evaluation of MachinesII
`
`Evaluation of Mechanismsfor Fine-Grained Parallel Programsin the J-Machine
`TVG he OMB roan ccs ede cssese can ot iersae Tae tone cas btn waco vn cu oh na 2b Cush SSS La SSN LSATEG CUOLTVON SAGO WG Soon 99 Sa Sea USEN NE ATTNNE 302
`E. Spertus, S.C. Goldstein, K.E. Schauser, T. von Eicken,
`D.E. Culler, and W.J. Dally
`Improving AP1000 Parallel Computer Performance with Message Communication...................00 314
`T. Horie, K. Hayashi, T. Shimizu, and H. Ishihata
`
`Session 15: Memory Systemsand Interconnection
`
`Performance of Cached DRAM Organizations in Vector Supercomputers.................sssssseesessessseeees 327
`W.-C. Hsu and J.E. Smith
`The Chinese Remainder Theorem and the Prime Memory System....................sssscscscessssrecesessseneeees 337
`Q.S. Gao
`Odd Memory Systems May Be Quite Interesting... eccessssseeeccceeseesscsssesssssssssssssssseseeeeeeaeees 341
`A. Seznec and J. Lenfant
`A Comparison of Adaptive Wormhole Routing Algorithms. ....................ssssccceccsssseeeesessssseeeseesesoeees 351
`R.V. Boppana and S. Chalasani
`
`Author Indewi.i64.beeeeM na bed avert rege eS PETS ©. ch tuSrosetonecetotesstesss 361
`
`HPE, Exh. 1012, p. 5
`
`HPE, Exh. 1012, p. 5
`
`
`
`This material may be protected by Copyright law (Title 17 U.S. Code)
`
`The Architecture of a Fault-Tolerant Cached RAID Controller
`
`Jai Menon and Jim Cortney
`
`IBM Almaden Research Center
`San Jose, California 95120-6099
`E-Mail: menonjm@almaden.ibm.com
`Telephone: (408) 927-2070
`
`Abstract— RAID-5 arrays need 4 disk accesses to
`update a data block -- 2 to read old data and parity,
`and 2 to write new data andparity. Schemes previously
`proposed to improve the update performance of such
`arrays are the Log-Structured File System [10] and
`the Floating Parity Approach [6]. Here, we consider
`a third approach, called Fast Write, which eliminates
`disk time from the host response time to a write, by
`using a Non-Volatile Cache in the disk array controller.
`We examine three alternativesfor handling Fast Writes
`and describe a hierarchy of destage algorithms with
`increasing robustness to failures. These destage algo-
`rithms are compared against those that would be used
`by a disk controller employing mirroring. We show
`that array controllers require considerably more (2 to
`3 times more) bus bandwidth and memory bandwidth
`than do disk controllers that employ mirroring. So,
`array controllers that use parity are likely to be more
`expensive than controllers that do mirroring,
`though
`mirroring is more expensive when both controllers and
`disks are considered.
`,
`
`are the Log-Structured File System [10] and the
`Floating Parity Approach [6]. In this paper, we con-
`sider a third approach,called Fast Write, which elim-
`inates disk time from the host response time to a
`write, by using Non-Volatile Storage (NVS) in the
`disk array controller. A block received from a host
`system is initially written to NVS in the disk array
`controller and a completion message is sent to the
`host system at this time. Actual destage of the block
`from NVSto disk is done asynchronously at a later
`time. We call a disk array that uses the Fast Write
`technique a Cached RAID.
`The rest of this paper is organized as follows. We
`first review the parity technique. Then, we describe
`Fast Write. Next, we give an overview of the archi-
`tecture of Hagar, a disk array controller prototype
`developed at
`the IBM Almaden Research Center.
`Hagar uses Fast Write. In the last sections of this
`report, we then analyze
`several alternatives
`for
`destaging blocks from NVS to disk. We show that
`destage algorithms must be carefully developed be-
`cause of complex trade-offs between availability and
`performance goals.
`1. Introduction
`
`2.ReviewofParityTechnique
`A disk array is a set of disk drives (and controller)
`Weillustrate the parity technique on a disk array
`which can automatically recover data when one (or
`of six data disks and a parity disk. In this diagram,
`more) drives in the set fails by using redundant data
`Pi is a parity block that protects the six data blocks
`that is maintained by the controller on the drives. [8]
`labelled Di. Pi and the 6 Dis together constitute a
`describes five types of disk arrays called RAID-1
`parity group. The Pi of a parity group must always
`through RAID-S and [2] describes a sixth type called
`be equal to the parity of the 6 Di blocks in the same
`a parity striped disk array. In this paper, our focusis
`parity group as Pi.
`on RAID-5 and/or parity striped disk arrays which
`D1 D2 D3 D4
`Data Disk 1
`employ a parity technique described in [1,8]. This
`D1 D2 D3 D4
`Data Disk 2
`D1 D2 D3 D4
`Data Disk 3
`technique requires fewer disks than mirroring and is
`D1 D2 D3 D4
`Data Disk 4
`therefore more acceptable in manysituations.
`D1 D2 D3 D4
`Data Disk 5
`The main drawback of such arrays are that they
`D1 D2 D3 D4
`Data Disk 6
`need four disk accesses to update a data block -- two
`Pl P2 P3 P4
`Parity Disk
`to read old data and parity, and two to write new
`Weshowonly one track (of 4 blocks) from each of
`data and parity. [5] showed that the performance deg-
`the disks.
`In all, we show four parity groups. P1
`radation can be quite severe in transaction processing
`contains the parity or exclusive OR of the blocks
`environments. Two schemes that have been previ-
`labeled D1 on all the data disks, P2 the exclusive OR
`ously proposed to improve array update performance
`
`0884-7495/93 $3.00 © 1993 IEEE
`
`76
`
`HPE, Exh. 1012, p. 6
`
`
`
`HPE, Exh. 1012, p. 6
`
`
`
`D2s, and so on. Such an array is robust against single
`a new disk block is needed in the cache, the LRU
`disk crashes; if disk 1 were to fail, data on it can be
`disk block in cache is examined. If it is clean,
`the
`recreated by reading data from the remainingfive data
`space occupiedby that disk block can be immediately
`disks and the parity disk and performing the appro-
`used; if it is dirty, the disk block must be destaged
`priate exclusive OR operations.
`before it can be used. While it
`is not necessary to
`Wheneverthe controller receives a request to write
`postpone destaging a dirty block until it becomes the
`a data block, it must also update the corresponding
`LRU block in the cache, the argument for doing so
`parity block for consistency. If D1 is to be altered,
`is that it could avoid unnecessary work. Consider that
`the new value of P1 is calculated as:
`a particular disk block has the value d. If the host
`new P1 = (old D1 XOR new D1 XOR old P1)
`later writes to this disk block and changesits value
`Since the parity must be altered each time the data is
`to d’, we would havea dirty block (d’) in cache which
`modified, these arrays require four disk accesses to
`would have to be destaged later. However, if the host
`write a data block - two to read old data andparity,
`writes to this disk block again, changing its value to
`d”, before d’ became LRU and wasdestaged, we no
`two to write new data and parity.
`longer need to destage d’, thus avoiding some work.!
`3. Overview of the Fast Write Technique —
`When a block is ready to be destaged, the disk
`In this technique,all disk array controller hardware
`array controller may also decide to destage other dirty
`such as processors, data memory (memory containing
`blocks in the cache that need to be written to the
`cached data blocks and other data buffers), control
`sametrack, or the same cylinder. This helps minimize
`memory (memory containing control structures such
`disk arm motion,byclustering together many destages
`as request control blocks, cache directories, etc..) are
`to the same disk arm position. However,
`this also
`divided into at least two disjoint sets, each set on a
`meansthat somedirty blocks are destaged before they
`different power boundary. The data memory and the
`become the LRU disk block, since they will be
`control memory are either battery-backed or built us-
`destaged at the same time as some otherdirty block
`ing NVSso they can survive power failures. When a
`that became LRU and that happened to be on the
`disk block to be written to the disk array is received,
`same track or cylinder. Therefore, the destage algo-
`the blockis first written to data memory in the array
`rithm must be carefully chosen to trade-off the reduc-
`controller, in two separate locations, on two different
`tion in destages that can be caused by overwrites of
`power boundaries. At this point, the disk array con-
`dirty blocks if we wait until dirty blocks become LRU
`troller returns successful completion of the write to
`versus the reduction in seeks that can be achieved if
`the host. In this way, from the host’s point of view,
`we destage multiple blocks at the same track or cyl-
`the write has been completed quickly without requiring
`inder position together. An example compromise
`any disk access. Since two separate copies of the disk
`might be along the followinglines: when a dirty block
`block are made in the disk array controller, no single
`becomes LRU, destage it and all other dirty blocks
`hardware or powerfailure can cause a loss of data.
`on the same track (cylinder) as long as these other
`Disk blocks in array controller cache memory that
`blocks are in the LRU half of the LRU chain of
`need to be written to disk are called dirty. Such dirty
`cached disk blocks.
`blocksare written to disk in a process wecall destaging.
`In a practical implementation, we may havea back-
`Whenablockis destaged to disk, it is also necessary
`ground destage process that continually destages dirty
`to update, on disk, the parity block for the data block.
`blocks near the LRU end of the LRUlist (and others
`This may require the array controller to read the old
`on the sametrack or cylinder) so that a request that
`values of the data block and the parity block from
`requires cache space (such as a host write that misses
`disk, XOR them with the new value of the data block
`in the cache) does not have to wait for destaging to
`in cache, then write the new value of the data block
`complete in order to find space in the cache. Another
`and of the parity block to disk. Since many applica-
`option is to trigger destages based on the fraction of
`tionsfirst read data before updating them, we expect
`dirty blocks in the cache. For example,if the fraction
`that the old value of the data block might already be
`of dirty blocks in the cache exceeds some threshold
`in array controller cache. Therefore, the more typical
`(say 50%), we may trigger a destage of dirty blocks
`destage operation is expected to require one disk read
`that are near the LRU end of the LRU chain (and
`and two disk writes.
`3.1. Overview of Destage
`Typically, the disk blocks in the disk array controller
`(both dirty and clean disk blocks) are organized in
`Least-Recently-Used (LRU) fashion. When space for
`
`1 On the other hand, there are two copies of every dirty disk
`block in the cache. The longer we delay destaging the dirty
`blocks, the longer they occupy two cachelocations.
`
`77
`
`HPE, Exh. 1012, p. 7
`
`HPE, Exh. 1012, p. 7
`
`
`
`data memory locations even though only one copy
`of other dirty blocks on the same tracks as these
`of the data block travels on the data bus.
`blocks). This destaging may continue until the number
`In the idealized Hagar implementation, we would
`of dirty blocks in cache drops below some reverse
`have processorcards; host interface cards; global data
`threshold (say 40%).
`memory cards; global control memory cards and disk
`Since read requests to the disk are synchronous
`controller cards attached to the reliable data and con-
`while destages to the disk are asynchronous,the best
`trol buses. Cards of each type are divided intoat least
`destage policy is one that minimizes any impact on
`2 disjoint sets; each set is on a different power bound-
`read performance. Therefore, the disk controller might
`ary. The disk controller cards would attach to multiple
`delay starting a destage until all waiting reads have
`disk strings overa serial link using a logical command
`completed to the disk and it may even consider pre-
`structure such as SCSI. For availability reasons, the
`empting a destage (particularly long destages of many
`disks would be dual-ported and would each attach to
`tracks) for subsequent reads.
`two serial links originating from two different disk
`3.2. Summary of Fast Write Benefits
`controllers. The data memory cards would provide
`To summarize, Fast Write: will eliminate disk time
`battery-backed memory, accessible to all processors,
`from write response time as seen by the host; will
`for caching, fast write and data buffering. The control
`eliminate some disk writes due to overwrites caused
`memory cards also provide battery-backed memory,
`by later host writes to dirty blocks in cache; will
`accessible to all processors, used for control structures
`reduce disk seeks because destages will be postponed
`such as cache directories and lock tables. Unlike the
`until many destages can be doneto a track orcylinder;
`data memory, the control memory providesefficient
`and can convert small writes (single block) to large
`access to small amounts of data (bytes) and supports
`writes (all blocks in parity group) and thuseliminate
`atomic operations necessary for synchronization be-
`many disk accesses. Work done by Joe Hyde [3]
`tween multiple processors.
`indicates that, for high-end IBM 370 processor work-
`The XOR hardware needed for performing parity
`loads, anywhere from 30%to 60%of the writes to
`operations is integrated with the data memory. We
`the disk controller cause overwrites of dirty blocks in
`chose to integrate the XOR logic with the data mem-
`cache. His work also indicates that even though the
`ory to avoid bus bandwidth during XOR operations
`host predominantly issues single block writes, any-
`to a separate XOR unit such as that used in the
`where from 2 to 7 dirty blocks can be destaged to-
`Berkeley RAID-II design ((9]). The data memory in
`gether whenatrackis destaged. Together, these results
`Hagar supports two kinds of store operations: a reg-
`indicate that Fast Write can be an effective technique
`ular store operation and a special store & XOR op-
`for improving the write performance of disk arrays
`eration. A store & XOR to location X,
`takes the
`that use the parity technique.
`incoming data, XORsit with data at location X, and
`4.OverviewofHagar
`stores the result of the XOR back into location X.
`The Hagar prototype is designed to support very
`5. Data Memory Management Algorithms
`large amounts of disk storage (up to 1 Terabyte); to
`5.1. Four Logical Regions of Data Memory
`provide high bandwidth (100 MB/sec);
`to provide
`The data memory in the disk array controller is
`high IOs/sec (5000 IOs/sec at 4 Kbyte transfers); and
`divided into four logical regions: the free pool, the
`to provide high availability. It provides for continuous
`data cache,
`the parity cache and the buffer pool.
`operation through use of battery-backed memory, du-
`When a block is written by the host, it is placed in
`plexed hardware components, multiple power bound-
`the buffer pool, in two separate power boundaries.
`aries, hot sparing of disks, on-the-fly-rebuilding of
`Subsequently, the two data blocks are movedinto the
`data lost in a disk crash to a hot spare and by per-
`data cache(this is a logical, not a physical move; that
`mitting nondisruptive installation and removal of disks
`is, the cache directories are updated to reflect the fact
`and hardware components.
`that the disk block is in cache). After this logical
`Hagar is organized around checked and reliable
`move of the blocks into the data cache,
`the array
`control and data buses on a backplane. The structure
`controller returns “done” to the host system that did
`of Hagar is shown in Figure 1. The data busis opti-
`the write. At some subsequenttime, the block D is
`mized for high throughput on large data transfers and
`destaged to disk. The data cache region of the data
`the control bus is optimized for efficient movement
`memory contains data blocks from disk and the parity
`of small control transfers. The Hagar data bus is a
`cache region of the data memory contains parity
`multi-destination bus; a block received from the host
`blocks from disk. The parity blocks are useful during
`system or from the disks can be placed in multiple
`destage, since the presence of a parity block in the
`
`78
`
`HPE, Exh. 1012, p. 8
`
`HPE, Exh. 1012, p. 8
`
`
`
`Logical Architecture of Array Controller
`checked checked
`Rellable Reliable
`Contro
`Data
`bus
`bus
`
`Global
`pleunarcemn|
`Data Store
`
`
`
` serial
`
`Disk Ctrirn
`
`Global
`Data Store
`
`Global
`
`Control Store
`
`Global
`Control Store
`
`handles requests
`manages cache
`manages rebuild
`directs disk ctrirs
`
`bullds responses
`
`HostI/f
`
`Host I/f
`
`To
`
`Host
`
`Add uPsfor performance
`Add global memory for performa
`Add HostI/F cards for connectivity
`Adddisk controllers for more disks
`Adddisks for capacity
`
`Figure 1: Hagar Array Controller
`
`detgri
`
`3/9/91
`
`parity cache would eliminate the need to read it from
`disk at destage time. There is some argumentfor not
`having a parity cache at all and to make the data
`cachelarger. This is because parity blocksin the parity
`cache only help destage performance, whereas data
`blocks in the data cache can help both read perfor-
`mance (due to cache hits) and destage performance
`
`(by eliminating the need to read old data from disk).
`Furthermore, data blocks are brought into the data
`cache naturally as a result of host block requests;
`parity blocks, on the other hand, must be specially
`brought into the cache when a particular data block
`is read in the hope that the host will subsequently
`write the data block.
`
`—
`
`79
`
`HPE, Exh. 1012, p. 9
`
`HPE, Exh. 1012, p. 9
`
`
`
`5.2. Details of Write Request Handling
`Because of this complication, we decided not to go
`with the save partial parity approach.
`When a block (value Y2 say) is written by the
`5.3. Organization of Data Cache
`host, it is placed in the buffer pool, in two separate
`There are three types of disk blocks in the data
`locations. Subsequently,
`the two copies of Y2 are
`cache -
`type d,
`type d’, and type d”. A particular
`movedinto the data cache. At this point,it is possible
`block in the data cacheis of type d if its value is the
`that a previous clean version of this block, say Y1, is
`same as the value of this block on the disk - in other
`already in data cache. In this case,
`there are three
`words, it is a clean block. Blocks of type d’ and of
`different possibilities for what action to take.
`type d” are both dirty blocks. If a block of type d is
`The first possibility, which is the one we assume
`in the cache and a new block is written by the host
`in the rest of this paper, is to leave the old value Y1
`to the same disk location, we will create two new
`in data cache and also create two copies of the new
`blocks of type d’; that is, the cache now contains a
`value Y2, for a total of three data cache locations
`block of type d (old value of block) and 2 blocks of
`occupied. Wecall this the save old data method. The
`type d’ (2 copies of new value of block). Only blocks
`old value Y1 is not removed because it will be useful
`of type d’ are destaged from cache to disk. Type d”
`to us in calculating new parity when weare ready to
`is a temporary classification to deal with new host
`destage Y2. Since the destage of Y2 may not happen
`writes received while a block of type d’
`is being
`until much later, we may be consuming an extra
`destaged. When a block of type d’ is being destaged,
`memory location for Y1 for a long time. We have
`it is possible to receive another write from the host
`found from simulations that the disk array controller
`to the same disk location. If the host write had been
`will need about a 20% larger cache to hold the old
`received before the destage started, we would have
`values of data. A second possibility we considered
`merely overwritten the dirty block in cache with the
`was to remove Y1 from cache when Y2is received,
`new onereceived, and made the new onereceived of
`giving us a 20%larger effective cache. We call this
`type d’. However, once we havestarted the destage
`the overwrite old data method. The drawbackis that
`and are committed to doing the destage, we mark any
`now, when weare ready to destage Y2, we will need
`new block received to the same disk location as being
`to reaccess Y1 from disk. This possibility may be
`of type d” (alternatively, we could reject the request).
`attractive if the increase in performance from the 20%
`Once a block of type d’ is destaged,
`it becomes a
`larger effective data cache offsets the loss in perfor-
`block of type d. At
`this time, any blocks of type d”
`mance dueto needto reaccess old data at destage time.
`for the disk location just destaged maybereclassified
`Finally, we considered and rejected the following
`as blocks of type d’.
`third possibility. Instead of leaving the old value (say
`Y1) of the block in cache and creating two copies of
`6.DestageAlgorithms
`the new value (say Y2) of the block (for a total of
`If a dirty disk block is destaged to disk, we must
`three memory locations occupied), XOR the old and
`also calculate and write the corresponding parity block
`new values of the block and store (Yl XOR Y2) in
`in order to keep the parity group consistent. When a
`one memory location and Y2 in a second memory
`disk block from a parity group is to be destaged, we
`location. We call this the save partial parity method.
`lock the parity group for the duration of the destage.
`This has the advantage of requiring only 2 memory
`The parity group is unlocked only after the disk block
`locations instead of 3; also we would have already
`and the parity block are both written to disk and the
`done one of the XOR operations needed to generate
`parity group is consistent on disk. The parity group
`new parity. At destage time, we would only need to
`lock prevents more than one destage to be in progress
`read old parity, XOR it with (Yl XOR Y2) to gen-
`simultaneously to any one parity group. While not
`erate new parity, then write new parity to disk. How-
`explicitly referred to in the algorithms that follow, a
`ever, the results of [3] indicate that there is a very
`parity group is locked before a destage begins and is
`high probability of receiving another write (say Y3)
`unlocked after the destage completes.
`to the samedisk location before we have had a chance
`Webegin by considering the case where only one
`to destage Y2. With our currently assumed approach
`of the data blocks of a parity group is dirty in the
`(save old data method), we would merely overwrite
`data cache and needs to be destaged; later we will
`the 2 memory locations containing Y2 with the new
`also consider cases where more than one block of a
`value Y3. However, if we went with an approach in
`parity group needs to be destaged. To simplify the
`which we hadalready XORed Y1 with Y2, we would
`discussion, we assume that whenadirty block is to
`need to first XOR Y2 to this result to get back Y1,
`be destaged, other blocks of the parity group are not
`then XOR the new value Y3 to get (Y1 KOR Y3).
`in the data cache even in clean form. We also assume
`
`80
`
`HPE, Exh. 1012, p. 10
`
`
`
`HPE, Exh. 1012, p. 10
`
`
`
`that the old value of the dirty block is not in cache
`and needs to be read from disk. Both these assump-
`tions will be relaxed in later sections of this paper.
`6.1. Two Data Copies Method (Method 1)
`Thefirst part of Figure 2 showsthe simplest option
`available to us in order to destage a dirty block (la-
`belled D1’ in the figure). In this figure, the dotted
`line separates two different power boundaries in the
`array controller, and we see that the two different
`copies of D1’ are on twodifferent power boundaries.
`Also, the solid horizontal line separates the array con-
`troller from the disk drives themselves. The figure
`shows six data disk blocks D1, D2,
`... D6, on six
`different disks and a seventh parity disk block P on
`a seventh disk drive. These seven disk blocks on seven
`different disks constitute the parity group ofinterest.
`DI’ is an updated value of block D1 which is to be
`destaged to disk.
`In this option, block D1 and block P are both read
`from disk and XORed directly into one of the two
`D1’ locations in controller memory (this would use
`the store & XOR feature of the