`
`The 22nd Annual International
`
`Symposium on
`
`COMPUTER ARCHITECTURE
`
`June 22-24, 1995
`
`Santa Margherita Ligure,Italy
`
`Sponsored by
`
`ACM SIGARCH
`IEEE Computer Society, TCCA
`With support of SGS-Thomson-Microelectronics and Olivetti
`In cooperation with the University of Genoa
`
`
`
`SIGARCH
`
`Lan HPE,Exh. 1014, p. 1
`
`
`
`
`
`HPE, Exh. 1014, p. 1
`
`
`
`QA% 5
`S BY
`aac
`he )
`
`;
`
`TheAssociation for Computing Machinery
`1515 Broadway
`New York, N.Y. 10036
`
`Copyright © 1995 by the Association for Computing Machinery, Inc. Copying withoutfee
`is permitted provided that the copies are not madeordistributed for direct commercial
`advantage,and credit to the source is given. Abstracting with credit is permitted. For other
`copying ofarticles that carry a code at the bottom ofthefirst or last page, copying is permit-
`ted provided that the per-copyfee indicated in the codeis paid through the Copyright
`Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. For permission to republish
`write to: Director of Publications, Association for Computing Machinery. To copy otherwise
`or republish, requires a fee and/or specific permission.
`
`95CB35801
`IEEE Catalog Number
`ISBN 0-89791-698-0 (softbound)
`ISBN 0-7803-3000-5 Ccasebound)
`ISBN 0-7803-3001-3 (microfiche)
`Library of Congress Number 85-642899
`ISSN:
`1063-6897
`
`
`Additional copies may be ordered prepaid from:
`
`ACM Order Department
`P.O. Box 12114
`Church Street Station
`New York, N.Y. 10257
`
`IEEE ComputerSociety Press
`CustomerService Center
`10662 Los VaquerosCircle
`Los Alamitos, CA 90720
`
`Phone: 1-800-342-6626
`(U.S.A. and Canada)
`1-212-626-0500
`(All other countries)
`Fax: 1-212-944-1318
`E-mail: acmhelp@acm.org
`acm_europe@acm.org (in Europe)
`
`Printed in the U.S.A.
`
`HPE, Exh. 1014, p. 2
`
`HPE, Exh. 1014, p. 2
`
`
`
`Table of Contents
`
`Coebial IMES ESSAcececciecsenrsctechen ceneen ges cservage encsentegnenaser sant ncsrregsnanscoenagman Ses teaae eager rtVv
`Program Chair's Message ...--ss++-.s-ssssssssseseseesssnesssesessnanecscnnnnnnannnssennanenersetteeetteee vi
`Organizing Committees ......s..sssscsssesssssessseecsceeseessssnssensssseeneseeannnnnnagmnaasisterengnennnagsanegeeeeeeeae Vii
`Refarets.c.cc.KincadeaeeredfitSoeePareeeetneiinsest ix
`
`Session 1: Multiprocessors and Applications
`The MIT Alewife Machine: Architecture and Performance........cccccccsesscesceeteeeeeerseesseeeseeeeeetseeenesnees=
`A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz, J. Kubiatow-
`icz, B.-H. Lim, K. Mackenzie, D. Yeung
`The EM-X Parallel Computer: Architecture and Basic Performance..........--:csecceecseeseenseesnconeeenees 14
`Y¥. Kodama, H. Sakane, M. Sato, H. Yamana,S. Sakai, Y. Yamaguchi
`The SPLASH-2 Programs: Characterization and Methodological Considerations............-+1-+++24
`S. C. Woo, M.Ohara, E. Torrie, J. P. Singh, A. Gupta
`
`Session 2A: Cache Coherence
`Efficient Strategies for Software-Only Directory Protocols in Shared-Memory
`MatobegstencaeRSUSatecnetcaeceneecnonnnnseditncgnenrnntrgntenenbchinsecnratuesnnnnstestsirSAEEOEEA38
`
`H. Grahn,P. Stenstrom
`Dynamic Self-Invalidation: Reducing Coherence Overhead in Shared-Memory
`MultigroQesS@rs.--cateaincesaieiersensyeverteraeMeeeEeeeees48
`A. R. Lebeck, D. A. Wood
`Boosting the Performance of Hybrid Snooping Cache Protocols.......sssseeeeecsccerscreteeetensseeesciss60
`F. Dahlgren
`
`Session 2B: Interconnect Technology and I/O
`S-Connect: from Networks of Workstations to Supercomputer Peformatices...<..8...acnyece- yest71
`A. G. Nowatzyk, M. C. Browne, E. J. Kelly, M. Parkin
`Destage Algorithms for Disk Arrays with Non-volatile Caches ...........::ccceeseesseeseeensetseeereeerseteress83
`A. Varma, Q. Jacobson
`Evaluating Multi-Port Frame Buffer Designs for a Mesh-Connected Multicomputer..............6++96
`G. Stoll, B. Wei, D. Clark, E. W. Felten, K. Li, P. Hanrahan
`Are Crossbars Really Dead? The Case for Optical Multiprocessor Interconnect Systems......... 106
`A. G. Nowatzyk, P. R. Prucnal
`
`Session 3: Instruction Level Parallelism
`Exploring Configurations of Functional Units in an Out-of-Order Superscalar Processor......... 117
`S. Jourdan, P. Sainrat, D. Litaize
`
`HPE, Exh. 1014, p. 3
`
`HPE, Exh. 1014, p. 3
`
`
`
`
`
`Unconstrained Speculative Execution with Predicated State Buffering ............00.ceeee 126
`H. Ando, C. Nakanishi, T. Hara, M. Nakaya
`A Comparison of Full and Partial Predicated Execution Support for ILP Processors................. 138
`S. A. Mahlke, R. E. Hank, J. E. McCormick, D. I. August, W. W. Hwu
`
`Implementation Trade-offs in Using a Restricted Data Flow Architecture in a High
`Performance RISC MECTOPOCESSOR oct dunes Uy ssede lee tees UMAR Hla ck coh anaidegn at Maat ee yuiss 15.1
`M.Simone, A. Essen, A. Ike, A. Krishnamoorthy, T. Maruyama, N. Patkar, M.
`Ramaswami, M. Shebanow, V. Thirumalaiswamy, D. Tovey
`Performance Evaluation of the PowerPC 620 Microarchitecture..........0c cece eecese eee ceseseeee ees 163
`T. A. Diep, C. Nelson, J. P. Shen
`
`Session 4B: Managing MemoryHierarchies
`Reducing TLB and Memory Overhead using Online Superpage Promotion ...........:.ccceeeseerees 176
`T. H. Romer, W. H. Ohlrich, A. R. Karlin, B. N. Bershad
`
`Speeding up Irregular Applications in Shared-Memory Multiprocessors: Memory
`Binding and Group PrefetChing sibs hassens gas yess caucasian anaes bls sas sh SUNN ee Melee, sc alates As 188
`Z. Zhang, J. Torrellas
`
`Session 5A: Interconnection Network Routing
`An Efficient, Fully Adaptive Deadlock Recovery Scheme: DISHA......... cece eeeeeetesseneeeees201
`Anjan K. V., T. M. Pinkston
`Analysis and implementation of Hybrid Switching(iit maai.............ds seantelmeeneetn sth ee 211
`K. G. Shin, S. W. Daniel
`
`Configurable Flow Control Mechanisms for Fault-Tolerant Routing... ceseeeseeeeneeeseeeeeeee220
`B. V. Dao, J. Duato, S. Yalamanchili
`
`NIFDY:»A/Low' Overhead, High ThroughputiNetworktmtertace ..o.ce cevesccstecessnecssteciecsene230
`T. Callahan, S. C. Goldstein
`
`Session 4A: New Microarchitectures
`
`
`
`Session 5B: Novel Memory Access Mechanisms
`Vector Multiprocessors with Arbitrated Memory AcCeSS..........ccsscsseescsscneseescseenssaesseseesscensens243
`M.Peiron, M.Valero, E. Aygaudé, T. Lang
`Design of Cache Memories for Multi-Threaded Dataflow Architecture ...........cccccccesesesteeeeeesees253
`K. M. Kavi, A. R. Hurson,P. Patadia, E. Abraham, P. Shanmugam
`Skewed Associativity Enhances Performance Predictability .0.....0.0...ccccccssecesecetseeesteeseseeeeesees 265
`F. Bodin, A. Seznec
`
`Session 6: Branch Prediction
`
`A Comparative Analysis of Schemes for Correlated Branch Prediction.........c.cccsesesescseeseeees 276
`C. Young, N. Gloy, M. D. Smith
`
`xii
`
`HPE, Exh. 1014, p. 4
`
`HPE, Exh. 1014, p. 4
`
`
`
`B. Calder, D. Grunwald
`
`Session 7B: Instruction Fetching
`Optimization of Instruction Fetch Mechanisms for High Issue Rates.............-::seseseetereeereesereees333
`T. M.Conte, K. N. Menezes, P. M.Mills, B. A. Patel
`Instruction Fetching: Coping with Code Bloat ........scesscseereersressisctcenteenenntensettntttenesee345
`R. Uhlig, D. Nagle, T. Mudge, S. Sechrest, J. Emer
`Instruction Cache Fetch Policies for Speculative Execution ......seseseseessreeseeserrenter ssrsessseseteenentsaT
`D. Lee, J.-L. Baer, B. Calder, D. Grunwald
`
`Session 7A: System Evaluation
`A Comparison of Architectural Support for Messaging on the TMC CM-5and Cray TOD298
`V. Karamcheti, A. A. Chien
`Optimizing Memory System Performance for Communication in Parallel Computers.............-308
`T. Stricker, T. Gross
`Empirical Evaluation of the CRAY-T3D: A Compiler Perspective........s.ssessssesessseretiescissseesesties320
`R. H. Arpaci, D. E. Culler, A. Krishnamurthy, S. G. Steinberg, K. Yelick
`
` Next Cache Line and Set Prediction ...........s::sssssssssessseseessenreeeeseessensesseneercnacsneencensensegnaggggeste287
`
`Session 8: Caches
`Streamlining Data Cache Access with Fast Address Calculation...........cc:cccccsscersesescenneeeneeeeeeee369
`T. M. Austin, D. N. Pnevmatikatos, G. S. Sohi
`CAT — Caching Address Tags: A Technique for Reducing Area Cost of On-Chip Caches........381
`H. Wang,T. Sun, Q. Yang
`
`Session 9: Processor Architecture
`Simultaneous Multithreading: Maximizing On-Chip Parallelism .........:...::cecccssceseetreenersaeeeernrens$92
`D. M.Tullsen, S. J. Eggers, H. M. Levy
`Architecture Validation for ProcessOrs.......ssssssesesssesseenerssenessssscsescstecsssenenananenanecrarerssnsnsese ttts404
`R. C. Ho, C. H. Yang, M.A. Horowitz, D. L. Dill
`Murltiscalar ProCeSSOrs s....ssis:ess-ce-osess--scensanapuammeresesunstentenosabinnentanernxepnonananesenananenananbessassearanecseasest414
`G. S. Sohi, S. E. Breach, T. N. Vijaykumar
`
`PRAAe BREieeecereal ey GebvonessessnebanedsmmmenteoeanamntatetesanwclyshienyaravenberzarssrsmolccrnangesesAssisarsr4yy426
`
`xiii
`
`HPE, Exh. 1014, p. 5
`
`HPE, Exh. 1014, p. 5
`
`
`
`Destage Algorithms for Disk Arrays with Non-Volatile Caches
`
`Anujan Varma and Quinn Jacobson
`Computer Engineering Department
`University of California
`Santa Cruz, CA 95064
`
`Abstract
`
`In a disk array with a nonvolatile write cache, destages
`from the cache to the disk are performed in the background
`asynchronously while read requests from the host system
`are serviced in the foreground.
`In this paper, we study a
`numberof algorithms for scheduling destages in a RAID-5
`system. We introduce a new scheduling algorithm, called
`linear threshold scheduling, that adaptively varies the rate
`of destages to disks based on the instantaneous occupancy
`of the write cache. The performance of the algorithm is
`compared with that of a numberof alternative scheduling
`approaches such as least-cost scheduling and high/low mark.
`The algorithms are evaluated in termsof their effectiveness
`in making destages transparent to the servicing of read re-
`quests from the host, disk utilization, and their ability to
`tolerate bursts in the workload without causing an overflow
`of the write cache. Our results show that linear threshold
`scheduling provides the best read performanceofall the al-
`gorithms compared, whilestill maintaining a high degree
`of burst tolerance. An approximate implementation of the
`linear-threshold scheduling algorithm is also described. The
`approximate algorithm can be implemented with much lower
`overhead, yet its performanceis virtually identical to that
`of the ideal algorithm.
`
`I.
`
`INTRODUCTION
`
`A disk array, in general, consists of a group ofdisk drives
`together with an associated controller function, organized
`logically as a single I/O device [2],
`[4]. Disk arrays, also
`known as RAIDs (Redundant Arrays of Inexpensive Disks),
`are capable of providing improvedlevels of reliability, avail-
`ability, and/or performance over single disks. A disk array
`usually provides protection against loss of data from a disk
`failure by maintaining redundant information within the ar-
`ray. Moreover, data availability can be maintained on a disk
`failure by using the redundant information to reconstruct
`data stored on the failed disk in real time.
`In addition to
`improving reliability and availability, disk arrays also im-
`prove the performance ofthe storage system bydistributing
`data across multiple disk drives — this is the result of either
`concurrency in servicing multiple I/O requests or parallelism
`in the data transfer for a single I/O request.
`Several types of disk array configurations are knownin
`
`This research is supported by NSF Young Investigator Award
`No. MIP-9257103.
`
`Permission to copy without feeall or part of this material is
`granted provided that the copies are not madeordistributed for
`direct commercial advantage, the ACM copyright notice and the
`litle of the publication and its date appear, and notice is given
`that copying
`is by permission of the Association of Computing
`Machinery.
`To copy otherwise, or to republish, requires
`afee and/or specific permission.
`ISCA'95, Santa Margherita LigureItaly
`© 1995 ACM 0-89791 -698-0/95/0006...$3.50
`
` 83
`
`the literature [2], [4], [17]. These configurations vary primar-
`ily in the redundancy scheme employed andthe data distri-
`bution (striping) scheme used to maplogical blocks among
`the individual disks.
`In a seminal paper, Patterson, Gib-
`son, and Katz [11] introduced a taxonomyof disk arrays,
`consisting of six different types (or “levels”), which is now
`widely used in the industry. Among these RAIDlevels, of
`particular interest are RAID-4 and RAID-5, which are op-
`timized for transaction processing workloads. These arrays
`employ coarse-grained striping of data so that small requests
`can benefit from the concurrency in servicing multiple re-
`quests, while a large request can achieve high throughput
`by transferring data from multiple disks in parallel. RAID-
`4 organizes the disks in the array into parity groups, with
`one dedicated parity disk in each group. This has the dis-
`advantage of the parity disks becoming a bottleneck while
`updating data in the system. RAID-5eliminates this bottle-
`neck by distributing parity blocks uniformly among all the
`disks in the group.
`Both RAID-4 and RAID-5 provide reliability and avail-
`ability at a fraction of the storage overhead incurred in disk
`mirroring. This reduction in storage overhead, however, is
`achieved at.the expense of increasing the numberof disk ac-
`cesses necessary to update data in the system. Every write
`request to the array that does not update an entire stripe
`must now update parity by reading the old data and the old
`parity, and exclusive-ORing them with the new data. This
`involves a total of four disk accesses — reading the old data
`and parity, and writing the new data and parity. Menon and
`Mattson [7] showed that this overhead can degrade the per-
`formance of a RAID-5 considerably in a transaction process-
`ing environment where small requests dominate the work-
`load. In addition, a small increase in the ratio of writes to
`reads in the workload can lead to a drastic increase in the
`response time for both reads and writes.
`Several solutions have been proposed to reduce the over-
`head for small writes in a RAID-5[8], [9], [15]. One approach
`is parity logging [15], where parity updates are posted into
`a dedicated log disk instead of updating the parity blocks
`in the array; updates of parity blocks in the array are per-
`formed in the background. This has the advantage of con-
`verting small writes of parity into large writes; the scheme
`also captures someof the temporal locality in parity updates
`since successive updates of parity can be combined into a sin-
`gle update in the disk array. Another scheme that reduces
`the overhead in parity updatesis floating parity [9], where
`parity blocks are dynamically remapped within disk cylin-
`ders to reduce the rotational latency between reading and
`writing parity.
`Both parity logging and floating parity only attempt to
`reduce the overhead of parity updates.
`In both schemes,
`the old data must be read from the disk and the new data
`written before signaling completion of the I/O transaction.
`
`HPE, Exh. 1014, p. 6
`
`
`
`HPE, Exh. 1014, p. 6
`
`
`
`miea!
`
`A more general schemeis to use a nonvolatile write cache to
`reduce write latency. Writes can now be deemed complete
`after writing the new data into the cache, an operation often
`referred to as fast write [8]. Both the data and the parity
`blocks on the disks can then be updated in the background.
`The process of updating data or parity in the disks from the
`write cache is referred to as destaging.
`In addition to the
`write cache, a larger (volatile) read cache may be used to
`improve read performance.
`A nonvolatile write cache has several advantages [8], the
`most important of which is the substantially lower service
`time seen by write requests to the array.
`In addition, the
`write cache can also exploit any locality in writes — both
`temporal and spatial — in the workload. Temporal local-
`ity is exploited by capturing successive updates of the same
`block in the cache. Spatial locality allows many small writes
`to be aggregated into a single large write to the disk. Fi-
`nally, the write cache can also lower the response timefor
`read requests serviced by the disks because of the reduced
`contention with writes to the disk.
`
`In spite of the above advantages, a nonvolatile write cache
`introduces several new problems. First, the write cache must
`be designed to be at least as reliable as the redundancy
`scheme used for the disks. The IBM Hagar disk array [8],
`for example, used duplicate caches with independent power
`boundaries to protect the contents of the write cache. Sec-
`ond, data losses could still occur while updating the disks
`from the cache because of a failure in the cache memory,
`disks, or the datapath between them. Thedestage algorithm
`must be designed to leave data in a consistent state should
`a failure occur while updating the disks. Several such algo-
`rithms to prevent data losses while destaging are presented
`in [8].
`A third problem in the use of a write cache is that of
`scheduling the destages, that is determining when to destage
`data from the cache to the disk, as well as whichof the dirty
`blocks is to be destaged next. Ideally, in a disk array with
`a write cache, the disks see only two types of requests —
`read requests generated by the workload that miss in the
`read cache, and (ii) read and write requests originated by
`the array controller for destaging data from the write cache.
`Requests of the first category must be serviced as soon as
`possible while destage requests can usually be serviced in the
`background. Ideally, destages should appear transparent to
`the user requests received by the array. In practice, however,
`because of the non-preemptive nature of disk service, some
`interference between user requests and destage requests is
`unavoidable. For example, a read request received while
`a destage request is being serviced is made to wait until
`the latter is completed. This interference, however, can be
`minimized with careful scheduling of the destages.
`In this paper, we study a number of algorithms for
`scheduling destages in a RAID-5 system. We introduce a
`new scheduling algorithm, called linear threshold scheduling,
`that adaptively varies the rate of destages to disks based on
`the instantaneous occupancyof the write cache. The perfor-
`manceofthe algorithm is compared with that of a numberof
`alternative scheduling approaches such as least-cost schedul-
`ing and high/low mark. The algorithms are evaluated in
`terms of their effectiveness in making destages transparent
`to the servicing of read requests, as well as their ability to
`tolerate large bursts of write requests without causing an
`
`overflow of the write cache. Ourresults show that linear
`threshold scheduling provides the best read performance of
`all the algorithms compared, while still maintaining a high
`degree of burst tolerance.
`Sec-
`The rest of this paper is organized as follows:
`tion II summarizes the tradeoffs involved in the design of
`the destage algorithm and describes the algorithms studied
`in the paper. Section III introduces the disk array model and
`workload modelused in our simulations. Section IV presents
`simulation results for the algorithms and compares theal-
`gorithmsin terms of read ‘performance, disk utilization, and
`tolerance to bursts of write requests. Section V concludes
`the paper with directions for future research.
`
`II. DESTAGE ALGORITHMS
`
`In a disk array with a nonvolatile write cache, destages
`from the cache to the disk are performed in the background
`asynchronously, while read requests from the host system
`are serviced in the foreground. Thus, the workload seen by
`an individual disk predominantly consists of (i) reads from
`the host system that cannot be serviced from the read cache
`or write cache, and (ii) read and write requests generated
`by the array controller as part of destage operations. We
`refer to requests of the first category as host reads and those
`of the second category as destages. Reads of data or parity
`information performed as part of a destage are referred to
`as destage reads and the writes as destage writes. A write
`request from the host is never directed to the disk except in
`the event of a write cache overflow or when thesize of the
`block being written is large enough to justify bypassing the
`write cache.
`If host reads are given higher priority over destages, a
`destage would never beinitiated by the array controller when
`there are requests in the read queue. Still, considerable flex-
`ibility exists in scheduling the destages, as the updates to
`the disk need not be performed in the order the writes were
`posted to the cache. In addition, the four disk accesses in-
`volved in a disk update — reads ofold parity and data, and
`writes of new parity and data — need not be performed as
`a single indivisible operation.
`
`A. Algorithm Design Tradeoffs
`Wefirst examine the tradeoffs involved in the design of
`the destage algorithm. Assuming that no overflow of the
`write cache occurs and noneof the write requests bypasses
`the cache, all the write requests from the host system are
`serviced by the cache. In this case, the scheduling algorithm
`employed has no direct effect on the response timefor a write
`to the disk array. However, the schedulingalgorithm canstill
`affect the response time seen by the host reads serviced by
`the disks, as well as the maximum sustainable throughput
`of the array. A schedulingalgorithm that optimizes destages
`reduces the load on the disks due to destages, thus improving
`their read performance.
`There are three ways in which a scheduling algorithm
`can improve the performance of the disk array: First, the
`algorithm can reduce the numberof destages by capturing
`most of the re-writes in the write cache, thus exploiting the
`temporallocality in the workload. This requires maintaining
`blocks that are likely to be re-written in the near future in
`the write cache and destaging them only when additional
`re-writes are unlikely to occur soon. Similarly, if parity is
`
`84
`
`HPE, Exh. 1014, p. 7
`
`HPE, Exh. 1014, p. 7
`
`
`
`
`
`
`
`allowed to be cached, successive updates of blocks in the
`"sameparity group generateonly a single update of the parity
`_block on disk.
`_ Second, the numberof destages can also be reduced by
`aggregating blocks that lie physically close on a disk and
`_ destaging them as a large read and/or write. This exploits
`any spatial locality present in the workload.
`Finally, the scheduling algorithm can reduce the aver-
`age time for a destage by ordering destage requests to the
`disks such that the service times in the individual disks are
`_ minimized. Since the mechanical positioning delays usually
`_ dominate the service time, this calls for ordering the requests
`_ to minimize the positioning delays in the disk. Many such
`algorithms for scheduling requests in individual disks taking
`into account the seek time and therotational latency of re-
`_
`_ quests have been described in theliterature [3], [5], [6], [14],
`| [16]. While similar algorithms could be used to schedule
`_ destage requests in a disk array, a basic distinction exists
`- between the problem of scheduling destages in a disk ar-
`_ ray and that of scheduling host requests in an independent
`disk. A disk scheduling algorithm for host requests must
`_ make a tradeoff between maximizing throughput and pro-
`viding fairness to user requests; for example, servicing the
`_ user requests in FCFS(first-come,first-served) order maxi-
` mizes fairness, while a scheduling algorithm that reorders re-
`- quests to minimize the total service time achieves maximum
`_ throughput at the expense of increasing the variance of the
`_ response time [5]. The destage algorithm in a disk array, on
`_ the other hand, does not have to deal with this dilemma: In
`_ most cases, the algorithm can schedule destages to maximize
`throughput without causing starvation of user requests.
`All the above objectives in the design of the scheduling
`algorithm suggest maintaining the occupancy of the write
`_ cache close to its capacity. This allows maximum benefit
`to be obtained from anylocality present in the workload
`and maximizes scheduling flexibility. Maintaining the write
`- cacheclose to full, however, makes it vulnerable to overflow.
`_ Evena short burst of writes in the workload may cause the
`cache to overflow, forcing subsequent writes to bypass the
`_ cache until some destages can be posted to disk. Thus, the
`scheduling algorithm muststrike a compromise between the
`conflicting goals of being able to exploit maximum locality
`and scheduling flexibility for destages, and preventing fre-
`quent overflows of the write cache.
`The scheduling algorithm may be designed to be either
`_ work-conserving or non-work-conserving with respect to the
`disks in the array. A work-conserving algorithm never allows
`a disk to be idle when one or more destages are pending to
`the disk. A non-work-conserving algorithm, on the contrary,
`mayrefrain from scheduling a destage to the disk even when
`it
`is idle, expecting that the same destage could be per-
`formed at a lower cost in the future; the goal is to minimize
`the response time for host reads by reducing the contention
`between host reads and destages. Both types of algorithms
`are studied in this paper.
`Thus, in summary, the following are the parameters that
`can be used by the destage algorithm to determine the block
`to be destaged next:
`1. The probability of the block to be re-written in the
`near future. This factor is usually accommodated in
`the algorithm by ordering blocks in the destage queue
`based on the time of their last access.
`
`2. The numberof blocks to be read/updated on the same
`track (or cylinder) to take advantage of spatial locality.
`3. Service times of the requests in the destage queue. The
`service time includes the delay in positioning the head
`at the beginningof the block to be read/written and the
`data transfer time. Different levels of approximations
`could be used to estimate the positioning delays in the
`disk [16].
`4. The current level of occupancy of the cache.
`In addition, other factors such as the type of destage re-
`quest (parity or data, read or write), may also be used in
`the scheduling process.
`In the following, we describe four scheduling algorithms
`for destaging blocks from the write cache in a RAID-5. In all
`the algorithms, destaging to each disk is handled indepen-
`dently by maintaining a separate queue of pending destages
`for each disk. In addition,in all the algorithms, a destage to
`a disk is initiated only when there are no host reads queued
`for that disk at that time. However, not all of the algorithms
`schedule a destage always when the read queue is empty. The
`algorithmsdiffer in the criteria used to determine whether a
`destage is to be initiated when the disk becomesidle after
`servicing a request, and in the function used to select the
`block to be destaged.
`
`B. Least-Cost Scheduling
`
`This algorithm is modeled after the well-known shortest
`seek-timefirst disk scheduling algorithm [3]. After every disk
`operation, the queue of eligible destages is examined and
`the request that takes the shortest access time is performed.
`In addition, to exploit spatial locality, if there are pending
`destage requests to other blocks on the sametrack, these are
`also performed as onelarge read or write.
`The access time of a destage read or write operation in-
`cludes the seek time (or head-switch time), rotational la-
`tency, and controller delays. Precise estimation of these de-
`lays is difficult due to many factors such as the physical-to-
`logical mappingof sectors on the disk and the use of caching
`within the disk itself. Considerable improvementcan still be
`obtained using approximate models for estimating the de-
`lays involved. We implemented such a scheme in our RAID
`simulator for comparison with other schemes. Details of im-
`plementation of the scheme can be foundin Section III.
`
`C. High/Low Mark Algorithm
`
`This algorithm is designed after the cache purging algo-
`rithm proposed by Biswas, Ramakrishnan and Towsley [1]
`for disks with a nonvolatile write cache.
`In this algorithm,
`two cache-occupancy thresholds are used to enable and dis-
`able destages. Destages to the disk are disabled when the
`cache occupancy drops below a “low mark” and turned on
`when it increases over a “high mark.” Some hysteresis be-
`tween the two thresholds is used to prevent short bursts in
`the workload from causing a cache overflow. We chose the
`high threshold as 70% of cache capacity and the low thresh-
`old as 30%, Destage requests were selected using theleast-
`cost policy to minimize the service time of individual disk
`accesses. In addition, when a block is selected for destage,
`all pending destage requests to the same track on the disk
`are scheduled together.
`5
`
`85
`
`HPE, Exh. 1014, p. 8
`
`HPE, Exh. 1014, p. 8
`
`
`
`MottLe
`
`
`
`
`D. Linear Threshold Scheduling
`Thebasic principle of this algorithm is to match the rate
`of destages from the cacheto the currentlevel of occupancy
`of the cache, accelerating destages gradually as the cache oc-
`cupancy increases and slowing them as the occupancyfalls.
`The primary difference between this algorithm and high/low
`mark scheduling is that, instead of on/off thresholds, the
`rate of destages is changed gradually as a function of cache
`occupancy.
`Asin other algorithms, each disk in the array is scheduled
`independently. Both parity and data destages are treated
`in a similar manner. The destage queue is examined after
`each disk operation if the host read queue is empty. As in
`the case of least-cost scheduling, the block to be destagedis
`chosen as the one with minimum cost amongall the eligible
`destage requests to that disk. However, unlike in least-cost
`scheduling, the destage is performedonlyif its cost is within
`a certain threshold maintained by the algorithm. Should
`a read request from the host arrive while the destage is in
`progress, this policy minimizes its waiting time.
`Therole of the thresholdis to reduce the interference be-
`tween destages and host reads; the threshold sets an upper
`boundon the waiting timeof a read arriving while a destage
`in progress. A tradeoff exists in the choice of the thresh-
`old:
`If the threshold is set too low, the destage rate may
`not be adequateto clear the cache fast enough,causing fre-
`quent overflows. A large threshold, on the other hand, may
`perform destages too soon, degrading both the waiting time
`for host reads and the hit rate of the write cache, and re-
`ducing the level of temporal andspatial localities that may
`potentially be exploited.
`In our implementation, the threshold was selected as a
`linear function of the instantaneous cache occupancy. The
`cost of a destageis its service time, computedas in the case of
`the least-cost scheduling algorithm. Asin other algorithms,
`when a destage request is scheduled, all pending destage
`requests to the sametrack are also performed along withit.
`The linear threshold scheduling algorithm uses the cache
`occupancy, service time of destages, and spatial locality as
`its decision-makingcriteria, but does not attempt to max-
`imize temporal locality explicitly. Thus, a block that has
`been written recently may be scheduled for destage if the
`destage can be performed “cheaply.” This may increase the
`total number of destage accesses to the disk in comparison
`to servicing the requests in LRU order, but thetotal service
`time can still be lower as a result of minimizing the cost
`of individual destage requests. Note that temporal locality
`for host reads can still be exploited by keeping the destaged
`block in the write cache until the block needs to be replaced.
`
`E. Approximation to Linear Threshold Scheduling
`A major difficulty in implementing both the least-cost
`scheduling and the linear threshold scheduling algorithms
`is the need to scan the entire queue of destage requests to
`select the minimum-cost request. This operation must be
`performed at the completion of every disk service. Further-
`more, the cost of individual destage requests must be com-
`puted afresh if the position of the head has changed.
`The need for searching the entire destage queue can be
`overcomebyresorting to an approximate implementation of
`linear threshold scheduling; the approach works by dividing
`the disk into regions and maintaining a separate queue of
`
`destage requests into each region. When the disk becomes
`idle, the queues are searched in the order of the closest to
`the farthest region relative to the current position of the
`head. Thefirst non-empty queue is found and the request
`at its head is selected as the minimum-cost destage. As in
`the ideal linear-threshold scheduling algorithm, the request
`is scheduledif its estimated service time is within the thresh-
`old determined by the current cache occupancylevel. The
`requests within each of the destage queues can be ordered
`using the LRU policy to allow maximum temporal locality.
`As in the other scheduling algorithms, when a request is se-
`lected for destage, all the pending dest