throbber
Proceedings
`
`The 22nd Annual International
`
`Symposium on
`
`COMPUTER ARCHITECTURE
`
`June 22-24, 1995
`
`Santa Margherita Ligure,Italy
`
`Sponsored by
`
`ACM SIGARCH
`IEEE Computer Society, TCCA
`With support of SGS-Thomson-Microelectronics and Olivetti
`In cooperation with the University of Genoa
`
`
`
`SIGARCH
`
`Lan HPE,Exh. 1014, p. 1
`
`
`
`
`
`HPE, Exh. 1014, p. 1
`
`

`

`QA% 5
`S BY
`aac
`he )
`
`;
`
`TheAssociation for Computing Machinery
`1515 Broadway
`New York, N.Y. 10036
`
`Copyright © 1995 by the Association for Computing Machinery, Inc. Copying withoutfee
`is permitted provided that the copies are not madeordistributed for direct commercial
`advantage,and credit to the source is given. Abstracting with credit is permitted. For other
`copying ofarticles that carry a code at the bottom ofthefirst or last page, copying is permit-
`ted provided that the per-copyfee indicated in the codeis paid through the Copyright
`Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. For permission to republish
`write to: Director of Publications, Association for Computing Machinery. To copy otherwise
`or republish, requires a fee and/or specific permission.
`
`95CB35801
`IEEE Catalog Number
`ISBN 0-89791-698-0 (softbound)
`ISBN 0-7803-3000-5 Ccasebound)
`ISBN 0-7803-3001-3 (microfiche)
`Library of Congress Number 85-642899
`ISSN:
`1063-6897
`
`
`Additional copies may be ordered prepaid from:
`
`ACM Order Department
`P.O. Box 12114
`Church Street Station
`New York, N.Y. 10257
`
`IEEE ComputerSociety Press
`CustomerService Center
`10662 Los VaquerosCircle
`Los Alamitos, CA 90720
`
`Phone: 1-800-342-6626
`(U.S.A. and Canada)
`1-212-626-0500
`(All other countries)
`Fax: 1-212-944-1318
`E-mail: acmhelp@acm.org
`acm_europe@acm.org (in Europe)
`
`Printed in the U.S.A.
`
`HPE, Exh. 1014, p. 2
`
`HPE, Exh. 1014, p. 2
`
`

`

`Table of Contents
`
`Coebial IMES ESSAcececciecsenrsctechen ceneen ges cservage encsentegnenaser sant ncsrregsnanscoenagman Ses teaae eager rtVv
`Program Chair's Message ...--ss++-.s-ssssssssseseseesssnesssesessnanecscnnnnnnannnssennanenersetteeetteee vi
`Organizing Committees ......s..sssscsssesssssessseecsceeseessssnssensssseeneseeannnnnnagmnaasisterengnennnagsanegeeeeeeeae Vii
`Refarets.c.cc.KincadeaeeredfitSoeePareeeetneiinsest ix
`
`Session 1: Multiprocessors and Applications
`The MIT Alewife Machine: Architecture and Performance........cccccccsesscesceeteeeeeerseesseeeseeeeeetseeenesnees=
`A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz, J. Kubiatow-
`icz, B.-H. Lim, K. Mackenzie, D. Yeung
`The EM-X Parallel Computer: Architecture and Basic Performance..........--:csecceecseeseenseesnconeeenees 14
`Y¥. Kodama, H. Sakane, M. Sato, H. Yamana,S. Sakai, Y. Yamaguchi
`The SPLASH-2 Programs: Characterization and Methodological Considerations............-+1-+++24
`S. C. Woo, M.Ohara, E. Torrie, J. P. Singh, A. Gupta
`
`Session 2A: Cache Coherence
`Efficient Strategies for Software-Only Directory Protocols in Shared-Memory
`MatobegstencaeRSUSatecnetcaeceneecnonnnnseditncgnenrnntrgntenenbchinsecnratuesnnnnstestsirSAEEOEEA38
`
`H. Grahn,P. Stenstrom
`Dynamic Self-Invalidation: Reducing Coherence Overhead in Shared-Memory
`MultigroQesS@rs.--cateaincesaieiersensyeverteraeMeeeEeeeees48
`A. R. Lebeck, D. A. Wood
`Boosting the Performance of Hybrid Snooping Cache Protocols.......sssseeeeecsccerscreteeetensseeesciss60
`F. Dahlgren
`
`Session 2B: Interconnect Technology and I/O
`S-Connect: from Networks of Workstations to Supercomputer Peformatices...<..8...acnyece- yest71
`A. G. Nowatzyk, M. C. Browne, E. J. Kelly, M. Parkin
`Destage Algorithms for Disk Arrays with Non-volatile Caches ...........::ccceeseesseeseeensetseeereeerseteress83
`A. Varma, Q. Jacobson
`Evaluating Multi-Port Frame Buffer Designs for a Mesh-Connected Multicomputer..............6++96
`G. Stoll, B. Wei, D. Clark, E. W. Felten, K. Li, P. Hanrahan
`Are Crossbars Really Dead? The Case for Optical Multiprocessor Interconnect Systems......... 106
`A. G. Nowatzyk, P. R. Prucnal
`
`Session 3: Instruction Level Parallelism
`Exploring Configurations of Functional Units in an Out-of-Order Superscalar Processor......... 117
`S. Jourdan, P. Sainrat, D. Litaize
`
`HPE, Exh. 1014, p. 3
`
`HPE, Exh. 1014, p. 3
`
`

`

`
`
`Unconstrained Speculative Execution with Predicated State Buffering ............00.ceeee 126
`H. Ando, C. Nakanishi, T. Hara, M. Nakaya
`A Comparison of Full and Partial Predicated Execution Support for ILP Processors................. 138
`S. A. Mahlke, R. E. Hank, J. E. McCormick, D. I. August, W. W. Hwu
`
`Implementation Trade-offs in Using a Restricted Data Flow Architecture in a High
`Performance RISC MECTOPOCESSOR oct dunes Uy ssede lee tees UMAR Hla ck coh anaidegn at Maat ee yuiss 15.1
`M.Simone, A. Essen, A. Ike, A. Krishnamoorthy, T. Maruyama, N. Patkar, M.
`Ramaswami, M. Shebanow, V. Thirumalaiswamy, D. Tovey
`Performance Evaluation of the PowerPC 620 Microarchitecture..........0c cece eecese eee ceseseeee ees 163
`T. A. Diep, C. Nelson, J. P. Shen
`
`Session 4B: Managing MemoryHierarchies
`Reducing TLB and Memory Overhead using Online Superpage Promotion ...........:.ccceeeseerees 176
`T. H. Romer, W. H. Ohlrich, A. R. Karlin, B. N. Bershad
`
`Speeding up Irregular Applications in Shared-Memory Multiprocessors: Memory
`Binding and Group PrefetChing sibs hassens gas yess caucasian anaes bls sas sh SUNN ee Melee, sc alates As 188
`Z. Zhang, J. Torrellas
`
`Session 5A: Interconnection Network Routing
`An Efficient, Fully Adaptive Deadlock Recovery Scheme: DISHA......... cece eeeeeetesseneeeees201
`Anjan K. V., T. M. Pinkston
`Analysis and implementation of Hybrid Switching(iit maai.............ds seantelmeeneetn sth ee 211
`K. G. Shin, S. W. Daniel
`
`Configurable Flow Control Mechanisms for Fault-Tolerant Routing... ceseeeseeeeneeeseeeeeeee220
`B. V. Dao, J. Duato, S. Yalamanchili
`
`NIFDY:»A/Low' Overhead, High ThroughputiNetworktmtertace ..o.ce cevesccstecessnecssteciecsene230
`T. Callahan, S. C. Goldstein
`
`Session 4A: New Microarchitectures
`
`
`
`Session 5B: Novel Memory Access Mechanisms
`Vector Multiprocessors with Arbitrated Memory AcCeSS..........ccsscsseescsscneseescseenssaesseseesscensens243
`M.Peiron, M.Valero, E. Aygaudé, T. Lang
`Design of Cache Memories for Multi-Threaded Dataflow Architecture ...........cccccccesesesteeeeeesees253
`K. M. Kavi, A. R. Hurson,P. Patadia, E. Abraham, P. Shanmugam
`Skewed Associativity Enhances Performance Predictability .0.....0.0...ccccccssecesecetseeesteeseseeeeesees 265
`F. Bodin, A. Seznec
`
`Session 6: Branch Prediction
`
`A Comparative Analysis of Schemes for Correlated Branch Prediction.........c.cccsesesescseeseeees 276
`C. Young, N. Gloy, M. D. Smith
`
`xii
`
`HPE, Exh. 1014, p. 4
`
`HPE, Exh. 1014, p. 4
`
`

`

`B. Calder, D. Grunwald
`
`Session 7B: Instruction Fetching
`Optimization of Instruction Fetch Mechanisms for High Issue Rates.............-::seseseetereeereesereees333
`T. M.Conte, K. N. Menezes, P. M.Mills, B. A. Patel
`Instruction Fetching: Coping with Code Bloat ........scesscseereersressisctcenteenenntensettntttenesee345
`R. Uhlig, D. Nagle, T. Mudge, S. Sechrest, J. Emer
`Instruction Cache Fetch Policies for Speculative Execution ......seseseseessreeseeserrenter ssrsessseseteenentsaT
`D. Lee, J.-L. Baer, B. Calder, D. Grunwald
`
`Session 7A: System Evaluation
`A Comparison of Architectural Support for Messaging on the TMC CM-5and Cray TOD298
`V. Karamcheti, A. A. Chien
`Optimizing Memory System Performance for Communication in Parallel Computers.............-308
`T. Stricker, T. Gross
`Empirical Evaluation of the CRAY-T3D: A Compiler Perspective........s.ssessssesessseretiescissseesesties320
`R. H. Arpaci, D. E. Culler, A. Krishnamurthy, S. G. Steinberg, K. Yelick
`
` Next Cache Line and Set Prediction ...........s::sssssssssessseseessenreeeeseessensesseneercnacsneencensensegnaggggeste287
`
`Session 8: Caches
`Streamlining Data Cache Access with Fast Address Calculation...........cc:cccccsscersesescenneeeneeeeeeee369
`T. M. Austin, D. N. Pnevmatikatos, G. S. Sohi
`CAT — Caching Address Tags: A Technique for Reducing Area Cost of On-Chip Caches........381
`H. Wang,T. Sun, Q. Yang
`
`Session 9: Processor Architecture
`Simultaneous Multithreading: Maximizing On-Chip Parallelism .........:...::cecccssceseetreenersaeeeernrens$92
`D. M.Tullsen, S. J. Eggers, H. M. Levy
`Architecture Validation for ProcessOrs.......ssssssesesssesseenerssenessssscsescstecsssenenananenanecrarerssnsnsese ttts404
`R. C. Ho, C. H. Yang, M.A. Horowitz, D. L. Dill
`Murltiscalar ProCeSSOrs s....ssis:ess-ce-osess--scensanapuammeresesunstentenosabinnentanernxepnonananesenananenananbessassearanecseasest414
`G. S. Sohi, S. E. Breach, T. N. Vijaykumar
`
`PRAAe BREieeecereal ey GebvonessessnebanedsmmmenteoeanamntatetesanwclyshienyaravenberzarssrsmolccrnangesesAssisarsr4yy426
`
`xiii
`
`HPE, Exh. 1014, p. 5
`
`HPE, Exh. 1014, p. 5
`
`

`

`Destage Algorithms for Disk Arrays with Non-Volatile Caches
`
`Anujan Varma and Quinn Jacobson
`Computer Engineering Department
`University of California
`Santa Cruz, CA 95064
`
`Abstract
`
`In a disk array with a nonvolatile write cache, destages
`from the cache to the disk are performed in the background
`asynchronously while read requests from the host system
`are serviced in the foreground.
`In this paper, we study a
`numberof algorithms for scheduling destages in a RAID-5
`system. We introduce a new scheduling algorithm, called
`linear threshold scheduling, that adaptively varies the rate
`of destages to disks based on the instantaneous occupancy
`of the write cache. The performance of the algorithm is
`compared with that of a numberof alternative scheduling
`approaches such as least-cost scheduling and high/low mark.
`The algorithms are evaluated in termsof their effectiveness
`in making destages transparent to the servicing of read re-
`quests from the host, disk utilization, and their ability to
`tolerate bursts in the workload without causing an overflow
`of the write cache. Our results show that linear threshold
`scheduling provides the best read performanceofall the al-
`gorithms compared, whilestill maintaining a high degree
`of burst tolerance. An approximate implementation of the
`linear-threshold scheduling algorithm is also described. The
`approximate algorithm can be implemented with much lower
`overhead, yet its performanceis virtually identical to that
`of the ideal algorithm.
`
`I.
`
`INTRODUCTION
`
`A disk array, in general, consists of a group ofdisk drives
`together with an associated controller function, organized
`logically as a single I/O device [2],
`[4]. Disk arrays, also
`known as RAIDs (Redundant Arrays of Inexpensive Disks),
`are capable of providing improvedlevels of reliability, avail-
`ability, and/or performance over single disks. A disk array
`usually provides protection against loss of data from a disk
`failure by maintaining redundant information within the ar-
`ray. Moreover, data availability can be maintained on a disk
`failure by using the redundant information to reconstruct
`data stored on the failed disk in real time.
`In addition to
`improving reliability and availability, disk arrays also im-
`prove the performance ofthe storage system bydistributing
`data across multiple disk drives — this is the result of either
`concurrency in servicing multiple I/O requests or parallelism
`in the data transfer for a single I/O request.
`Several types of disk array configurations are knownin
`
`This research is supported by NSF Young Investigator Award
`No. MIP-9257103.
`
`Permission to copy without feeall or part of this material is
`granted provided that the copies are not madeordistributed for
`direct commercial advantage, the ACM copyright notice and the
`litle of the publication and its date appear, and notice is given
`that copying
`is by permission of the Association of Computing
`Machinery.
`To copy otherwise, or to republish, requires
`afee and/or specific permission.
`ISCA'95, Santa Margherita LigureItaly
`© 1995 ACM 0-89791 -698-0/95/0006...$3.50
`
` 83
`
`the literature [2], [4], [17]. These configurations vary primar-
`ily in the redundancy scheme employed andthe data distri-
`bution (striping) scheme used to maplogical blocks among
`the individual disks.
`In a seminal paper, Patterson, Gib-
`son, and Katz [11] introduced a taxonomyof disk arrays,
`consisting of six different types (or “levels”), which is now
`widely used in the industry. Among these RAIDlevels, of
`particular interest are RAID-4 and RAID-5, which are op-
`timized for transaction processing workloads. These arrays
`employ coarse-grained striping of data so that small requests
`can benefit from the concurrency in servicing multiple re-
`quests, while a large request can achieve high throughput
`by transferring data from multiple disks in parallel. RAID-
`4 organizes the disks in the array into parity groups, with
`one dedicated parity disk in each group. This has the dis-
`advantage of the parity disks becoming a bottleneck while
`updating data in the system. RAID-5eliminates this bottle-
`neck by distributing parity blocks uniformly among all the
`disks in the group.
`Both RAID-4 and RAID-5 provide reliability and avail-
`ability at a fraction of the storage overhead incurred in disk
`mirroring. This reduction in storage overhead, however, is
`achieved at.the expense of increasing the numberof disk ac-
`cesses necessary to update data in the system. Every write
`request to the array that does not update an entire stripe
`must now update parity by reading the old data and the old
`parity, and exclusive-ORing them with the new data. This
`involves a total of four disk accesses — reading the old data
`and parity, and writing the new data and parity. Menon and
`Mattson [7] showed that this overhead can degrade the per-
`formance of a RAID-5 considerably in a transaction process-
`ing environment where small requests dominate the work-
`load. In addition, a small increase in the ratio of writes to
`reads in the workload can lead to a drastic increase in the
`response time for both reads and writes.
`Several solutions have been proposed to reduce the over-
`head for small writes in a RAID-5[8], [9], [15]. One approach
`is parity logging [15], where parity updates are posted into
`a dedicated log disk instead of updating the parity blocks
`in the array; updates of parity blocks in the array are per-
`formed in the background. This has the advantage of con-
`verting small writes of parity into large writes; the scheme
`also captures someof the temporal locality in parity updates
`since successive updates of parity can be combined into a sin-
`gle update in the disk array. Another scheme that reduces
`the overhead in parity updatesis floating parity [9], where
`parity blocks are dynamically remapped within disk cylin-
`ders to reduce the rotational latency between reading and
`writing parity.
`Both parity logging and floating parity only attempt to
`reduce the overhead of parity updates.
`In both schemes,
`the old data must be read from the disk and the new data
`written before signaling completion of the I/O transaction.
`
`HPE, Exh. 1014, p. 6
`
`
`
`HPE, Exh. 1014, p. 6
`
`

`

`miea!
`
`A more general schemeis to use a nonvolatile write cache to
`reduce write latency. Writes can now be deemed complete
`after writing the new data into the cache, an operation often
`referred to as fast write [8]. Both the data and the parity
`blocks on the disks can then be updated in the background.
`The process of updating data or parity in the disks from the
`write cache is referred to as destaging.
`In addition to the
`write cache, a larger (volatile) read cache may be used to
`improve read performance.
`A nonvolatile write cache has several advantages [8], the
`most important of which is the substantially lower service
`time seen by write requests to the array.
`In addition, the
`write cache can also exploit any locality in writes — both
`temporal and spatial — in the workload. Temporal local-
`ity is exploited by capturing successive updates of the same
`block in the cache. Spatial locality allows many small writes
`to be aggregated into a single large write to the disk. Fi-
`nally, the write cache can also lower the response timefor
`read requests serviced by the disks because of the reduced
`contention with writes to the disk.
`
`In spite of the above advantages, a nonvolatile write cache
`introduces several new problems. First, the write cache must
`be designed to be at least as reliable as the redundancy
`scheme used for the disks. The IBM Hagar disk array [8],
`for example, used duplicate caches with independent power
`boundaries to protect the contents of the write cache. Sec-
`ond, data losses could still occur while updating the disks
`from the cache because of a failure in the cache memory,
`disks, or the datapath between them. Thedestage algorithm
`must be designed to leave data in a consistent state should
`a failure occur while updating the disks. Several such algo-
`rithms to prevent data losses while destaging are presented
`in [8].
`A third problem in the use of a write cache is that of
`scheduling the destages, that is determining when to destage
`data from the cache to the disk, as well as whichof the dirty
`blocks is to be destaged next. Ideally, in a disk array with
`a write cache, the disks see only two types of requests —
`read requests generated by the workload that miss in the
`read cache, and (ii) read and write requests originated by
`the array controller for destaging data from the write cache.
`Requests of the first category must be serviced as soon as
`possible while destage requests can usually be serviced in the
`background. Ideally, destages should appear transparent to
`the user requests received by the array. In practice, however,
`because of the non-preemptive nature of disk service, some
`interference between user requests and destage requests is
`unavoidable. For example, a read request received while
`a destage request is being serviced is made to wait until
`the latter is completed. This interference, however, can be
`minimized with careful scheduling of the destages.
`In this paper, we study a number of algorithms for
`scheduling destages in a RAID-5 system. We introduce a
`new scheduling algorithm, called linear threshold scheduling,
`that adaptively varies the rate of destages to disks based on
`the instantaneous occupancyof the write cache. The perfor-
`manceofthe algorithm is compared with that of a numberof
`alternative scheduling approaches such as least-cost schedul-
`ing and high/low mark. The algorithms are evaluated in
`terms of their effectiveness in making destages transparent
`to the servicing of read requests, as well as their ability to
`tolerate large bursts of write requests without causing an
`
`overflow of the write cache. Ourresults show that linear
`threshold scheduling provides the best read performance of
`all the algorithms compared, while still maintaining a high
`degree of burst tolerance.
`Sec-
`The rest of this paper is organized as follows:
`tion II summarizes the tradeoffs involved in the design of
`the destage algorithm and describes the algorithms studied
`in the paper. Section III introduces the disk array model and
`workload modelused in our simulations. Section IV presents
`simulation results for the algorithms and compares theal-
`gorithmsin terms of read ‘performance, disk utilization, and
`tolerance to bursts of write requests. Section V concludes
`the paper with directions for future research.
`
`II. DESTAGE ALGORITHMS
`
`In a disk array with a nonvolatile write cache, destages
`from the cache to the disk are performed in the background
`asynchronously, while read requests from the host system
`are serviced in the foreground. Thus, the workload seen by
`an individual disk predominantly consists of (i) reads from
`the host system that cannot be serviced from the read cache
`or write cache, and (ii) read and write requests generated
`by the array controller as part of destage operations. We
`refer to requests of the first category as host reads and those
`of the second category as destages. Reads of data or parity
`information performed as part of a destage are referred to
`as destage reads and the writes as destage writes. A write
`request from the host is never directed to the disk except in
`the event of a write cache overflow or when thesize of the
`block being written is large enough to justify bypassing the
`write cache.
`If host reads are given higher priority over destages, a
`destage would never beinitiated by the array controller when
`there are requests in the read queue. Still, considerable flex-
`ibility exists in scheduling the destages, as the updates to
`the disk need not be performed in the order the writes were
`posted to the cache. In addition, the four disk accesses in-
`volved in a disk update — reads ofold parity and data, and
`writes of new parity and data — need not be performed as
`a single indivisible operation.
`
`A. Algorithm Design Tradeoffs
`Wefirst examine the tradeoffs involved in the design of
`the destage algorithm. Assuming that no overflow of the
`write cache occurs and noneof the write requests bypasses
`the cache, all the write requests from the host system are
`serviced by the cache. In this case, the scheduling algorithm
`employed has no direct effect on the response timefor a write
`to the disk array. However, the schedulingalgorithm canstill
`affect the response time seen by the host reads serviced by
`the disks, as well as the maximum sustainable throughput
`of the array. A schedulingalgorithm that optimizes destages
`reduces the load on the disks due to destages, thus improving
`their read performance.
`There are three ways in which a scheduling algorithm
`can improve the performance of the disk array: First, the
`algorithm can reduce the numberof destages by capturing
`most of the re-writes in the write cache, thus exploiting the
`temporallocality in the workload. This requires maintaining
`blocks that are likely to be re-written in the near future in
`the write cache and destaging them only when additional
`re-writes are unlikely to occur soon. Similarly, if parity is
`
`84
`
`HPE, Exh. 1014, p. 7
`
`HPE, Exh. 1014, p. 7
`
`

`

`
`
`
`
`allowed to be cached, successive updates of blocks in the
`"sameparity group generateonly a single update of the parity
`_block on disk.
`_ Second, the numberof destages can also be reduced by
`aggregating blocks that lie physically close on a disk and
`_ destaging them as a large read and/or write. This exploits
`any spatial locality present in the workload.
`Finally, the scheduling algorithm can reduce the aver-
`age time for a destage by ordering destage requests to the
`disks such that the service times in the individual disks are
`_ minimized. Since the mechanical positioning delays usually
`_ dominate the service time, this calls for ordering the requests
`_ to minimize the positioning delays in the disk. Many such
`algorithms for scheduling requests in individual disks taking
`into account the seek time and therotational latency of re-
`_
`_ quests have been described in theliterature [3], [5], [6], [14],
`| [16]. While similar algorithms could be used to schedule
`_ destage requests in a disk array, a basic distinction exists
`- between the problem of scheduling destages in a disk ar-
`_ ray and that of scheduling host requests in an independent
`disk. A disk scheduling algorithm for host requests must
`_ make a tradeoff between maximizing throughput and pro-
`viding fairness to user requests; for example, servicing the
`_ user requests in FCFS(first-come,first-served) order maxi-
` mizes fairness, while a scheduling algorithm that reorders re-
`- quests to minimize the total service time achieves maximum
`_ throughput at the expense of increasing the variance of the
`_ response time [5]. The destage algorithm in a disk array, on
`_ the other hand, does not have to deal with this dilemma: In
`_ most cases, the algorithm can schedule destages to maximize
`throughput without causing starvation of user requests.
`All the above objectives in the design of the scheduling
`algorithm suggest maintaining the occupancy of the write
`_ cache close to its capacity. This allows maximum benefit
`to be obtained from anylocality present in the workload
`and maximizes scheduling flexibility. Maintaining the write
`- cacheclose to full, however, makes it vulnerable to overflow.
`_ Evena short burst of writes in the workload may cause the
`cache to overflow, forcing subsequent writes to bypass the
`_ cache until some destages can be posted to disk. Thus, the
`scheduling algorithm muststrike a compromise between the
`conflicting goals of being able to exploit maximum locality
`and scheduling flexibility for destages, and preventing fre-
`quent overflows of the write cache.
`The scheduling algorithm may be designed to be either
`_ work-conserving or non-work-conserving with respect to the
`disks in the array. A work-conserving algorithm never allows
`a disk to be idle when one or more destages are pending to
`the disk. A non-work-conserving algorithm, on the contrary,
`mayrefrain from scheduling a destage to the disk even when
`it
`is idle, expecting that the same destage could be per-
`formed at a lower cost in the future; the goal is to minimize
`the response time for host reads by reducing the contention
`between host reads and destages. Both types of algorithms
`are studied in this paper.
`Thus, in summary, the following are the parameters that
`can be used by the destage algorithm to determine the block
`to be destaged next:
`1. The probability of the block to be re-written in the
`near future. This factor is usually accommodated in
`the algorithm by ordering blocks in the destage queue
`based on the time of their last access.
`
`2. The numberof blocks to be read/updated on the same
`track (or cylinder) to take advantage of spatial locality.
`3. Service times of the requests in the destage queue. The
`service time includes the delay in positioning the head
`at the beginningof the block to be read/written and the
`data transfer time. Different levels of approximations
`could be used to estimate the positioning delays in the
`disk [16].
`4. The current level of occupancy of the cache.
`In addition, other factors such as the type of destage re-
`quest (parity or data, read or write), may also be used in
`the scheduling process.
`In the following, we describe four scheduling algorithms
`for destaging blocks from the write cache in a RAID-5. In all
`the algorithms, destaging to each disk is handled indepen-
`dently by maintaining a separate queue of pending destages
`for each disk. In addition,in all the algorithms, a destage to
`a disk is initiated only when there are no host reads queued
`for that disk at that time. However, not all of the algorithms
`schedule a destage always when the read queue is empty. The
`algorithmsdiffer in the criteria used to determine whether a
`destage is to be initiated when the disk becomesidle after
`servicing a request, and in the function used to select the
`block to be destaged.
`
`B. Least-Cost Scheduling
`
`This algorithm is modeled after the well-known shortest
`seek-timefirst disk scheduling algorithm [3]. After every disk
`operation, the queue of eligible destages is examined and
`the request that takes the shortest access time is performed.
`In addition, to exploit spatial locality, if there are pending
`destage requests to other blocks on the sametrack, these are
`also performed as onelarge read or write.
`The access time of a destage read or write operation in-
`cludes the seek time (or head-switch time), rotational la-
`tency, and controller delays. Precise estimation of these de-
`lays is difficult due to many factors such as the physical-to-
`logical mappingof sectors on the disk and the use of caching
`within the disk itself. Considerable improvementcan still be
`obtained using approximate models for estimating the de-
`lays involved. We implemented such a scheme in our RAID
`simulator for comparison with other schemes. Details of im-
`plementation of the scheme can be foundin Section III.
`
`C. High/Low Mark Algorithm
`
`This algorithm is designed after the cache purging algo-
`rithm proposed by Biswas, Ramakrishnan and Towsley [1]
`for disks with a nonvolatile write cache.
`In this algorithm,
`two cache-occupancy thresholds are used to enable and dis-
`able destages. Destages to the disk are disabled when the
`cache occupancy drops below a “low mark” and turned on
`when it increases over a “high mark.” Some hysteresis be-
`tween the two thresholds is used to prevent short bursts in
`the workload from causing a cache overflow. We chose the
`high threshold as 70% of cache capacity and the low thresh-
`old as 30%, Destage requests were selected using theleast-
`cost policy to minimize the service time of individual disk
`accesses. In addition, when a block is selected for destage,
`all pending destage requests to the same track on the disk
`are scheduled together.
`5
`
`85
`
`HPE, Exh. 1014, p. 8
`
`HPE, Exh. 1014, p. 8
`
`

`

`MottLe
`
`
`
`
`D. Linear Threshold Scheduling
`Thebasic principle of this algorithm is to match the rate
`of destages from the cacheto the currentlevel of occupancy
`of the cache, accelerating destages gradually as the cache oc-
`cupancy increases and slowing them as the occupancyfalls.
`The primary difference between this algorithm and high/low
`mark scheduling is that, instead of on/off thresholds, the
`rate of destages is changed gradually as a function of cache
`occupancy.
`Asin other algorithms, each disk in the array is scheduled
`independently. Both parity and data destages are treated
`in a similar manner. The destage queue is examined after
`each disk operation if the host read queue is empty. As in
`the case of least-cost scheduling, the block to be destagedis
`chosen as the one with minimum cost amongall the eligible
`destage requests to that disk. However, unlike in least-cost
`scheduling, the destage is performedonlyif its cost is within
`a certain threshold maintained by the algorithm. Should
`a read request from the host arrive while the destage is in
`progress, this policy minimizes its waiting time.
`Therole of the thresholdis to reduce the interference be-
`tween destages and host reads; the threshold sets an upper
`boundon the waiting timeof a read arriving while a destage
`in progress. A tradeoff exists in the choice of the thresh-
`old:
`If the threshold is set too low, the destage rate may
`not be adequateto clear the cache fast enough,causing fre-
`quent overflows. A large threshold, on the other hand, may
`perform destages too soon, degrading both the waiting time
`for host reads and the hit rate of the write cache, and re-
`ducing the level of temporal andspatial localities that may
`potentially be exploited.
`In our implementation, the threshold was selected as a
`linear function of the instantaneous cache occupancy. The
`cost of a destageis its service time, computedas in the case of
`the least-cost scheduling algorithm. Asin other algorithms,
`when a destage request is scheduled, all pending destage
`requests to the sametrack are also performed along withit.
`The linear threshold scheduling algorithm uses the cache
`occupancy, service time of destages, and spatial locality as
`its decision-makingcriteria, but does not attempt to max-
`imize temporal locality explicitly. Thus, a block that has
`been written recently may be scheduled for destage if the
`destage can be performed “cheaply.” This may increase the
`total number of destage accesses to the disk in comparison
`to servicing the requests in LRU order, but thetotal service
`time can still be lower as a result of minimizing the cost
`of individual destage requests. Note that temporal locality
`for host reads can still be exploited by keeping the destaged
`block in the write cache until the block needs to be replaced.
`
`E. Approximation to Linear Threshold Scheduling
`A major difficulty in implementing both the least-cost
`scheduling and the linear threshold scheduling algorithms
`is the need to scan the entire queue of destage requests to
`select the minimum-cost request. This operation must be
`performed at the completion of every disk service. Further-
`more, the cost of individual destage requests must be com-
`puted afresh if the position of the head has changed.
`The need for searching the entire destage queue can be
`overcomebyresorting to an approximate implementation of
`linear threshold scheduling; the approach works by dividing
`the disk into regions and maintaining a separate queue of
`
`destage requests into each region. When the disk becomes
`idle, the queues are searched in the order of the closest to
`the farthest region relative to the current position of the
`head. Thefirst non-empty queue is found and the request
`at its head is selected as the minimum-cost destage. As in
`the ideal linear-threshold scheduling algorithm, the request
`is scheduledif its estimated service time is within the thresh-
`old determined by the current cache occupancylevel. The
`requests within each of the destage queues can be ordered
`using the LRU policy to allow maximum temporal locality.
`As in the other scheduling algorithms, when a request is se-
`lected for destage, all the pending dest

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket