`
`The 22nd Annual International
`
`Symposium on
`
`COMPUTER ARCHITECTURE
`
`June 22-24, 1995
`
`Santa Margher
`
`to Ligure, Italy
`
`Sponsored by
`
`ACM SIGARCH
`IEEE Computer Society, TCCA
`With support of SGS-Thomson-Microelectronics and Olivetti
`In cooperation with the University of Genoa
`
`iac l
`PRESS
`
`SIGARCH
`
`HPE, Exh. 1012, p. 1
`
`
`
`fl")6.5
`5 Z1/
`095
`
`The Association for Computing Machinery
`1515 Broadway
`New York, N.Y. 10036
`
`Copyright © 1995 by the Association for Computing Machinery, Inc. Copying without fee
`is permitted provided that the copies are not made or distributed for direct commercial
`advantage,and credit to the source is given. Abstracting with credit is permitted. For other
`copying of articles that carry a code at the bottom of the first or last page, copying is permit-
`ted provided that the per-copy fee indicated in the code is paid through the Copyright
`Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. For permission to republish
`write to: Director of Publications, Association for Computing Machinery. To copy otherwise
`or republish, requires a fee and/or specific permission.
`
`IEEE Catalog Number
`95CB35801
`ISBN 0-89791-698-0
`(softbound)
`ISBN 0-7803-3000-5
`(casebound)
`ISBN 0-7803-3001-3
`(microfiche)
`Library of Congress
`Number 85-642899
`1063-6897
`ISSN:
`
`Additional copies may be ordered prepaid from:
`
`ACM Order Department
`P.O. Box 12114
`Church Street Station
`New York, N.Y. 10257
`
`IEEE Computer Society Press
`Customer Service Center
`10662 Los Vaqueros Circle
`Los Alamitos, CA 90720
`
`Phone: 1-800-342-6626
`(U.S.A. and Canada)
`1-212-626-0500
`(All other countries)
`Fax: 1-212-944-1318
`E-mail: acmhelp@acm.org
`acm_europe@acm.org (in Europe)
`
`Printed in the U.S.A.
`
`HPE, Exh. 1012, p. 2
`
`
`
`Table of Contents
`
`General Chair's Message
`Program Chair's Message
`Organizing Committees
`Referees
`
`Session 1: Multiprocessors and Applications
`The MIT Alewife Machine: Architecture and Performance
`A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz, J. Kubiatow-
`icz, B.-H. Lim, K. Mackenzie, D. Yeung
`The EM-X Parallel Computer: Architecture and Basic Performance
`Y Kodama, H. Sakane, M. Sato, H. Yamana, S. Sakai, Y Yamaguchi
`The SPLASH-2 Programs: Characterization and Methodological Considerations
`S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, A. Gupta
`
`Session 2A: Cache Coherence
`Efficient Strategies for Software-Only Directory Protocols in Shared-Memory
`Multiprocessors
`H. Grahn, P Stenstrom
`Dynamic Self-Invalidation: Reducing Coherence Overhead in Shared-Memory
`Multiprocessors
`A. R. Lebeck, D. A. Wood
`Boosting the Performance of Hybrid Snooping Cache Protocols
`F Dahlgren
`
`Session 2B: Interconnect Technology and 1/O
`S-Connect: from Networks of Workstations to Supercomputer Peformance
`A. G. Nowatzyk, M. C. Browne, E. J. Kelly, M. Parkin
`Destage Algorithms for Disk Arrays with Non-volatile Caches
`A. Varma, Q. Jacobson
`Evaluating Multi-Port Frame Buffer Designs for a Mesh-Connected Multicomputer
`G. Stoll, B. Wei, D. Clark, E. W. Felten, K. Li, P Hanrahan
`Are Crossbars Really Dead? The Case for Optical Multiprocessor Interconnect Systems
`A. G. Nowatzyk, P R. Prucnal
`
`Session 3: Instruction Level Parallelism
`Exploring Configurations of Functional Units in an Out-of-Order Superscalar Processor
`S. Jourdan, P Sainrat, D. Litaize
`
`xi
`
`vi
` vii
` ix
`
`2
`
`14
`
`24
`
`38
`
`48
`
`60
`
`71
`
`83
`
`96
`
`106
`
`117
`
`HPE, Exh. 1012, p. 3
`
`
`
`Unconstrained Speculative Execution with Predicated State Buffering
`H. Ando, C. Nakanishi, T Hara, M. Nakaya
`A Comparison of Full and Partial Predicated Execution Support for ILP Processors
`S. A. Mahlke, R. E. Hank, J. E. McCormick, D. I. August, W W Hwu
`
`Session 4A: New Microarchitectures
`Implementation Trade-offs in Using a Restricted Data Flow Architecture in a High
`Performance RISC Microprocessor
`M. Simone, A. Essen, A. Ike, A. Krishnamoorthy, T Maruyama, N. Patkar, M.
`Ramaswami, M. Shebanow, V Thirumalaiswamy, D. Tovey
`Performance Evaluation of the PowerPC 620 Microarchitecture
`T A. Diep, C. Nelson, J. P. Shen
`
`Session 4B: Managing Memory Hierarchies
`Reducing TLB and Memory Overhead using Online Superpage Promotion
`T H. Romer, W H. Ohlrich, A. R. Karlin, B. N. Bershad
`Speeding up Irregular Applications in Shared-Memory Multiprocessors: Memory
`Binding and Group Prefetching
`Z. Zhang, J. Torrellas
`
`Session 5A: Interconnection Network Routing
`An Efficient, Fully Adaptive Deadlock Recovery Scheme: DISHA
`Anjan K. V, T M. Pinkston
`Analysis and Implementation of Hybrid Switching
`K. G. Shin, S. W. Daniel
`Configurable Flow Control Mechanisms for Fault-Tolerant Routing
`B. V. Dao, J. Duato, S. Yalamanchili
`NIFDY: A Low Overhead, High Throughput Network Interface
`T Callahan, S. C. Goldstein
`
`Session 5B: Novel Memory Access Mechanisms
`Vector Multiprocessors with Arbitrated Memory Access
`M. Peiron, M. Valero, E. Aygaude, T Lang
`Design of Cache Memories for Multi-Threaded Dataflow Architecture
`K. M. Kavi, A. R. Hurson, P Patadia, E. Abraham, P. Shanmugam
`Skewed Associativity Enhances Performance Predictability
`F Bodin, A. Seznec
`
`Session 6: Branch Prediction
`A Comparative Analysis of Schemes for Correlated Branch Prediction
`C. Young, N. Gloy, M. D. Smith
`
`xii
`
`126
`
`138
`
`151
`
`163
`
`176
`
`188
`
`201
`
`211
`
`220
`
`230
`
`243
`
`253
`
`265
`
`276
`
`HPE, Exh. 1012, p. 4
`
`
`
`Next Cache Line and Set Prediction
`B. Calder, D. Grunwald
`
`Session 7A: System Evaluation
`A Comparison of Architectural Support for Messaging on the TMC CM-5 and Cray T3D
`V Karamcheti, A. A. Chien
`Optimizing Memory System Performance for Communication in Parallel Computers
`T Stricker, T Gross
`Empirical Evaluation of the CRAY-T3D: A Compiler Perspective
`R. H. Arpaci, D. E. Culler, A. Krishnamurthy, S. G. Steinberg, K. Yelick
`
`Session 7B: Instruction Fetching
`Optimization of Instruction Fetch Mechanisms for High Issue Rates
`T M. Conte, K. N. Menezes, P M. Mills, B. A. Patel
`Instruction Fetching: Coping with Code Bloat
`R. Uhlig, D. Nagle, T Mudge, S. Sechrest, J. Emer
`Instruction Cache Fetch Policies for Speculative Execution
`D. Lee, J.-L. Baer, B. Calder, D. Grunwald
`
`Session 8: Caches
`Streamlining Data Cache Access with Fast Address Calculation
`T M. Austin, D. N. Pnevmatikatos, G. S. Sohi
`CAT — Caching Address Tags: A Technique for Reducing Area Cost of On-Chip Caches
`H. Wang, T Sun, Q. Yang
`
`Session 9: Processor Architecture
`Simultaneous Multithreading: Maximizing On-Chip Parallelism
`D. M. Tullsen, S. J. Eggers, H. M. Levy
`Architecture Validation for Processors
`R. C. Ho, C. H. Yang, M. A. Horowitz, D. L. Dill
`Multiscalar Processors
`G. S. Sohi, S. E. Breach, T N. Vijaykumar
`
`Author Index
`
`287
`
`298
`
`308
`
`320
`
`333
`
`345
`
`357
`
`369
`
`381
`
`392
`
`404
`
`414
`
`426
`
`HPE, Exh. 1012, p. 5
`
`
`
`Destage Algorithms for Disk Arrays with Non-Volatile Caches
`
`Anujan Varma and Quinn Jacobson
`Computer Engineering Department
`University of California
`Santa Cruz, CA 95064
`
`Abstract
`In a disk array with a nonvolatile write cache, destages
`from the cache to the disk are performed in the background
`asynchronously while read requests from the host system
`are serviced in the foreground. In this paper, we study a
`number of algorithms for scheduling destages in a RAID-5
`system. We introduce a new scheduling algorithm, called
`linear threshold scheduling, that adaptively varies the rate
`of destages to disks based on the instantaneous occupancy
`of the write cache. The performance of the algorithm is
`compared with that of a number of alternative scheduling
`approaches such as least-cost scheduling and high/low mark.
`The algorithms are evaluated in terms of their effectiveness
`in making destages transparent to the servicing of read re-
`quests from the host, disk utilization, and their ability to
`tolerate bursts in the workload without causing an overflow
`of the write cache. Our results show that linear threshold
`scheduling provides the best read performance of all the al-
`gorithms compared, while still maintaining a high degree
`of burst tolerance. An approximate implementation of the
`linear-threshold scheduling algorithm is also described. The
`approximate algorithm can be implemented with much lower
`overhead, yet its performance is virtually identical to that
`of the ideal algorithm.
`
`I. INTRODUCTION
`A disk array, in general, consists of a group of disk drives
`together with an associated controller function, organized
`logically as a single I/O device [2], [4]. Disk arrays, also
`known as RAIDs (Redundant Arrays of Inexpensive Disks),
`are capable of providing improved levels of reliability, avail-
`ability, and/or performance over single disks. A disk array
`usually provides protection against loss of data from a disk
`failure by maintaining redundant information within the ar-
`ray. Moreover, data availability can be maintained on a disk
`failure by using the redundant information to reconstruct
`data stored on the failed disk in real time. In addition to
`improving reliability and availability, disk arrays also im-
`prove the performance of the storage system by distributing
`data across multiple disk drives — this is the result of either
`concurrency in servicing multiple I/O requests or parallelism
`in the data transfer for a single I/O request.
`Several types of disk array configurations are known in
`
`This research is supported by NSF Young Investigator Award
`No. MIP-9257103.
`
`Permission to copy without fee all or part of this material is
`granted provided that the copies are not made or distributed for
`direct commercial advantage, the ACM copyright notice and the
`title of the publication and its date appear, and notice is given
`that copying is by permission of the Association of Computing
`Machinery.To copy otherwise, or to republish, requires
`a fee and/or specific permission.
`ISCA '95, Santa Margherita Ligure Italy
`O 1995 ACM 0-89791-698-0/95/0006...$3.50
`
`the literature [2], [4], [17]. These configurations vary primar-
`ily in the redundancy scheme employed and the data distri-
`bution (striping) scheme used to map logical blocks among
`the individual disks. In a seminal paper, Patterson, Gib-
`son, and Katz [11] introduced a taxonomy of disk arrays,
`consisting of six different types (or "levels"), which is now
`widely used in the industry. Among these RAID levels, of
`particular interest are RAID-4 and RAID-5, which are op-
`timized for transaction processing workloads. These arrays
`employ coarse-grained striping of data so that small requests
`can benefit from the concurrency in servicing multiple re-
`quests, while a large request can achieve high throughput
`by transferring data from multiple disks in parallel. RAID-
`4 organizes the disks in the array into parity groups, with
`one dedicated parity disk in each group. This has the dis-
`advantage of the parity disks becoming a bottleneck while
`updating data in the system. RAID-5 eliminates this bottle-
`neck by distributing parity blocks uniformly among all the
`disks in the group.
`Both RAID-4 and RAID-5 provide reliability and avail-
`ability at a fraction of the storage overhead incurred in disk
`mirroring. This reduction in storage overhead, however, is
`achieved at the expense of increasing the number of disk ac-
`cesses necessary to update data in the system. Every write
`request to the array that does not update an entire stripe
`must now update parity by reading the old data and the old
`parity, and exclusive-ORing them with the new data. This
`involves a total of four disk accesses — reading the old data
`and parity, and writing the new data and parity. Menon and
`Mattson [7] showed that this overhead can degrade the per-
`formance of a RAID-5 considerably in a transaction process-
`ing environment where small requests dominate the work-
`load. In addition, a small increase in the ratio of writes to
`reads in the workload can lead to a drastic increase in the
`response time for both reads and writes.
`Several solutions have been proposed to reduce the over-
`head for small writes in a RAID-5 [8], [9], [15]. One approach
`is parity logging [15], where parity updates are posted into
`a dedicated log disk instead of updating the parity blocks
`in the array; updates of parity blocks in the array are per-
`formed in the background. This has the advantage of con-
`verting small writes of parity into large writes; the scheme
`also captures some of the temporal locality in parity updates
`since successive updates of parity can be combined into a sin-
`gle update in the disk array. Another scheme that reduces
`the overhead in parity updates is floating parity [9], where
`parity blocks are dynamically remapped within disk cylin-
`ders to reduce the rotational latency between reading and
`writing parity.
`Both parity logging and floating parity only attempt to
`reduce the overhead of parity updates. In both schemes,
`the old data must be read from the disk and the new data
`written before signaling completion of the I/O transaction.
`
`83
`
`HPE, Exh. 1012, p. 6
`
`
`
`A more general scheme is to use a nonvolatile write cache to
`reduce write latency. Writes can now be deemed complete
`after writing the new data into the cache, an operation often
`referred to as fast write [8]. Both the data and the parity
`blocks on the disks can then be updated in the background.
`The process of updating data or parity in the disks from the
`write cache is referred to as destaging. In addition to the
`write cache, a larger (volatile) read cache may be used to
`improve read performance.
`A nonvolatile write cache has several advantages [8], the
`most important of which is the substantially lower service
`time seen by write requests to the array. In addition, the
`write cache can also exploit any locality in writes — both
`temporal and spatial — in the workload. Temporal local-
`ity is exploited by capturing successive updates of the same
`block in the cache. Spatial locality allows many small writes
`to be aggregated into a single large write to the disk. Fi-
`nally, the write cache can also lower the response time for
`read requests serviced by the disks because of the reduced
`contention with writes to the disk.
`In spite of the above advantages, a nonvolatile write cache
`introduces several new problems. First, the write cache must
`be designed to be at least as reliable as the redundancy
`scheme used for the disks. The IBM Hagar disk array [8],
`for example, used duplicate caches with independent power
`boundaries to protect the contents of the write cache. Sec-
`ond, data losses could still occur while updating the disks
`from the cache because of a failure in the cache memory,
`disks, or the datapath between them. The destage algorithm
`must be designed to leave data in a consistent state should
`a failure occur while updating the disks. Several such algo-
`rithms to prevent data losses while destaging are presented
`in [8].
`A third problem in the use of a write cache is that of
`scheduling the destages, that is determining when to destage
`data from the cache to the disk, as well as which of the dirty
`blocks is to be destaged next. Ideally, in a disk array with
`a write cache, the disks see only two types of requests —
`read requests generated by the workload that miss in the
`read cache, and (ii) read and write requests originated by
`the array controller for destaging data from the write cache.
`Requests of the first category must be serviced as soon as
`possible while destage requests can usually be serviced in the
`background. Ideally, destages should appear transparent to
`the user requests received by the array. In practice, however,
`because of the non-preemptive nature of disk service, some
`interference between user requests and destage requests is
`unavoidable. For example, a read request received while
`a destage request is being serviced is made to wait until
`the latter is completed. This interference, however, can be
`minimized with careful scheduling of the destages.
`In this paper, we study a number of algorithms for
`scheduling destages in a RAID-5 system. We introduce a
`new scheduling algorithm, called linear threshold scheduling,
`that adaptively varies the rate of destages to disks based on
`the instantaneous occupancy of the write cache. The perfor-
`mance of the algorithm is compared with that of a number of
`alternative scheduling approaches such as least-cost schedul-
`ing and high/low mark. The algorithms are evaluated in
`terms of their effectiveness in making destages transparent
`to the servicing of read requests, as well as their ability to
`tolerate large bursts of write requests without causing an
`
`overflow of the write cache. Our results show that linear
`threshold scheduling provides the best read performance of
`all the algorithms compared, while still maintaining a high
`degree of burst tolerance.
`The rest of this paper is organized as follows: Sec-
`tion II summarizes the tradeoffs involved in the design of
`the destage algorithm and describes the algorithms studied
`in the paper. Section III introduces the disk array model and
`workload model used in our simulations. Section IV presents
`simulation results for the algorithms and compares the al-
`gorithms in terms of read performance, disk utilization, and
`tolerance to bursts of write requests. Section V concludes
`the paper with directions for future research.
`
`II. DESTAGE ALGORITHMS
`
`In a disk array with a nonvolatile write cache, destages
`from the cache to the disk are performed in the background
`asynchronously, while read requests from the host system
`are serviced in the foreground. Thus, the workload seen by
`an individual disk predominantly consists of (i) reads from
`the host system that cannot be serviced from the read cache
`or write cache, and (ii) read and write requests generated
`by the array controller as part of destage operations. We
`refer to requests of the first category as host reads and those
`of the second category as destages. Reads of data or parity
`information performed as part of a destage are referred to
`as destage reads and the writes as destage writes. A write
`request from the host is never directed to the disk except in
`the event of a write cache overflow or when the size of the
`block being written is large enough to justify bypassing the
`write cache.
`If host reads are given higher priority over destages, a
`destage would never be initiated by the array controller when
`there are requests in the read queue. Still, considerable flex-
`ibility exists in scheduling the destages, as the updates to
`the disk need not be performed in the order the writes were
`posted to the cache. In addition, the four disk accesses in-
`volved in a disk update — reads of old parity and data, and
`writes of new parity and data — need not be performed as
`a single indivisible operation.
`
`A. Algorithm Design Tradeoffs
`
`We first examine the tradeoffs involved in the design of
`the destage algorithm. Assuming that no overflow of the
`write cache occurs and none of the write requests bypasses
`the cache, all the write requests from the host system are
`serviced by the cache. In this case, the scheduling algorithm
`employed has no direct effect on the response time for a write
`to the disk array. However, the scheduling algorithm can still
`affect the response time seen by the host reads serviced by
`the disks, as well as the maximum sustainable throughput
`of the array. A scheduling algorithm that optimizes destages
`reduces the load on the disks due to destages, thus improving
`their read performance.
`There are three ways in which a scheduling algorithm
`can improve the performance of the disk array: First, the
`algorithm can reduce the number of destages by capturing
`most of the re-writes in the write cache, thus exploiting the
`temporal locality in the workload. This requires maintaining
`blocks that are likely to be re-written in the near future in
`the write cache and destaging them only when additional
`re-writes are unlikely to occur soon. Similarly, if parity is
`
`84
`
`HPE, Exh. 1012, p. 7
`
`
`
`11
`
`allowed to be cached, successive updates of blocks in the
`same parity group generate only a single update of the parity
`block on disk.
`Second, the number of destages can also be reduced by
`aggregating blocks that lie physically close on a disk and
`destaging them as a large read and/or write. This exploits
`any spatial locality present in the workload.
`Finally, the scheduling algorithm can reduce the aver-
`age time for a destage by ordering destage requests to the
`disks such that the service times in the individual disks are
`minimized. Since the mechanical positioning delays usually
`dominate the service time, this calls for ordering the requests
`to minimize the positioning delays in the disk. Many such
`algorithms for scheduling requests in individual disks taking
`into account the seek time and the rotational latency of re-
`quests have been described in the literature [3], [5], [6], [14],
`[16]. While similar algorithms could be used to schedule
`destage requests in a disk array, a basic distinction exists
`between the problem of scheduling destages in a disk ar-
`ray and that of scheduling host requests in an independent
`disk. A disk scheduling algorithm for host requests must
`make a tradeoff between maximizing throughput and pro-
`viding fairness to user requests; for example, servicing the
`user requests in FCFS (first-come, first-served) order maxi-
`mizes fairness, while a scheduling algorithm that reorders re-
`quests to minimize the total service time achieves maximum
`throughput at the expense of increasing the variance of the
`response time [5]. The destage algorithm in a disk array, on
`the other hand, does not have to deal with this dilemma: In
`most cases, the algorithm can schedule destages to maximize
`throughput without causing starvation of user requests.
`All the above objectives in the design of the scheduling
`algorithm suggest maintaining the occupancy of the write
`cache close to its capacity. This allows maximum benefit
`to be obtained from any locality present in the workload
`and maximizes scheduling flexibility. Maintaining the write
`cache close to full, however, makes it vulnerable to overflow.
`Even a short burst of writes in the workload may cause the
`cache to overflow, forcing subsequent writes to bypass the
`cache until some destages can be posted to disk. Thus, the
`scheduling algorithm must strike a compromise between the
`conflicting goals of being able to exploit maximum locality
`and scheduling flexibility for destages, and preventing fre-
`quent overflows of the write cache.
`The scheduling algorithm may be designed to be either
`work-conserving or non-work-conserving with respect to the
`disks in the array. A work-conserving algorithm never allows
`a disk to be idle when one or more destages are pending to
`the disk. A non-work-conserving algorithm, on the contrary,
`may refrain from scheduling a destage to the disk even when
`it is idle, expecting that the same destage could be per-
`formed at a lower cost in the future; the goal is to minimize
`the response time for host reads by reducing the contention
`between host reads and destages. Both types of algorithms
`are studied in this paper.
`Thus, in summary, the following are the parameters that
`can be used by the destage algorithm to determine the block
`to be destaged next:
`1. The probability of the block to be re-written in the
`near future. This factor is usually accommodated in
`the algorithm by ordering blocks in the destage queue
`based on the time of their last access.
`
`2. The number of blocks to be read/updated on the same
`track (or cylinder) to take advantage of spatial locality.
`3. Service times of the requests in the destage queue. The
`service time includes the delay in positioning the head
`at the beginning of the block to be read/written and the
`data transfer time. Different levels of approximations
`could be used to estimate the positioning delays in the
`disk [16].
`4. The current level of occupancy of the cache.
`In addition, other factors such as the type of destage re-
`quest (parity or data, read or write), may also be used in
`the scheduling process.
`In the following, we describe four scheduling algorithms
`for destaging blocks from the write cache in a RAID-5. In all
`the algorithms, destaging to each disk is handled indepen-
`dently by maintaining a separate queue of pending destages
`for each disk. In addition, in all the algorithms, a destage to
`a disk is initiated only when there are no host reads queued
`for that disk at that time. However, not all of the algorithms
`schedule a destage always when the read queue is empty. The
`algorithms differ in the criteria used to determine whether a
`destage is to be initiated when the disk becomes idle after
`servicing a request, and in the function used to select the
`block to be destaged.
`
`B. Least-Cost Scheduling
`
`This algorithm is modeled after the well-known shortest
`seek-time first disk scheduling algorithm [3]. After every disk
`operation, the queue of eligible destages is examined and
`the request that takes the shortest access time is performed.
`In addition, to exploit spatial locality, if there are pending
`destage requests to other blocks on the same track, these are
`also performed as one large read or write.
`The access time of a destage read or write operation in-
`cludes the seek time (or head-switch time), rotational la-
`tency, and controller delays. Precise estimation of these de-
`lays is difficult due to many factors such as the physical-to-
`logical mapping of sectors on the disk and the use of caching
`within the disk itself. Considerable improvement can still be
`obtained using approximate models for estimating the de-
`lays involved. We implemented such a scheme in our RAID
`simulator for comparison with other schemes. Details of im-
`plementation of the scheme can be found in Section III.
`
`C. High/Low Mark Algorithm
`
`This algorithm is designed after the cache purging algo-
`rithm proposed by Biswas, Ramakrishnan and Towsley [1]
`for disks with a nonvolatile write cache. In this algorithm,
`two cache-occupancy thresholds are used to enable and dis-
`able destages. Destages to the disk are disabled when the
`cache occupancy drops below a "low mark" and turned on
`when it increases over a "high mark." Some hysteresis be-
`tween the two thresholds is used to prevent short bursts in
`the workload from causing a cache overflow. We chose the
`high threshold as 70% of cache capacity and the low thresh-
`old as 30%. Destage requests were selected using the least-
`cost policy to minimize the service time of individual disk
`accesses. In addition, when a block is selected for destage,
`all pending destage requests to the same track on the disk
`are scheduled together.
`
`85
`
`HPE, Exh. 1012, p. 8
`
`
`
`D. Linear Threshold Scheduling
`The basic principle of this algorithm is to match the rate
`of destages from the cache to the current level of occupancy
`of the cache, accelerating destages gradually as the cache oc-
`cupancy increases and slowing them as the occupancy falls.
`The primary difference between this algorithm and high/low
`mark scheduling is that, instead of on/off thresholds, the
`rate of destages is changed gradually as a function of cache
`occupancy.
`As in other algorithms, each disk in the array is scheduled
`independently. Both parity and data destages are treated
`in a similar manner. The destage queue is examined after
`each disk operation if the host read queue is empty. As in
`the case of least-cost scheduling, the block to be destaged is
`chosen as the one with minimum cost among all the eligible
`destage requests to that disk. However, unlike in least-cost
`scheduling, the destage is performed only if its cost is within
`a certain threshold maintained by the algorithm. Should
`a read request from the host arrive while the destage is in
`progress, this policy minimizes its waiting time.
`The role of the threshold is to reduce the interference be-
`tween destages and host reads; the threshold sets an upper
`bound on the waiting time of a read arriving while a destage
`in progress. A tradeoff exists in the choice of the thresh-
`old: If the threshold is set too low, the destage rate may
`not be adequate to clear the cache fast enough, causing fre-
`quent overflows. A large threshold, on the other hand, may
`perform destages too soon, degrading both the waiting time
`for host reads and the hit rate of the write cache, and re-
`ducing the level of temporal and spatial localities that may
`potentially be exploited.
`In our implementation, the threshold was selected as a
`linear function of the instantaneous cache occupancy. The
`cost of a destage is its service time, computed as in the case of
`the least-cost scheduling algorithm. As in other algorithms,
`when a destage request is scheduled, all pending destage
`requests to the same track are also performed along with it.
`The linear threshold scheduling algorithm uses the cache
`occupancy, service time of destages, and spatial locality as
`its decision-making criteria, but does not attempt to max-
`imize temporal locality explicitly. Thus, a block that has
`been written recently may be scheduled for destage if the
`destage can be performed "cheaply." This may increase the
`total number of destage accesses to the disk in comparison
`to servicing the requests in LRU order, but the total service
`time can still be lower as a result of minimizing the cost
`of individual destage requests. Note that temporal locality
`for host reads can still be exploited by keeping the destaged
`block in the write cache until the block needs to be replaced.
`
`E. Approximation to Linear Threshold Scheduling
`A major difficulty in implementing both the least-cost
`scheduling and the linear threshold scheduling algorithms
`is the need to scan the entire queue of destage requests to
`select the minimum-cost request. This operation must be
`performed at the completion of every disk service. Further-
`more, the cost of individual destage requests must be com-
`puted afresh if the position of the head has changed.
`The need for searching the entire destage queue can be
`overcome by resorting to an approximate implementation of
`linear threshold scheduling; the approach works by dividing
`the disk into regions and maintaining a separate queue of
`
`destage requests into each region. When the disk becomes
`idle, the queues are searched in the order of the closest to
`the farthest region relative to the current position of the
`head. The first non-empty queue is found and the request
`at its head is selected as the minimum-cost destage. As in
`the ideal linear-threshold scheduling algorithm, the request
`is scheduled if its estimated service time is within the thresh-
`old determined by the current cache occupancy level. The
`requests within each of the destage queues can be ordered
`using the LRU policy to allow maximum temporal locality.
`As in the other scheduling algorithms, when a request is se-
`lected for destage, all the pending destages to the same track
`are performed in one disk access.
`Both seek time and rotational latency must be taken into
`account in ordering the search of regions. Therefore, we used
`a partitioning scheme in which the disk cylinders are parti-
`tioned into circular bands; each band is then radially subdi-
`vided into regions. This is illustrated in Fig. 1(a) with re-
`spect to the example disk used in our simulation model. The
`example drive is an HP 97560 disk with a total of 1935 cylin-
`ders. The cylinders are partitioned into 15 circular bands,
`with 129 cylinders per band; each band is radially subdi-
`vided into three regions, to provide a total of 45 regions for
`the scheduling algorithm. This is similar to the disk parti-
`tioning scheme proposed by Jacobson and Wilkes [6], except
`that the cost of each destage is approximated as the average
`positioning delay from the current region to the target re-
`gion. The average positioning delay from region i to region j
`is taken as the positioning delay from the center of region i
`to the center of region j.
`The linear threshold scheme is approximated by design-
`ing a stepped threshold function that specifies the maximum
`destage cost for each level of cache occupancy. The destage
`cost is the average access time to a region expressed as a
`multiple of the fractional rotational delay of the disk corre-
`sponding to a region. Table I shows the threshold function
`used in our example. The second column in the table pro-
`vides the maximum allowable destage cost for each step of
`cac