throbber
Proceedings
`
`The 22nd Annual International
`
`Symposium on
`
`COMPUTER ARCHITECTURE
`
`June 22-24, 1995
`
`Santa Margher
`
`to Ligure, Italy
`
`Sponsored by
`
`ACM SIGARCH
`IEEE Computer Society, TCCA
`With support of SGS-Thomson-Microelectronics and Olivetti
`In cooperation with the University of Genoa
`
`iac l
`PRESS
`
`SIGARCH
`
`HPE, Exh. 1012, p. 1
`
`

`

`fl")6.5
`5 Z1/
`095
`
`The Association for Computing Machinery
`1515 Broadway
`New York, N.Y. 10036
`
`Copyright © 1995 by the Association for Computing Machinery, Inc. Copying without fee
`is permitted provided that the copies are not made or distributed for direct commercial
`advantage,and credit to the source is given. Abstracting with credit is permitted. For other
`copying of articles that carry a code at the bottom of the first or last page, copying is permit-
`ted provided that the per-copy fee indicated in the code is paid through the Copyright
`Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. For permission to republish
`write to: Director of Publications, Association for Computing Machinery. To copy otherwise
`or republish, requires a fee and/or specific permission.
`
`IEEE Catalog Number
`95CB35801
`ISBN 0-89791-698-0
`(softbound)
`ISBN 0-7803-3000-5
`(casebound)
`ISBN 0-7803-3001-3
`(microfiche)
`Library of Congress
`Number 85-642899
`1063-6897
`ISSN:
`
`Additional copies may be ordered prepaid from:
`
`ACM Order Department
`P.O. Box 12114
`Church Street Station
`New York, N.Y. 10257
`
`IEEE Computer Society Press
`Customer Service Center
`10662 Los Vaqueros Circle
`Los Alamitos, CA 90720
`
`Phone: 1-800-342-6626
`(U.S.A. and Canada)
`1-212-626-0500
`(All other countries)
`Fax: 1-212-944-1318
`E-mail: acmhelp@acm.org
`acm_europe@acm.org (in Europe)
`
`Printed in the U.S.A.
`
`HPE, Exh. 1012, p. 2
`
`

`

`Table of Contents
`
`General Chair's Message
`Program Chair's Message
`Organizing Committees
`Referees
`
`Session 1: Multiprocessors and Applications
`The MIT Alewife Machine: Architecture and Performance
`A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz, J. Kubiatow-
`icz, B.-H. Lim, K. Mackenzie, D. Yeung
`The EM-X Parallel Computer: Architecture and Basic Performance
`Y Kodama, H. Sakane, M. Sato, H. Yamana, S. Sakai, Y Yamaguchi
`The SPLASH-2 Programs: Characterization and Methodological Considerations
`S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, A. Gupta
`
`Session 2A: Cache Coherence
`Efficient Strategies for Software-Only Directory Protocols in Shared-Memory
`Multiprocessors
`H. Grahn, P Stenstrom
`Dynamic Self-Invalidation: Reducing Coherence Overhead in Shared-Memory
`Multiprocessors
`A. R. Lebeck, D. A. Wood
`Boosting the Performance of Hybrid Snooping Cache Protocols
`F Dahlgren
`
`Session 2B: Interconnect Technology and 1/O
`S-Connect: from Networks of Workstations to Supercomputer Peformance
`A. G. Nowatzyk, M. C. Browne, E. J. Kelly, M. Parkin
`Destage Algorithms for Disk Arrays with Non-volatile Caches
`A. Varma, Q. Jacobson
`Evaluating Multi-Port Frame Buffer Designs for a Mesh-Connected Multicomputer
`G. Stoll, B. Wei, D. Clark, E. W. Felten, K. Li, P Hanrahan
`Are Crossbars Really Dead? The Case for Optical Multiprocessor Interconnect Systems
`A. G. Nowatzyk, P R. Prucnal
`
`Session 3: Instruction Level Parallelism
`Exploring Configurations of Functional Units in an Out-of-Order Superscalar Processor
`S. Jourdan, P Sainrat, D. Litaize
`
`xi
`
`vi
` vii
` ix
`
`2
`
`14
`
`24
`
`38
`
`48
`
`60
`
`71
`
`83
`
`96
`
`106
`
`117
`
`HPE, Exh. 1012, p. 3
`
`

`

`Unconstrained Speculative Execution with Predicated State Buffering
`H. Ando, C. Nakanishi, T Hara, M. Nakaya
`A Comparison of Full and Partial Predicated Execution Support for ILP Processors
`S. A. Mahlke, R. E. Hank, J. E. McCormick, D. I. August, W W Hwu
`
`Session 4A: New Microarchitectures
`Implementation Trade-offs in Using a Restricted Data Flow Architecture in a High
`Performance RISC Microprocessor
`M. Simone, A. Essen, A. Ike, A. Krishnamoorthy, T Maruyama, N. Patkar, M.
`Ramaswami, M. Shebanow, V Thirumalaiswamy, D. Tovey
`Performance Evaluation of the PowerPC 620 Microarchitecture
`T A. Diep, C. Nelson, J. P. Shen
`
`Session 4B: Managing Memory Hierarchies
`Reducing TLB and Memory Overhead using Online Superpage Promotion
`T H. Romer, W H. Ohlrich, A. R. Karlin, B. N. Bershad
`Speeding up Irregular Applications in Shared-Memory Multiprocessors: Memory
`Binding and Group Prefetching
`Z. Zhang, J. Torrellas
`
`Session 5A: Interconnection Network Routing
`An Efficient, Fully Adaptive Deadlock Recovery Scheme: DISHA
`Anjan K. V, T M. Pinkston
`Analysis and Implementation of Hybrid Switching
`K. G. Shin, S. W. Daniel
`Configurable Flow Control Mechanisms for Fault-Tolerant Routing
`B. V. Dao, J. Duato, S. Yalamanchili
`NIFDY: A Low Overhead, High Throughput Network Interface
`T Callahan, S. C. Goldstein
`
`Session 5B: Novel Memory Access Mechanisms
`Vector Multiprocessors with Arbitrated Memory Access
`M. Peiron, M. Valero, E. Aygaude, T Lang
`Design of Cache Memories for Multi-Threaded Dataflow Architecture
`K. M. Kavi, A. R. Hurson, P Patadia, E. Abraham, P. Shanmugam
`Skewed Associativity Enhances Performance Predictability
`F Bodin, A. Seznec
`
`Session 6: Branch Prediction
`A Comparative Analysis of Schemes for Correlated Branch Prediction
`C. Young, N. Gloy, M. D. Smith
`
`xii
`
`126
`
`138
`
`151
`
`163
`
`176
`
`188
`
`201
`
`211
`
`220
`
`230
`
`243
`
`253
`
`265
`
`276
`
`HPE, Exh. 1012, p. 4
`
`

`

`Next Cache Line and Set Prediction
`B. Calder, D. Grunwald
`
`Session 7A: System Evaluation
`A Comparison of Architectural Support for Messaging on the TMC CM-5 and Cray T3D
`V Karamcheti, A. A. Chien
`Optimizing Memory System Performance for Communication in Parallel Computers
`T Stricker, T Gross
`Empirical Evaluation of the CRAY-T3D: A Compiler Perspective
`R. H. Arpaci, D. E. Culler, A. Krishnamurthy, S. G. Steinberg, K. Yelick
`
`Session 7B: Instruction Fetching
`Optimization of Instruction Fetch Mechanisms for High Issue Rates
`T M. Conte, K. N. Menezes, P M. Mills, B. A. Patel
`Instruction Fetching: Coping with Code Bloat
`R. Uhlig, D. Nagle, T Mudge, S. Sechrest, J. Emer
`Instruction Cache Fetch Policies for Speculative Execution
`D. Lee, J.-L. Baer, B. Calder, D. Grunwald
`
`Session 8: Caches
`Streamlining Data Cache Access with Fast Address Calculation
`T M. Austin, D. N. Pnevmatikatos, G. S. Sohi
`CAT — Caching Address Tags: A Technique for Reducing Area Cost of On-Chip Caches
`H. Wang, T Sun, Q. Yang
`
`Session 9: Processor Architecture
`Simultaneous Multithreading: Maximizing On-Chip Parallelism
`D. M. Tullsen, S. J. Eggers, H. M. Levy
`Architecture Validation for Processors
`R. C. Ho, C. H. Yang, M. A. Horowitz, D. L. Dill
`Multiscalar Processors
`G. S. Sohi, S. E. Breach, T N. Vijaykumar
`
`Author Index
`
`287
`
`298
`
`308
`
`320
`
`333
`
`345
`
`357
`
`369
`
`381
`
`392
`
`404
`
`414
`
`426
`
`HPE, Exh. 1012, p. 5
`
`

`

`Destage Algorithms for Disk Arrays with Non-Volatile Caches
`
`Anujan Varma and Quinn Jacobson
`Computer Engineering Department
`University of California
`Santa Cruz, CA 95064
`
`Abstract
`In a disk array with a nonvolatile write cache, destages
`from the cache to the disk are performed in the background
`asynchronously while read requests from the host system
`are serviced in the foreground. In this paper, we study a
`number of algorithms for scheduling destages in a RAID-5
`system. We introduce a new scheduling algorithm, called
`linear threshold scheduling, that adaptively varies the rate
`of destages to disks based on the instantaneous occupancy
`of the write cache. The performance of the algorithm is
`compared with that of a number of alternative scheduling
`approaches such as least-cost scheduling and high/low mark.
`The algorithms are evaluated in terms of their effectiveness
`in making destages transparent to the servicing of read re-
`quests from the host, disk utilization, and their ability to
`tolerate bursts in the workload without causing an overflow
`of the write cache. Our results show that linear threshold
`scheduling provides the best read performance of all the al-
`gorithms compared, while still maintaining a high degree
`of burst tolerance. An approximate implementation of the
`linear-threshold scheduling algorithm is also described. The
`approximate algorithm can be implemented with much lower
`overhead, yet its performance is virtually identical to that
`of the ideal algorithm.
`
`I. INTRODUCTION
`A disk array, in general, consists of a group of disk drives
`together with an associated controller function, organized
`logically as a single I/O device [2], [4]. Disk arrays, also
`known as RAIDs (Redundant Arrays of Inexpensive Disks),
`are capable of providing improved levels of reliability, avail-
`ability, and/or performance over single disks. A disk array
`usually provides protection against loss of data from a disk
`failure by maintaining redundant information within the ar-
`ray. Moreover, data availability can be maintained on a disk
`failure by using the redundant information to reconstruct
`data stored on the failed disk in real time. In addition to
`improving reliability and availability, disk arrays also im-
`prove the performance of the storage system by distributing
`data across multiple disk drives — this is the result of either
`concurrency in servicing multiple I/O requests or parallelism
`in the data transfer for a single I/O request.
`Several types of disk array configurations are known in
`
`This research is supported by NSF Young Investigator Award
`No. MIP-9257103.
`
`Permission to copy without fee all or part of this material is
`granted provided that the copies are not made or distributed for
`direct commercial advantage, the ACM copyright notice and the
`title of the publication and its date appear, and notice is given
`that copying is by permission of the Association of Computing
`Machinery.To copy otherwise, or to republish, requires
`a fee and/or specific permission.
`ISCA '95, Santa Margherita Ligure Italy
`O 1995 ACM 0-89791-698-0/95/0006...$3.50
`
`the literature [2], [4], [17]. These configurations vary primar-
`ily in the redundancy scheme employed and the data distri-
`bution (striping) scheme used to map logical blocks among
`the individual disks. In a seminal paper, Patterson, Gib-
`son, and Katz [11] introduced a taxonomy of disk arrays,
`consisting of six different types (or "levels"), which is now
`widely used in the industry. Among these RAID levels, of
`particular interest are RAID-4 and RAID-5, which are op-
`timized for transaction processing workloads. These arrays
`employ coarse-grained striping of data so that small requests
`can benefit from the concurrency in servicing multiple re-
`quests, while a large request can achieve high throughput
`by transferring data from multiple disks in parallel. RAID-
`4 organizes the disks in the array into parity groups, with
`one dedicated parity disk in each group. This has the dis-
`advantage of the parity disks becoming a bottleneck while
`updating data in the system. RAID-5 eliminates this bottle-
`neck by distributing parity blocks uniformly among all the
`disks in the group.
`Both RAID-4 and RAID-5 provide reliability and avail-
`ability at a fraction of the storage overhead incurred in disk
`mirroring. This reduction in storage overhead, however, is
`achieved at the expense of increasing the number of disk ac-
`cesses necessary to update data in the system. Every write
`request to the array that does not update an entire stripe
`must now update parity by reading the old data and the old
`parity, and exclusive-ORing them with the new data. This
`involves a total of four disk accesses — reading the old data
`and parity, and writing the new data and parity. Menon and
`Mattson [7] showed that this overhead can degrade the per-
`formance of a RAID-5 considerably in a transaction process-
`ing environment where small requests dominate the work-
`load. In addition, a small increase in the ratio of writes to
`reads in the workload can lead to a drastic increase in the
`response time for both reads and writes.
`Several solutions have been proposed to reduce the over-
`head for small writes in a RAID-5 [8], [9], [15]. One approach
`is parity logging [15], where parity updates are posted into
`a dedicated log disk instead of updating the parity blocks
`in the array; updates of parity blocks in the array are per-
`formed in the background. This has the advantage of con-
`verting small writes of parity into large writes; the scheme
`also captures some of the temporal locality in parity updates
`since successive updates of parity can be combined into a sin-
`gle update in the disk array. Another scheme that reduces
`the overhead in parity updates is floating parity [9], where
`parity blocks are dynamically remapped within disk cylin-
`ders to reduce the rotational latency between reading and
`writing parity.
`Both parity logging and floating parity only attempt to
`reduce the overhead of parity updates. In both schemes,
`the old data must be read from the disk and the new data
`written before signaling completion of the I/O transaction.
`
`83
`
`HPE, Exh. 1012, p. 6
`
`

`

`A more general scheme is to use a nonvolatile write cache to
`reduce write latency. Writes can now be deemed complete
`after writing the new data into the cache, an operation often
`referred to as fast write [8]. Both the data and the parity
`blocks on the disks can then be updated in the background.
`The process of updating data or parity in the disks from the
`write cache is referred to as destaging. In addition to the
`write cache, a larger (volatile) read cache may be used to
`improve read performance.
`A nonvolatile write cache has several advantages [8], the
`most important of which is the substantially lower service
`time seen by write requests to the array. In addition, the
`write cache can also exploit any locality in writes — both
`temporal and spatial — in the workload. Temporal local-
`ity is exploited by capturing successive updates of the same
`block in the cache. Spatial locality allows many small writes
`to be aggregated into a single large write to the disk. Fi-
`nally, the write cache can also lower the response time for
`read requests serviced by the disks because of the reduced
`contention with writes to the disk.
`In spite of the above advantages, a nonvolatile write cache
`introduces several new problems. First, the write cache must
`be designed to be at least as reliable as the redundancy
`scheme used for the disks. The IBM Hagar disk array [8],
`for example, used duplicate caches with independent power
`boundaries to protect the contents of the write cache. Sec-
`ond, data losses could still occur while updating the disks
`from the cache because of a failure in the cache memory,
`disks, or the datapath between them. The destage algorithm
`must be designed to leave data in a consistent state should
`a failure occur while updating the disks. Several such algo-
`rithms to prevent data losses while destaging are presented
`in [8].
`A third problem in the use of a write cache is that of
`scheduling the destages, that is determining when to destage
`data from the cache to the disk, as well as which of the dirty
`blocks is to be destaged next. Ideally, in a disk array with
`a write cache, the disks see only two types of requests —
`read requests generated by the workload that miss in the
`read cache, and (ii) read and write requests originated by
`the array controller for destaging data from the write cache.
`Requests of the first category must be serviced as soon as
`possible while destage requests can usually be serviced in the
`background. Ideally, destages should appear transparent to
`the user requests received by the array. In practice, however,
`because of the non-preemptive nature of disk service, some
`interference between user requests and destage requests is
`unavoidable. For example, a read request received while
`a destage request is being serviced is made to wait until
`the latter is completed. This interference, however, can be
`minimized with careful scheduling of the destages.
`In this paper, we study a number of algorithms for
`scheduling destages in a RAID-5 system. We introduce a
`new scheduling algorithm, called linear threshold scheduling,
`that adaptively varies the rate of destages to disks based on
`the instantaneous occupancy of the write cache. The perfor-
`mance of the algorithm is compared with that of a number of
`alternative scheduling approaches such as least-cost schedul-
`ing and high/low mark. The algorithms are evaluated in
`terms of their effectiveness in making destages transparent
`to the servicing of read requests, as well as their ability to
`tolerate large bursts of write requests without causing an
`
`overflow of the write cache. Our results show that linear
`threshold scheduling provides the best read performance of
`all the algorithms compared, while still maintaining a high
`degree of burst tolerance.
`The rest of this paper is organized as follows: Sec-
`tion II summarizes the tradeoffs involved in the design of
`the destage algorithm and describes the algorithms studied
`in the paper. Section III introduces the disk array model and
`workload model used in our simulations. Section IV presents
`simulation results for the algorithms and compares the al-
`gorithms in terms of read performance, disk utilization, and
`tolerance to bursts of write requests. Section V concludes
`the paper with directions for future research.
`
`II. DESTAGE ALGORITHMS
`
`In a disk array with a nonvolatile write cache, destages
`from the cache to the disk are performed in the background
`asynchronously, while read requests from the host system
`are serviced in the foreground. Thus, the workload seen by
`an individual disk predominantly consists of (i) reads from
`the host system that cannot be serviced from the read cache
`or write cache, and (ii) read and write requests generated
`by the array controller as part of destage operations. We
`refer to requests of the first category as host reads and those
`of the second category as destages. Reads of data or parity
`information performed as part of a destage are referred to
`as destage reads and the writes as destage writes. A write
`request from the host is never directed to the disk except in
`the event of a write cache overflow or when the size of the
`block being written is large enough to justify bypassing the
`write cache.
`If host reads are given higher priority over destages, a
`destage would never be initiated by the array controller when
`there are requests in the read queue. Still, considerable flex-
`ibility exists in scheduling the destages, as the updates to
`the disk need not be performed in the order the writes were
`posted to the cache. In addition, the four disk accesses in-
`volved in a disk update — reads of old parity and data, and
`writes of new parity and data — need not be performed as
`a single indivisible operation.
`
`A. Algorithm Design Tradeoffs
`
`We first examine the tradeoffs involved in the design of
`the destage algorithm. Assuming that no overflow of the
`write cache occurs and none of the write requests bypasses
`the cache, all the write requests from the host system are
`serviced by the cache. In this case, the scheduling algorithm
`employed has no direct effect on the response time for a write
`to the disk array. However, the scheduling algorithm can still
`affect the response time seen by the host reads serviced by
`the disks, as well as the maximum sustainable throughput
`of the array. A scheduling algorithm that optimizes destages
`reduces the load on the disks due to destages, thus improving
`their read performance.
`There are three ways in which a scheduling algorithm
`can improve the performance of the disk array: First, the
`algorithm can reduce the number of destages by capturing
`most of the re-writes in the write cache, thus exploiting the
`temporal locality in the workload. This requires maintaining
`blocks that are likely to be re-written in the near future in
`the write cache and destaging them only when additional
`re-writes are unlikely to occur soon. Similarly, if parity is
`
`84
`
`HPE, Exh. 1012, p. 7
`
`

`

`11
`
`allowed to be cached, successive updates of blocks in the
`same parity group generate only a single update of the parity
`block on disk.
`Second, the number of destages can also be reduced by
`aggregating blocks that lie physically close on a disk and
`destaging them as a large read and/or write. This exploits
`any spatial locality present in the workload.
`Finally, the scheduling algorithm can reduce the aver-
`age time for a destage by ordering destage requests to the
`disks such that the service times in the individual disks are
`minimized. Since the mechanical positioning delays usually
`dominate the service time, this calls for ordering the requests
`to minimize the positioning delays in the disk. Many such
`algorithms for scheduling requests in individual disks taking
`into account the seek time and the rotational latency of re-
`quests have been described in the literature [3], [5], [6], [14],
`[16]. While similar algorithms could be used to schedule
`destage requests in a disk array, a basic distinction exists
`between the problem of scheduling destages in a disk ar-
`ray and that of scheduling host requests in an independent
`disk. A disk scheduling algorithm for host requests must
`make a tradeoff between maximizing throughput and pro-
`viding fairness to user requests; for example, servicing the
`user requests in FCFS (first-come, first-served) order maxi-
`mizes fairness, while a scheduling algorithm that reorders re-
`quests to minimize the total service time achieves maximum
`throughput at the expense of increasing the variance of the
`response time [5]. The destage algorithm in a disk array, on
`the other hand, does not have to deal with this dilemma: In
`most cases, the algorithm can schedule destages to maximize
`throughput without causing starvation of user requests.
`All the above objectives in the design of the scheduling
`algorithm suggest maintaining the occupancy of the write
`cache close to its capacity. This allows maximum benefit
`to be obtained from any locality present in the workload
`and maximizes scheduling flexibility. Maintaining the write
`cache close to full, however, makes it vulnerable to overflow.
`Even a short burst of writes in the workload may cause the
`cache to overflow, forcing subsequent writes to bypass the
`cache until some destages can be posted to disk. Thus, the
`scheduling algorithm must strike a compromise between the
`conflicting goals of being able to exploit maximum locality
`and scheduling flexibility for destages, and preventing fre-
`quent overflows of the write cache.
`The scheduling algorithm may be designed to be either
`work-conserving or non-work-conserving with respect to the
`disks in the array. A work-conserving algorithm never allows
`a disk to be idle when one or more destages are pending to
`the disk. A non-work-conserving algorithm, on the contrary,
`may refrain from scheduling a destage to the disk even when
`it is idle, expecting that the same destage could be per-
`formed at a lower cost in the future; the goal is to minimize
`the response time for host reads by reducing the contention
`between host reads and destages. Both types of algorithms
`are studied in this paper.
`Thus, in summary, the following are the parameters that
`can be used by the destage algorithm to determine the block
`to be destaged next:
`1. The probability of the block to be re-written in the
`near future. This factor is usually accommodated in
`the algorithm by ordering blocks in the destage queue
`based on the time of their last access.
`
`2. The number of blocks to be read/updated on the same
`track (or cylinder) to take advantage of spatial locality.
`3. Service times of the requests in the destage queue. The
`service time includes the delay in positioning the head
`at the beginning of the block to be read/written and the
`data transfer time. Different levels of approximations
`could be used to estimate the positioning delays in the
`disk [16].
`4. The current level of occupancy of the cache.
`In addition, other factors such as the type of destage re-
`quest (parity or data, read or write), may also be used in
`the scheduling process.
`In the following, we describe four scheduling algorithms
`for destaging blocks from the write cache in a RAID-5. In all
`the algorithms, destaging to each disk is handled indepen-
`dently by maintaining a separate queue of pending destages
`for each disk. In addition, in all the algorithms, a destage to
`a disk is initiated only when there are no host reads queued
`for that disk at that time. However, not all of the algorithms
`schedule a destage always when the read queue is empty. The
`algorithms differ in the criteria used to determine whether a
`destage is to be initiated when the disk becomes idle after
`servicing a request, and in the function used to select the
`block to be destaged.
`
`B. Least-Cost Scheduling
`
`This algorithm is modeled after the well-known shortest
`seek-time first disk scheduling algorithm [3]. After every disk
`operation, the queue of eligible destages is examined and
`the request that takes the shortest access time is performed.
`In addition, to exploit spatial locality, if there are pending
`destage requests to other blocks on the same track, these are
`also performed as one large read or write.
`The access time of a destage read or write operation in-
`cludes the seek time (or head-switch time), rotational la-
`tency, and controller delays. Precise estimation of these de-
`lays is difficult due to many factors such as the physical-to-
`logical mapping of sectors on the disk and the use of caching
`within the disk itself. Considerable improvement can still be
`obtained using approximate models for estimating the de-
`lays involved. We implemented such a scheme in our RAID
`simulator for comparison with other schemes. Details of im-
`plementation of the scheme can be found in Section III.
`
`C. High/Low Mark Algorithm
`
`This algorithm is designed after the cache purging algo-
`rithm proposed by Biswas, Ramakrishnan and Towsley [1]
`for disks with a nonvolatile write cache. In this algorithm,
`two cache-occupancy thresholds are used to enable and dis-
`able destages. Destages to the disk are disabled when the
`cache occupancy drops below a "low mark" and turned on
`when it increases over a "high mark." Some hysteresis be-
`tween the two thresholds is used to prevent short bursts in
`the workload from causing a cache overflow. We chose the
`high threshold as 70% of cache capacity and the low thresh-
`old as 30%. Destage requests were selected using the least-
`cost policy to minimize the service time of individual disk
`accesses. In addition, when a block is selected for destage,
`all pending destage requests to the same track on the disk
`are scheduled together.
`
`85
`
`HPE, Exh. 1012, p. 8
`
`

`

`D. Linear Threshold Scheduling
`The basic principle of this algorithm is to match the rate
`of destages from the cache to the current level of occupancy
`of the cache, accelerating destages gradually as the cache oc-
`cupancy increases and slowing them as the occupancy falls.
`The primary difference between this algorithm and high/low
`mark scheduling is that, instead of on/off thresholds, the
`rate of destages is changed gradually as a function of cache
`occupancy.
`As in other algorithms, each disk in the array is scheduled
`independently. Both parity and data destages are treated
`in a similar manner. The destage queue is examined after
`each disk operation if the host read queue is empty. As in
`the case of least-cost scheduling, the block to be destaged is
`chosen as the one with minimum cost among all the eligible
`destage requests to that disk. However, unlike in least-cost
`scheduling, the destage is performed only if its cost is within
`a certain threshold maintained by the algorithm. Should
`a read request from the host arrive while the destage is in
`progress, this policy minimizes its waiting time.
`The role of the threshold is to reduce the interference be-
`tween destages and host reads; the threshold sets an upper
`bound on the waiting time of a read arriving while a destage
`in progress. A tradeoff exists in the choice of the thresh-
`old: If the threshold is set too low, the destage rate may
`not be adequate to clear the cache fast enough, causing fre-
`quent overflows. A large threshold, on the other hand, may
`perform destages too soon, degrading both the waiting time
`for host reads and the hit rate of the write cache, and re-
`ducing the level of temporal and spatial localities that may
`potentially be exploited.
`In our implementation, the threshold was selected as a
`linear function of the instantaneous cache occupancy. The
`cost of a destage is its service time, computed as in the case of
`the least-cost scheduling algorithm. As in other algorithms,
`when a destage request is scheduled, all pending destage
`requests to the same track are also performed along with it.
`The linear threshold scheduling algorithm uses the cache
`occupancy, service time of destages, and spatial locality as
`its decision-making criteria, but does not attempt to max-
`imize temporal locality explicitly. Thus, a block that has
`been written recently may be scheduled for destage if the
`destage can be performed "cheaply." This may increase the
`total number of destage accesses to the disk in comparison
`to servicing the requests in LRU order, but the total service
`time can still be lower as a result of minimizing the cost
`of individual destage requests. Note that temporal locality
`for host reads can still be exploited by keeping the destaged
`block in the write cache until the block needs to be replaced.
`
`E. Approximation to Linear Threshold Scheduling
`A major difficulty in implementing both the least-cost
`scheduling and the linear threshold scheduling algorithms
`is the need to scan the entire queue of destage requests to
`select the minimum-cost request. This operation must be
`performed at the completion of every disk service. Further-
`more, the cost of individual destage requests must be com-
`puted afresh if the position of the head has changed.
`The need for searching the entire destage queue can be
`overcome by resorting to an approximate implementation of
`linear threshold scheduling; the approach works by dividing
`the disk into regions and maintaining a separate queue of
`
`destage requests into each region. When the disk becomes
`idle, the queues are searched in the order of the closest to
`the farthest region relative to the current position of the
`head. The first non-empty queue is found and the request
`at its head is selected as the minimum-cost destage. As in
`the ideal linear-threshold scheduling algorithm, the request
`is scheduled if its estimated service time is within the thresh-
`old determined by the current cache occupancy level. The
`requests within each of the destage queues can be ordered
`using the LRU policy to allow maximum temporal locality.
`As in the other scheduling algorithms, when a request is se-
`lected for destage, all the pending destages to the same track
`are performed in one disk access.
`Both seek time and rotational latency must be taken into
`account in ordering the search of regions. Therefore, we used
`a partitioning scheme in which the disk cylinders are parti-
`tioned into circular bands; each band is then radially subdi-
`vided into regions. This is illustrated in Fig. 1(a) with re-
`spect to the example disk used in our simulation model. The
`example drive is an HP 97560 disk with a total of 1935 cylin-
`ders. The cylinders are partitioned into 15 circular bands,
`with 129 cylinders per band; each band is radially subdi-
`vided into three regions, to provide a total of 45 regions for
`the scheduling algorithm. This is similar to the disk parti-
`tioning scheme proposed by Jacobson and Wilkes [6], except
`that the cost of each destage is approximated as the average
`positioning delay from the current region to the target re-
`gion. The average positioning delay from region i to region j
`is taken as the positioning delay from the center of region i
`to the center of region j.
`The linear threshold scheme is approximated by design-
`ing a stepped threshold function that specifies the maximum
`destage cost for each level of cache occupancy. The destage
`cost is the average access time to a region expressed as a
`multiple of the fractional rotational delay of the disk corre-
`sponding to a region. Table I shows the threshold function
`used in our example. The second column in the table pro-
`vides the maximum allowable destage cost for each step of
`cac

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket