`Gill et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 7,574,556 B2
`Aug. 11, 2009
`
`US007574556B2
`
`(54) WISE ORDERING FOR
`WRITES COMBINING SPATAL AND
`TEMPORAL LOCALITY IN WRITE CACHES
`
`75
`D.
`(75) Inventors: E. R t Nissa, s );
`"artists anta IVodna, San
`
`(73) Assignee: International Business Machines
`Corporation, Armonk, NY (US)
`
`(*) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 491 days.
`
`(21) Appl. No.: 11/384939
`
`(22) Filed:
`65
`
`Mar 20, 2006
`Prior Publication Data
`US 2007/0220201 A1
`Sep. 20, 2007
`(51) Int. Cl.
`(2006.01)
`G06F 12/00
`(52) U.S. Cl. ........................ 711/113: 711/130; 711/114
`(58) Field of Classification Search ..................... 71 1/4,
`711/113114, 118, 130, 141–144
`See application file for complete search history.
`References Cited
`
`(56)
`
`U.S. PATENT DOCUMENTS
`
`E. 3. R et al.
`5.543,066 A
`T. 1996 MNO etal
`5,754sss A
`5/1998 Yang et al.
`6,272,662 B1
`8/2001 Jadav et al.
`6,341,331 B1
`1, 2002 McNutt
`6,493,800 B1
`12/2002 Blumrich
`
`8/2004 Ash et al.
`6,785,771 B2
`3/2005 Olarig et al.
`6,865,647 B2
`3/2005 Naamad et al.
`6,865,648 B1
`5/2005 Lambright et al.
`6,898,672 B2
`4/2004 Kuwata
`2004f0078518 A1
`1/2005 Chen
`2005/0010722 A1
`9, 2007 G11
`2007/022O2OO A1
`FOREIGN PATENT DOCUMENTS
`
`JP
`
`9, 2002
`2001.249835 A
`OTHER PUBLICATIONS
`Lee, Jung-Hoon et al., Application-adaptive intelligent cache
`memory system, ACM Transactions on Embedded Computing Sys
`tems (TECS), ACM Transactions on Embedded Computing Systems
`(TECS), vol. 1. Issue 1 (Nov. 2002), pp. 56-78.
`(Continued)
`Primary Examiner Jasmine Song
`74) Attorney, Agent, or Firm—Jeffrey P. Aiello; Anthony V.
`V, Ag
`y
`y
`S. England
`ABSTRACT
`(57)
`A storage SVStem has a storage controller for an arraV of
`9.
`ge Sy
`y
`storage disks, the array being ordered in an sequence of write
`groups. A write cache is shared by the disks. The storage
`controller temporarily stores write groups in the write cache,
`responsive to write groups being written, and lists the write
`groups in order of their sequence in the array and in circular
`fashion, so that a lowest is listed next to a highest one of the
`write groups. The storage controller selects the listed write
`groups in rotating sequence. Such a write group is destaged
`from the write cache to the disk responsive to i) the selecting
`of the write group and ii) a state of a recency indicator for the
`write group, wherein the recency indicator shows recency of
`writing to the write group.
`
`PAGEP
`ly
`DR1
`
`Rak
`
`
`
`PAGEP3
`c-PAGE P4
`
`D2R2
`Ya
`
`S.
`
`STRIP 102
`
`STRIPE 104
`
`- D1R2
`D1R3
`-
`
`R
`
`s
`
`h
`
`Y DR3
`
`27 Claims, 4 Drawing Sheets
`
`WG1
`WG4
`WG7
`WG2
`WG5
`RANK YWG8
`R3
`
`- DiR
`- D2R3
`
`WG3
`WG6
`WG9
`D3R3
`
`w
`
`CONTROLER
`CXR1
`
`STORAGE
`CONTROLLER
`
`SORAGE
`CONTROLLER
`
`RECENCY
`
`SECUENCE
`
`BIT 310 (S
`RANKR1 SORTED, S
`CIRCULARQUEUEQ
`O
`P1
`
`0.
`
`/2 N. ->1 1
`
`RANKR3 SORTED, N
`CIRCULARQUEUE C3N 0
`SEQUENCE
`
`HPE, Exh. 1011, p. 1
`
`
`
`US 7,574,556 B2
`Page 2
`
`OTHER PUBLICATIONS
`Panda, P.R. et al., Data and memory optimization techniques for
`embedded systems, ACM Transactions on Design Automation of
`Electronic Systems (TODAES), vol. 6, Issue 2 (Apr. 2001), pp.
`149-206.
`Song, Yonghong et al., “Data Locality Enhancement by Memory
`Reduction.” Proceedings of the 15th international conference on
`Supercomputing, Jun. 16-21, 2001, Sorrento, Italy, pp. 50-64. http://
`www.cs.purdue.edu/homes/li/draftlics01.pdf.
`Lee, Jung-Hoon et al., “Application-adaptive intelligent cache
`memory system.” ACM Transactions on Embedded Computing Sys
`tems (TECS), ACM Transactions on Embedded Computing Systems
`(TECS), vol. 1, Issue 1 (Nov. 2002), pp. 56-78, http://wotanliu.edu/
`docis/lib/sisl/rclis/dblfatemcs/
`(2002)1%253A1%253C56%253AAICMS%253E'supercom.
`yonsei.ac.kr'6252Fpaper%252FACM-format-final.pdf.
`
`Sermulins, Janis et al., "Cache Aware Optimization of Stream Pro
`grams.” Proceedings of the 2005 ACM SIGPLAN/SIGBED confer
`ence on Languages, compilers, and tools for embedded systems, Jun.
`15-27, 2005, Chicago, vol. 40 Issue 7, pp. 115-126, http://web.mit.
`edu/rabbah/www/docs/sermulins-Ictes-2005.pdf.
`Panda, P.R. et al., “Data and memory optimization techniques for
`embedded systems.” ACM Transactions on Design Automation of
`Electronic Systems (TODAES), vol. 6, Issue 2 (Apr. 2001), pp.
`149-206. http://www.ee.ucla.edu/-ingrid/Courses/Readingp149
`pandaACM2001.pdf.
`Marathe, Jaydeep et al., “METRIC: tracking down inefficiencies in
`the memory hierarchy viabinary rewriting.” Performance monitoring
`session, ACM International Conference Proceeding Series, vol. 37.
`Proceedings of the international Symposium on Code generation and
`optimization: feedback-directed and runtime optimization, Mar.
`23-26, 2003, pp. 289-300, http://www.csl.cornell.edu/~sam/papers/
`cgo03.pdf.
`
`HPE, Exh. 1011, p. 2
`
`
`
`U.S. Patent
`
`Aug. 11, 2009
`
`Sheet 1 of 4
`
`US 7,574,556 B2
`
`
`
`
`
`
`
`NON
`
`
`
`GTT SHOSSBOOMd
`
`
`
`TZT Å HOWEW ETILWTON
`
`
`
`
`
`T?T BOIABO AVTdSIG
`
`HPE, Exh. 1011, p. 3
`
`
`
`U.S. Patent
`U.S. Patent
`
`Aug. 11, 2009
`Aug. 11, 2009
`
`Sheet 2 of 4
`Sheet 2 of 4
`
`US 7,574,556 B2
`US 7,574,556 B2
`
`
`
`
`
`
`
`001ALYYNVYJOVYOLS
`
`
`MATTONINOD
`
`ZYYNWYwnaneaaedRdeeme,roaEaSLUM
`
`(ees
`
`tag-—“YINOD4INOOD
`
`p013dIMLSFOVYOLS
`
`aNdmeda0zuL0a,a,a,a,a,a,390SLIM
`POTETIOMINGD
`FSCe1204dlls
`
`
`
`
`soy.Sued€uzo
`
`aMATIONLNOD
`EYINVYFOVYOLS
`
`
`e‘Sid
`
`eYNGeyedeuzacuLa
`
`NwaNaNa
`
`HPE, Exh. 1011, p. 4
`
`HPE, Exh. 1011, p. 4
`
`
`
`
`U.S. Patent
`
`Aug. 11, 2009
`
`Sheet 3 of 4
`
`US 7,574,556 B2
`
`eyed“acySUC
`
`69M89MNVY\99MSOM
`
`zuzd
`
`COMZOM
`
`LOM
`
`
`
`euza7yOM0),adIMLS
`
`ey“a201dIdLS
`
`LOM
`
`€d4J9Vd
`édADVd
`
`bddDVd
`
`€uXD
`
`
`
`FOVYOLSJOVYOLS
`
`
`
`YATIONLNODYSTIONLNOD
`
`AOVYOLS
`
`YSATIOWLNOD
`
`LYX9
`
`So
`
`|SS‘GALUOSCYNYYYcCALYOSZHMNYY
`0aZ
`gOanandYVTNOUIO
`
`20ANANDYVINDUID
`
`0,m‘GSLYOSLYYNVY
`
`LY3NANDYVININID
`
`
`JONANOASAON303Y
`ty
`0OleLid~~
`
`e"old
`
`bd
`
`HPE, Exh. 1011, p. 5
`
`HPE, Exh. 1011, p. 5
`
`
`
`
`
`
`U.S. Patent
`
`Aug. 11, 2009
`
`Sheet 4 of 4
`
`US 7,574,556 B2
`
`
`
`
`
`
`
`
`
`STORINGLOGIC 410
`
`RECENCY LOGIC 440
`
`LISTINGLOGIC 420
`
`DESTAGINGLOGIC 450
`
`ISSUNGLOGIC 460
`
`SELECTINGLOGIC430
`
`QUEUE 465
`
`STORAGE CONTROLLERCXR1
`
`FG 4
`
`HPE, Exh. 1011, p. 6
`
`
`
`US 7,574,556 B2
`
`1.
`WISE ORDERING FOR
`WRITES COMBINING SPATAL AND
`TEMPORAL LOCALITY IN WRITE CACHES
`
`CROSS-REFERENCE
`
`This application is related to U.S. patent application Ser.
`No. 1 1/384,890, Wise Ordering For Writes Combining
`Spatial and Temporal Locality in Write Caches For Multi
`Rank Storage, which is filed on the same date as the present
`application, and the related application is hereby incorpo
`rated herein by reference.
`
`BACKGROUND
`
`1. Field of the Invention
`The present invention concerns caching in computer sys
`tems and, more particularly, concerns write caching for disk
`storage devices.
`2. Related Art
`Processor speeds have grown about sixty percent per year,
`while electromechanical disk storage devices have improved
`their access times at a comparatively meager annual rate of
`about eight percent. At the same time, capacity of disk storage
`devices has grown more than fifty percent per year, which
`tends to reduce the amount of data available by parallel,
`concurrent disk access. By themselves, these trends would
`dictate that a processor must wait longer for disk read and
`write operations. Consequently, huge efforts have gone into
`hiding latency for disk bound computer applications.
`A cache has been described by the Storage Networking
`Industry Association as “A high speed memory or storage
`device used to reduce the effective time required to read data
`from or write data to a lower speed memory or device.”
`Caching is a fundamental technique in hiding I/O latency and
`is widely used in storage controllers, databases, file systems,
`and operating systems. A modern storage controller's cache
`typically contains Volatile memory used as a read cache and a
`non-volatile memory used as a write cache. Non-volatile
`storage (“NVS), which is typically fast, but relatively expen
`sive, random access memory, enables writes to be stored
`quickly and safely in an NVS write cache and destaged, i.e.,
`written to disk storage, later in an asynchronous fashion,
`which hides the latency of writing to the disk storage, which
`is slower, but relatively less expensive than the NVS write
`cache. Read cache management is a well studied discipline
`and there are a large number of cache replacement algorithms
`in this context. In contrast, write caching is a relatively less
`developed subject.
`In destaging, a balance must be obtained. Data in the write
`cache must be destaged quickly enough so that there is always
`space for incoming writes, but not so quickly that the benefit
`of the write cache is under utilized. This balance has been
`addressed, for example, by linear threshold scheduling, in
`which the rate of destaging is varied in response to instanta
`neous occupancy of the write cache. It has also been
`addressed by Scheduling using least-cost and high/low
`“water marks.
`Another issue presented in destaging concerns the destag
`ing order. As long as the write cache is drained fast enough,
`the precise order in which data is destaged does not affect
`performance from the standpoint of fairness or of write
`requests becoming starved because write requests are consid
`ered complete, in most respects, once they are written to the
`write cache. However, destage ordering can crucially affect
`peak write throughput and performance of concurrent reads.
`Peak write throughput is affected by destage ordering because
`
`2
`the capacity of disks to physically Support sequential or
`nearly sequential write traffic is significantly higher than their
`capacity to Supportrandom writes. That is, destaging writes in
`an order that exploits this physical fact can significantly
`improve peak write throughput. This was a fundamental moti
`Vation for development of log-structured file systems in a
`different context. Performance of concurrent reads is affected
`by destage ordering because to the extent destage ordering
`improves write performance, it improves read performance.
`That is, with concurrent reads and writes, the less time spent
`writing to a disk, the more time there is available to read from
`the disk.
`As may be seen from this briefbackground, problems are
`presented in balancing tradeoffs involved in write caching.
`Problems are presented even regarding how to frame all the
`issues regarding write caching and in recognizing resultant
`interactions among the issues.
`
`SUMMARY OF THE INVENTION
`
`The present invention addresses the foregoing problem.
`According to one form of the invention, in a storage system
`having an array of storage disks, the array is ordered in an
`sequence of write groups and a write cache is shared by the
`disks. A method of destaging the write cache includes placing
`ones of the write groups in the write cache temporarily,
`responsive to the ones of the write groups being written to the
`array. The write groups stored in the write cache are listed in
`order of the sequence in the array. The listing is circular, so
`that a lowest one of the write groups is listed next to a highest
`one of the write groups. The listed write groups are selected in
`rotation according to the sequence. Such a listed write group
`is destaged from the write cache to the disk responsive to the
`selecting of the write group and a state of a recency indicator
`for the write group. The recency indicator shows recency of
`writing to the write group.
`In another aspect, Such a recency indicator includes a bit
`having one state indicating the write group has been written to
`recently and another state indicating the write group has not
`been written to recently.
`In another aspect, the method includes setting Such a
`recency indicator for Such a write group responsive to placing
`the write group in the write cache.
`In another aspect, the method includes clearing the state of
`Such a recency indicator responsive to the selecting of the
`indicator's respective write group, and wherein the destaging
`of a write group responsive to the write group's recency
`indicator includes destaging responsive to the indicator being
`cleared.
`In another aspect, the destaging of Such write groups
`includes issuing numerous, concurrent destaging write
`request to ones of the disks, and the method includes varying
`the concurrent write requests in number responsive to space
`in the shared write cache.
`In another aspect, varying the concurrent write requests
`includes issuing concurrent destaging write requests at a cer
`tain predetermined maximum rate responsive to write cache
`space exceeding the certain high threshold.
`In another aspect, varying the concurrent write requests
`includes increasing the number of concurrent destaging write
`requests responsive to decreasing space in the write cache if
`the write cache space is within a range between a certain high
`threshold and a certain low threshold.
`In another aspect, varying the concurrent write requests
`includes issuing concurrent destaging write requests at a cer
`tain predetermined minimum rate responsive to write cache
`space falling below the low threshold for a write group that is
`
`10
`
`15
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`HPE, Exh. 1011, p. 7
`
`
`
`3
`a member of a sequence of write groups in the write cache, if
`a next listed write group for destaging is also a member of a
`sequence of write groups.
`In another aspect, varying the concurrent destaging write
`requests includes issuing no write requests responsive to the
`write cache space falling below the low threshold for a write
`group that is not a member of a sequence of write groups in the
`write cache.
`In another form of the invention, a computer program
`product resides on a computer usable medium having com
`10
`puter readable program code. The program code includes
`instructions for placing ones of the write groups in the write
`cache temporarily, responsive to the ones of the write groups
`being written to the array. The program code further includes
`instructions for listing the write groups stored in the write
`cache in order of the sequence in the array, wherein the listing
`is circular, so that a lowest one of the write groups is listed
`next to a highest one of the write groups. The program code
`further includes instructions for selecting the listed write
`groups in rotation according to the sequence. The program
`code further includes instructions for destaging Such a listed
`write group from the write cache to the disk responsive to i)
`the selecting of the write group and ii) a state of a recency
`indicator for the write group, wherein the recency indicator
`shows recency of writing to the write group.
`In another aspect, the program code includes instructions
`for setting Such a recency indicator for Such a write group
`responsive to placing the write group in the write cache.
`In another aspect, the program code includes instructions
`for clearing the state of such a recency indicator responsive to
`the selecting of the indicator's respective write group, and the
`instructions for destaging of a write group responsive to the
`write group's recency indicator include instructions for
`destaging responsive to the indicator being cleared.
`In another aspect, the program code includes instructions
`for destaging of Such write groups include instructions for
`issuing numerous, concurrent destaging write request to ones
`of the disks, and the program code includes instructions for
`varying the concurrent write requests in number responsive to
`space in the shared write cache.
`In another aspect, the instructions for varying the concur
`rent write requests include instructions for issuing concurrent
`destaging write requests at a certain predetermined maximum
`rate responsive to write cache space exceeding the certain
`high threshold.
`In another aspect, the instructions for varying the concur
`rent write requests include instructions for increasing the
`number of concurrent destaging write requests responsive to
`decreasing space in the write cache if the write cache space is
`within a range between a certain high threshold and a certain
`low threshold.
`In another aspect, the instructions for varying the concur
`rent write requests include instructions for issuing concurrent
`destaging write requests at a certain predetermined minimum
`rate responsive to write cache space falling below the low
`threshold for a write group that is a member of a sequence of
`write groups in the write cache, if a next listed write group for
`destaging is also a member of a sequence of write groups.
`In another aspect, the instructions for varying the concur
`rent destaging write requests include instructions for issuing
`no write requests responsive to the write cache space falling
`below the low threshold for a write group that is not a member
`of a sequence of write groups in the write cache.
`In another form of the invention, a storage system has a
`storage controller and a write cache that is shared by an array
`of storage disks. The array is ordered in an sequence of write
`groups. The storage controller includes storing logic for plac
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 7,574,556 B2
`
`15
`
`4
`ing ones of the write groups in the write cache temporarily,
`responsive to the ones of the write groups being written to the
`array. The storage controller also includes listing logic for
`listing the write groups stored in the write cache in order of
`their sequence in the array, wherein the listing is circular, so
`that a lowest one of the write groups is listed next to a highest
`one of the write groups. The storage controller also includes
`selecting logic for selecting the listed write groups in rotation
`according to the sequence. The storage controller also
`includes destaging logic for destaging such a listed write
`group from the write cache to the disk responsive to i) the
`selecting of the write group and ii) a state of a recency indi
`cator for the write group, wherein the recency indicator shows
`recency of writing to the write group.
`In another aspect of the storage system, such a recency
`indicator includes a bit having one state indicating the write
`group has been written to recently and another state indicat
`ing the write group has not been written to recently.
`In another aspect, the storage system includes recency
`logic for setting Such a recency indicator for Such a write
`group responsive to placing the write group in the write cache.
`In another aspect, the recency logic includes logic for clear
`ing the state of Such a recency indicator responsive to the
`selecting of the indicator's respective write group, and
`wherein the destaging logic includes logic for destaging
`responsive to the indicator being cleared.
`In another aspect, the destaging logic includes logic for
`issuing numerous, concurrent destaging write request to ones
`of the disks, and the system includes logic for varying the
`concurrent write requests in number responsive to space in
`the shared write cache.
`In another aspect, the logic for varying the concurrent write
`requests includes logic for issuing concurrent destaging write
`requests at a certain predetermined maximum rate responsive
`to write cache space exceeding the certain high threshold.
`In another aspect, the logic for varying the concurrent write
`requests includes logic for increasing the number of concur
`rent destaging write requests responsive to decreasing space
`in the write cache if the write cache space is within a range
`between a certain high threshold and a certain low threshold.
`In another aspect, the logic for varying the concurrent write
`requests includes logic for issuing concurrent destaging write
`requests at a certain predetermined minimum rate responsive
`to write cache space falling below the low threshold for a
`write group that is a member of a sequence of write groups in
`the write cache, if a next listed write group for destaging is
`also a member of a sequence of write groups.
`In another aspect, the logic for varying the concurrent
`destaging write requests includes logic for issuing no write
`requests responsive to the write cache space falling below the
`low threshold for a write group that is not a member of a
`sequence of write groups in the write cache.
`Other variations, objects, advantages, and forms of the
`invention will become apparent upon reading the following
`detailed description and upon reference to the accompanying
`drawings.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The foregoing and other objects, aspects and advantages
`will be better understood from the following detailed descrip
`tion of a preferred embodiment(s) of the invention with ref
`erence to the drawings, in which:
`FIG. 1 illustrates a storage system, according to an embodi
`ment of the present invention.
`FIG. 2 illustrates certain details of the storage system of
`FIG. 1, according to an embodiment of the present invention.
`
`HPE, Exh. 1011, p. 8
`
`
`
`US 7,574,556 B2
`
`5
`FIG. 3 illustrates certain additional details of the storage
`system of FIGS. 1-2, according to an embodiment of the
`present invention.
`FIG. 4 illustrates certain additional details of one of the
`storage controllers of FIGS. 1-3, according to an embodiment
`of the present invention.
`
`DETAILED DESCRIPTION OF PREFERRED
`EMBODIMENTS OF THE INVENTION
`
`6
`many other variants known as LOOK, VSCAN, FSCAN,
`Shortest Positioning Time First (“SPTF), GSTF and WSTF,
`and largest segment per track (“LST).
`These known algorithms that are advantageous regarding
`spatial locality may depend upon a detailed knowledge of
`instantaneous position of a disk head and the location of each
`data segment relative to the disk head, particularly for read
`operations. However, the present invention is in the context of
`a storage controller at a level in a memory hierarchy that is
`somewhat removed from the actual disks themselves. Further,
`it is for write operations instead of read operations. In this
`context, most disk parameters are hidden, due to RAID, for
`example, so that spatial locality is generally hard to exploit.
`For example, it has been found that applying SATF at a high
`level in the memory hierarchy Such as that of a storage con
`troller was not practical. See L. Huang et al., Experience in
`Building a Software Based SATF Scheduler.” Tech. Rep.
`ECSL-TR81, SUNY at Stony Brook (July 2001) (“... mod
`ern disks have too many internal control mechanisms that are
`too complicated to properly account for in the disk service
`time model. This exercise lead us to conclude that software
`based SATF disk schedulers are less and less feasible as the
`disk technology evolves.”), also (“Even when a reasonably
`accurate software based SATF disk scheduler can be success
`fully built, the performance gain over a SCAN based disk
`scheduler that it can realistically achieve appears to be insig
`nificant...”). This conclusion regarding SATF was for mere
`single disk applications. For applications having an array of
`redundant disks, as in the context of the present invention, the
`conclusion is all the more certain. Previous work has
`attempted to apply principles of both spatial and temporal
`locality to the problem of write cache control by partitioning
`the write cache into a “hot” Zone managed according to an
`LRW policy and a “cold' Zone managed via an LST policy.
`However, this work only deals with interaction between the
`cache and one disk. T. R. Haining, “Non-volatile Cache Man
`agement for Improving Write Response Time with Rotating
`Magnetic Media. 126, Ph.D. Thesis, Ph.D. Dissertation,
`University of California, Santa Cruz (2000). Also, even with
`regard to a single disk, further work is needed to develop an
`adaptive algorithm fortuning the size of the hot Zone. Seeid.
`125 (“One of the most immediate aspects of this work requir
`ing more research is the method to determine the size of the
`hot Zone for the stack model based replacement algorithm.
`We determined the best size for the hot Zone empirically in
`our experiments”). The work does not fully address how to
`determine the best adaptive partition. In addition, the hot Zone
`optimizes for temporal locality, whereas the cold Zone for
`spatial locality.
`In seeking to apply principles of both spatial and temporal
`locality for write cache control in the context of a storage
`controller, i.e., in upper levels of memory hierarchy, the
`present invention applies processes related to LRW for tem
`poral locality and CSCAN for spatial locality. In CSCAN,
`I/O's are served only in increasing order of their logical
`addresses. A difficulty arises because the order of destaging
`that is suggested by LRW tends to be different than that of
`CSCAN. This difficulty is addressed in the present invention
`by a novel combination, referred to herein as Wise Ordering
`for Writes (“WOW), wherein write groups in the write cache
`are maintained in a CSCAN-like arrangement, but with
`recency bits akin to those of CLOCK. (For details of CLOCK,
`See F. J. Corbato, “A paging experiment with the multics
`system.” In Honor of P. M. Morse, pp. 217-228, MIP Press,
`1969.) Destaging is skipped for write groups to which data
`has been recently written, as indicated by the recency bits.
`
`10
`
`15
`
`25
`
`In the following detailed description of the preferred
`embodiments, reference is made to the accompanying draw
`ings illustrating embodiments in which the invention may be
`practiced. It should be understood that other embodiments
`may be utilized and changes may be made without departing
`from the scope of the present invention. The drawings and
`detailed description are not intended to limit the invention to
`the particular form disclosed. On the contrary, the intention is
`to coverall modifications, equivalents and alternatives falling
`within the spirit and scope of the present invention as defined
`by the appended claims. Headings herein are not intended to
`limit the Subject matter in any way.
`Overview
`There are numerous incentives to improving write cache
`destaging performance, and, hence, overall write perfor
`mance, which also leads to improved read performance, as
`previously stated. For one thing, the widely used redundant
`array of independent disks (“RAID) technology tends to
`make writes slower than reads. Also, write caches are com
`monly smaller than read caches. Further, client side caching
`30
`does not significantly benefit write operations as it does read
`operations. Also, the proportion of write operations in the
`widely used storage benchmark SPC-1 is larger than the
`proportion of read operations.
`The present invention involves a recognition that write
`destaging performance depends upon a combination of fac
`tors, including (i) the total number of destages to disks,
`namely, the write miss ratio, and (ii) the average cost of each
`destage. In the absence of any read operations, it is important
`for a write destage algorithm to minimize the product of these
`factors. Even in the presence of concurrent reads, the product
`of these factors tends to minimize the amount of time that the
`disk heads are occupied in serving writes, which tends to
`minimize the average read response time and maximize
`aggregate throughput.
`To minimize the first factor, the present invention seeks to
`exploit temporal locality, which may favor destaging seg
`ments of data least likely to be written to while in the write
`cache, and thus most likely to ultimately be destaged. Clas
`sically, a least recently used (“LRU) policy has been used to
`exploit temporal locality in read caches. In write caches this
`translates to a least recently written (“LRW) policy. Other
`policies that have been used in read caches for exploiting
`temporal locality include CLOCK, FBR, LRU-2, 20, LRFU,
`LIRS, MQ, ARC and CAR, all of which attempt to reduce the
`miss ratio in a read cache.
`To minimize the second factor, the present invention seeks
`to exploit spatial locality, which may favor destaging seg
`ments of data located closest together on the disks, and thus
`Subject to the fastest write operation, according to the heads
`and the geometry of the disks in the system. A number of
`algorithms are known for exploiting spatial locality, includ
`ing those which seek to serve I/O’s on a first-come-first-serve
`(“FCFS) basis, a shortest-seek-time-first (“SSTF) basis, a
`shortest access time first ("SATF) basis, and on the basis that
`I/O's are served first in increasing order and then in decreas
`ing order of their logical addresses ("SCAN). There are
`
`50
`
`35
`
`40
`
`45
`
`55
`
`60
`
`65
`
`HPE, Exh. 1011, p. 9
`
`
`
`US 7,574,556 B2
`
`5
`
`10
`
`15
`
`25
`
`35
`
`7
`Described another way, the present invention frames the
`destaging problem as interrelated issues concerning what
`write data to destage and when to destage the selected write
`data. According to the present invention, the issues are
`defined and addressed by a novel combination of a particular
`kind of adaptive water mark and a particular kind of linear
`threshold Scheduling.
`Example of Destaging Context
`Referring now to FIG. 1, a storage system 100 is shown
`according to an embodiment of the present invention. System
`100 takes the form of a computer system. It should be under
`stood that the term “computer system’ is intended to encom
`pass any device having a processor that executes instructions
`from a memory medium, regardless of whether referred to in
`terms of a microcontroller, personal computer system, main
`frame computer system, workStation, server, or in Some other
`terminology. Computer system 100 includes processors 115,
`a volatile memory 127, e.g., RAM and a nonvolatile memory
`129. Memories 127 and 129 store program instructions (also
`known as a “software program'), which are executable by
`processors 115, to implement various embodiments of a soft
`ware program in accordance with the present invention. Pro
`cessor 115 and memories 127 and 129 are interconnected by
`bus 140. An input/output adapter (not shown) is also con
`nected to bus 140 to enable information exchange between
`processors 115 and other devices or circuitry. System 100 also
`includes a keyboard 133, pointing device 130, e.g., mouse,
`floppy disk, CD-ROM, and DVD, and a display device 137.
`In the illustrated embodiment, system 100 is an IBM
`30
`XSeries 345 computer equipped with two Intel Xeon 2 GHz
`processors 115. In the illustrated embodiment, nonvolatile
`memory 129 includes 10K RPM, SCSI, DDR disks D1R1
`through DNR1, D1R2 through DNR2, and D1R3 through
`DNR3, each disk being 36.4 GB each. The number of ranks
`and number of disks N in each rank may vary in different
`embodiments, of course.
`Nonvolatile memory 129 also includes another disk (not
`shown) that is used for an operating system, Software appli
`cations, and workloads. In other embodiments the operating
`system may be on multiple disks or on Some other nonvolatile
`store, not necessarily a disk. In another embodiment the oper
`ating system may even be programmed in specialized chip
`hardware. A Linux kernel runs on system 100 for hosting the
`applications and standard workload generators. Memory 129
`also includes ROM, which is not shown, and may include
`other devices, which are also not shown, Such as floppy disks,
`CD-ROMs, and DVDs.
`Referring now to FIG. 2, certain details of system 100 are
`illustrated according to an embodiment of the invention. SCSI
`disks, D1R1, etc., are configured as three RAID-10 arrays R1,
`R2, and R3, each having a respective set of data disks, each set
`being N in number. For example, array R1 has data disks
`D1R1, D2R1, D3R1, etc. through DNR1. Arrays R1,R2 and
`R3 have respective RAID storage controllers CXR1, CXR2
`55
`and CXR3, each for controlling its respective set of the
`RAID-configured disks. Each disk has its own disk controller.
`For example, disk D1 R1 has its own disk controller C1R1, as
`shown.
`In a RAID-10 implementation illustrated in FIG. 2, storage
`controllers CXR1, CXR2 and CXR3 have respective write
`caches WXR1, WXR2 and WXR3, which are managed in
`terms of 64 KB strip 102 write groups. Strip 102 size may
`vary, of course. Write cache WXR1 is shared by the disks
`D1R1, etc. of array R1, WXR2 is shared by the disks D1R2,
`etc. of array R2 and WXR3 is shared by the disks D1 R3, etc.
`of array R3.
`
`50
`
`40
`
`45
`
`60
`
`65
`
`8
`In an alternative embodiment of the invention, disks D1R1
`DNR1, D1R2-DNR2 and D1R3-DNR3 are configured as
`RAID-5 arrays. For the RAID-5 implemen