`
`United States Patent
`Cochran
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 7,058,850 B2
`Jun. 6, 2006
`
`US00705885OB2
`
`(54) METHOD AND SYSTEM FOR PREVENTING
`DATA LOSS WITHN DISK-ARRAY PARS
`SUPPORTING MIRRORED LOGICAL UNITS
`(75) Inventor: Robert A. Cochran, Rocklin, CA (US)
`(73) Assignee: Hewlett-Packard Development
`Company, L.P., Houston, TX (US)
`Subject to any disclaimer, the term of this
`past l
`sts, A listed under 35
`M
`YW-
`(b) by
`ayS.
`(21) Appl. No.: 10/210,368
`(22) Filed:
`Jul. 31, 2002
`
`(*) Notice:
`
`(65)
`
`Prior Publication Data
`US 2004/OO78.638 A1
`Apr. 22, 2004
`s
`(51) Int. Cl.
`(2006.01)
`G06F II/00
`(52) U.S. Cl. ................................. 7146,7144, 71443
`(58) Field of Classification Search
`s
`s 714f6
`- - - - - - - - - - - - 714/43 4 42
`See application file for complete search histo s is
`pp
`p
`ry.
`References Cited
`
`(56)
`
`U.S. PATENT DOCUMENTS
`6,543,001 B1 * 4/2003 LeCrone et al. ...............
`
`6,587,970 B1* 7/2003 Wang et al. .................. T14? 47
`6,691.245 B1* 2/2004 DeKoning ..................... T14?6
`6,728,898 B1 * 4/2004 Tremblay et al. .............. T14?6
`6,785,678 B1* 8/2004 Price ............................. 707/8
`6,816,951 B1 * 1 1/2004 Kimura et al. .............. 711/162
`2002/00999 16 A1* 7/2002 Ohran et al. ................ T11 162
`* cited by examiner
`
`Primary Examiner Robert Beausoliel
`Assistant Examiner Christopher McCarthy
`
`(57)
`
`ABSTRACT
`
`An additional communications link between two mass
`storage devices containing LUNs of a mirrored-LUN pair, as
`well as incorporation of a fail-safe mass-storage-device
`implemented retry protocol to facilitate non-drastic recovery
`from communications link failures within the controllers of
`the two mass-storage devices, prevents build-up of WRITE
`requests in cache and Subsequent data loss due to multiple
`communications-link and host computer failures. The com
`bination of the additional link and the retry protocol together
`ameliorates a deficiency in current LUN-mirroring imple
`mentations that often leads to data loss and inconsistent and
`unrecoverable databases.
`
`12 Claims, 22 Drawing Sheets
`
`
`
`714
`
`LAN OR WAN
`
`
`
`704
`
`HOST
`COMPUTER
`
`HOST
`COMPUTER
`
`710
`
`
`
`
`
`MASS-STORAGE
`DEVICE
`
`ESCON, ATM, T3 OR OTHER
`
`MASS-STORAGE
`DEVICE
`
`
`
`708
`
`HPE, Exh. 1009, p. 1
`
`
`
`U.S. Patent
`U.S. Patent
`
`Jun. 6, 2006
`Jun.6, 2006
`
`Sheet 1 of 22
`Sheet 1 of 22
`
`US 7,058,850 B2
`US 7,058,850 B2
`
`
`
`
`
`HPE, Exh. 1009, p. 2
`
`HPE, Exh. 1009, p. 2
`
`
`
`U.S. Patent
`
`Jun. 6, 2006
`
`Sheet 2 of 22
`
`US 7,058,850 B2
`
`302
`
`REQUEST/REPLY BUFFER
`
`CONTROLLER HARDWARE LOGIC
`
`
`
`
`
`
`
`CONTROLLER
`FIRMWARE
`
`DSK MEDIA
`READ/WRITE
`MANAGEMENT
`FRMWARE
`
`
`
`307
`
`
`
`308
`
`303
`
`305
`
`306
`
`
`
`DISK
`
`Fig. 3
`
`HPE, Exh. 1009, p. 3
`
`
`
`U.S. Patent
`
`Jun. 6, 2006
`
`Sheet 3 of 22
`
`US 7,058,850 B2
`
`498
`
`ABSTRACT
`UNS
`
`.
`499
`f
`
`-
`
`.
`
`.
`410
`- - -
`-
`
`.
`
`.
`411
`- - -
`
`.
`
`.
`412
`- - -
`
`.
`
`.
`415
`- - -
`
`.
`
`.
`.
`415
`414
`- - -
`----
`
`.
`
`- - -
`
`
`
`l
`
`-
`
`- - -
`
`- - -
`
`- - -
`
`- - -
`
`- - -
`
`--
`
`- - -
`
`
`
`407
`
`Gi
`
`DISK ARRAY CONTROLLER
`
`402
`
`HPE, Exh. 1009, p. 4
`
`
`
`U.S. Patent
`U.S. Patent
`
`Jun. 6, 2006
`Jun.6, 2006
`
`Sheet 4 of 22
`Sheet 4 of 22
`
`US 7,058,850 B2
`US 7,058,850 B2
`
`505
`
`r
`
`|
`
`501
`
`0;
`
`0%
`
`4 05
`
`506
`
`507
`
`7 NZS
`
`Stf
`
`7 |
`
`Sy 9% og Vv
`
`7
`
`('S9
`
`|
`|
`ee | 510
`LUNA \
`
`502
`
`|
`
`503
`
`|
`[
`od
`LUN BO
`
`504
`
`fig. 9
`
`HPE, Exh. 1009, p. 5
`
`HPE, Exh. 1009, p. 5
`
`
`
`U.S. Patent
`
`Jun. 6, 2006
`
`Sheet S of 22
`
`US 7,058,850 B2
`
`
`
`019
`
`809909
`
`D] Raes || No.vº109~~[]
`
`Z19
`
`HPE, Exh. 1009, p. 6
`
`
`
`U.S. Patent
`
`Jun. 6, 2006
`
`Sheet 6 of 22
`
`US 7,058,850 B2
`
`Ole“|ISOH
`JOVUOLS~SSYN
`YALNdNOD
`
`ell
`
`80L
`
`JDIAIC
`
`Z‘aly
`
`Ble
`
`NVMdONVI
`
`POL
`
`LSOH
`
`YaLAdNOD
`
`9
`OL
`
`OIL
`
`COL
`
`
`
`YFHLOYOFl‘ALY‘NOIS3
`
`JOVYOLS-SSVN
`
`JOA
`
`HPE, Exh. 1009, p. 7
`
`HPE, Exh. 1009, p. 7
`
`
`
`
`
`U.S. Patent
`
`Jun. 6, 2006
`
`Sheet 7 of 22
`
`US 7,058,850 B2
`
`#38
`
`
`
`þ08
`
`909
`
`9380Z8
`
`
`
`
`
`
`
`
`
`
`
`HPE, Exh. 1009, p. 8
`
`
`
`U.S. Patent
`U.S. Patent
`
`Jun. 6, 2006
`Jun. 6, 2006
`
`Sheet 8 of 22
`Sheet 8 of 22
`
`US 7,058,850 B2
`US 7,058,850 B2
`
`
`
`a9‘OLYYNITJALVOINAAHOD
`
`Y3HLOdOFL
`
`
`
`O18.
`
`
`
`
`HaLAdNOdHLAGNOo
`{SOH-{SOH
`
`
`
`‘04“NLY‘NOOS3
`
`
`
`JOIAIG_JOVYOLSSSYAJOIAIOJOVAOLSSSYN
`
`
`
`
`
`HPE, Exh. 1009, p. 9
`
`HPE, Exh. 1009, p. 9
`
`
`
`
`U.S. Patent
`U.S. Patent
`
`Jun. 6, 2006
`Jun. 6, 2006
`
`Sheet 9 of 22
`Sheet 9 of 22
`
`US 7,058,850 B2
`US 7,058,850 B2
`
`
`
`LSOH
`
`YALNdNOD
`
`"LSOH
`
`
`
`YIiNdN0d *
`
`
`
`‘94‘WLY‘NOOS3
`
`YIHLOYO£1
`
`
`
`ANT)JALLYOINNANOD
`
`
`
`JNAIG_FOVYOLSSSVA
`
`
`
`HPE, Exh. 1009, p. 10
`
`HPE, Exh. 1009, p. 10
`
`
`
`
`
`
`U.S. Patent
`U.S. Patent
`
`Jun. 6, 2006
`Jun.6, 2006
`
`Sheet 10 of 22
`Sheet 10 of 22
`
`US 7,058,850 B2
`US 7,058,850 B2
`
`S
`
`S
`So
`
`
`
`H||
`
`4
`
`Hil
`
`
`
`
`
`
`
`e=)
`c n
`-R
`
`@_90_19)
`
`
`
`
`
`
`
`
`e
`
`O
`se
`
`rt
`
`or
`
`6—
`
`a?
`
`S
`s
`
`so
`
`
`
`02
`
`S
`
`Fig.9A
`
`S.
`
`
`
`904
`
`HPE, Exh. 1009, p. 11
`
`HPE, Exh. 1009, p. 11
`
`
`
`U.S. Patent
`
`Jun. 6, 2006
`
`Sheet 11 of 22
`
`US 7,058,850 B2
`
`1005
`
`1008
`
`X
`
`1001
`
`f002
`
`Of 6
`
`1918
`
`1005 Fig, 10A
`
`1006
`
`f004
`
`f012
`
`f014
`
`1010
`
`1022
`
`
`
`f02O
`
`1024 Fig. 10B
`
`HPE, Exh. 1009, p. 12
`
`
`
`U.S. Patent
`U.S. Patent
`
`Jun. 6, 2006
`
`Sheet 12 of 22
`
`US 7,058,850 B2
`US 7,058,850 B2
`
`ANOWAK YITIOVINODa.
`
`YIHLOYOSL
`
`
`
`ANTJAULYOINAWNOD
`
`
`
`ISOHISOH
`
` dadA09uaLNdNOd
`
`LSOHISOH
`
`zgral
`
`
`
`‘D4‘WLY‘NOOS3
`
`HPE, Exh. 1009, p. 13
`
`HPE, Exh. 1009, p. 13
`
`
`
`U.S. Patent
`U.S. Patent
`
`Jun. 6, 2006
`Jun. 6, 2006
`
`Sheet 13 of 22
`Sheet 13 of 22
`
`US 7,058,850 B2
`US 7,058,850 B2
`
`Vel‘sty
`
`VŽI '5||
`
`LSOH
`
`YALNdN09
`
`YITIONLNOD
`
`AMON
`
`1SOH
`
`
`
`
`
`
`
`‘O4“WIV‘NOOSI
`
`YaHLOYOSL
`
`
`
`ANT)SALLVOINNWNOD
`
`
`
`
`
`
`
`
`
`YALAdNOd
`
`HPE, Exh. 1009, p. 14
`
`HPE, Exh. 1009, p. 14
`
`
`
`U.S. Patent
`U.S. Patent
`
`Jun.6, 2006
`Jun. 6, 2006
`
`Sheet 14 of 22
`Sheet 14 of 22
`
`US 7,058,850 B2
`US 7,058,850 B2
`
`
`
`HOSTCOMPUTER
`
`CONTROLLER
`
`
`
`
`
`
`
`
`
`
`
`HOSTCOMPUTER
`
`se
`cs sPas
`~Gi
`woe
`hat
`=x
`==ro;
`<l
`<=
`cLzoom
`o
`oO
`.~2
`ie)—
`S =
`uw SoO
`
`aaO
`
`—Q
`
`HPE, Exh. 1009, p. 15
`
`HPE, Exh. 1009, p. 15
`
`
`
`U.S. Patent
`U.S. Patent
`
`Jun. 6, 2006
`Jun. 6, 2006
`
`Sheet 15 of 22
`Sheet 15 of 22
`
`US 7,058,850 B2
`US 7,058,850 B2
`
`O18.
`
`
`
`O61‘OlyNIT3ALYOINANNOD
`LSOH
`
`ASOH
`
`YALNdNO9
`
`¥aNdNOd
`
`
`
`
`
`
`
`
`
`‘Q4‘WLY‘NOOS3
`
`Y3HLOYO¢l
`
`HPE, Exh. 1009, p. 16
`
`HPE, Exh. 1009, p. 16
`
`
`
`U.S. Patent
`U.S. Patent
`
`Jun. 6, 2006
`Jun. 6, 2006
`
`Sheet 16 of 22
`Sheet 16 of 22
`
`US 7,058,850 B2
`US 7,058,850 B2
`
`
` YsLAdN0D
`‘O4‘WY‘NOQSI[siPsi)
`suo1|2S)Qse)Ps)
`
`
`
`
`
`4SOH
`
`ISOH
`
`YALAdNOD
`
`AYOWIN vOL!
`
`cS-
`g)
`
`
`
`ANITFALLVOINANAOD
`
`
`
`
`
`HPE, Exh. 1009, p. 17
`
`HPE, Exh. 1009, p. 17
`
`
`
`
`
`
`
`U.S. Patent
`U.S. Patent
`
`Jun. 6, 2006
`Jun. 6, 2006
`
`Sheet 17 of 22
`Sheet 17 of 22
`
`US 7,058,850 B2
`US 7,058,850 B2
`
`
`
`LSOH
`
`YaLAdNOo
`
`
`
`
`
`
` ‘04‘WLY“NOOSAY
`YIHLOYOfh
`
`
`
`ANTSAILVOINNWNOD
`
`HPE, Exh. 1009, p. 18
`
`HPE, Exh. 1009, p. 18
`
`
`
`
`
`U.S. Patent
`
`Jun. 6, 2006
`
`Sheet 18 of 22
`
`US 7,058,850 B2
`
`80/
`
`
`
`BEHIO HO ?I 'WIW 'NOOSE
`
`
`
`
`
`
`
`
`
`
`
`HPE, Exh. 1009, p. 19
`
`
`
`U.S. Patent
`U.S. Patent
`
`Jun. 6, 2006
`
`Sheet 19 of 22
`
`US 7,058,850 B2
`US 7,058,850 B2
`
`AMOWIN
`
`
`
`‘D4“ALY‘NOISI
`
`YdHLOYOot
`
`
`
`ANTTJAILYOINNANOD
`
`LSOH
`
`
`
`YALNdNOD=
`
`YALNdNOD
`ISOH
`LSOH
`
`HPE, Exh. 1009, p. 20
`
`HPE, Exh. 1009, p. 20
`
`
`
`
`
`
`U.S. Patent
`U.S. Patent
`
`Jun. 6, 2006
`Jun. 6, 2006
`
`Sheet 20 of 22
`Sheet 20 of 22
`
`US 7,058,850 B2
`US 7,058,850 B2
`
`LSOH
`
`ANNJAILVOINAAKOD
`
`909;809!
`
`LSOH
`
`YALNdNOd
`
`¥alNdNO
`
`AMON
`
`
`
`‘Od“LY‘NOOS3
`
`Y3HLOYOSl
`
`HPE, Exh. 1009, p. 21
`
`HPE, Exh. 1009, p. 21
`
`
`
`U.S. Patent
`U.S. Patent
`
`Jun. 6, 2006
`
`Sheet 21 of 22
`
`US 7,058,850 B2
`US 7,058,850 B2
`
`
`
`LSOH
`
`YALNdNOd
`
`LSOH
`
`ANONSA
`
`9€8-
`
`
`
`‘D4‘WLY‘NODS3
`
`YAHLOYOSL
`
`
`
`ANTJALLVOINANNOD
`
`HPE, Exh. 1009, p. 22
`
`HPE, Exh. 1009, p. 22
`
`
`
`
`
`
`‘D4‘ALY‘NOOS3
`
`YALNdN0daLSOH
`
`
`
`
`
`
`
`
`
`919
`
`U.S. Patent
`U.S. Patent
`
`Jun. 6, 2006
`Jun. 6, 2006
`
`Sheet 22 of 22
`Sheet 22 of 22
`
`US 7,058,850 B2
`US 7,058,850 B2
`
`Y
`
`S
`S
`
`4AHLOYO£1
`
`
`
`ANT)JALLYOINNNNOD
`
`HPE, Exh. 1009, p. 23
`
`HPE, Exh. 1009, p. 23
`
`
`
`US 7,058,850 B2
`
`1.
`METHOD AND SYSTEM FOR PREVENTING
`DATA LOSS WITHN DISK-ARRAY PARS
`SUPPORTING MIRRORED LOGICAL UNITS
`
`TECHNICAL FIELD
`
`The present invention relates to the mirroring of logical
`units provided by disk arrays and other multi-logical-unit
`mass-storage devices and, in particular, to a method and
`system for preventing data loss resulting from host-com
`puter and communications-link failures that interrupt data
`flow between a primary, or dominant, logical unit on a first
`mass-storage device and a secondary, remote-mirror logical
`unit on a second mass-storage device.
`
`BACKGROUND OF THE INVENTION
`
`10
`
`15
`
`2
`communications medium. For many types of Storage
`devices, including the disk drive 301 illustrated in FIG. 3,
`the vast majority of I/O requests are either READ or WRITE
`requests. A READ request requests that the storage device
`return to the requesting remote computer some requested
`amount of electronic data stored within the storage device.
`A WRITE request requests that the storage device store
`electronic data furnished by the remote computer within the
`storage device. Thus, as a result of a READ operation
`carried out by the storage device, data is returned via
`communications medium 302 to a remote computer, and as
`a result of a WRITE operation, data is received from a
`remote computer by the storage device via communications
`medium 302 and stored within the storage device.
`The disk drive storage device illustrated in FIG. 3
`includes controller hardware and logic 303 including elec
`tronic memory, one or more processors or processing cir
`cuits, and controller firmware, and also includes a number of
`disk platters 304 coated with a magnetic medium for storing
`electronic data. The disk drive contains many other compo
`nents not shown in FIG. 3, including READ/WRITE heads,
`a high-speed electronic motor, a drive shaft, and other
`electronic, mechanical, and electromechanical components.
`The memory within the disk drive includes a request/reply
`buffer 305, which stores I/O requests received from remote
`computers, and an I/O queue 306 that stores internal I/O
`commands corresponding to the I/O requests stored within
`the request/reply buffer 305. Communication between
`remote computers and the disk drive, translation of I/O
`requests into internal I/O commands, and management of
`the I/O queue, among other things, are carried out by the
`disk drive I/O controller as specified by disk drive I/O
`controller firmware 307. Translation of internal I/O com
`mands into electromechanical disk operations in which data
`is stored onto, or retrieved from, the disk platters 304 is
`carried out by the disk drive I/O controller as specified by
`disk media read/write management firmware 308. Thus, the
`disk drive I/O control firmware 307 and the disk media
`read/write management firmware 308, along with the pro
`cessors and memory that enable execution of the firmware,
`compose the disk drive controller.
`Individual disk drives, such as the disk drive illustrated in
`FIG. 3, are normally connected to, and used by, a single
`remote computer, although it has been common to provide
`dual-ported disk drives for concurrent use by two computers
`and multi-host-accessible disk drives that can be accessed by
`numerous remote computers via a communications medium
`such as a fibre channel. However, the amount of electronic
`data that can be stored in a single disk drive is limited. In
`order to provide much larger-capacity electronic data-Stor
`age devices that can be efficiently accessed by numerous
`remote computers, disk manufacturers commonly combine
`many different individual disk drives, such as the disk drive
`illustrated in FIG.3, into a disk array device, increasing both
`the storage capacity as well as increasing the capacity for
`parallel I/O request servicing by concurrent operation of the
`multiple disk drives contained within the disk array.
`FIG. 4 is a simple block diagram of a disk array. The disk
`array 402 includes a number of disk drive devices 403, 404,
`and 405. In FIG. 4, for simplicity of illustration, only three
`individual disk drives are shown within the disk array, but
`disk arrays may contain many tens or hundreds of individual
`disk drives. A disk array contains a disk array controller 406
`and cache memory 407. Generally, data retrieved from disk
`drives in response to READ requests may be stored within
`the cache memory 407 so that subsequent requests for the
`same data can be more quickly satisfied by reading the data
`
`The present invention is related to mirroring of data
`contained in a dominant logical unit of a first mass-storage
`device to a remote-mirror logical unit provided by a second
`mass-storage device. An embodiment of the present inven
`tion, discussed below, involves disk-array mass-storage
`devices. To facilitate that discussion, a general description of
`disk drives and disk arrays is first provided.
`The most commonly used non-volatile mass-storage
`device in the computer industry is the magnetic disk drive.
`In the magnetic disk drive, data is stored in tiny magnetized
`regions within an iron-oxide coating on the Surface of the
`disk platter. A modern disk drive comprises a number of
`platters horizontally stacked within an enclosure. The data
`within a disk drive is hierarchically organized within various
`logical units of data. The surface of a disk platter is logically
`divided into tiny, annular tracks nested one within another.
`FIG. 1A illustrated tracks on the surface of a disk platter.
`Note that, although only a few tracks are shown in FIG. 1A,
`Such as track 101, an actual disk platter may contain many
`thousands of tracks. Each track is divided into radial sectors.
`FIG. 1B illustrates sectors within a single track on the
`Surface of the disk platter. Again, a given disk track on an
`actual magnetic disk platter may contain many tens or
`hundreds of sectors. Each sector generally contains a fixed
`number of bytes. The number of bytes within a sector is
`generally operating-system dependent, and normally ranges
`from 512 bytes per sector to 4096 bytes per sector. The data
`normally retrieved from, and stored to, a hard disk drive is
`in units of sectors.
`The modern disk drive generally contains a number of
`magnetic disk platters aligned in parallel along a spindle
`passed through the center of each platter. FIG. 2 illustrates
`a number of Stacked disk platters aligned within a modern
`magnetic disk drive. In general, both Surfaces of each platter
`are employed for data storage. The magnetic disk drive
`generally contains a comb-like array with mechanical
`READ/WRITE heads 201 that can be moved along a radial
`line from the outer edge of the disk platters toward the
`spindle of the disk platters. Each discrete position along the
`radial line defines a set of tracks on both surfaces of each
`disk platter. The set of tracks within which ganged READ/
`WRITE heads are positioned at some point along the radial
`line is referred to as a cylinder. In FIG. 2, the tracks 202-210
`beneath the READ/WRITE heads together comprise a cyl
`inder, which is graphically represented in FIG. 2 by the
`dashed-out lines of a cylinder 212.
`FIG. 3 is a block diagram of a standard disk drive. The
`disk drive 301 receives input/output (“I/O) requests from
`remote computers via a communications medium 302 Such
`as a computer bus, fibre channel, or other Such electronic
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`HPE, Exh. 1009, p. 24
`
`
`
`US 7,058,850 B2
`
`5
`
`10
`
`15
`
`3
`from the quickly accessible cache memory rather than from
`the much slower electromechanical disk drives. Various
`elaborate mechanisms are employed to maintain, within the
`cache memory 407, data that has the greatest chance of
`being Subsequently re-requested within a reasonable amount
`of time. The disk WRITE requests, in cache memory 407, in
`the event that the data may be subsequently requested via
`READ requests or in order to defer slower writing of the
`data to physical storage medium.
`Electronic data is stored within a disk array at specific
`addressable locations. Because a disk array may contain
`many different individual disk drives, the address space
`represented by a disk array is immense, generally many
`thousands of gigabytes. The overall address space is nor
`mally partitioned among a number of abstract data storage
`resources called logical units (“LUNs). A LUN includes a
`defined amount of electronic data storage space, mapped to
`the data storage space of one or more disk drives within the
`disk array, and may be associated with various logical
`parameters including access privileges, backup frequencies,
`and mirror coordination with one or more LUNs. LUNs may
`also be based on random access memory (“RAM), mass
`storage devices other than hard disks, or combinations of
`memory, hard disks, and/or other types of mass-storage
`devices. Remote computers generally access data within a
`disk array through one of the many abstract LUNs 408-415
`provided by the disk array via internal disk drives 403-405
`and the disk array controller 406. Thus, a remote computer
`may specify a particular unit quantity of data, such as a byte,
`word, or block, using a bus communications media address
`corresponding to a disk array, a LUN specifier, normally a
`64-bit integer, and a 32-bit, 64-bit, or 128-bit data address
`that specifies a LUN, and a data address within the logical
`data address partition allocated to the LUN. The disk array
`controller translates Such a data specification into an indi
`cation of a particular disk drive within the disk array and a
`logical data address within the disk drive. A disk drive
`controller within the disk drive finally translates the logical
`address to a physical medium address. Normally, electronic
`data is read and written as one or more blocks of contiguous
`32-bit or 64-bit computer words, the exact details of the
`granularity of access depending on the hardware and firm
`ware capabilities within the disk array and individual disk
`drives as well as the operating system of the remote com
`puters generating I/O requests and characteristics of the
`45
`communication medium interconnecting the disk array with
`the remote computers.
`In many computer applications and systems that need to
`reliably store and retrieve data from a mass-storage device,
`Such as a disk array, a primary data object, such as a file or
`database, is normally backed up to backup copies of the
`primary data object on physically discrete mass-storage
`devices or media so that if, during operation of the appli
`cation or system, the primary data object becomes corrupted,
`inaccessible, or is overwritten or deleted, the primary data
`object can be restored by copying a backup copy of the
`primary data object from the mass-storage device. Many
`different techniques and methodologies for maintaining
`backup copies have been developed. In one well-known
`technique, a primary data object is mirrored. FIG. 5 illus
`trates object-level mirroring. In FIG. 5, a primary data object
`“O'” 501 is stored on LUNA 502. The mirror object, or
`backup copy, “O, 503 is stored on LUN B 504. The arrows
`in FIG. 5, such as arrow 505, indicate I/O write operations
`directed to various objects stored on a LUN. I/O write
`operations directed to object “O'” are represented by arrow
`506. When object-level mirroring is enabled, the disk array
`
`55
`
`4
`controller providing LUNs A and Bautomatically generates
`a second I/O write operation from each I/O write operation
`506 directed to LUNA, and directs the second generated I/O
`write operation via path 507, switch “S” 508, and path 509
`to the mirror object “O'” 503 stored on LUN B 504. In FIG.
`5, enablement of mirroring is logically represented by
`switch “S” 508 being on. Thus, when object-level mirroring
`is enabled, any I/O write operation, or any other type of I/O
`operation that changes the representation of object"O' 501
`on LUNA, is automatically mirrored by the disk array
`controller to identically change the mirror object “O'” 503.
`Mirroring can be disabled, represented in FIG. 5 by switch
`“S 508 being in an off position. In that case, changes to the
`primary data object "O' 501 are no longer automatically
`reflected in the mirror object “O'” 503. Thus, at the point
`that mirroring is disabled, the stored representation, or state,
`of the primary data object “O'” 501 may diverge from the
`stored representation, or state, of the mirror object “O'” 503.
`Once the primary and mirror copies of an object have
`diverged, the two copies can be brought back to identical
`representations, or states, by a resync operation represented
`in FIG. 5 by switch “S. 510 being in an on position. In the
`normal mirroring operation, Switch “S” 510 is in the off
`position. During the resync operation, any I/O operations
`that occurred after mirroring was disabled are logically
`issued by the disk array controller to the mirror copy of the
`object via path 511, switch “S. and pass 509. During
`resync, switch “S” is in the off position. Once the resync
`operation is complete, logical Switch “S” is disabled and
`logical switch “S”508 can be turned on in order to reenable
`mirroring so that Subsequent I/O write operations or other
`I/O operations that change the storage state of primary data
`object “O” are automatically reflected to the mirror object
`“O, 503.
`FIG. 6 illustrates a dominant LUN coupled to a remote
`mirror LUN. In FIG. 6, a number of computers and com
`puter servers 601–608 are interconnected by various com
`munications media 610–612 that are themselves
`interconnected by additional communications media
`613-614. In order to provide fault tolerance and high
`availability for a large data set stored within a dominant
`LUN on a disk array 616 coupled to server computer 604,
`the dominant LUN 616 is mirrored to a remote-mirror LUN
`provided by a remote disk array 618. The two disk arrays are
`separately interconnected by a dedicated communications
`medium 620. Note that the disk arrays may be linked to
`server computers, as with disk arrays 616 and 618, or may
`be directly linked to communications medium 610. The
`dominant LUN 616 is the target for READ, WRITE, and
`other disk requests. All WRITE requests directed to the
`dominant LUN 616 are transmitted by the dominant LUN
`616 to the remote-mirror LUN 618, so that the remote
`mirror LUN faithfully mirrors the data stored within the
`dominant LUN. If the dominant LUN fails, the requests that
`would have been directed to the dominant LUN can be
`redirected to the mirror LUN without a perceptible inter
`ruption in request servicing. When operation of the domi
`nant LUN 616 is restored, the dominant LUN 616 may
`become the remote-mirror LUN for the previous remote
`mirror LUN 618, which becomes the new dominant LUN,
`and may be resynchronized to become a faithful copy of the
`new dominant LUN 618. Alternatively, the restored domi
`nant LUN 616 may be brought up to the same data state as
`the remote-mirror LUN 618 via data copies from the remote
`mirror LUN and then resume operating as the dominant
`LUN. Various types of dominant-LUN/remote-mirror-LUN
`pairs have been devised. Some operate entirely synchro
`
`25
`
`30
`
`35
`
`40
`
`50
`
`60
`
`65
`
`HPE, Exh. 1009, p. 25
`
`
`
`5
`nously, while others allow for asynchronous operation and
`reasonably slight discrepancies between the data states of
`the dominant LUN and mirror LUN.
`Unfortunately, interruptions in the direct communications
`between disk arrays containing a dominant LUN and a
`remote-mirror LUN of a mirrored LUN pair occur relatively
`frequently. Currently, when communications are interrupted
`or Suffer certain types of failures, data may end up languish
`ing in cache-memory buffers, and, in the worst cases, purged
`from cache-memory buffers or lost due to systems failures.
`Designers and manufacturers of mass-storage devices. Such
`as disk arrays, and users of mass-storage devices and high
`availability and fault-tolerant systems that employ mass
`storage devices, have recognized the need for a more reliable
`LUN-mirroring technique and system that can weather com
`15
`munications failures and host-computer failures.
`
`10
`
`SUMMARY OF THE INVENTION
`
`One embodiment of the present invention provides an
`additional communications link between two mass-storage
`devices containing LUNs of a mirror-LUN pair, as well as
`incorporating a fail-safe, mass-storage-device-implemented
`retry protocol to facilitate non-drastic recovery from com
`munications-link failures. The additional communications
`25
`link between the two mass-storage devices greatly reduces
`the likelihood of the loss of buffered data within the mass
`storage device containing the dominant LUN of a mirrored
`LUN pair, and the retry protocol prevents unnecessary
`build-up of data within cache-memory buffers of the mass
`storage device containing the remote-mirror LUN. The
`combination of the additional communications link and retry
`protocol together ameliorates a deficiency in current LUN
`mirroring implementations that leads to data loss and incon
`sistent and unrecoverable databases. The additional commu
`35
`nications link provided by the present invention is physically
`distinct and differently implemented from the direct com
`munications link between the two mass-storage devices, to
`provide greater robustness in the event of major hardware
`failure.
`
`30
`
`40
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`45
`
`50
`
`FIG. 1A illustrated tracks on the surface of a disk platter.
`FIG. 1B illustrates sectors within a single track on the
`surface of the disk platter.
`FIG. 2 illustrates a number of disk platters aligned within
`a modern magnetic disk drive.
`FIG. 3 is a block diagram of a standard disk drive.
`FIG. 4 is a simple block diagram of a disk array.
`FIG. 5 illustrates object-level mirroring.
`FIG. 6 illustrates a dominant logical unit coupled to a
`remote-mirror logical unit.
`FIG. 7 shows an abstract representation of the commu
`nications-link topography currently employed for intercon
`55
`necting mass-storage devices containing the dominant and
`remote-mirror logical units of a mirrored-logical-unit pair.
`FIGS. 8A-C illustrates a communications-link failure that
`results in purging of the cache memory within the mass
`storage device containing a remote-mirror logical unit.
`FIGS. 9A and 9B illustrate a normal WRITE-request
`buffer, such as the input queue 826 of the second mass
`storage device in FIG. 8C, and a bit-map buffer, such as the
`bit map 846 in FIG. 8C.
`FIGS. 10A-E illustrates an example of a detrimental,
`out-of-order WRITE request applied to a mass-storage
`device.
`
`60
`
`65
`
`US 7,058,850 B2
`
`6
`FIG. 11 shows the final stage in recovery from the missing
`WRITE request problem illustrated in FIG. 8A-C.
`FIGS. 12A–C illustrates an error-recovery technique
`employed to handle communications-link failures.
`FIGS. 13 and 14 illustrate the occurrence of multiple
`failures, leading to data loss within the mass-storage devices
`of FIGS. 8A-C, 11, and 12A-C.
`FIG. 15 illustrates an enhanced communications topology
`that represents a portion of one embodiment of the present
`invention.
`FIGS. 16A-D illustrates operation of the exemplary
`mass-storage devices using the techniques provided by one
`embodiment of the present invention.
`
`DETAILED DESCRIPTION OF THE
`INVENTION
`
`One embodiment of the present invention provides a more
`communications-fault-tolerant mirroring technique that pre
`vents loss of data stored in electronic cache-memory for
`relatively long periods of time due to host-computer failures
`and communications failures. In the discussion below, the
`data-loss problems are described, in detail, followed by a
`description of an enhanced mass-storage-device pair and an
`enhanced high-level communications protocol implemented
`in the controllers of the mass-storage-devices.
`FIG. 7 shows an abstract representation of the commu
`nications-link topography currently employed for intercon
`necting mass-storage devices containing the dominant and
`remote-mirror LUNs of a mirrored-LUN pair. A first mass
`storage device 702 is interconnected with a first host com
`puter 704 via a small-computer-systems interface (“SCSI),
`fiber-channel (“FC), or other type of communications link
`706. A second mass-storage device 708 is interconnected
`with a second host computer 710 via a second SCSI or FC
`communications link 712. The two host computers are
`interconnected via a local-area network (“LAN”) or wide
`area network (“WAN) 714. The two mass-storage devices
`702 and 708 are directly interconnected, for purposes of
`mirroring, by one or more dedicated enterprise systems
`connection (“ESCON”), asynchronous transfer mode
`(“ATM), FC, T3, or other types of links 716. The first
`mass-storage device 702 contains a dominant LUN of a
`mirrored-LUN pair, while the second mass-storage device
`708 contains the remote-mirror LUN of the mirrored-LUN
`pa1r.
`FIGS. 8A-C illustrates a communications-link failure that
`leads to a purge of cache memory within the mass-storage
`device containing a remote-mirror LUN. In FIG. 8A, data to
`be written to physical-data-storage devices within a first
`mass-storage device 802 is transmitted by a host computer
`804 through a SCSI, FC, or other type of link 806 to the
`mass-storage device 802. In FIGS. 8A-C, and in FIGS.
`11A-D, 12, 13, and 15A-D, which employ similar illustra
`tion conventions as employed in FIGS. 8A-C, incoming
`WRITE commands are illustrated as small square objects,
`such as incoming WRITE command 808, within a commu
`nications path such as the SCSI, FC, or other type of link
`806. Each WRITE request contains a volume or LUN
`number followed a "slash, followed in turn, by a sequence
`number. WRITE requests are generally sequenced by high
`level protocols so that WRITE requests can be applied, in
`order, to the database contained within volumes or LUNs
`stored on one or more physical data-storage devices. For
`example, both in FIGS. 8A-C and in the subsequent figures,
`identified above, LUN “0” is mirrored to a remote mirror
`stored within physical data-storage devices of a second
`
`HPE, Exh. 1009, p. 26
`
`
`
`7
`mass-storage device 810, interconnected with the first mass
`storage device 802 by one or more ESCON, ATM, FC, T3,
`or other type of communications links 812.
`The controller 814 of the first mass-storage device 802
`detects WRITE requests directed to dominant LUN “0” and
`directs copies of the WRITE requests to the second mass
`storage device 810 via an output buffer 816 stored within
`cache memory 818 of the mass-storage device 802. The
`WRITE requests directed to the dominant LUN, and to other
`LUNS or volumes provided by the first mass-storage device,
`are also directed to an input buffer 820 from which the
`WRITE requests are subsequently extracted and executed to
`store data on physical data-storage devices 822 within the
`first mass-storage device. Similarly, the duplicate WRITE
`requests transmitted by the first mass-storage device through
`15
`the ESCON, ATM, FC, T3, or other type of link or links 812
`are directed by the controller 824 of the second mass-storage
`device 810 to an input buffer 826 within a cache-memory
`822 of the second mass-storage device for eventual execu
`tion and storage of data on the physical data-storage devices
`830 within the second mass-storage device 810.
`In general, the output buffer 816 within the first mass
`storage device is used both as a transmission queue as well
`as a storage buffer for holding already transmitted WRITE
`requests until an acknowledgment for the already transmit
`25
`ted WRITE requests is received from the second mass
`storage device. Thus, for example, in FIG. 8A, the next
`WRITE-request to be transmitted 832 appears in the middle
`of the output buffer, above already transmitted WRITE
`requests 834–839. When an acknowledgement for a trans
`30
`mitted WRITE request is received from the second mass
`storage device, the output buffer 818 entry corresponding to
`the acknowledged, transmitted WRITE request can be over
`written by a new incoming WRITE request. In general,
`output buffers are implemented as circular queues with
`dynamic head and tail pointers. Also note that, in FIGS.
`8A-C, and in the subsequent, related figures identified
`above, the cache memory buffers are shown to be rather
`Small, containing only a handful of messages. In actual
`mass-storage devices, by contrast, electronic cache memo
`40
`ries may provide as much as 64 gigabytes of data storage.
`Therefore, output a