`(10) Patent No.:
`a2 United States Patent
`
` Burgeretal. (45) Date of Patent: Aug. 17, 2004
`
`
`US006779082B2
`
`(54) NETWORK-BASED DISK REDUNDANCY
`STORAGE SYSTEM AND METHOD
`Inventors: Eric William Burger, McClean, VA
`US); Walt
`h Ore
`
`(75)
`
`Nechun NH(US)Andy Spitzer, North
`Andover, MA (US); Barry David
`Wessler, Potomac, MD (US)
`
`wo
`
`FOREIGN PATENT DOCUMENTS
`PCT/US02/03315
`5/2002
`OTHER PUBLICATIONS
`
`(73) Assignee: Ulysses ESD,Inc., San Jose, CA (US)
`
`
`
`Udo Kelter, “Discretionary Access Controls in a High—Per-
`formance Object Management System”,
`IEEE 1991, p.
`288-299.
`* cited by examiner
`.
`.
`Primary Examiner—Donald Sparks
`
`(*) Notice: Subject to any disclaimer, the term of this—Asséstant Examiner—BrianR. Peugh ; ;
`
`patent is extended or adjusted under 35
`(74) Attorney, Agent, or Firm—Morgan, Lewis & Bockius
`U.S.C. 154(b) by 476 days.
`LLP
`(57)
`
`(21) Appl. No.: 09/777,776
`
`ABSTRACT
`
`An embodimentof the invention described in the specifica-
`Feb. 5, 2001
`tion and drawingsis a distributed and highly available data
`Prior Publication Data
`storage system. In one embodiment,
`the distributed data
`storage system includesa plurality of data storage units that
`US 2002/0144058 Al Oct. 3, 2002
`are controlled by an object management system. The object
`(51) Unt, C1. eeececeeeseeeeeeeeeteeteeneeeees GO6F 12/00 en se preferentially poles the distributed
`(52) US. Ch. ceececceeeceee 711/114; 711/162; 711/4
`ala storage unis for perlorming the Te Access reques’s
`according to the external inputs/outputs with which the file
`
`(58) Field of Search 0.0... 711/114, 162, acccss requests are associated. In responsetoafile ercation
`TA/112, 4
`request
`that is associated with an external
`input of one
`.
`distributed data storage unit, the object management system
`References Cited
`preferentially creates a data file tn that distributed data
`U.S. PATENT DOCUMENTS
`storage unit. In response to a file retrieval request that is
`associated with a data file and an external output of a
`A
`*
`5,511,177
`distributed data storage unit, the object management system
`......... TAL/A14
`4/1996 Kagimasaet al.
`5,673,381 A *
`9/1997 Tluai et al.
`......
`
`preferentially returns a hostname and pathname of a copy of
` - seus
`5,751,883 A
`5/1998 Ottesen et al.
`..
`*
`the data file that is stored within that distributed data storage
`10/1998 Vishlitzky ct al. sss... F114
`5,819,310
`
`
`unit. The object management system also makes redundant
`scccesccsess--- 709/202
`10/2000 Beck et al.
`6,138,139
`*
`
`
`6,167,494 A*12/2000 Cheston etal. .
`. TU/162
`copies of the data files in different units to provide high
`
`.. 707/201
`availability of data.
`6,298,356 B1
`10/2001 Jawaharetal. .
`
`
`6,467,034 B1*10/2002 Yanaka.......... we FAL/162
`
`6,493,825 Bl*12/2002 Blumenauctal. .......... 713/168 15 Claims, 8 Drawing Sheets
`
`**
`
`AA
`
`(22) Filed:
`(65)
`
`(56)
`
`400
`
`|— 419
`
`|-— 420
`
`Application sends new fila
`request (associated with
`an extemal I/O
`connection) to OMS.
`
`OMSidentifies and
`preferentially selects the
`distributed data storage
`unit thatis assoclated
`with the extemal VO
`
`connection.
`
`OMScalls the name
`service onthe selacted|439
`data storage unit, and
`asks for a uniquefile
`nameto be allocated.
`
` ¥
`The selected data storage
`unit assigns a file name|— 440
`that is unique within the
`
`data storageunit.
`
`Application racords
`informationin the
`selected data storage
`unit, using the assigned
`file narne.
`
`_— 450
`
`HPE, Exh. 1001, p. 1
`
`
`
`
`HPE, Exh. 1001, p. 1
`
`
`
`U.S. Patent
`
`Aug. 17, 2004
`
`Sheet 1 of 8
`
`US 6,779,082 B2
`
`peerseenenneneeneenseenetceeeeceteeceeeeenecetneeecetnecteeveetteeneeneeeiveecnnbinnAD
`
`10
`
`[
`
`—vO
`
`[
`
`VO
`
`f
`
`0
`
`OY
`
`
`
`
`
`F) Unit||Unit Unit Unit
`
`
`
`130a
`130b
`130¢
`oman
`130n
`
`!
`
`|
`
`
`
`Network
`
`Network
`
`\— Connection
`
`peceeeees sd
`
`OMS Manager
`Unit
`(Secondary)
`1106
`
`Network
`Connection
`
`
`
`
`
`
`
`
`
`Network
`Network
`Network
`Network
`
`Connection
`Connection | Connection Connection
`
`Network Switch 105
`
`
`
`
`Connection
`
`
`
`
`
`
`OMS Manager
`Unit
`(Primary)
`110a
`
`
`
`
`Application
`Server
`
`
`
`Figure 1
`
`
`
`150
`
`HPE, Exh. 1001, p. 2
`
`HPE, Exh. 1001, p. 2
`
`
`
`U.S. Patent
`
`Aug. 17, 2004
`
`Sheet 2 of 8
`
`US 6,779,082 B2
`
`Distributed Data
`Storage Unit 130a
`
`
`
`“212
`
`Mass
`
`Storage
`Subsystem
`208
`
`
`
`
`
`CPU
`202
`
`|
`
`
`ile Copying Service
`
`Object Management System|— 249
`
`Memory 206
`
`[
`Operating System
`
`
`Networking module |— 234
`|— 236
`External 1/0 module
`
`
`232
`
`ce Namin Service
`
`|-— 242
`
`34d
`
`
`
`External I/O |-— 210
`Subsystem
`
`
`
`(
`
`}
`
`External /O
`
`Network
`Interface
`
`204
`
`
`
`
`To Network
`Swtich 105
`
`Figure 2
`
`HPE, Exh. 1001, p. 3
`
`HPE, Exh. 1001, p. 3
`
`
`
`U.S. Patent
`
`Aug.17, 2004
`
`Sheet 3 of 8
`
`US 6,779,082 B2
`
`OMS
`
`Manager
`Unit 110a oN
`
`M
`306
`a emory
`
`232
`
`234
`
`Operating System
`
`-
`Networking Module
`External /O module
`
`|~ 2°6
`
`
`
`
`
`
`
`
`
`
`i
`
`| Mass Storage
`' Subsystem
`y
`308
`
`|— 240
`
`242
`
`31200| sbject Management
`System
`f
`File naming service
`~ 244
`~
`File copying service
`24
`y— 248
`OMSWork Queue
`
`[24Unit Selector }— 248
`.
`.
`a 250
`OMS File Mapping Tabie
`OMSFile State Table °°"
`OMSUnit State Table | 254
`File Creation Module
`— 260
`File Replication Module
`|— 270
`File Retrieval Module
`
`310
`
`
`
`External 0
`Subsystem |
`TTP
`Piiidi add
`
`External I/O
`lines
`
`Network
`Interface
`
`304
`
`To Network
`
`Swtich 105
`
`.
`
`Figure 3
`
`HPE, Exh. 1001, p. 4
`
`HPE, Exh. 1001, p. 4
`
`
`
`U.S. Patent
`
`Aug. 17, 2004
`
`Sheet 4 of $
`
`US 6,779,082 B2
`
`400
`
`
`
`
`file name.
`data storage unit.
`tf
`
`Application sends newfile
`request (associated with
`an external I/O
`connection) to OMS.
`
`
`|— 410
`
`OMS identifiesand
`preferentially selects the
`
`distributed data storage - 420
`
`unit that is associated
`with the external I/O
`connection.
`
`OMS calls the name
`service on the selected
`data storage unit, and
`asks for a uniquefile
`nameto be allocated.
`
`-— 430
`
`y
`
`
`The selected data storage
`unit assignsa file name
`that is unique within the
`
`L— 440
`
`Application records
`information in the
`selected data storage {
`unit, using the assigned
`
`450
`
`Po
`
`Fig. 4
`
`HPE, Exh. 1001, p. 5
`
`HPE, Exh. 1001, p. 5
`
`
`
`U.S. Patent
`
`Aug. 17, 2004
`
`Sheet 5 of 8
`
`US 6,779,082 B2
`
`300
`
`
`
`Application sends replication — 510
`request to OMS.
`
`y— 920
`OMS queuesthe replication
`request
`
`OMSselects atargetdata
`storage unit.
`
`~~ 930
`
`
`
`OMS stores the sourcefile
`information and notes that the
`file is not redundant.-
`
`OMS contacts the target unit's
`|— 550
`nameservice, requesting a new
`file allocation.
`
`
`"I
`|OMS contacts the target unit's
`file copy service, requesting a ~ 560
`
`copy from sourcefile to target
`file.
`
`
`
`
`
`
`the targetfile information in the a 570
`
`| Upon completion, OMS stores
`
`OMS, and marksthetargetfile
`as a redundantcopy.
`
`Fig. 5
`
`HPE, Exh. 1001, p. 6
`
`HPE, Exh. 1001, p. 6
`
`
`
`U.S. Patent
`
`Aug.17, 2004
`
`Sheet 6 of 8
`
`US 6,779,082 B2
`
`600
`
`
`
` Application contacts the OMS with the
`
`nameof the sourcefile in the
`application name-space ("handle").
`
`610
`
`
`
`
`OMS queuesthe file retrieval request.
`
`|— 620
`
` 4 O
`
`MS looks up the “handle,”
`preferentially selects the file stored in
`the data storage unit with the mostidle
`capacity, and returns the hostname and
`pathnameofthefile to the application.
`
`[|
`
`
`
`Application retrieves the file by passing
`the hostname and pathname ofthe file
`to the appropriate data storage unit.
`
`Fig. 6
`
`HPE, Exh. 1001, p. 7
`
`HPE, Exh. 1001, p. 7
`
`
`
`U.S. Patent
`
`Aug. 17, 2004
`
`Sheet 7 of 8
`
`US 6,779,082 B2
`
`700
`
`
`
`Application sendsfile pn"
`copy requestto the
`OMS.
`
`
`
`
`
`| OMS queuesthefile /— 720
`copy request.
`
`
`
`OMSincreases the
`link count of thefile
`
`and updates the OMS
`file mapping table with
`any new handle.
`
`
`Fig. 7
`
`HPE, Exh. 1001, p. 8
`
`HPE, Exh. 1001, p. 8
`
`
`
`U.S. Patent
`
`Aug.17, 2004
`
`Sheet 8 of 8
`
`US 6,779,082 B2
`
`800
`
`
`
`—- 810
`Application sendsafile delete
`request to the OMS.
`
`-——_
`
`
`
`|
`
`OMS queuesthe file delete
`request.
`
`|
`
`a NO oO
`
`830
`
`840
`
`850NY
`
`OMS removesanyapplication
`name to OMS name-space
`mapping, and decrementsthe link
`count.
`
`
`
`
`OMS deterimesif the link count
`for that file has reached zero.
`
`
`
`If the link count has reached zero,
`the OMS calls the name service
`of all the distributed data storage
`units that contain copies of the
`file, requesting the service to
`remove them.
`
`
`
`Fig. 8
`
`HPE, Exh. 1001, p. 9
`
`HPE, Exh. 1001, p. 9
`
`
`
`US 6,779,082 B2
`
`1
`NETWORK-BASED DISK REDUNDANCY
`STORAGE SYSTEM AND METHOD
`
`BRIEF DESCRIPTION OF THE INVENTION
`
`The present invention relates generally to computer data
`storage. More specifically, the present inventionrelates to a
`high-availability data storage methodology for a computer
`network.
`
`BACKGROUND OF THE INVENTION
`
`rn
`
`10
`
`RAID (Redundant Array of Inexpensive Disks)
`technology, which uses multiple disk drives attached to a
`host computer,
`is a way of making a data store highly
`available. The RAID controller or host software makes a
`redundant copy of the data, either by duplicating the writes
`(RAID 1), establishing a parity disk (RAID 3), or establish-
`ing a parity disk with striped writes (RAID 5). Greater levels
`of redundancycan be achieved byincreasing the number of
`redundant copies.
`Although RAID technology provides a highly available
`disk array, data availability is not guaranteed. For instance,
`if the host computer fails, data becomes unavailable regard-
`less of how many redundant disk arrays are used. In order to +
`provide an even higherlevel of data availability, dual-ported
`arrays, which are accessible by two host computers, are
`used. The two host computers establish a protocol between
`them so that only one writes to a given disk segment at a
`time. If onc host computer fails, the surviving host computcr
`can take over the work of the failed computer. This type of
`configuration is typical in network file servers or data base
`servers.
`
`toCc
`
`40
`
`A disadvantage of dual-ported disk arrays, however, is
`that they use a number of rather expensive components.
`Dual-ported RAID controllers are expensive. Moreover, a
`complex protocol is used by the hosts to determine which is
`allowed to write to cach disk and when thcy are allowed to
`do so. Often, host manufacturers charge a substantial pre-
`mium for clustering software.
`Beside the high costs of system components, another
`disadvantage of dual-ported disk array systems is that the
`number of host computers that can simultaneously access
`the disk array is restricted. In dual-ported disk array systems,
`data must be accessed via one or the other host computer.
`Thus,
`the number of data access requests that can be
`serviced bya disk array system is limited by the processing
`capability of the host computers.
`Yet another disadvantage with multi-ported disk arrays is 5
`that expanding the storage requires upgrading the disk array
`(for storage) or host computer (for processing). There are
`RAID arrays that expand by adding disks on carrier racks.
`However, once a carrier rack is full, the only way to expand
`the array is to get a new, larger one. The samesituation holds
`for the host computer. Some host computers, such as Sun
`6500, from Sun Microsystems of Mountain View, Calif.,
`may be expanded by adding more processors and network
`interfaces. [lowever, once the computeris full of expansion
`cards, one needs to buy a new computer to expand.
`SUMMARYOF THE INVENTION
`
`60
`
`An embodiment of the present invention is a distributed
`and highly available data storage system.
`In one
`embodiment, the distributed data storage system includes a
`network of data storage unils that are controlled byan object
`managementsystem. Significantly, whenever data is written
`
`2
`the object management system
`to one data storage unit,
`makes a redundant copy of that data in another data storage
`unit. The object management system preferentially selects
`the distributed data storage units for performing the file
`access requests according to the external inputs/outputs with
`whichthe file access requests are associated. In response to
`a file creation request that is associated with an external
`input of one distributed data storage unit, the object man-
`agementsystem will prefercntially crcate a data file in that
`distributed data storage unit. In response to a file retrieval
`request that is associated with a data file and an external
`output of another distributed data storage unit, the object
`management system will preferentially return a hostname
`and pathnameof a copy of the datafile that is stored within
`that distributed data storage unit. The object management
`system also makes redundant copies of the data files in
`different units to provide high availability of data.
`is not
`An aspect of the present
`invention is that
`it
`necessary to use expensive RAID servers. Rather,
`inexpensive, commodity disk servers can be used.
`‘Ihe
`distributed and highly available data storage systcm is also
`highly scalable, as additional disk servers can be added
`according to the storage requirement of the network.
`Another aspect of this invention is that dedicated servers
`for the disk service functions are not required. Disk service
`functions can be integrated into each data storage unit. The
`data storage units may be implemented using relatively low
`cost, gencral-purpose computcrs, such as so-called desktop
`computers or personal computers (PCs). This aspect is of
`importance to applications where I/O, CPU, and storage
`resources follow a proportional relationship.
`Yet another aspect of the present inventionisthat users of
`the system are nol tied to any specific one ofthe data storage
`units. ‘hus, individual users may exceed the storage capac-
`ity of any given data storage unit. Yct anothcr aspect of the
`present invention is that an expensive TDM (Time Domain
`Multiplexed) switching infrastructure is not required. An
`inexpensive high-speed Ethernet network is sufficient
`to
`provide for the necessary interconnection. Yet another aspect
`of the present invention is that the data storage system is
`scalable to the numberofits external I/Os.
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`For a better understanding of the invention, reference
`should be made to the following detailed description taken
`in conjunction with the accompanying drawings, in which:
`FIG. 1 is a block diagram illustrating a data storage
`system according to an embodimentof the present inven-
`tion.
`
`FIG. 2 is a block diagram illustrating the components of
`a distributed data storage unit in accordance with an embodi-
`ment of the present invention.
`FIG. 3 is a block diagram illustrating the components of
`an OMS managerunit in accordance with an embodimentof
`the present invention.
`FIG. 4 is a flow diagramillustrating the operations of the
`data storage system of TIG. 1 when creating a new file.
`FIG. 5 is a flow diagramillustrating the operations of the
`data storage system of FIG.
`1 when making a redundant
`copy ofa file.
`FIG. 6 is a flow diagram illustrating the operations of the
`data storage system of FIG. 1 when an application is
`retrieving a file.
`FIG. 7 is a flow diagram illustrating the operations of the
`data storage system of FIG. 1 when an application copies a
`file.
`
`HPE, Exh. 1001, p. 10
`
`HPE, Exh. 1001, p. 10
`
`
`
`US 6,779,082 B2
`
`3
`FIG.8 is a flow diagram illustrating the operations of the
`data storage system of FIG. 1 when an application deletes a
`file.
`
`Like reference numerals refer to corresponding parts
`throughout the drawings.
`DETAILED DESCRIPTION OF THE
`PRETERRED EMBODIMENTS
`
`aya
`
`4
`includes a central processing unit (CPU) 202, a network
`interface 204 for coupling to network switch 105, a memory
`206 (which may include randomaccess memoryas well as
`disk storage and other storage media), a mass-storage sub-
`system 208 (which mayinclude a disk subsystem for storing
`voice mail messages), an external I/O subsystem 210 (which
`may include one or more voice cards for communicating
`with a public service telephone network), and one or more
`buscs 212 for interconnecting the aforementioned clements
`of system 130.
`The network interface 204 provides the appropriate hard-
`ware and software layers to implement networking of the
`distributed data storage units. In the preferred embodiment,
`the network interface 204 is a LOOBaseTX Ethernet network
`interface, running the TCP/IP network stack.
`The external I/O subsystem 210 provides the appropriate
`hardware and software layers to implementthe interface to
`the outside world for the server. It may be another Ethernet
`interface to serve web pages, for example. It may be a
`toCc
`) Natural Microsystems AG4000cto interface with the Public
`Switched Telephony Network.In the preferred embodiment,
`it is a Natural Microsystems CG6000c to interface with the
`packet telephony network. It can be a combination of these
`or other external interfaces. Alternately,
`the external I/O
`subsystem 210 maybe a virtual interface: one can serve
`TCP/IP-based services over the network interface 210. It
`should be note that the external I/O subsystem is optional.
`For example,
`the distributed data storage unit 130a can
`simply be a file server for the network, using the network
`interface 204 for service access.
`‘The mass storage subsystem 208 providesfile service to
`the CPU 202. In the prescnt embodiment, the mass storage
`subsystem 208 runs the VxFS operating system from Veri-
`tas. Alternate embodiments include the Unix File System
`(UFS) or the WindowsNT File System (NTFS).
`Operations of the distributed data storage unit 130a are
`controlled primarily by control programs that are executed
`by the unit’s central processing unit 202.
`In a typical
`implementation, the programs and data structures stored in
`the system memory 206 will include:
`an operating system 232 (such as Solaris, Linux, or
`WindowsNT) that
`includes procedures for handling
`various basic systemservices and for performing hard-
`ware dependent tasks;
`networking software 234, which is a component of
`Solaris, Linux, and Windows2000;
`applications 236 related to the external I/O subsystem
`(e.g., an inbound voice message storage module for
`storing voice messagesin user voice mailboxes, a voice
`message playback module, etc.); and
`necessary components of the object management system
`240.
`The components of the object management system 240
`that reside in memory 206 of the distributed data storage unit
`1304 preferably include the following:
`a file naming service 242; and
`a file copying service 244.
`FIG. 3 illustrates the componcnts of an OMS manager
`unit 110@ in accordance with an embodimentof the present
`invention. Components of the secondary OMS managerunit
`110b are similar to those of the illustrated unit 110a. As
`shown, OMS managerunit 110q includes a central pracess-
`ing unit (CPU) 302, a network interface 304 for coupling to
`network switch 105, a memory 306 (which mayinclude
`random access memory as well as disk storage and other
`storage media), a mass-storage subsystem 308 (which may
`
`Reference will now be made in detail to the preferred
`embodiments of the invention, examples of which are illus-
`trated in the accompanying drawings.
`In the following
`detailed description, numerous specific details are set forth
`in order to provide a thorough understanding of the present
`invention. However, it will be apparent to one of ordinary
`skill in the art that the present invention maybe practiced
`without these specific details. In other instances, well-known
`methods, procedures, components, and circuits have not
`been described in detail so as not to unnecessarily obscure
`aspects of the present invention.
`System Components of the Data Storage System of the
`Present Jovention
`FIG. 1 is a block diagram illustrating a data storage
`system 100 according to an embodiment of the present
`invention. As illustrated,
`the data storage system 100
`includes a network switch 105 coupled to distributed data :
`storage units 130a—-130n and OMS (Object Management
`System) managers 110¢—-110b One embodiment of the
`present embodiment is implemented using a 100BaseTX
`Ethernet network, and thus,
`the network switch 105 is a
`high-speed Ethernet switch, such as the Nortel Networks
`Accelar 1200. In other embodiments of the invention, other
`types of networks, such as an ATM network may be uscd to
`interconnect the distributed data storage units 130¢—130n
`and the OMS managers 110a—110b. Also illustrated is an
`application server 150 that may be coupled to access the data
`storage system 100 via the network switch 105. Application
`programs, such as voice message application programs, may
`reside on the application server 150.
`The distributed data storage units 130¢-130n are the units
`of storage and disk redundancy. In the present embodiment,
`cach of the distributed data storage units 130a—130n has a
`plurality of external input/output (I/O) lines for coupling to
`an external system, such as a public exchange (PBX) sys-
`tem. Each of the distributed data storage units 130a—130n
`also has its own processing resources. In one embodiment,
`each distributed data storage unit is implemented using a
`low cost general purpose computer.
`The object management system (OMS)ofthe data storage
`system 100 resides on the distributed data storage units
`130a-130n and two OMS managers 110a—-110b. The OMS.
`provides nametranslation, object location, and redundancy
`management for the system 100. The OMSusesa closely-
`coupled redundancy scheme to provide a highly-available
`object management system service.
`In the present embodiment, the OMS managerresides on
`two computer systems to provide high-availability and fault
`tolerance capability. That is, if one of the primary OMS
`manager 110a@ crashes or otherwise becomes unavailable,
`the sccondary OMS manager 1105 may be uscd. In other
`embodiments, the object management system may run on
`specialized data processing hardware, or on a single fault-
`tolerant computer.
`TIG. 2 is a block diagram illustrating the components of
`the distributed data storage unit 130@ in accordance with an
`embodiment of the present invention. Components of the
`distributed data storage units 130b-130n are similarto those
`of the illustrated unit. As shown, data storage unit 130d
`
`£a
`
`51
`
`QDa
`
`HPE, Exh. 1001, p. 11
`
`HPE, Exh. 1001, p. 11
`
`
`
`US 6,779,082 B2
`
`15
`
`30
`
`5
`include a disk subsystem for storing voice mail messages),
`and one or more buses 312 for interconnecting the afore-
`mentioned elements of system 110a. The OMS managerunit
`1102 mayalso include an optional external I/O subsystem
`310.
`The OMS manager unit 110@ may include components
`similar to those of the distributed data storage unit 130a.
`Operations of the OMS manager unit 110¢ are controlled
`primarily by control programs that are executed by the
`system’s central processing unit 302. The software running
`on the OMS manager unit 110a, however, may be different.
`Particularly, as shown in FIG. 3,
`the programs and data
`structures stored in the system memory 306 may include:
`an operating system 232 (such as Solaris, Linux, or
`WindowsNT) that
`includes procedures for handling
`variousbasic system services and for performing hard-
`ware dependent tasks;
`networking software 234, which is a component of
`Solaris, Linux, and Windows2000;
`applications 236 related to the external I/O subsystem
`(e.g., an inbound voice message storage module for
`storing voice messagesin user voice mailboxes, a voice
`message playback module, etc.); and
`neccssary componcnts of the object management system
`240.
`The components of the object management system 240
`that reside on the OMS manager unit 110¢@ include the
`following:
`a file naming service 242;
`a file copying service 244;
`an OMS work queue 246;
`a unit selector module 248;
`an OMSfile mapping table 250;
`an OMSfile state table 252; and
`an OMSunitstate table 254.
`the file naming
`According to the present embodiment,
`service 242 is for obtaining a unique file name in the OMS
`manager unit 110a. The file copying service 244 is for
`copying files to and from the OMS manager unit 110a. The
`OMSwork queue 246is for storing file access requests from
`the applications. The unit sclector module 248is for sclect-
`ing one of the distributed data storage units 130a—130for
`carrying out
`the file access or duplication requests. ‘lhe
`OMSfile mapping table 250 stores the correlation between
`a file’s name in the application name-space (or “handle”)
`and the actual location of the file. The OMSfile state table
`252 stores the status of the files stored by the data storage
`system 100. ‘The OMSfile state table 252 also keeps track of
`wn5
`a “link count” for each ofthe files stored bythe data storage °
`system 100. The OMS unit state table 254 stores the status
`of the distributed data storage units 130a¢—103n.
`The secondary OMS managerunit can take over when the
`primary OMS managerunit is down.
`Tables 1-4 belowillustrate exemplary OMS work queue
`246, OMSfile mapping table 250, OMSfile state table 252,
`and OMSunitstate table 254, and their respective contents.
`
`35
`
`40
`
`55
`
`TABLE 1
`
`OMS Work Qucuc
`
`handle
`
`hostname
`
`pathname
`
`command
`
`MyFileName
`MyOtherName
`DeleteThis
`
`Unit3
`Unit2
`
`/infiles/V00,1/infile.tif
`/infiles/V00,1/voice.vox
`
`new
`copy
`delete
`
`60
`
`65
`
`6
`
`TABLE 2
`
`OMSFile Mapping Table
`
`handle
`MyOtherName
`MyOtherName
`Delete Vhis
`DelctcThis
`
`hostname
`Unit2
`Unit5
`Unit?
`Unit1
`
`pathname
`jinfiles/VUU,1/voice.vox
`fu2/V99,7/£19283.vox
`/ul/V23,44/2308tasd.tif
`/infiles/V21,8/3q49-7n.tit
`
`TABLE 3
`
`OMSFile State Table
`
`handle
`
`MyFilcName
`MyOthcrName
`AnotherFile
`
`slale
`
`New
`OK
`OK
`
`link count
`
`1
`2
`1
`
`TABLE 4
`
`OMS Unit State Table
`
`hostname
`
`Unitl
`Unit2
`Unit3
`Unit4
`Units
`Unit6
`Unit7
`Unit8
`
`state
`
`UP
`MAINT
`uP
`DOWN
`uP
`uP
`UP
`MAINT
`
`Operations of the OMS 240 will be discussed in greater
`detail below.
`Operations of the Object Management System
`FIG. 4 is a flow diagram 400 illustrating the operations of
`the data storage system 100 when creating a newfile. As
`shown,
`in step 410, when an application (e.g., a voice
`message application program running on application scrver
`150) needs to create a new data file, the application sends a
`requestto the object management system (OMS) 240 of the
`data storage system 100. Preferably, the request for a new
`file has an association with an external 110 connection. The
`requestis preferably sent to the primary OMS managerunit
`1104. Then,in step 420, the file creation module 260 of the
`OMS240 identifies and preferentially selects the distributed
`data storage unit that 1s associated with the external I/O
`conncction. But if the data storage unit that is associated
`with the external I/O connection is unavailable, the OMS
`selects an available data storage unit. The physical I/O
`stream from the external J/O connection is then converted
`into data packets, which are transmitted across the network
`and stored at the selected data storage unit.
`With reference still to FIG. 4, in step 430,the file creation
`module 260 then calls the name service of the selected
`distributed data storage unit, asking for a unique file name
`to be allocated. In step 440, the name service of the selected
`data storage unit then assigns a file name that is unique
`within the particular distributed data storage system. In step
`450, after the distributed data storage unit creates thefile, the
`application then records information into the file.
`According to one particular embodiment of the present
`invention, the data storage system 100 may be implemented
`as part of a voice message system. In this embodiment, a
`new file needs to be created for recording a new message
`
`HPE, Exh. 1001, p. 12
`
`HPE, Exh. 1001, p. 12
`
`
`
`US 6,779,082 B2
`
`8
`OMS240 with the name of the source file in the application
`name-space (or “handle”). In step 620, the OMS 240 queues
`the request in the OMS work qucuc 246. In step 630, when
`the OMS 240 works through the OMS work queue 256 and
`finds the file retrieval request, the file retrieval module 280
`of the OMS 240 then looks up the “handle” from the OMS
`file mapping table 250. Assuming that multiple copies of the
`file are stored in the data storage system 100, the OMS 240
`will preferentially select a copy that is stored within the data
`storage unit with the most idle capacity. TheOMS 240 then
`returns the hostname and pathnameofthe file to the appli-
`cation. In the present embodiment, the file retrieval module
`280 may use the unit selector module 248 to choose the
`preferred distributed data storage unit. To provide a high-
`available service,
`the file retrieval module 280 will not
`return a file stored on an unreachable node. Since multiple
`copies of every file (except the most recently created files
`that have not yet been replicated) are stored in the system
`100,
`the OMS 240 should be able to find a copy of any
`specified file on a running unit, even when one of the data
`storage unit has failed. In an alternate embodiment, the file
`retrieval module 280 returns information onall copies of the
`file to allow the application to choose the best file copy to
`use.
`
`aya
`
`toCc
`
`30
`
`35
`
`£a
`
`7
`whena call comesin on an external I/O connection. A voice
`message application, detecting that a call is coming in, will
`preferentially create a new file for recording the voice
`stream ofthe call. In the present example, the request for the
`newfile is sent to the distributed data storage unit associated
`with the incoming call. Thus,
`the same data storage unit
`receiving the physical I/O stream will be used [or recording
`the I/O stream.
`FIG. 5 is a flow diagram 500 illustrating the operations of
`the data storage system 100 when committing a file to
`redundant storage. As shown, in step 510, when the appli-
`cation is ready to committhe file to redundant storage, the
`application makesa replication request to the OMS 240.‘The
`replication request includes the source hostname, the name
`ofthe file to be replicated, and the nameofthe replicatedfile.
`In step 520, the OMS queuesthe replication request in the
`OMS work queue 246. If the application needs to know
`immediately when replication is complete,
`the OMS 240
`may perform the replication immediately and may synchro-
`nously inform the application through synchronous remote
`procedure call mechanisms.
`With referencestill to FIG. 5, in step 530, when the OMS
`240 works through the OMS work queue 246 and finds a
`replication request, the file replication module 270 of the
`OMS240 selects a target data storage unit for copying the :
`file. In one embodiment, the replication module 270 uses the
`selector module 248 that has knowledge of the current state
`of each distributed data storage unit 130a—130n. The selec-
`tor module 248 selects a target unit based on current disk,
`CPU,and I/O utilization. The selector module 248 may also
`allow a newly installed distributed data storage unit to get
`the bulk of copics without overwhelming it. Altcrnatcly, the
`selector module 248 may use less sophisticated algorithms.
`Forinstance, the selector module 248 may always pick the
`distributed data storage unit to the “left” of the source data
`storage unit. The selector module 248 mayalso randomly
`pick one ofthe distributed data storage units 130a—-130n for
`storing the replicated file.
`In step 540,
`the file replication module 270 stores the
`source file information, noting the file is not redundant. Prior
`to replication,
`the source file is initially denoted as not
`redundantto protect against a system failure while the file is
`being replicated. In step 550,the file replication module 270
`contacts the target data storage unit’s name service, request-
`ing a new file nameallocation. In step 560, upon success-
`fully obtaining a new file name from the target data storage
`unit, the file replication module 270 contacts the target data
`storage unil’s file copy service, requesting a copy from the
`source file to the target file. In step 570, when the copy is
`complete, the file replication module 270 stores the desti-
`nation file information. After successfully replicating the
`file, the file replication module 270 marks the file as being
`redundant. At this point, the OMS 240 has a relationship
`between the file’s name in the application name-space and
`the OMS namespace.
`According to one embodimentof the invention, the OMS
`240 also stores a link countfor eachfile in the OMSfile state
`table 252. The link count is the number of unique application
`references to the given file. When the application creates a
`file in the OMS 240, the OMS 240sets the link countto one.
`When the application copies the file in the OMS 240, the
`OMS 240 increments the link count. Likewise, when the
`application deletesthe file, the OMS 240 decrementsthe link
`count.
`
`With referencestill to FIG. 6, in step 640, after obtaining
`the hostname and pathnameofthe file from the OMS 240,
`the applicationretrieves the file by passing the hostname and
`pathnameto the appropriate distributed data storage unit. In
`the present embodiment, a host-to-host binary copy
`protocol, such as CacheFS from Sun Microsystems, may be
`used to send the file to the requesting applica