`
`TAI EI TO U KONTUCI
`
`US009881040B2
`
`( 12 ) United States Patent
`Rawat et al .
`
`( 10 ) Patent No . :
`( 45 ) Date of Patent :
`
`US 9 , 881 , 040 B2
`Jan . 30 , 2018
`
`( 54 ) TRACKING DATA OF VIRTUAL DISK
`SNAPSHOTS USING TREE DATA
`STRUCTURES
`( 71 ) Applicant : VMware , Inc . , Palo Alto , CA ( US )
`( 72 ) Inventors : Mayank Rawat , Sunnyvale , CA ( US ) ;
`Ritesh Shukla , Saratoga , CA ( US ) ; Li
`Ding , Cupertino , CA ( US ) ; Serge
`Pashenkov , Los Altos , CA ( US ) ;
`Raveesh Ahuja , San Jose , CA ( US )
`( 73 ) Assignee : VMware , Inc . , Palo Alto , CA ( US )
`Subject to any disclaimer , the term of this
`( * ) Notice :
`patent is extended or adjusted under 35
`U . S . C . 154 ( b ) by 336 days .
`( 21 ) Appl . No . : 14 / 831 , 808
`( 22 ) Filed :
`Aug . 20 , 2015
`Prior Publication Data
`( 65 )
`US 2017 / 0052717 A1 Feb . 23 , 2017
`( 51 ) Int . Ci .
`G06F 3 / 06
`( 2006 . 01 )
`G06F 1730
`( 2006 . 01 )
`( 52 ) U . S . CI .
`CPC . . . . . . . . G06F 17 / 30327 ( 2013 . 01 ) ; G06F 3 / 067
`( 2013 . 01 ) ; G06F 3 / 0608 ( 2013 . 01 ) ; GOOF
`370641 ( 2013 . 01 ) ; G06F 17 / 30088 ( 2013 . 01 )
`Field of Classification Search
`CPC . . . . . . . . . . . . GO6F 17 / 30327 ; G06F 3 / 0608 ; G06F
`370641 ; G06F 3 / 067 ; G06F 17 / 30088
`See application file for complete search history .
`
`( 58 )
`
`( 56 )
`
`References Cited
`U . S . PATENT DOCUMENTS
`7 / 2014 Acharya et al .
`8 , 775 , 773 B2
`9 , 720 , 947 B2 *
`8 / 2017 Aron . . . . . . . . . . . . . . . . . G06F 17 / 30327
`G06F 12 / 1018
`9 , 740 , 632 B1 *
`8 / 2017 Love
`. . . .
`2015 / 0058863 Al
`2 / 2015 Karamanolis et al .
`2016 / 0210302 Al *
`7 / 2016 Xia
`. . . G06F 3 / 0619
`* cited by examiner
`
`Primary Examiner - Eric S Cardwell
`( 74 ) Attorney , Agent , or Firm — Patterson & Sheridan ,
`LLP
`
`( 57 )
`ABSTRACT
`User data of different snapshots for the same virtual disk are
`stored in the same storage object . Similarly , metadata of
`different snapshots for the same virtual disk are stored in the
`same storage object , and log data of different snapshots for
`the same virtual disk are stored in the same storage object .
`As a result , the number of different storage objects that are
`managed for snapshots do not increase proportionally with
`the number of snapshots taken . In addition , any one of the
`multitude of persistent storage back - ends can be selected as
`the storage back - end for the storage objects according to
`user preference , system requirement , snapshot policy , or any
`other criteria . Another advantage is that the storage location
`of the read data can be obtained with a single read of the
`metadata storage object , instead of traversing metadata files
`of multiple snapshots .
`
`20 Claims , 4 Drawing Sheets
`
`wwwwwwwwwwwane
`
`virluai disk
`210
`
`file descriptor 211
`geometry =
`size =
`data _ region - PTR - -
`
`Snapshot Management Data Structure
`
`snapshot _ data = OID !
`snapshot _ metadata = OID2
`snapshot _ log = 01D3
`
`OID1 - PTR1 w
`OID2 = PTR2
`QID3 = PTR3
`$ $ 1 = lagt ; OID2 , offset xD
`SS2 = tag2 : OD2 , offset x2
`SS3 = tag 3
`RP = OID2 , offset xC
`
`Storage Device 152
`
`base
`Wwwwwwwwwwwwwwwwww
`
`VMFS 230
`
`wwwwwwwwwwwww
`Storage Object 1
`-
`- -
`-
`-
`-
`Storage Obiect 2
`how w ww
`| Storage Object 3
`
`Storage Object 1
`
`Slorage Object 2
`
`Storage Object 3
`
`Storage Device 161
`yan wwwwwwwww
`Storage Object 1
`-
`-
`Storage Object 2
`Storage Object 3
`
`WIZ, Inc. EXHIBIT - 1019
`WIZ, Inc. v. Orca Security LTD.
`
`
`
`atent
`
`Jan . 30 , 2018
`
`Sheet 1 of 4
`
`US 9 , 881 , 040 B2
`
`WWWWWWWWWWWWWWWWWWWW
`
`Host Computer System 100
`
`WWWWWWWWWWWWW
`
`WMULA MAJMU
`
`UMMUM
`
`MAKAMU
`
`VM 112
`Applications 118
`
`OS 116
`
`WWMWWMWWWWWWWW
`
`+
`
`+ +
`
`+
`
`+
`
`+
`
`+ 144141 +
`
`+
`
`+
`
`+
`
`+ 44440 +
`
`+
`
`VM 112N
`
`nnnnnnnnnnnn
`
`WWW xxxxxxxxxxxxxxxxxxxxx
`
`KALULUKLUX * * *
`
`. LILIK
`
`*
`
`* * * * * MKMKMHRHMHMK???
`
`* * W
`
`W WXXX * * *
`
`*
`
`* * *
`
`VMM 1221
`
`VMM 122N
`
`SCSI Virtualization Layer 131
`Filesystem
`Snapshot
`Device Switch
`Module
`132
`
`???????
`wwwwwwwwwwwwwwwwwwwwwwwww
`wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww
`
`HESAOLIVSAN Driver 134
`
`+
`
`2 + 3
`
`+
`
`4
`
`+ 1 +
`
`1
`
`+
`
`1
`
`+ + +
`
`BERE
`
`Data Access Layer 136
`
`Hypervisor
`108
`
`CPUS )
`103
`
`Memory
`104
`
`NICIS )
`105
`
`HBA ( S )
`106
`
`HW Platform
`102
`
`179???? ?1?1?17 ??? # ?
`
`# H
`
`#
`
`?? #
`
`#
`
`# ????????
`
`151
`
`152
`
`77702077777777777777777777777777777777777777777777777777777777777
`
`
`
`???????????????????? ???????????? ?? ??????
`
`Storage
`
`Device JuruterCuttuu
`
`161
`
`UUUUUUUUUU
`
`Storage
`Device
`162
`
`??
`
`FIGURE 1
`
`
`
`U . S . Patent
`
`Jan . 30 , 2018
`
`Sheet 2 of 4
`
`US 9 , 881 , 040 B2
`
`virtual disk
`210
`
`file descriptor 211
`geometry =
`size =
`data region = PTR = - -
`
`Snapshot Management Data Structure
`220
`
`Snapshot _ data = OIDI
`snapshot _ metadata = 0102
`snapshot _ log = O1D3
`OID1 = PTR1
`OID2 = PTR2 -
`OID3 = PTR3
`$ $ 1 = tag1 ; OID2 , offset xo
`SS2 = tag2 ; OID2 , offset x2
`SS3 = tag 3
`RP = 01D2 , offset xa
`
`ww
`
`w
`
`w
`
`w
`
`w
`
`w
`
`w
`
`www
`
`wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww
`
`Storage Device 162
`
`wwwwwwwwwwwwwwwwwwwwwwwwwww
`
`base
`
`wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww
`
`VMFS 230
`
`Hoe
`ww
`w
`*
`KWA WA
`1 Storage Object 1 JAAR
`KAKA
`Storage Object 2 *
`| Storage Object 3 want
`
`*
`
`We
`
`w
`w
`Pm w
`wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww
`
`Storage Object 1
`
`Storage Object 2
`
`Storage Object 3
`
`tuwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww
`
`wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww
`
`Storage Device 161
`????
`?????
`????
`????
`???? ????
`1 Storage Object 1
`mm
`- - - - -
`Storage Object 2 1
`Storage Object 3
`w
`
`3
`
`m
`
`wong
`
`???
`
`FIGURE 2
`
`
`
`U . S . Patent
`
`Jan . 30 , 2018
`
`Sheet 3 of 4
`
`US 9 , 881 , 040 B2
`
`MUND
`27
`Se "
`00000
`
`* * * * *
`
`W
`
`* * * * *
`
`* * *
`
`0
`
`A WR2
`
`LBA
`PIR
`OID2 , X10
`OID2 , x2
`3500
`OID2 , X3
`3800
`Atelier AT & C ) C300E
`A1c347633
`A Att51474618914
`BUCHULE
`A317635 Arteriet
`Alt
`€140X16378296
`DEC31634901054
`40°C )
`Hotectie 3 C3ECEDO
`CECSECSESSE
`1926
`ACHILE )
`36963
`48010
`ACCESS * *
`alebo write data
`
`A * *
`
`* * * * * * *
`
`* *
`
`*
`
`*
`
`* * * * * * * * *
`
`* *
`
`* * *
`
`HUWAKAMKAMAKAKAKACHUMAKMAMAKAA
`
`WR3
`
`www membang k
`
`another unit
`pohon
`allocated
`AUTO
`Actress
`Asesorerost
`Af0919x314
`frescore
`Area
`ASTO 15046910161
`oreret
`1982 recor
`WEDS
`et ses
`3 93183xD ) 10x « te
`Her89500 )
`1994€ ) 1031 )
`10 * 2013
`3 * 010
`write data
`to Show write data
`
`SS2
`S $ 1 = 0102 , offset xo
`RP = OID2 , offset x8
`
`WR4
`
`HARAKARAR
`
`pour
`
`another unit
`for
`allocated
`non minumang mga tao ang
`reka hang
`mga mata me
`le
`SCHECHISI 372€ CIS ) CCCC
`HOHEN
`Cre
`1
`56940 )
`Cheeheese
`$ 1831999
`
`?
`* * * *
`) $ 70 CM Chanel
`974 )
`CHE
`ISY
`write data
`to write data
`
`$ S3
`SS2 = OID2 , offset x2
`RP = 01D2 , offset xc
`
`0
`LBA W
`
`7000 8000
`2000 2000 3000 4000
`7000 8000
`enefit
`B
`
`WR1
`
`2
`
`WR3
`
`OID2
`
`OID1
`from
`
`=
`
`SS1
`RP = OID2 , offset 0
`
`* * ) SeNCH
`TEST )
`
`t he unit of allocation
`WRI Why
`TIETO ) 414SPEED torty
`* * * * 5001 PORTO
`FIETSE103107147
`TEREOC1767467562
`St10109
`SO167015
`Ceca * * * TC )
`01010
`OLEDICIESIEC2
`CC1324
`furorch
`4118
`Methane Write data
`
`20000
`
`B + Tree
`
`base , 0
`
`base , 3800
`base , o
`C1D1 , y
`
`base , 3800
`base , o
`0D1 , y1
`OID1 , 0
`base , 3200
`
`000000
`base ,
`base ,
`base , 0
`1
`0101 , 0 OD1 , y1 3800
`7900
`base , 3200
`OD1 , y3
`
`base ,
`base ,
`base , o
`7900
`OID1 , 0 OID1 , y1 3800
`OD1 , y3
`base , 3200
`
`0
`
`4
`
`N
`
`PRUNAKAN
`
`5 Y
`
`2
`
`4
`
`Y
`
`LBA
`PIR
`0 : 02 ,
`3000
`OD ? x5
`3200
`002 , xi
`0 : 02 , x2 3500
`0 : 02 , 43
`3800
`TOID2 . ó
`7700
`0102x7 7900
`1213141516171810
`PTR
`LBA
`{ } } , x9
`0102 , 46
`3000
`v ) : x 316
`01D2 .
`3500
`0102 , x3
`3800
`OID2 , 07700
`C2 , x
`790
`Luwwwwwwwwwwwwwwwwwwwwwwwww
`PERY
`
`W
`
`WW
`
`S
`
`G
`
`nu (
`
`3
`
`Y
`
`N www
`
`FIGURE 3
`
`base ,
`base ,
`7900
`OID1 , 71 3800
`OD1 , 0
`OID1 , 13
`base , 3200
`
`7 09
`B
`?
`?
`mwenental
`OD1 , y4 base , 300
`
`
`
`U . S . Patent
`
`Jan . 30 , 2018
`
`Sheet 4 of 4
`
`US 9 , 881 , 040 B2
`
`402
`
`404
`
`VM
`Power On
`
`Read SMDS
`
`YYYYYYYYYYYYYYYYYYYYYY MYYYYYYYYYYYYYYYYYYYYYYYY
`
`Open storage
`objects
`
`Establish running
`point ( RP )
`
`* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
`
`FIGURE 4A
`RE
`
`Read 10
`
`422
`
`Access snapshot
`metadata at RP
`+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 424
`Traverse tree
`beginning at RP to
`locate data
`
`11111111111111111111111111111111111111111111111
`
`1111111111111111111111111111111111111111111111111111
`
`426
`
`VM
`Snapshot
`
`Set running point as root
`node of previous snapshot
`
`Create node for
`new running point
`
`412
`
`414
`
`nnnnnnnnnnnn
`
`416
`
`Copy contents of root node
`of previous snapshot into the
`new running point node and
`mark all pointers as pointing
`to shared nodes
`
`FIGURE 4B
`
`Write 10
`
`Access snapshot
`metadata at RP
`
`Traverse tree
`beginning at RP to
`find write location
`
`and update tree nnnnnnnnnnnnnnnnnnnnnnnnnn
`
`Issue write
`command to write
`data at the
`location
`
`FIGURE 4D
`
`432
`
`434
`
`436
`
`Issue read
`command to read
`data from the
`location
`
`*
`
`*
`
`*
`
`*
`
`*
`
`* 241424344477777XXIIIIIIIIII
`
`I I111111111111111111111
`
`?????????????????????????
`
`*
`
`FIGURE 4C
`
`
`
`US 9 , 881 , 040 B2
`
`TRACKING DATA OF VIRTUAL DISK
`SNAPSHOTS USING TREE DATA
`STRUCTURES
`
`object , which may take the form of a file in a host file
`system , a file in a network file system , and object storage
`provisioned as a virtual storage area network ( SAN ) object ,
`a virtual volume object , or a cloud storage object . Similarly ,
`5 metadata of different snapshots for the same virtual disk are
`BACKGROUND
`stored in the same storage object , and log data of different
`snapshots for the same virtual disk are stored in the same
`In a virtualized computing environment , virtual disks of
`storage object . As a result , the number of different storage
`virtual machines ( VMs ) running in a host computer system
`objects that are managed for snapshots do not increase
`( " host " ) are typically represented as files in the host ’ s file
`system . To back up the VM data and to support linked VM 10 proportionally with the number of snapshots taken . In addi
`clones , snapshots of the virtual disks are taken to preserve
`tion , any one of the multitude of persistent storage back - ends
`the VM data at a specific point in time . Frequent backup of
`can be selected as the storage back - end for the storage
`VM data increases the reliability of the VMs . The cost of
`objects containing data for the snapshots . As a result , the
`frequent backup , i . e . , taking frequent snapshots , is high
`form of the storage objects containing data for the snapshots
`because of the increase in associated storage costs and 15 may be selected according to user preference , system
`adverse impact on performance , in particular read perfor -
`requirement , snapshot policy , or any other criteria . Another
`mance because each read will have to potentially traverse
`advantage is that the storage location of the read data can be
`each snapshot level to find the location of the read data .
`obtained with a single read of the metadata storage object ,
`Solutions have been developed to reduce the amount of
`instead of traversing metadata files of multiple snapshots .
`storage consumed by snapshots . For example , snapshots can 20
`FIG . 1 is a computer system , shown as host computer
`be backed up incrementally by comparing blocks from one
`system 100 , having a hypervisor 108 installed on top of
`version to another and only the blocks that have changed
`hardware platform 102 to support the execution of virtual
`from the previous version are saved . Deduplication has also
`machines ( VMs ) 1121 - 112n through corresponding virtual
`been used to identify content duplicates among snapshots to
`machine monitors ( VMMs ) 122 , - 122x . Host computer sys
`25 tem 100 may be constructed on a conventional , typically
`remove redundant storage content .
`Although these solutions have reduced the storage
`server - class , hardware platform 102 , and includes one or
`requirements of snapshots , further enhancements are needed
`more central processing units ( CPUs ) 103 , system memory
`104 , one or more network interface controllers ( NICs ) 105 ,
`for effective deployment in cloud computing environments
`where the number of VMs and snapshots that are managed
`and one or more host bus adapters ( HBAs ) 106 . Persistent
`is quite large , often several orders of magnitude times 30 storage for host computer system 100 may be provided
`greater than deployment in conventional data centers . In
`locally , by a storage device 161 ( e . g . , network - attached
`addition , storage technology has advanced to provide a
`storage or cloud storage ) connected to NIC 105 over a
`multitude of persistent storage back - ends , but snapshot
`network 151 or by a storage device 162 connected to HBA
`technology has yet to fully exploit the benefits that are
`106 over a network 152 .
`provided by the different persistent storage back - ends .
`Each VM 112 implements a virtual hardware platform in
`the corresponding VMM 122 that supports the installation of
`a guest operating system ( OS ) which is capable of executing
`BRIEF DESCRIPTION OF THE DRAWINGS
`applications . In the example illustrated in FIG . 1 , the virtual
`FIG . 1 is a block diagram of a virtualized host computer
`hardware platform for VM 112 , supports the installation of
`system that implements a snapshot module according to 40 a guest OS 116 which is capable of executing applications
`118 within VM 112 , . Guest OS 116 may be any of the
`embodiments .
`FIG . 2 is a schematic diagram that illustrates data struc -
`well - known commodity operating systems , such as Micro
`tures for managing virtual disk snapshots according to an
`soft Windows® , Linux® , and the like , and includes a native
`file system layer , for example , either an NTFS or an ext3FS
`embodiment .
`FIG . 3 is a schematic diagram that illustrates additional 45 type file system layer . Input - output operations ( IOs ) issued
`data structures , including B + trees , for managing virtual disk
`by guest OS 116 through the native file system layer appear
`to guest OS 116 as being routed to one or more virtual disks
`snapshots according to an embodiment .
`FIG . 4A depicts a flow diagram of method steps that are
`provisioned for VM 112 , for final execution , but such IOs
`carried out in connection with opening storage objects that
`are , in reality , reprocessed by IO stack 130 of hypervisor 108
`are needed to manage snapshots according to an embodi - 50 and the reprocessed IOs are issued through NIC 105 to
`storage device 161 or through HBA 106 to storage device
`ment .
`FIG . 4B depicts a flow diagram of method steps that are
`162 .
`At the top of IO stack 130 is a SCSI virtualization layer
`carried out in connection with taking snapshots according to
`131 , which receives IOs directed at the issuing VM ' s virtual
`an embodiment .
`FIG . 4C depicts a flow diagram of method steps that are 55 disk and translates them into IOs directed at one or more
`carried out to process a read IO on a virtual disk having one
`storage objects managed by hypervisor 108 , e . g . , virtual disk
`or more snapshots that have been taken according to an
`storage objects representing the issuing VM ' s virtual disk . A
`file system device switch ( FDS ) driver 132 examines the
`embodiment .
`FIG . 4D depicts a flow diagram of method steps that are
`translated IOs from SCSI virtualization layer 131 and in
`carried out to process a write IO on a virtual disk having one 60 situations where one or more snapshots have been taken of
`or more snapshots that have been taken according to an
`the virtual disk storage objects , the IOs are processed by a
`snapshot module 133 , as described below in conjunction
`embodiment .
`with FIGS . 4C and 4D .
`The remaining layers of IO stack 130 are additional layers
`65 managed by hypervisor 108 . HFS / VVOL / NSAN driver 134
`represents one of the following depending on the particular
`implementation : ( 1 ) a host file system ( HFS ) driver in cases
`
`DETAILED DESCRIPTION
`According to embodiments , user data of different snap -
`shots for the same virtual disk are stored in the same storage
`
`35
`
`
`
`US 9 , 881 , 040 B2
`
`2 , 3 , storage objects 1 , 2 , 3 are identified by their object
`where the virtual disk and / or data structures relied on by
`identifiers ( OIDs ) in the embodiments . SMDS provides a
`snapshot module 133 are represented as a file in a file
`mapping of each OID to a location in storage . In SMDS 220 ,
`system , ( 2 ) a virtual volume ( VVOL ) driver in cases where
`OID1 is mapped to PTR1 , OID2 mapped to PTR2 , and OID3
`the virtual disk and / or data structures relied on by snapshot
`module 133 are represented as a virtual volume as described 5 mapped to PTR3 . Each of PTR1 , PTR2 , and PTR3 may be
`in U . S . Pat . No . 8 , 775 , 773 , which is incorporated by refer -
`a path to a file in HFS 230 or a uniform resource identifier
`ence herein in its entirety , and ( 3 ) a virtual storage area
`( URI ) of a storage object .
`network ( VSAN ) driver in cases where the virtual disk
`SMDS is created per virtual disk and snapshot module
`and / or data structures relied on by snapshot module 133 are
`133 maintains the entire snapshot hierarchy for a single
`represented as a VSAN object as described in U . S . patent 10 virtual disk in the SMDS . Whenever a new snapshot of a
`application Ser . No . 14 / 010 , 275 , which is incorporated by
`virtual disk is taken , snapshot module 133 adds an entry in
`reference herein in its entirety . In each case , driver 134
`the SMDS of that virtual disk . SMDS 220 shows an entry for
`receives the IOs passed through filter driver 132 and trans -
`each of snapshots SS1 , SS2 , SS3 . Snapshot SS1 is the first
`lates them to IOs issued to one or more storage objects , and
`snapshot taken for virtual disk 210 and its entry includes a
`provides them to data access layer 136 which transmits the 15 tag ( tagl ) that contains searchable information about snap
`IOs to either storage device 161 through NIC 105 or storage
`shot SS1 and a pointer to a root node of a B + tree that records
`device 162 through HBA 106 .
`locations of the snapshot data for snapshot SS1 . Snapshot
`It should be recognized that the various terms , layers and
`SS2 is the second snapshot taken for virtual disk 210 and its
`categorizations used to describe the virtualization compo -
`entry includes a tag ( tag2 ) that contains searchable infor
`nents in FIG . 1 may be referred to differently without 20 mation about snapshot SS2 and a pointer to a root node of
`departing from their functionality or the spirit or scope of the
`a B + tree that records locations of the snapshot data for
`invention . For example , VMMs 122 may be considered
`snapshot SS2 . Snapshot SS3 is the third snapshot taken for
`separate virtualization components between VMs 112 and
`virtual disk 210 and its entry includes a tag ( tag3 ) that
`hypervisor 108 ( which , in such a conception , may itself be
`contains searchable information about snapshot SS3 . The
`considered a virtualization “ kernel ” component ) since there 25 pointer to a root node of a B + tree that records locations of
`exists a separate VMM for each instantiated VM . Alterna -
`the snapshot data for snapshot SS3 is added to the entry for
`tively , each VMM may be considered to be a component of
`snapshot SS3 when the next snapshot is taken and the
`its corresponding virtual machine since such VMM includes
`contents of snapshot SS3 are frozen . The contents of the
`the hardware emulation components for the virtual machine .
`nodes of all B + trees are stored in the snapshot metadata
`It should also be recognized that the techniques described 30 storage object . Accordingly , the pointer in the entry for
`herein are also applicable to hosted virtualized computer
`snapshot SS1 indicates OID2 as the storage object contain
`systems . Furthermore , although benefits that are achieved
`ing the B + tree for snapshot SS1 and offset x0 as the location
`may be different , the techniques described herein may be
`of the root node . Similarly , the pointer in the entry for
`applied to certain non - virtualized computer systems .
`snapshot SS2 indicates OID2 as the storage object contain
`FIG . 2 is a schematic diagram that illustrates data struc - 35 ing the B + tree for snapshot SS2 and offset x2 as the location
`tures for managing virtual disk snapshots according to an
`of the root node .
`embodiment . In the embodiment illustrated herein , the vir -
`SMDS also specifies a running point RP , which is a
`tual disk for a VM ( shown in FIG . 2 as virtual disk 210 ) is
`pointer to a root node of a B + tree that is traversed for reads
`assumed to be a file that is described by a file descriptor in
`and writes that occur after the most recent snapshot was
`the host file system ( shown in FIG . 2 as file descriptor 211 ) . 40 taken . Each time snapshot module 133 takes a snapshot ,
`Each file descriptor of a virtual disk contains a pointer to a
`snapshot module 133 adds the running point to the entry of
`data region of the virtual disk in storage . In the example of
`the immediately prior snapshot as the pointer to the root
`FIG . 2 , file descriptor 211 contains the pointer PTR which
`node of the B + tree thereof , and creates a new running point
`points to a base data region in storage device 162 . In the
`in the manner further described below .
`description that follows , this base data region in storage 45
`FIG . 3 is a schematic diagram that illustrates additional
`device 162 is referred to as " base " and locations within this
`data structures , including B + trees , for managing virtual disk
`base data region are specified with an offset . In other
`snapshots according to an embodiment . FIG . 3 depicts the
`embodiments , the virtual disk may be represented as a
`logical block address ( LBA ) space of virtual disk 210 , the
`VVOL object , a VSAN object , or other types of object stores
`snapshot data storage object ( OID1 ) , and the snapshot
`known in the art , and described using associated descriptor 50 metadata storage object ( OID2 ) , in linear arrays beginning at
`objects .
`offset 0 . FIG . 3 also schematically illustrates B + trees
`In addition to file descriptor 211 , the data structures for
`associated with each of SS1 and SS2 , the first having root
`node 0 and the second having root node 8 .
`managing snapshots include a snapshot management data
`timeline is depicted along the left side of FIG . 3 and
`structure ( SMDS ) 220 , storage object 1 which contains
`actual data written to virtual disk 210 after a snapshot has 55 various events useful for illustrating the embodiments , such
`been taken for virtual disk 210 ( hereinafter referred to as
`as snapshots ( e . g . , SS1 , SS2 , SS3 ) and writes ( WR1 , WR2 ,
`“ the snapshot data storage object ” ) , storage object 2 which
`WR3 , WR4 ) are depicted along this timeline . Alongside
`contains metadata about the snapshots taken for virtual disk
`each of these events , FIG . 3 also illustrates the changes to
`210 ( hereinafter referred to as “ the snapshot metadata stor -
`the contents of the snapshot data storage object ( OID1 ) and
`age object " ) , and storage object 3 which is used to record 60 the snapshot metadata storage object ( OID2 ) , and the B +
`snapshot metadata operations for crash consistency ( herein -
`trees .
`after referred to as “ the snapshot log storage object ” ) .
`The first event is a snapshot of virtual disk 210 , SS1 . In
`Storage objects 1 , 2 , 3 are depicted herein as object stores
`the example described herein , this snapshot is the very first
`within storage device 162 , but may be files of HFS 230 or
`snapshot of virtual disk 210 , and so snapshot module 133
`a network file system in storage device 161 . Storage objects 65 creates SMDS 220 , which specifies the storage locations for
`1 , 2 , 3 may be also be object stores in a cloud storage device .
`the snapshot data storage object ( OID1 ) , the snapshot meta
`Regardless of the type of storage backing storage objects 1 ,
`data storage object ( OID2 ) , and the snapshot log storage
`
`
`
`US 9 , 881 , 040 B2
`
`identifies storage location = OID2 and offset = x3 as the
`object ( OID3 ) . Snapshot module 133 also sets the running
`pointer to node 3 , a beginning LBA of 3800 , and a P flag
`point RP to be at node 0 ( whose contents are stored at
`indicating that it points to a private node . Private nodes are
`storage location = OID2 , offset x0 ) , and updates node 0 to
`those nodes whose contents may be overwritten without
`include a single pointer to the base data region of virtual disk
`210 . Thus , initially , subsequent to the event SS1 , snapshot 5 preserving the original contents . On the other hand , when a
`module 133 directs all read IOs ( regardless of the LBA range
`write IO targets an LBA and a shared node is traversed to
`targeted by the read IO ) to the base data region of virtual
`find the data location corresponding to the targeted LBA , the
`disk 210 .
`contents of the shared node need to be preserved and a new
`The second event is a write IO to virtual disk 210 , WR1 .
`node created . The handling of shared nodes is described
`In the example of FIG . 3 , WR1 is a write IO into the virtual 10 below in conjunction with the write IO , WR4 .
`disk at LBA = 3500 and has a size that spans 300 LBAs .
`The B + tree on the right side of FIG . 3 schematically
`According to embodiments , instead of overwriting data in
`illustrates the relationship of the nodes that are maintained
`the base data region of virtual disk 210 , the write data of
`in the snapshot metadata storage object after each event
`WR1 is written into the snapshot data storage object through
`depicted in FIG . 3 . The B + tree to the right of WR1 shows
`the following steps .
`that node 0 now points to nodes 1 , 2 , 3 , and nodes 1 , 2 , 3
`First , snapshot module 133 allocates an unused region in
`the snapshot data storage object . The size of this allocation
`point to data regions that together span the entire LBA range
`is based on a unit of allocation that has been configured for
`spanned by the base data region of virtual disk 210 . Node 1
`the snapshot storage object . The unit of allocation is 4 MB includes a pointer to the base data region of virtual disk 210
`in this example , but may be changed by the snapshot 20 at an offset equal to 0 . Node 2 includes a pointer to the
`administrator . For example , the snapshot administrator may
`snapshot data storage object at an offset equal to yl ( = 500 )
`set the unit of allocation to be larger ( > 4 MB ) if the snapshot
`Node 3 includes a pointer to the base data region of virtual
`data storage object is backed by a rotating disk array or to
`disk 210 at an offset equal to 3800 .
`be smaller ( < 4 MB ) if the snapshot data storage object is
`The third event is a write IO to virtual disk 210 , WR2 . In
`backed by solid state memory such as flash memory . In 25 the example of FIG . 3 , WR2 is a write IO into virtual disk
`addition , in order to preserve the spatial locality of the data ,
`at LBA = 3000 and has a size that spans 200 LBAS . As with
`snapshot module 133 allocates each region in the snapshot
`WR1 , instead of overwriting data in the base data region of
`data storage object to span a contiguous range of LBAS
`virtual disk 210 , the write data of WR1 is written into the
`( hereinafter referred to as the “ LBA chunk ” ) of the virtual
`snapshot data storage object through the following steps .
`disk beginning at one of the alignment boundaries of the 30
`First , snapshot module 133 detects that LBA at offset
`virtual disk , for example , and integer multiples of ( unit of
`3000 has been allocated already . Therefore , snapshot mod
`allocation ) / ( size of one LBA ) . In the example of FIG . 3 , the
`ule 133 issues a write command to the snapshot data storage
`size of one LBA is assumed to be 4 KB . Accordingly , the
`object to store the write data of WR2 in the allocated region
`very first allocated region in the snapshot data storage object
`000 35 at an offset equal to 0 . The offset is 0 because the LBA 3000
`spans 1000 LBAs and the alignment boundary is at 3000 , 35 al
`falls on an alignment boundary . Then , snapshot module 133
`because WR1 is a write IO into the LBA range beginning at
`creates two additional nodes , nodes 4 , 5 , and adds two
`offset 3500 .
`Second , snapshot module 133 issues a write command to
`pointers to these two nodes in node 0 . More specifically , a
`the snapshot data storage object to store the write data of
`first new entry in node O identifies storage location = OID2
`WR1 in the allocated region at an offset equal to an offset 40 and offset = x4 as the pointer to node 4 , a beginning LBA of
`from an alignment boundary of the LBA chunk spanned by
`0 , and a P flag indicating that it points to a private node , and
`the allocated region . In the example of FIG . 3 , the allocated
`a second new entry
`in
`node 0
`identifies storage
`region spans LBA range 3000 - 3999 , and so snapshot module
`location = OID2 and offset = x5 as the pointer to node 5 , a
`133 issues a write command to the snapshot data storage
`beginning LBA of 3000 , and a P flag indicating that it points
`object to store the write data ( having a size equal to 1 . 2 45 to a private node . Snapshot module 133 also modifies the
`MB = 300x4 KB ) in the allocated region at an offset equal to
`beginning LBA for the pointer to node 1 from 0 to 3200 .
`500 from the beginning of the allocated region . The offset
`The B + tree to the right of WR2 shows that node 0 now
`from the beginning of the snapshot data storage object is also
`points to nodes 4 , 5 , 1 , 2 , 3 , and nodes 4 , 5 , 1 , 2 , 3 point to
`500 ( shown in Figure as y1 ) because the allocated region is
`data regions that together span the entire LBA range spanned
`the very first allocated region of the snapshot data storage 50 by the base data region of virtual disk 210 . Node 4 includes
`object .
`a pointer to the base data region of virtual disk 210 at an
`Third , snapshot module 133 updates the snapshot meta -
`offset equal to 0 . Node 5 includes a pointer to the snapshot
`data storage object at an offset equal to 0 . Node 1 includes
`data of virtual disk 210 ( in particular , the snapshot metadata
`storage object , OID2 ) by creating three additional nodes ,
`a pointer to the base data region of virtual disk 210 at an
`nodes 1 , 2 , 3 , and overwrites the contents of node 0 to 55 offset equal to 3200 . Node 2 includes a pointer to the
`convert node 0 from a leaf node ( which points to data ) to an
`snapshot data storage object at an offset equal to y1 ( = 500 ) .
`index node ( which points to one or more other nodes ) , so
`Node 3 includes a pointer to the base data region of virtual
`that node ( includes the following information : ( i ) pointers
`disk 210 at an offset equal to 3800 .
`to nodes 1 , 2 , 3 , ( ii ) a beginning LBA for each pointer , and
`The fourth event is a write IO to virtual disk 210 , WR3 .
`( iii ) a private / shared flag for each pointer . More specifically , 60 In the example of FIG . 3 , WR3 is a write IO into virtual disk
`node O has three entries , one entry for each pointer . The first
`at LBA = 7700 and has a size that spans 200 LBAS . As with
`entry identifies storage location = OID2 and offset = xl as the
`WR1 and WR2 , instead of overwriting data in the base data
`pointer to node 1 , a beginning LBA of 0 , and a P flag
`region of virtual disk 210 , the write data of WR3 is written
`indicating that it points to a private node . The second entry
`into the snapshot data storage object through the following
`identifies storage location = OID2 and offset = x2 as the 65 steps .
`pointer to node 2 , a beginning LBA of 3500 , and a P flag
`First , snapshot module 133 allocates a new unused region
`indicating that it points to a private node . The third entry
`in the snapshot data storage object because the previously
`
`
`
`US 9 , 881 , 040 B2
`allocated region does not span the LBA targeted by WR3 . In
`spanned by the newly allocated region . In the example of
`the example of FIG . 3 , the size of the newly allocated region
`FIG . 3 , the newly allocated region spans LBA range 0000
`0999 , and so snapshot module 133 issues a write command
`is again 4 MB .
`Second , s