`US 6,795,966 B1
`(10) Patent N0.:
`Lim et al.
`
`(45) Date of Patent: Sep. 21, 2004
`
`US006795966B1
`
`(54) MECHANISM FOR RESTORING, PORTING,
`REPLICATING AND CHECKPOINTING
`COMPUTER SYSTEMS USING STATE
`EXTRACTION
`
`(75)
`
`Inventors: Beng-Hong Lim, Menlo Park, CA
`(US); Edouard Bugnion, Mountain
`View, CA (US); Scott W. Devine,
`Redwood City, CA (US)
`
`(73) Assignee: VMWare, Inc., Palo Alto, CA (US)
`
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(21) Appl. No.: 09/497,978
`
`(22)
`
`Filed:
`
`Feb. 4, 2000
`
`Related US. Application Data
`
`(63)
`
`(60)
`
`Continuation—in—part of application No. 09/151,175, filed on
`Sep. 10, 1998, now Pat. No. 6,496,847.
`Provisional application No. 60/118,862, filed on Feb. 5,
`1999, and provisional application No. 60/085,685, filed on
`Feb. 15, 1998.
`
`Int. Cl.7 ................................................ G06F 9/455
`(51)
`
`.. 718/1; 718/100
`(52) US. Cl.
`.....
`(58) Field of Search .......................... 709/1; 718/1, 100
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`5,386,552 A
`5,452,462 A *
`5,634,096 A
`5,715,464 A
`5,758,174 A
`5,905,855 A *
`
`1/1995 Garney et al.
`9/1995 Matsuura et al.
`5/1997 Baylor et al.
`2/1998 Crump et al.
`5/1998 Crump et al.
`5/1999 Klaiber et al.
`
`.............. 709/1
`
`................ 714/31
`
`OTHER PUBLICATIONS
`
`J. Howard, et al., “Scale and performance in a distributed file
`system,” ACM Transactions
`on Computer Systems,
`6(1):51—81, Feb., 1988.
`
`M. Litzkow, et al., “Supporting Checkpointing and Process
`Migration Outside the UNIX Kernel,” Proceedings of the
`1994 Winter USENIX Technical Conference, San Francisco,
`CA, Jan. 1992.
`C. Landau, “The Checkpoint Mechanism in KeyKOS,”
`Proceedings of the Second International Workshop on
`Object Orientation in Operating Systems, Sep. 1992.
`De Jonge,et al., “The Logical Disk: A New Approach to
`Improving File Systems,” in Proceedings of the 141h ACM
`Symposium on Operating System Principles, pp. 15—28,
`Dec. 1993.
`
`D. Hitz, et al., “File system design for a file server appli-
`ance,” Proceedings of the 1994 Winter USENIX Technical
`Conference, pp. 235—245, San Francisco, CA, Jan. 1994.
`Lee, et al., “Petal: Distributed Virtual Disks,” Proc. 1 “Intl.
`Conf. On Architectural Support
`for Programming Lan-
`guages and Operating Systems,” pp. 84—92, Oct. 1996.
`
`(List continued on next page.)
`
`Primary Examiner—John Follansbee
`Assistant Examiner—Ashok Patel
`
`(74) Attorney, Agent, or Firm—Jeffrey Pearce
`
`(57)
`
`ABSTRACT
`
`A computer system is interrupted, and its entire state infor-
`mation is extracted as one or more checkpoints at one or
`more respective points during operation of the system. The
`checkpoint may be restored into the system at any later time,
`even multiple times, and it may also even be loaded into one
`or more other systems; all systems loaded with the same
`checkpoint will then execute from the same checkpointed
`state. The state extraction mechanism is preferably a virtual
`machine monitor, on which one or more virtual machines are
`installed, each virtual machine constituting an encapsulated,
`virtualized computer system whose states can be check-
`pointed under control of the virtual machine monitor.
`Checkpoints may be stored on a portable memory device or
`transmitted as a batch or dynamically over a network so that
`even virtual machines installed at different sites may execute
`from the same state.
`
`21 Claims, 5 Drawing Sheets
`
`200
`VM1
`2201
`2202
`
`VPROC
`
`NON-
`VOLATILE
`STORAGE
`
`
`
`VMM
`
` DEVICE EMULATORS
`._.
`_____
`252
`___
`
`
`
`
` HARDWARE
`
`<->
`PROC
`
`100/
`
`110
`
`142
`
`VEEAM 1004
`
`IPR of US. Patent No. 7,093,086
`
`1 50
`PERIPH
`
`
`E
`)
`202
`
`VOS
`VPERIPH
`
`_/
`208
`
`200n
`
`
`
`
`
`
`
`
`
`
`
`
`US 6,795,966 B1
`Page 2
`
`OTHER PUBLICATIONS
`
`Litzkow, et al., “Checkpoint and Migration of UNIX Pro-
`cesses in the Condor Distributed Processing System,” M.
`University of Wisconsin—Madison Computer Sciences Tech-
`nical Report #1346, Apr. 1997.
`J. Shapiro, et al., “EROS: a fast capability system,” Pro-
`ceedings of the 17th ACM Symposium on Operating Sys
`
`tems Principles (SOSP ’99), Dec. 1999, Charleston, South
`Carolina.
`
`D. Santry, et al., “Deciding When to forget in the Elephant
`file system,” Proceedings of the 17th ACM Symposium on
`Operating Principles, Dec. 1999, Charleston, Sout Carolina.
`
`* cited by examiner
`
`
`
`US. Patent
`
`Sep.21,2004
`
`SheetlrofS
`
`US 6,795,966 B1
`
`0:
`
`N:
`
`-ZOZ
`
`mu__._.<._O>
`
`m0<m0._.m
`
`m_I__._.<._O>
`
`MOEOHw
`
`omF ..Imam:
`
`
`
`mm<>>0m<I
`
`oow
`
`_‘
`EV
`
`_ON_\
`
`NONF
`
`gr
`
`om;
`
`_..0.”—
`
`
`
`S.U
`
`eton”
`
`m
`
`S
`
`5f0
`
`.com
`
`:1.COON
`
`tmomnomw
`
`Smom
`
`4m_|__._.<n_O>m.202
`
`29ooon:_>_>
`
`wmu<mO._.mVON
`
`omN
`
`
`
`2wN—OFSDSEMOSMDwamma>mo
`
`N.0."—
`
`In=mm_n_
`
`o:
`
`m.__._.<._O>
`
`m0<m0._.w
`
`US 6,795,966 B1
`
`N:o:08
`
`DONE
`
`mm<>>DE<I
`
`
`
`
`US. Patent
`
`Sep.21,2004
`
`Sheet3 0f5
`
`US 6,795,966 B1
`
`
`
`
`
`US. Patent
`
`Sep.21,2004
`
`Sheet4 0f5
`
`US 6,795,966 B1
`
`
`
`m.0."—
`
`
`
`US. Patent
`
`Sep. 21, 2004
`
`Sheet 5 0f 5
`
`US 6,795,966 B1
`
`
`
`SEEMS.ZO_ww_S_wZ<N_._.
`
`a:can.aa20
`
`o.0."—
`
`
`
`
`US 6,795,966 B1
`
`1
`
`MECHANISM FOR RESTORING, PORTING,
`REPLICATING AND CHECKPOINTING
`COMPUTER SYSTEMS USING STATE
`EXTRACTION
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`
`This application is a Continuation-in-Part of commonly
`assigned US. patent application Ser. No. 09/151,175, “Sys-
`tem and Method for Virtualizing Computer Systems, ” filed
`Sep. 10, 1998, which issued as US. Pat. No. 6,496,847 on
`Dec. 17, 2002. This application also claims priority of the
`pending US. Provisional Application No. 60/085,685, filed
`Feb. 15, 1998.
`This application also incorporates by reference the US.
`patent application Ser. No. 09/179,137, “Virtualization Sys-
`tem Including a Virtual Machine Monitor for a Computer
`with a Segmented Architecture,” filed Oct. 26, 1998, which
`issued as US. Pat. No. 6,397,242 on May 28, 2002.
`BACKGROUND OF THE INVENTION
`1. Field of the Invention
`
`This application relates to an arrangement of a computer
`system, in particular, to a system and a method for acquiring,
`storing and using data concerning the state of hardware and
`software components within the computer system.
`2. Description of the Related Art
`Modern computers “crash” with irritating frequency, with
`much work lost or recovered only with time-consuming
`effort. Sometimes, crashes or other errors are expected, for
`example, when designing new software or debugging an
`existing program. In such cases, and even when first turning
`the computer on, time is also lost waiting for computers to
`“boot” or “reboot.” At other times, when problems occur for
`an ordinary user of a commercial application, even more
`time is often lost when the frustrated user must try to explain
`orally what has happened to a technician located far away in
`a customer service department. These are just a few of many
`possible examples of situations when information about the
`state of the computer system is either desirable, for example,
`when debugging a new program, or necessary, for example,
`when the computer is to reboot and automatically load
`previously running applications along with the data they
`were processing when exited.
`One known attempt to ensure the ability to analyze and
`reconstruct the state of a physical memory, disk or data base
`is based on the concept of a “transaction,” which involves
`on-going tracking of updates to at
`least one region of
`storage. In this context, a transaction is a collection of
`updates that are bundled together so that they are atomic that
`is, either all of the updates occur, or none of them occur. The
`idea of transactions is typically applied to databases, where
`a series of updates to different tables need to occur simul-
`taneously.
`A transaction proceeds as follows: A begin command
`from the operating system or an application marks the
`beginning of the series of updates that make up the trans-
`action. After the updates complete, a commit command
`marks the end of the transaction and the updates become
`permanent. If an error occurs during one of the updates that
`are part of the transaction, a rollback command is used to
`undo any updates in the transaction that may have com-
`pleted.
`
`Transactional Disks
`
`In the prior art, this use of the concept of transactions is
`commonly implemented in database systems. Recently,
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`transactions have been extended to apply to logical disks
`(also referred to as virtual disks), which are a software
`construct that emulate physical disks. One example of this
`solution, in the context of a parallel or distributed processing
`arrangement,
`is described in US. Pat. No. 5,634,096
`(Baylor, et al., May 27, 1997, “Using virtual disks for disk
`system checkpointing”), which discloses a scheme for stor-
`ing data on disks in such a way that a “checkpoint” is taken
`across several disks connected to different processors. This
`checkpoint is then used to restore the entire disk system to
`a known state after one or more of the disks or processors
`fails.
`
`Yet another solution involving virtual disks is described in
`“The Logical Disk: A New Approach to Improving File
`Systems,” by de Jonge, Kaashoek, and Hsieh, in Proceed-
`ings of the 141h ACM Symposium on Operating System
`Principles, pp. 15—28, December 1993. In this paper, the
`term “Atomic Recovery Unit” is used to describe transac-
`tions to the logical disk.
`The implementation of a logical disk requires the inter-
`ception of requests to the physical disk, and transforming
`them into operations on a logical disk. Once this has been
`accomplished, it is possible to keep a log of all of the updates
`to the logical disk and defer the update so that the original
`data is not overwritten. When the updates are kept in a log
`in this fashion, then a rollback can be accomplished by
`discarding the updates in the log for a particular transaction.
`Acommit can be accomplished by retaining these updates in
`the log, and eventually applying them to the logical disk.
`A similar concept has been proposed in “Petal: Distrib-
`uted Virtual Disks,” by Lee and Thekkath, in Proc. 1 “Intl.
`Conf. 0n Architectural Support for Programming Lan-
`guages and Operating Systems,” pp. 84—92, October 1996.
`The Petal virtual disk supports the ability to take snapshots
`of the virtual disk, using techniques known as “copy-on-
`write.” Copy-on-write is a common technique that allows
`copies to be created quickly, using a table of pointers to the
`actual data, and only copying the data when it is modified by
`a user program.
`In Petal, the virtual disk itself is implemented as a table
`of pointers, and the snapshot (equivalent to a “checkpoint”)
`is implemented by including an identifier (called an epoch
`number) in this table. When a snapshot is taken, the current
`epoch number is assigned to the snapshot. The epoch
`number is then incremented, and all subsequent updates to
`the virtual disk belong to this new epoch number. When a
`block of the disk is next updated, there will be no copy at the
`current epoch number, so a copy of the block will be created.
`In short, as the term “copy-on-write” implies, a copy is made
`only when a disk block is written to. The original data is still
`available, under the epoch number of the snapshot.
`Both the logging technique and the snapshot technique
`allow the implementation of transactions on a logical disk.
`In both cases, there are two copies of the modified disk
`block:
`the original version and the updated version. By
`restoring the state of the logical disk to point to the original
`version of all the disk blocks that were modified during the
`transaction, the transaction can be rolled back, that is, the
`state of the disk at the beginning of the transaction can be
`restored.
`
`The concepts of transactions on virtual disks and snap-
`shots of virtual disks have a number of limitations. The first
`
`is that they are useful only in the context of restoring the
`state of the disk: These systems provide no way to recover
`from, for example, failures caused by errors in a peripheral
`device.
`
`
`
`US 6,795,966 B1
`
`3
`Another limitation is that, during the operation of a typical
`computer system,
`the state of the disk is not complete:
`Modern operating systems employ disk caches that contain
`copies of data from the disk, as well as data that needs to be
`written to the disk. Applications also buffer data, so that even
`the operating system itself lacks a complete view of all the
`data entered by a user of the computer system. Snapshots of
`the disk state taken at an arbitrary point are only as consis-
`tent as the disk would be if the computer system were to
`crash at that point. On the other hand, any data that is present
`in the cache or in application memory, but that is not yet
`written to disk, is lost.
`If snapshots of the disk state are taken only at points when
`the operating system is shut down, then the disk is in a
`consistent state, and no data is lost. However, this represents
`a significant
`limitation on the concept of transactions:
`Before a transaction can begin or end, all applications must
`be closed and the operating system must be shut down. This
`makes the snapshot technique inadequate to restore the full
`state of the disk when the system or an application
`“crashes,” that is, when an application terminates other than
`as a result of a prescribed shut-down routine and whose
`execution cannot proceed. Alternatively, the application or
`operating system must explicitly issue commands that cause
`the buffered or cached data to be written back to the disk. In
`
`short, the reality of modern systems does not always con-
`form to the “clean” assumptions of the snapshot model, or
`they require the explicit coordination of application or
`operating system software.
`The technique of taking snapshots (also known as
`“checkpointing”) has also been used not only for virtual
`disks, but also for other subsystems such as file systems.
`Moreover, checkpointing has also been proposed for
`applications, and, in certain very restricted senses and cases,
`for systems as a whole. Examples of each will now be given.
`
`File System Checkpointing
`
`One example of checkpointing of file systems is disclosed
`in “Deciding when to forget in the Elephant file system,” D.
`Santry, et al., Proceedings of the 17th ACM Symposium on
`Operating Systems Principles, Charleston, SC. This
`“Elephant File System” uses copy-on-write techniques, as
`well as per-file characteristics to implement checkpointing
`of the file system, albeit only on a file-by-file basis.
`Other checkpointing techniques for
`file systems are
`described in “File system design for a file server appliance,”
`D. Hitz, et al., Proceedings of the 1994 Winter USENIX
`Technical Conference, pages 235—245, San Francisco,
`Calif., January 1994; and “Scale and performance in a
`distributed file system,” J. Howard, et al., ACM Transactions
`on Computer Systems, 6(1):51—81, February, 1988. In both
`of these systems, copy-on-write techniques are used to
`create whole file system checkpoints.
`
`System Checkpointing
`
`Many different proposals have also been put forward for
`checkpointing systems in certain restricted situations. One
`such proposal for the system known as KeyKOS is
`described, for example, in “The Checkpoint Mechanism in
`KeyKOS,” C. Landau, Proceedings of the Second Interna-
`tional Workshop on Object Orientation in Operating
`Systems, September 1992. The KeyKOS system, which
`operates as a microkernel-based operating system (OS),
`treats an entire system (from a software perspective) as a
`collection of objects and periodically takes checkpoints of
`all the objects. After a crash, the objects can be restored and
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`the system resumed. One shortcoming of the KeyKOS
`system is that it requires new system software to be written,
`in particular, new application program interfaces (API’s).
`Yet another disadvantage of KeyKOS is that, after a crash,
`the OS still needs to go through a boot-up process before
`restoring the objects.
`Still another known system-checkpointing technique is
`described in “EROS: a fast capability system,” J. Shapiro, et
`al., Proceedings of the 17th ACM Symposium on Operating
`Systems Principles (SOSP ’99), December 1999,
`Charleston, SC. Like KeyKOS, this EROS system is an
`object-oriented operating system with objects that are made
`persistent by checkpointing them. This checkpointing
`requires that all state resides in special objects called
`“pages” and “nodes,” and that all kernel (OS) operations are
`atomic. Like KeyKOS, the system requires a new API, that
`is, new software, to be written, and requires O/S coordina-
`tion. In EROS, periodic copies (checkpoints) are made of all
`objects, which are saved using copy-on-write techniques.
`Also like KeyKOS, the EROS system requires an O/S reboot
`after a crash.
`
`As its title implies, US. Pat. No. 5,715,464 (Crump, et al.,
`Feb. 3, 1998, “Computer system having suspend once
`resume many sessions”) describes a computer system that
`has suspend once resume many (SORM) sessions. This
`SORM arrangement operates in a manner similar to the way
`in which existing portable computers are able to “suspend”
`their operation, for example, when the lid is closed, and then
`resume operation when reactivated. In the SORM system
`described in the Crump ’464 patent, however, the suspended
`image is preserved after resuming and thus may be restored
`multiple times, although subject
`to the very restrictive
`condition that the suspended image may no longer be valid
`after the next disk access in a resumed system. Moreover, the
`disclosed system-checkpointing solution describes possibil-
`ity of keeping multiple suspended images, each for a dif-
`ferent operating system, so that one can alternate between
`running the suspended operating systems.
`Yet another system with features similar to the suspend-
`to-disk features of a portable computer is disclosed in US.
`Pat. No. 5,758,174 (Crump, et al., May 26, 1998, “Computer
`system having a plurality of stored system capability states
`from which to resume”). In this system, multiple suspended
`images may be kept and the user may resume from any one
`of them.
`
`In both the Crump ’464 and ’174 systems, the operating
`system (OS) and application software must participate in the
`suspension and must go through a shutdown and a wake-up
`phase. In particular, these known systems require software
`executing within the operating-system, such as an Advanced
`Power Management
`(APM) driver, and applications/
`subsystems to register with the APM driver.
`Still another system of this type is described in US. Pat.
`No. 5,386,552 (Garney, et al., Jan. 31, 1995, “Preservation
`of a computer system processing state in a mass storage”).
`In this system, the contents of system registers and system
`memory are saved in a mass storage device upon the
`occurrence of a triggering event, such as during power-off or
`when the system is to enter a low-power mode. The system
`then enters a suspend state. Once processing is resumed, the
`contents of a previously saved processing state are read in
`and control is returned to the previously running application
`program. This system requires two separate modules—a
`special
`interrupt handler and a system management
`module—to handle saving different partitions—isolated and
`non-isolated—of the memory.
`
`
`
`US 6,795,966 B1
`
`5
`As in other suspend-and-resume systems, in the Garney
`system, the evolution of the computer system state is always
`moving forward in a linear trajectory. In other words, once
`the system is resumed, there is no way to go back to the
`previously suspended state. This is in part because the
`contents of the disk, which are not saved when the system
`enters the suspend state, may be freely modified after
`resuming—any post-resume modification prevents resum-
`ing again from the previously saved state. Thus,
`is not
`possible to resume multiple times from a saved image. It is
`also not possible to save the state, continue execution, and
`then resume later from the saved state.
`
`The Garney system also illustrates another common dis-
`advantage of existing arrangements that provide for saving
`at
`least some part of the system state:
`It requires that
`software within the system itself must participate in saving
`the system state. Thus, in order to save the partial state in the
`Garney system,
`the additional system software needs to
`cause the processor to go into a system management inter-
`rupt state so that it can access a system management memory
`area. The processor must also be in the system management
`interrupt state in order to ensure that a critical part of the
`save routine will not be interrupted by a hardware interrupt.
`
`Application/Process-level Checkpointing
`
`One known system for checkpointing applications is the
`“Condor” distributed processing system, which is described
`in “Checkpoint and Migration of UNIX Processes in the
`Condor Distributed Processing System,” M. Litzkow, et al.,
`University of Wisconsin-Madison Computer Sciences Tech-
`nical Report #1346, April 1997; and “Supporting Check-
`pointing and Process Migration Outside the UNIX Kernel,”
`M. Litzkow, et al., Proceedings of the 1994 Winter USENIX
`Technical Conference, San Francisco, Calif., January 1992.
`The Condor system checkpoints the processes of running
`applications, and can migrate them to other machines as long
`as these also are running Condor. Only the application state
`is checkpointed, however, and the applications themselves
`must participate in the checkpointing by making calls to a
`checkpoint library.
`All of the known systems and methods mentioned above
`suffer from one or more of the following disadvantages:
`1) They save only part of the entire system state; as such,
`they cannot ensure complete restoration of the system state
`sufficient to guarantee that all applications will be able to
`continue exactly as they would have when the saved state is
`restored.
`
`2) They are not able to generate checkpoints and save the
`state of the system at arbitrary points, or at multiple points.
`The systems will therefore not correctly save the partial state
`except when processing is interrupted at specific points or
`under specific conditions. This implies, of course, that there
`will be circumstances when the state cannot be saved at all.
`
`This means, in turn that such systems cannot be used for
`such operations as full-state, step-by-step debugging of
`applications. In many cases, this limitation is caused by a
`need for synchronization of the partial state-saving proce-
`dure with applications, or a need to wait for some other
`internal process—such as a shut down of some sub-
`system—to be completed before saving the partial state.
`3) They require specialized system software such as
`special API’s or operating systems. Alternatively,
`they
`assume and work only for particular operating systems and
`hardware architectures. They are therefore not beneficial to
`the most common users—those who need to run off-the-
`
`shelf applications using an off-the-shelf operating system.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`An additional consequence of this is that the checkpoints are
`not portable between different systems.
`4) They need to flush disk caches.
`What is needed is some way to overcome these disad-
`vantages of the prior art, and in particular, to extract and
`restore the entire state of the computer system as a whole,
`not just of some portion of the memory. This then would
`enable complete restoration of the system to any point in its
`processing without requiring any application or operating
`system intervention, or any specialized or particular system
`software (such as API’s and 08’s) or hardware architecture.
`This invention provides a system and method that accom-
`plishes this, and it does so in a way that makes possible even
`other unique features, such as the ability for one or even
`multiple users to run, evaluate, test, restart, and duplicate a
`processing stream not only from the same point, but also
`from different points. The invention accomplishes this,
`moreover, in a manner that allows checkpointing the entire
`state of the system in a way that allows state information to
`be portable between different hardware platforms and sys-
`tem software configurations.
`SUMMARY OF THE INVENTION
`
`The preferred embodiment of the invention provides a
`computer system that has a hardware processor; physical
`system storage; a virtual machine monitor functionally con-
`nected to the hardware processor and physical system stor-
`age; and at least one virtual machine that is functionally
`connected to the virtual machine monitor.
`
`Each virtual machine includes: a) a virtual processor that
`has a plurality of registers and executes a sequence of virtual
`processor instructions, virtual processor instructions being
`conveyed by the virtual machine monitor to the hardware
`processor, physical execution of the virtual processor
`instructions being carried out by the hardware processor; b)
`a virtual memory; c) a virtual operating system; d) at least
`one virtual peripheral device having a plurality of peripheral
`settings and peripheral device data; and e) at
`least one
`application program executing on the virtual machine.
`At the completion of execution of each processor instruc-
`tion by the virtual processor, the virtual machine has a total
`state that includes state information of the virtual processor,
`of the virtual memory, of the virtual operating system, of
`each application program, and of each virtual peripheral
`device. The virtual machine monitor is then used as a
`
`checkpointing mechanism a) for interrupting the virtual
`processor; b) for sensing a checkpoint request for the virtual
`machine; and c) for generating at least one checkpoint, each
`checkpoint comprising a list of the total state of the virtual
`machine corresponding to the respective checkpoint request,
`at the completion of execution of a respective virtual pro-
`cessor instruction.
`
`In the preferred embodiment of the invention, checkpoint-
`ing mechanism is further provided a)
`for storing each
`generated checkpoint; b)
`for sensing a state restoration
`request corresponding to one of the checkpoints; and c) upon
`sensing the state restoration request, for restoring into the
`virtual machine the total state corresponding to the check-
`point that in turn corresponds to the state restoration request,
`the virtual machine thereby resuming execution from the
`restored total state,
`that
`is, of the sequence of virtual
`processor instructions following the virtual processor
`instruction after whose completion the checkpoint was
`generated, identically as it would have if the virtual proces-
`sor had not been interrupted.
`The physical system storage preferably includes a check-
`point storage portion accessible only by the virtual machine
`
`
`
`US 6,795,966 B1
`
`7
`monitor. Each generated checkpoint is then stored in the
`checkpoint storage portion. Stored checkpoints thereby
`remain unchanged by other access to the physical system
`storage. In this case, the total state of the virtual machine
`excludes the contents of the corresponding checkpoint stor-
`age portion.
`The invention allows the use of a plurality of virtual
`machines. In this case, one of the virtual machines may serve
`as a source virtual machine, for which at least one check-
`point is generated, and the checkpointing mechanism is then
`further provided for loading into at least one secondary
`virtual machine, which is different from the source virtual
`machine, the total state of the source virtual machine cor-
`responding to one of the generated checkpoints. The differ-
`ent virtual machines may also be installed on different
`virtual machine monitors, and the different virtual machine
`monitors may even be installed on different hardware plat-
`forms.
`If the source virtual machine is located within a source
`
`computer, the secondary virtual machine may be located
`within a secondary computer that contains a secondary
`virtual machine that is structurally substantially identical to
`the source virtual machine. The source and secondary com-
`puters may thereby have separate hardware processors and
`physical system storage. The system according to the inven-
`tion may then include a transmission medium via which the
`checkpoint is transferred from the source virtual machine to
`the secondary virtual machine. Examples of the transmission
`medium include a portable memory device and a network to
`which the source and secondary computers are connected.
`One advantage of having distinct but structurally identical
`virtual machines running on a virtual machine monitor is
`that they may then be independent of an underlying system
`hardware platform that includes the hardware processor and
`the physical system storage. The total state of any source
`virtual machine is thus exportable to any other secondary
`virtual machine.
`
`The computer system according to the invention may
`include a plurality of peripheral system hardware devices. In
`this case,
`in the preferred embodiment, for each of the
`plurality of peripheral system hardware devices, the virtual
`machine monitor preferably includes a respective device
`emulation module corresponding to a predetermined emu-
`lation of a representative, virtualized hardware device cor-
`responding to the peripheral system hardware device. The
`device emulation modules then form a device interface
`
`between the virtual machines and the peripheral system
`hardware devices. The virtual machine monitor then pref-
`erably further includes a conversion module converting
`device control signals and data from any virtual machine in
`a format of the respective representative virtualized hard-
`ware device into corresponding signals and data into a
`format of the corresponding peripheral system hardware
`device, and vice versa. Each virtual machine is thereby made
`independent of the underlying system hardware platform,
`which includes the hardware processor, the physical system
`storage, and the peripheral system hardware devices. The
`total state of any source virtual machine then is exportable
`to any other secondary virtual machine, regardless of what
`physical peripherals might be included in a particular sys-
`tem.
`
`In an embodiment of the invention that does not rely on
`the use of virtual machines and a virtual machine monitor
`
`there is at least one application program running on system
`software such as an operating system. In this embodiment,
`the mechanism for generating checkpoints is a driver
`installed in the operating system.
`
`8
`According to the invention, checkpoints may be restored
`a plurality of times. In such case, the corresponding virtual
`machine thereby resumes execution multiple times from an
`identical total state.
`
`Whenever multiple checkpoints are generated and stored,
`each subsequently generated checkpoint is generated with-
`out changing any earlier generated checkpoint.
`Also in the preferred embodiment of the invention, any
`arbitrary one of the stored checkpoints may be restored into
`an arbitrary one of the virtual machines. Moreover, the same
`stored checkpoint may be loaded concurrently into a plu-
`rality of virtual machines; each virtual machine is then able
`to proceed from the loaded checkpoint independently of the
`other virtual machines.
`
`The checkpointing procedure does not require modifica-
`tion or long suspension of the operation of a virtual machine.
`Rather, according to the invention, unmodified execution of
`the virtual machine may be resumed immediately after the
`step of generating each checkpoint.
`The invention also provides a way to reduce how much
`storage space is needed for storing the state information
`contained in checkpoints. If this feature is implemented,
`then the steps of sensing each checkpoint request and
`generating a corresponding checkpoint may then include the
`following sub-steps: a) generating an initial checkpoint
`including the total state of the virtual machine; and b) for at
`least one subsequent checkpoint after the initial checkpoint,
`generating the subsequent checkpoint as a list of state
`changes relative to an immediately preceding generated
`checkpoint as well as link information to the immediately
`preceding generated checkpoint.
`According to another aspect of the preferred embodiment
`of the invention, checkpoints may also be deleted. In this
`case, at least one checkpoint is designated for deletion. If it
`is the initial checkpoint is the one designated for deletion, all
`the total state information in the initial checkpoint is incor-
`porated into each checkpoint
`immediately following the
`designated checkpoints. If the initial checkpoint is not one
`designated for deletion then all state change information in
`the checkpoints designated for deletion is incorporated into
`each checkpoint
`immediately following the designated
`checkpoints. (Note that a checkpoint may have more than
`one immediate successor, for example, when a checkpoint