`
`Proceedings of the
`Sth Symposium on Operating Systems
`Design and Implementation
`
`Boston, Massachusetts, USA
`
`December 9-11, 2002
`
`
`
`過: I za ED COMPUTING SYSTEMS ASSOCIA’ \ | |
`
`
`
`
`
`
`
`TION
`
`© 2002 by The USENIX Association
`
`All Rights Reserved
`
`For more information about the USENIX Association:
`
`Phone: 1 510 528 8649
`
`FAX: 1 510 548 5738
`
`Email: office@usenix.org
`
`WWW: http://www.usenix.org
`
`
`Rights to individual papers remain with the author or the author's employer.
`
`Permission is granted for noncommercial reproduction of the work for educational or research purposes.
`This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein.
`
`Pg. 01
`
`WIZ, Inc. EXHIBIT - 1010
`WIZ, Inc. v. Orca Security LTD.
`
`Pg. 01
`
`WIZ, Inc. EXHIBIT - 1010
`WIZ, Inc. v. Orca Security LTD.
`
`
`
`Memory Resource Management in VMware ESX Server
`
`Carl A. Waldspurger
`
`VMware, Inc.
`Palo Alto, CA 94304 USA
`carl@vmware.com
`
`Abstract
`
`VMware ESX Server is
`
`a thin software layer designed to
`
`multiplex hardware resources efficiently among virtual ma-
`chines running unmodified commodity operating systems.
`
`This paper introduces several novel ESX Server mechanisms
`
`and policies for managing memory. A ballooning technique
`
`reclaims the pages considered least valuable by the operat-
`
`ing system running in a virtual machine. An idle memory tax
`achieves efficient memory utilization while maintaining per-
`formance isolation guarantees.
`Content-based page sharing
`and hot I/O page remapping exploit transparent page remap-
`ping to eliminate redundancy and reduce copying overheads.
`These techniques are combined to efficiently support virtual
`
`machine workloads that overcommit memory.
`
`1
`
`Introduction
`
`Recent industry trends, such as server consolida-
`tion and the proliferation of inexpensive shared-memory
`multiprocessors, have fueled a resurgence of interest in
`server virtualization techniques.
`Virtual machines are
`particularly attractive for server virtualization.
`Each
`virtual machine (VM) is given the illusion of being a ded-
`icated physical machine that is fully protected and iso-
`lated from other virtual machines. Virtual machines are
`also convenient abstractions of server workloads, since
`they cleanly encapsulate the entire state of a running sys-
`tem, including both user-level applications and kernel-
`mode operating system services.
`
`In many computing environments, individual servers
`are underutilized, allowing them to be consolidated as
`virtual machines on a single physical server with little or
`no performance penalty. Similarly, many small servers
`can be consolidated onto fewer larger machines to sim-
`plify management and reduce costs. Ideally, system ad-
`ministrators should be able to flexibly overcommit mem-
`ory, processor, and other resources in order to reap the
`benefits of statistical multiplexing, while still providing
`resource guarantees to VMs of varying importance.
`
`Virtual machines have been used for decades to al-
`low multiple copies of potentially different operating
`systems to run concurrently on a single hardware plat-
`form [8]. A virtual machine monitor (VMM) is a soft-
`ware layer that virtualizes hardware resources, export-
`ing a virtual hardware interface that reflects the under-
`lying machine architecture. For example, the influential
`VM/370 virtual machine system [6] supported multiple
`concurrent virtual machines, each of which believed it
`was running natively on the IBM System/370 hardware
`architecture [10].
`More recent research, exemplified
`by Disco [3, 9], has focused on using virtual machines
`to provide scalability and fault containment for com-
`modity operating systems running on large-scale shared-
`memory multiprocessors.
`
`VMware ESYX Server is a thin software layer designed
`to multiplex hardware resources efficiently among vir-
`tual machines. The current system virtualizes the Intel
`IA-32 architecture [13]. It is in production use on servers
`running multiple instances of unmodified operating sys-
`tems such as Microsoft Windows 2000 Advanced Server
`and Red Hat Linux 7.2. The design of ESX Server dif-
`fers significantly from VMware Workstation, which uses
`a hosted virtual machine architecture [23] that takes ad-
`vantage of a pre-existing operating system for portable
`I/O device support. For example, a Linux-hosted VMM
`intercepts attempts by a VM to read sectors from its vir-
`tual disk, and issues a read() system call to the under-
`lying Linux host OS to retrieve the corresponding data.
`In contrast, ESX Server manages system hardware di-
`rectly, providing significantly higher I/O performance
`and complete control over resource management.
`
`The need to run existing operating systems without
`modification presented a number of interesting chal-
`lenges. Unlike IBM’s mainframe division, we were un-
`able to influence the design of the guest operating sys-
`tems running within virtual machines. Even the Disco
`prototypes [3,
`9], designed to run unmodified operat-
`ing systems, resorted to minor modifications in the IRIX
`kernel sources.
`
`Pg. 02
`
`Pg. 02
`
`
`
`This paper introduces several novel mechanisms and
`policies that ESX Server 1.5 [29] uses to manage mem-
`ory. High-level resource management policies compute
`a target memory allocation for each VM based on spec-
`ified parameters and system load. These allocations are
`achieved by invoking lower-level mechanisms to reclaim
`memory from virtual machines.
`In addition, a back-
`ground activity exploits opportunities to share identical
`pages between VMs, reducing overall memory pressure
`on the system.
`
`In the following sections, we present the key aspects
`of memory resource management using a bottom-up
`approach, describing low-level mechanisms before dis-
`cussing the high-level algorithms and policies that co-
`ordinate them.
`Section 2 describes low-level memory
`virtualization.
`Section 3 discusses mechanisms for re-
`claiming memory to support dynamic resizing of virtual
`machines. A general technique for conserving memory
`by sharing identical pages between VMs is presented
`in Section 4.
`Section 5 discusses the integration of
`working-set estimates into a proportional-share alloca-
`tion algorithm.
`Section 6 describes the high-level al-
`location policy that coordinates these techniques. Sec-
`tion 7 presents a remapping optimization that reduces
`I/O copying overheads in large-memory systems. Sec-
`tion 8 examines related work. Finally, we summarize our
`conclusions and highlight opportunities for future work
`in Section 9.
`
`2
`
`Memory Virtualization
`
`A guest operating system that executes within a vir-
`tual machine expects a zero-based physical address
`space, as provided by real hardware. ESX Server gives
`each VM this illusion, virtualizing physical memory by
`adding an extra level of address translation. Borrowing
`terminology from Disco [3], a machine address refers to
`actual hardware memory, while a physical address is a
`software abstraction used to provide the illusion of hard-
`ware memory to a virtual machine. We will often use
`“physical” in quotes to highlight this deviation from its
`usual meaning.
`
`ESX Server maintains a pmap data structure for each
`VM to translate “physical” page numbers (PPNs) to
`machine page numbers (MPNs). VM instructions that
`manipulate guest OS page tables or TLB contents are
`intercepted, preventing updates to actual MMU state.
`Separate shadow page tables, which contain virtual-to-
`machine page mappings, are maintained for use by the
`processor and are kept consistent with the physical-to-
`
`This approach per-
`machine mappings in the pmap.'
`mits ordinary memory references to execute without ad-
`ditional overhead, since the hardware TLB will cache
`direct virtual-to-machine address translations read from
`the shadow page table.
`
`The extra level of indirection in the memory system
`is extremely powerful. The server can remap a “phys-
`ical” page by changing its PPN-to-MPN mapping, in a
`manner that is completely transparent to the VM. The
`server may also monitor or interpose on guest memory
`accesses.
`
`3
`
`Reclamation Mechanisms
`
`ESX Server supports overcommitment of memory to
`facilitate a higher degree of server consolidation than
`would be possible with simple static partitioning. Over-
`commitment means that the total size configured for all
`running virtual machines exceeds the total amount of ac-
`tual machine memory. The system manages the alloca-
`tion of memory to VMs automatically based on config-
`uration parameters and system load.
`
`Each virtual machine is given the illusion of having
`a fixed amount of physical memory. This max size is
`a configuration parameter that represents the maximum
`amount of machine memory it can be allocated. Since
`commodity operating systems do not yet support dy-
`namic changes to physical memory sizes, this size re-
`mains constant after booting a guest OS.
`A VM will be
`allocated its maximum size when memory is not over-
`committed.
`
`3.1
`
`Page Replacement Issues
`
`When memory is overcommitted, ESX Server must
`employ some mechanism to reclaim space from one or
`more virtual machines. The standard approach used by
`earlier virtual machine systems is to introduce another
`level of paging [9, 20], moving some VM “physical”
`pages to a swap area on disk. Unfortunately, an extra
`level of paging requires a meta-level page replacement
`policy: the virtual machine system must choose not only
`the VM from which to revoke memory, but also which
`of its particular pages to reclaim.
`
`In general, a meta-level page replacement policy must
`make relatively uninformed resource management deci-
`sions. The best information about which pages are least
`
`
`
`'The IA-32 architecture has hardware mechanisms that walk in-
`memory page tables and reload the TLB [13].
`
`Pg. 03
`
`Pg. 03
`
`
`
`valuable is known only by the guest operating system
`within each VM. Although there is no shortage of clever
`page replacement algorithms [26], this is actually the
`crux of the problem. A sophisticated meta-level policy
`is likely to introduce performance anomalies due to un-
`intended interactions with native memory management
`policies in guest operating systems.
`This situation is
`exacerbated by diverse and often undocumented guest
`OS policies [1], which may vary across OS versions and
`may even depend on performance hints from applica-
`tions [4].
`
`The fact that paging is transparent to the guest OS can
`also result in a double paging problem, even when the
`meta-level policy is able to select the same page that the
`native guest OS policy would choose [9, 20]. Suppose
`the meta-level policy selects a page to reclaim and pages
`it out. Ifthe guest OS is under memory pressure, it may
`choose the very same page to write to its own virtual
`paging device. This will cause the page contents to be
`faulted in from the system paging device, only to be im-
`mediately written out to the virtual paging device.
`
`3.2
`
`Ballooning
`
`a VM from which memory has been re-
`Ideally,
`claimed should perform as if it had been configured with
`less memory. ESX Server uses a ballooning technique
`to achieve such predictable performance by coaxing the
`guest OS into cooperating with it when possible. This
`process is depicted in Figure 1.
`
`A small balloon module is loaded into the guest OS
`as a pseudo-device driver or kernel service.
`It has no
`external interface within the guest, and communicates
`with ESX Server via a private channel. When the server
`wants to reclaim memory, it instructs the driver to “in-
`flate” by allocating pinned physical pages within the
`VM, using appropriate native interfaces. Similarly, the
`server may “deflate” the balloon by instructing it to deal-
`locate previously-allocated pages.
`
`Inflating the balloon increases memory pressure in the
`guest OS, causing it to invoke its own native memory
`management algorithms. When memory is plentiful, the
`guest OS will return memory from its free list. When
`memory is scarce, it must reclaim space to satisfy the
`driver allocation request. The guest OS decides which
`particular pages to reclaim and, if necessary, pages them
`out to its own virtual disk.
`The balloon driver com-
`municates the physical page number for each allocated
`page to ESX Server, which may then reclaim the corre-
`sponding machine page. Deflating the balloon frees up
`
`
`
`(Guest Memory
`
`may
`page out
`
`balloon >
`
`inflate
`
`-一
`
`
`
`Guest Memory
`
`
`
`
`
`
`
`
`
`~
`deflate
`
`
`
`
`
`page in
`
`Guest Memory
`
`
`
`Figure 1: Ballooning. ESX Server controls a balloon mod-
`ule running within the guest, directing it to allocate guest pages
`and pin them in “physical” memory. The machine pages back-
`
`ing this memory can then be reclaimed by ESX Server. Inflat-
`
`ing the balloon increases memory pressure, forcing the guest
`
`OS to invoke its own memory management algorithms. The
`
`guest OS may page out to its virtual disk when memory is
`
`scarce. Deflating the balloon decreases pressure, freeing guest
`memory.
`
`
`
`memory for general use within the guest OS.
`
`Although a guest OS should not touch any physical
`memory it allocates to a driver, ESX Server does not
`depend on this property for correctness. When a guest
`PPN is ballooned, the system annotates its pmap entry
`and deallocates the associated MPN. Any subsequent at-
`tempt to access the PPN will generate a fault that is han-
`dled by the server; this situation is rare, and most likely
`the result of complete guest failure, such as a reboot
`or crash. The server effectively “pops” the balloon, so
`that the next interaction with (any instance of) the guest
`driver will first reset its state. The fault is then handled
`by allocating a new MPN to back the PPN, just as if the
`page was touched for the first time.”
`
`Our balloon drivers for the Linux, FreeBSD, and Win-
`dows operating systems poll the server once per sec-
`ond to obtain a target balloon size, and they limit their
`allocation rates adaptively to avoid stressing the guest
`OS. Standard kernel interfaces are used to allocate phys-
`ical pages, such as get_-free_page() in Linux, and
`
`MmAllocatePagesForMdl() or MmProbeAndLock-
`
`Pages ()
`
`in Windows.
`
`Future guest OS support for hot-pluggable memory
`cards would enable an additional form of coarse-grained
`ballooning. Virtual memory cards could be inserted into
`
`2 ESX Server zeroes the contents of newly-allocated machine pages
`to avoid leaking information between VMs. Allocation also respects
`cache coloring by the guest OS; when possible, distinct PPN colors are
`mapped to distinct MPN colors.
`
`Pg. 04
`
`Pg. 04
`
`
`
`Pg. 05
`
`3.3
`
`Demand Paging
`
`ESX Server preferentially uses ballooning to reclaim
`memory, treating it
`as
`a common-case optimization.
`When ballooning is not possible or insufficient, the sys-
`tem falls back to a paging mechanism. Memory is re-
`claimed by paging out to an ESX Server swap area on
`disk, without any guest involvement.
`
`The ESX Server swap daemon receives information
`about target swap levels for each VM from a higher-
`level policy module. It manages the selection of candi-
`date pages and coordinates asynchronous page outs to a
`swap area on disk. Conventional optimizations are used
`to maintain free slots and cluster disk writes.
`
`A randomized page replacement policy is used to pre-
`vent the types of pathological interference with native
`guest OS memory management algorithms described in
`Section 3.1.
`This choice was also guided by the ex-
`pectation that paging will be a fairly uncommon oper-
`ation. Nevertheless, we are investigating more sophisti-
`cated page replacement algorithms, as well policies that
`may be customized on a per-VM basis.
`
`4
`
`Sharing Memory
`
`Server consolidation presents numerous opportunities
`for sharing memory between virtual machines. For ex-
`ample, several VMs may be running instances of the
`same guest OS, have the same applications or compo-
`nents loaded, or contain common data. ESX Server ex-
`ploits these sharing opportunities, so that server work-
`loads running in VMs on a single machine often con-
`sume less memory than they would running on separate
`physical machines. As a result, higher levels of over-
`commitment can be supported efficiently.
`
`4.1
`
`Transparent Page Sharing
`
`Disco [3] introduced transparent page sharing as a
`method for eliminating redundant copies of pages, such
`as code or read-only data, across virtual machines. Once
`copies are identified, multiple guest “physical” pages are
`mapped to the same machine page, and marked copy-
`on-write.
`Writing to a shared page causes a fault that
`generates a private copy.
`
`Unfortunately, Disco required several guest OS mod-
`ifications to identify redundant copies as they were cre-
`ated. For example, the bcopy() routine was hooked to
`
`wv
`
`全
`
`ww
`
`Throughput (MB/sec)
`
`
`128
`160
`192
`224
`256
`VM Size (MB)
`
`Figure 2: Balloon Performance. Throughput of single
`Linux VM running dbench with 40 clients. The black bars
`
`plot the performance when the VM is configured with main
`
`memory sizes ranging from 128 MB to 256 MB. The gray bars
`
`plot the performance of the same VM configured with 256 MB,
`ballooned down to the specified size.
`
`
`
`or removed from a VM in order to rapidly adjust its
`physical memory size.
`
`To demonstrate the effectiveness of ballooning, we
`used the synthetic dbench benchmark [28] to simulate
`fileserver performance under load from 40 clients. This
`workload benefits significantly from additional memory,
`since a larger buffer cache can absorb more disk traffic.
`For this experiment, ESX Server was running on a dual-
`processor Dell Precision 420, configured to execute one
`VM running Red Hat Linux 7.2 on a single 800 MHz
`Pentium HI CPU.
`
`Figure 2 presents dbench throughput as a function
`of VM size, using the average of three consecutive runs
`for each data point.
`The ballooned VM tracks non-
`ballooned performance closely, with an observed over-
`head ranging from 4.4% at 128 MB (128 MB balloon)
`down to 1.4% at 224 MB (32 MB balloon). This over-
`head is primarily due to guest OS data structures that are
`sized based on the amount of “physical” memory; the
`Linux kernel uses more space in a 256 MB system than
`in a 128 MB system.
`Thus, a 256 MB VM ballooned
`down to 128 MB has slightly less free space than a VM
`configured with exactly 128 MB.
`
`Despite its advantages, ballooning does have limita-
`tions. The balloon driver may be uninstalled, disabled
`explicitly, unavailable while a guest OS is booting, or
`temporarily unable to reclaim memory quickly enough
`to satisfy current system demands. Also, upper bounds
`on reasonable balloon sizes may be imposed by various
`guest OS limitations.
`
`Pg. 05
`
`
`
`enable file buffer cache Sharing across virtual machines.
`Some sharing also required the use of non-standard or
`restricted interfaces.
`A special network interface with
`support for large packets facilitated sharing data com-
`municated between VMs on a virtual subnet.
`Interpo-
`sition on disk accesses allowed data from shared, non-
`persistent disks to be shared across multiple guests.
`
`4.2
`
`Content-Based Page Sharing
`
`Because modifications to guest operating system in-
`ternals are not possible in our environment, and changes
`to application programming interfaces are not accept-
`able, ESX Server takes a completely different approach
`to page sharing. The basic idea is to identify page copies
`by their contents. Pages with identical contents can be
`shared regardless of when, where, or how those contents
`were generated. This general-purpose approach has two
`key advantages.
`First, it eliminates the need to mod-
`ify, hook, or even understand guest OS code. Second,
`it can identify more opportunities for sharing; by defini-
`tion, all potentially shareable pages can be identified by
`their contents.
`
`The cost for this unobtrusive generality is that work
`must be performed to scan for sharing opportunities.
`Clearly, comparing the contents of each page with ev-
`ery other page in the system would be prohibitively ex-
`pensive; naive matching would require O(n”) page com-
`parisons. Instead, hashing is used to identify pages with
`potentially-identical contents efficiently.
`
`A hash value that summarizes a page’s contents is
`used as a lookup key into a hash table containing entries
`for other pages that have already been marked copy-on-
`write (COW). If the hash value for the new page matches
`an existing entry, it is very likely that the pages are iden-
`tical, although false matches are possible. A successful
`match is followed by a full comparison of the page con-
`tents to verify that the pages are identical.
`
`Once a match has been found with an existing shared
`page, a standard copy-on-write technique can be used
`to share the pages, and the redundant copy can be re-
`claimed. Any subsequent attempt to write to the shared
`page will generate a fault, transparently creating a pri-
`vate copy of the page for the writer.
`
`If no match is found, one option is to mark the page
`COW in anticipation of some future match. However,
`this simplistic approach has the undesirable side-effect
`of marking every scanned page copy-on-write, incurring
`unnecessary overhead on subsequent writes. As an op-
`
`PPN 2863
`
`Bpeioay
`
`hash
`
`...2bd806at
`
`
`
`
`
`hash
`
`’
`
`*
`
`’
`
`’
`
`
`
`
`hash:
`MPN:
`refs:
`
`...07d8
`8f44
`4
`
`hint frame
`
`
`
`...06af|
`hash:
`123b
`MPN:
`3
`VM:
`43f8
`PPN:
`
`
`
`
`
`
`shared frame
`
`ESX Server
`Figure 3: Content-Based Page Sharing.
`scans for sharing opportunities, hashing the contents of can-
`
`didate PPN 0x2868 in VM 2. The hash is used to index into a
`
`table containing other scanned pages, where a match is found
`with a hint frame associated with PPN 0x43f8 in VM 3. Ifa
`
`full comparison confirms the pages are identical, the PPN-to-
`
`MPN mapping for PPN 0x2868 in VM 2
`
`is changed from MPN
`
`0x 1096 to MPN 0x123b, both PPNs are marked
`
`COW, and the
`
`redundant MPN is reclaimed.
`
`
`
`timization, an unshared page is not marked COW, but
`instead tagged as a special hint entry.
`On any future
`match with another page, the contents of the hint page
`are rehashed. If the hash has changed, then the hint page
`has been modified, and the stale hint is removed. If the
`hash is still valid, a full comparison is performed, and
`the pages are shared if it succeeds.
`
`Higher-level page sharing policies control when and
`where to scan for copies. One simple option is to scan
`pages incrementally at some fixed rate. Pages could be
`considered sequentially, randomly, or using heuristics to
`focus on the most promising candidates, such as pages
`marked read-only by the guest OS, or pages from which
`code has been executed. Various policies can be used
`to limit CPU overhead, such as scanning only during
`otherwise-wasted idle cycles.
`
`4.3
`
`Implementation
`
`The ESX Server implementation of content-based
`page sharing is illustrated in Figure 3. A single global
`hash table contains frames for all scanned pages, and
`chaining is used to handle collisions. Each frame is en-
`coded compactly in 16 bytes. A shared frame consists
`of a hash value, the machine page number (MPN) for
`the shared page, a reference count, and a
`link for chain-
`ing.
`A hint frame
`is similar, but encodes a truncated
`
`Pg. 06
`
`Pg. 06
`
`
`
`
`
`
`
`一 大 一 VM Memory
`—e— Shared (COW)
`—t#— Reclaimed
`—o— Zero Pages
`
`
`
`—e— Shared (COW)
`
`—t— Reclaimed
`—¢#— Shared - Reclaimed
`
`
`
`s 3
`Memory (MB)
`
`名
`
`60
`50
`40
`
`同
`=
`s 30
`>
`3s
`
`
`
`Number of VMs
`
`Figure 4: Page Sharing Performance. Sharing metrics
`for a series of experiments consisting of identical Linux VMs
`
`running SPEC95 benchmarks. The top graph indicates the ab-
`
`solute amounts of memory shared and saved increase smoothly
`
`with the number of concurrent VMs. The bottom graph plots
`
`these metrics as a percentage of aggregate VM memory. For
`
`large numbers of VMs, sharing approaches 67% and nearly
`60% of all VM memory is reclaimed.
`
`
`
`concurrent VMs running SPEC95 benchmarks for thirty
`minutes. For these experiments, ESX Server was run-
`ning on a Dell PowerEdge 1400SC multiprocessor with
`two 933 MHz Pentium II CPUs.
`
`Figure 4 presents several sharing metrics plotted as
`a function of the number of concurrent VMs. Surpris-
`ingly, some sharing is achieved with only a single VM.
`Nearly 5 MB of memory was reclaimed from a single
`VM, of which about 55% was due to shared copies of
`the zero page. The top graph shows that after an initial
`jump in sharing between the first and second VMs, the
`total amount of memory shared increases linearly with
`the number of VMs, as expected. Little sharing is at-
`tributed to zero pages, indicating that most sharing is
`due to redundant code and read-only data pages.
`The
`bottom graph plots these metrics as a percentage of ag-
`gregate VM memory. As the number of VMs increases,
`the sharing level approaches 67%, revealing an over-
`lap of approximately two-thirds of all memory between
`the VMs. The amount of memory required to contain
`the single copy of each common shared page (labelled
`Shared - Reclaimed), remains nearly constant, decreasing
`as a percentage of overall VM memory.
`
`Pg. 07
`
`hash value to make room for a reference back to the cor-
`responding guest page, consisting of a VM identifier and
`a physical page number (PPN). The total space overhead
`for page sharing is less than 0.5% of system memory.
`
`Unlike the Disco page sharing implementation, which
`maintained a backmap for each shared page, ESX Server
`uses a simple reference count. A small 16-bit count is
`stored in each frame, and a separate overflow table is
`used to store any extended frames with larger counts.
`This allows highly-shared pages to be represented com-
`pactly.
`For example, the empty zero page filled com-
`pletely with zero bytes is typically shared with a large
`reference count. A similar overflow technique for large
`reference counts was used to save space in the early
`OOZE virtual memory system [15].
`
`is used to
`A fast, high-quality hash function [14]
`generate a 64-bit hash value for each scanned page.
`Since the chance of encountering a false match due to
`hash aliasing is incredibly smallB the system can make
`the simplifying assumption that all shared pages have
`unique hash values. Any page that happens to yield a
`false match is considered ineligible for sharing.
`
`The current ESX Server page sharing implementation
`scans guest pages randomly. Although more sophisti-
`cated approaches are possible, this policy is simple and
`effective. Configuration options control maximum per-
`VM and system-wide page scanning rates.
`Typically,
`these values are set to ensure that page sharing incurs
`negligible CPU overhead.
`As an additional optimiza-
`tion, the system always attempts to share a page before
`paging it out to disk.
`
`To evaluate the ESX Server page sharing implemen-
`tation, we conducted experiments to quantify its effec-
`tiveness at reclaiming memory and its overhead on sys-
`tem performance. We first analyze a “best case” work-
`load consisting of many homogeneous VMs, in order to
`demonstrate that ESX Server is able to reclaim a large
`fraction of memory when the potential for sharing exists.
`We then present additional data collected from produc-
`tion deployments serving real users.
`
`We performed a series of controlled experiments us-
`ing identically-configured virtual machines, each run-
`ning Red Hat Linux 7.2 with 40 MB of “physical” mem-
`ory. Each experiment consisted of between one and ten
`
`
`
`3 Assuming page contents are randomly mapped to 64-bit hash val-
`ues, the probability of a single collision doesn’t exceed 50% until ap-
`
`proximately V264 = 23? distinct pages are hashed [14]. For a static
`snapshot of the largest possible [A-32 memory configuration with 274
`pages {64 GB), the collision probability is less than 0.01%.
`
`Pg. 07
`
`
`
`
`Total
`Shared
`Reclaimed
`
`Guest Types | MB | MB
`%
`MB
`%
`
`
`
`2048 | 880 | 42.9 | 673 | 32.9
`A | 10 WinNT
`
`B
`9 Linux
`1846 | 539 | 29.2 | 345 |
`18.7
`Cc
`S Linux
`1658 | 165 | 10.0 | 120
`7.2
`
`
`
`
`
`
`
`
`
`
`Sharing metrics
`Figure 5: Real-World Page Sharing.
`from production deployments of ESX Server. (a) Ten Windows
`
`NT VMs serving users at a Fortune 50 company, running a va-
`
`riety of database (Oracle, SQL Server), web (IIS, Websphere),
`
`development (Java, VB), and other applications.
`
`(b) Nine
`
`Linux VMs serving a large user community for a nonprofit
`organization, executing a mix of web (Apache), mail (Major-
`
`domo, Postfix, POP/IMAP, MailArmor), and other servers. (c)
`
`Five Linux VMs providing web proxy (Squid), mail (Postfix,
`
`RAV), and remote access (ssh) services to VMware employees.
`
`
`
`The CPU overhead due to page sharing was negligi-
`ble. We ran an identical set of experiments with page
`sharing disabled, and measured no significant difference
`in the aggregate throughput reported by the CPU-bound
`benchmarks running in the VMs. Over all runs, the ag-
`gregate throughput was actually 0.5% higher with page
`sharing enabled, and ranged from 1.6% lower to 1.8%
`higher. Although the effect is generally small, page shar-
`ing does improve memory locality, and may therefore
`increase hit rates in physically-indexed caches.
`
`These experiments demonstrate that ESX Server is
`able to exploit sharing opportunities effectively.
`Of
`course, more diverse workloads will typically exhibit
`lower degrees of sharing. Nevertheless, many real-world
`server consolidation workloads do consist of numerous
`VMs running the same guest OS with similar applica-
`tions. Since the amount of memory reclaimed by page
`sharing is very workload-dependent, we collected mem-
`ory sharing statistics from several ESX Server systems
`in production use.
`
`Figure 5 presents page sharing metrics collected from
`three different production deployments of ESX Server.
`Workload A, from a corporate IT department at a For-
`tune 50 company, consists of ten Windows NT 4.0 VMs
`running a wide variety of database, web, and other
`servers. Page sharing reclaimed nearly a third of all VM
`memory, saving 673 MB. Workload B, from a nonprofit
`organization’s Internet server, consists of nine Linux
`VMs ranging in size from 64 MB to 768 MB, running
`a mix of mail, web, and other servers. In this case, page
`sharing was able to reclaim 18.7% of VM memory, sav-
`ing 345 MB, of which 70 MB was attributed to zero
`pages. Finally, workload C is from VMware’s own IT
`department, and provides web proxy, mail, and remote
`access services to our employees using five Linux VMs
`
`ranging in size from 32 MB to 512 MB. Page sharing
`reclaimed about 7% of VM memory, for a savings of
`120 MB, of which 25 MB was due to zero pages.
`
`5
`
`Shares vs. Working Sets
`
`Traditional operating systems adjust memory alloca-
`tions to improve some aggregate, system-wide perfor-
`mance metric.
`While this is usually a desirable goal,
`it often conflicts with the need to provide quality-of-
`service guarantees to clients of varying importance.
`Such guarantees are critical for server consolidation,
`where each VM may be entitled to different amounts
`of resources based on factors such as importance, own-
`ership, administrative domains, or even the amount of
`money paid to a service provider for executing the VM.
`In such cases, it can be preferable to penalize a less im-
`portant VM, even when that VM would derive the largest
`performance benefit from additional memory.
`
`ESX Server employs a new allocation algorithm that
`is able to achieve efficient memory utilization while
`maintaining memory performance isolation guarantees.
`In addition, an explicit parameter is introduced that al-
`lows system administrators to control the relative impor-
`tance of these conflicting goals.
`
`5.1
`
`Share-Based Allocation
`
`In proportional-share frameworks, resource rights are
`encapsulated by shares, which are owned by clients that
`consume resources.’ A
`client is entitled to consume re-
`sources proportional to its share allocation; it is guaran-
`teed a minimum resource fraction equal to its fraction of
`the total shares in the system. Shares represent relative
`resource rights that depend on the total number of shares
`contending for a resource.
`Client allocations degrade
`gracefully in overload situations, and clients proportion-
`ally benefit from extra resources when some allocations
`are underutilized.
`
`Both randomized and deterministic alg