throbber
USENIX Association
`
`Proceedings of the
`Sth Symposium on Operating Systems
`Design and Implementation
`
`Boston, Massachusetts, USA
`
`December 9-11, 2002
`
`
`
`過: I za ED COMPUTING SYSTEMS ASSOCIA’ \ | |
`
`
`
`
`
`
`
`TION
`
`© 2002 by The USENIX Association
`
`All Rights Reserved
`
`For more information about the USENIX Association:
`
`Phone: 1 510 528 8649
`
`FAX: 1 510 548 5738
`
`Email: office@usenix.org
`
`WWW: http://www.usenix.org
`
`
`Rights to individual papers remain with the author or the author's employer.
`
`Permission is granted for noncommercial reproduction of the work for educational or research purposes.
`This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein.
`
`Pg. 01
`
`WIZ, Inc. EXHIBIT - 1010
`WIZ, Inc. v. Orca Security LTD.
`
`Pg. 01
`
`WIZ, Inc. EXHIBIT - 1010
`WIZ, Inc. v. Orca Security LTD.
`
`

`

`Memory Resource Management in VMware ESX Server
`
`Carl A. Waldspurger
`
`VMware, Inc.
`Palo Alto, CA 94304 USA
`carl@vmware.com
`
`Abstract
`
`VMware ESX Server is
`
`a thin software layer designed to
`
`multiplex hardware resources efficiently among virtual ma-
`chines running unmodified commodity operating systems.
`
`This paper introduces several novel ESX Server mechanisms
`
`and policies for managing memory. A ballooning technique
`
`reclaims the pages considered least valuable by the operat-
`
`ing system running in a virtual machine. An idle memory tax
`achieves efficient memory utilization while maintaining per-
`formance isolation guarantees.
`Content-based page sharing
`and hot I/O page remapping exploit transparent page remap-
`ping to eliminate redundancy and reduce copying overheads.
`These techniques are combined to efficiently support virtual
`
`machine workloads that overcommit memory.
`
`1
`
`Introduction
`
`Recent industry trends, such as server consolida-
`tion and the proliferation of inexpensive shared-memory
`multiprocessors, have fueled a resurgence of interest in
`server virtualization techniques.
`Virtual machines are
`particularly attractive for server virtualization.
`Each
`virtual machine (VM) is given the illusion of being a ded-
`icated physical machine that is fully protected and iso-
`lated from other virtual machines. Virtual machines are
`also convenient abstractions of server workloads, since
`they cleanly encapsulate the entire state of a running sys-
`tem, including both user-level applications and kernel-
`mode operating system services.
`
`In many computing environments, individual servers
`are underutilized, allowing them to be consolidated as
`virtual machines on a single physical server with little or
`no performance penalty. Similarly, many small servers
`can be consolidated onto fewer larger machines to sim-
`plify management and reduce costs. Ideally, system ad-
`ministrators should be able to flexibly overcommit mem-
`ory, processor, and other resources in order to reap the
`benefits of statistical multiplexing, while still providing
`resource guarantees to VMs of varying importance.
`
`Virtual machines have been used for decades to al-
`low multiple copies of potentially different operating
`systems to run concurrently on a single hardware plat-
`form [8]. A virtual machine monitor (VMM) is a soft-
`ware layer that virtualizes hardware resources, export-
`ing a virtual hardware interface that reflects the under-
`lying machine architecture. For example, the influential
`VM/370 virtual machine system [6] supported multiple
`concurrent virtual machines, each of which believed it
`was running natively on the IBM System/370 hardware
`architecture [10].
`More recent research, exemplified
`by Disco [3, 9], has focused on using virtual machines
`to provide scalability and fault containment for com-
`modity operating systems running on large-scale shared-
`memory multiprocessors.
`
`VMware ESYX Server is a thin software layer designed
`to multiplex hardware resources efficiently among vir-
`tual machines. The current system virtualizes the Intel
`IA-32 architecture [13]. It is in production use on servers
`running multiple instances of unmodified operating sys-
`tems such as Microsoft Windows 2000 Advanced Server
`and Red Hat Linux 7.2. The design of ESX Server dif-
`fers significantly from VMware Workstation, which uses
`a hosted virtual machine architecture [23] that takes ad-
`vantage of a pre-existing operating system for portable
`I/O device support. For example, a Linux-hosted VMM
`intercepts attempts by a VM to read sectors from its vir-
`tual disk, and issues a read() system call to the under-
`lying Linux host OS to retrieve the corresponding data.
`In contrast, ESX Server manages system hardware di-
`rectly, providing significantly higher I/O performance
`and complete control over resource management.
`
`The need to run existing operating systems without
`modification presented a number of interesting chal-
`lenges. Unlike IBM’s mainframe division, we were un-
`able to influence the design of the guest operating sys-
`tems running within virtual machines. Even the Disco
`prototypes [3,
`9], designed to run unmodified operat-
`ing systems, resorted to minor modifications in the IRIX
`kernel sources.
`
`Pg. 02
`
`Pg. 02
`
`

`

`This paper introduces several novel mechanisms and
`policies that ESX Server 1.5 [29] uses to manage mem-
`ory. High-level resource management policies compute
`a target memory allocation for each VM based on spec-
`ified parameters and system load. These allocations are
`achieved by invoking lower-level mechanisms to reclaim
`memory from virtual machines.
`In addition, a back-
`ground activity exploits opportunities to share identical
`pages between VMs, reducing overall memory pressure
`on the system.
`
`In the following sections, we present the key aspects
`of memory resource management using a bottom-up
`approach, describing low-level mechanisms before dis-
`cussing the high-level algorithms and policies that co-
`ordinate them.
`Section 2 describes low-level memory
`virtualization.
`Section 3 discusses mechanisms for re-
`claiming memory to support dynamic resizing of virtual
`machines. A general technique for conserving memory
`by sharing identical pages between VMs is presented
`in Section 4.
`Section 5 discusses the integration of
`working-set estimates into a proportional-share alloca-
`tion algorithm.
`Section 6 describes the high-level al-
`location policy that coordinates these techniques. Sec-
`tion 7 presents a remapping optimization that reduces
`I/O copying overheads in large-memory systems. Sec-
`tion 8 examines related work. Finally, we summarize our
`conclusions and highlight opportunities for future work
`in Section 9.
`
`2
`
`Memory Virtualization
`
`A guest operating system that executes within a vir-
`tual machine expects a zero-based physical address
`space, as provided by real hardware. ESX Server gives
`each VM this illusion, virtualizing physical memory by
`adding an extra level of address translation. Borrowing
`terminology from Disco [3], a machine address refers to
`actual hardware memory, while a physical address is a
`software abstraction used to provide the illusion of hard-
`ware memory to a virtual machine. We will often use
`“physical” in quotes to highlight this deviation from its
`usual meaning.
`
`ESX Server maintains a pmap data structure for each
`VM to translate “physical” page numbers (PPNs) to
`machine page numbers (MPNs). VM instructions that
`manipulate guest OS page tables or TLB contents are
`intercepted, preventing updates to actual MMU state.
`Separate shadow page tables, which contain virtual-to-
`machine page mappings, are maintained for use by the
`processor and are kept consistent with the physical-to-
`
`This approach per-
`machine mappings in the pmap.'
`mits ordinary memory references to execute without ad-
`ditional overhead, since the hardware TLB will cache
`direct virtual-to-machine address translations read from
`the shadow page table.
`
`The extra level of indirection in the memory system
`is extremely powerful. The server can remap a “phys-
`ical” page by changing its PPN-to-MPN mapping, in a
`manner that is completely transparent to the VM. The
`server may also monitor or interpose on guest memory
`accesses.
`
`3
`
`Reclamation Mechanisms
`
`ESX Server supports overcommitment of memory to
`facilitate a higher degree of server consolidation than
`would be possible with simple static partitioning. Over-
`commitment means that the total size configured for all
`running virtual machines exceeds the total amount of ac-
`tual machine memory. The system manages the alloca-
`tion of memory to VMs automatically based on config-
`uration parameters and system load.
`
`Each virtual machine is given the illusion of having
`a fixed amount of physical memory. This max size is
`a configuration parameter that represents the maximum
`amount of machine memory it can be allocated. Since
`commodity operating systems do not yet support dy-
`namic changes to physical memory sizes, this size re-
`mains constant after booting a guest OS.
`A VM will be
`allocated its maximum size when memory is not over-
`committed.
`
`3.1
`
`Page Replacement Issues
`
`When memory is overcommitted, ESX Server must
`employ some mechanism to reclaim space from one or
`more virtual machines. The standard approach used by
`earlier virtual machine systems is to introduce another
`level of paging [9, 20], moving some VM “physical”
`pages to a swap area on disk. Unfortunately, an extra
`level of paging requires a meta-level page replacement
`policy: the virtual machine system must choose not only
`the VM from which to revoke memory, but also which
`of its particular pages to reclaim.
`
`In general, a meta-level page replacement policy must
`make relatively uninformed resource management deci-
`sions. The best information about which pages are least
`
`
`
`'The IA-32 architecture has hardware mechanisms that walk in-
`memory page tables and reload the TLB [13].
`
`Pg. 03
`
`Pg. 03
`
`

`

`valuable is known only by the guest operating system
`within each VM. Although there is no shortage of clever
`page replacement algorithms [26], this is actually the
`crux of the problem. A sophisticated meta-level policy
`is likely to introduce performance anomalies due to un-
`intended interactions with native memory management
`policies in guest operating systems.
`This situation is
`exacerbated by diverse and often undocumented guest
`OS policies [1], which may vary across OS versions and
`may even depend on performance hints from applica-
`tions [4].
`
`The fact that paging is transparent to the guest OS can
`also result in a double paging problem, even when the
`meta-level policy is able to select the same page that the
`native guest OS policy would choose [9, 20]. Suppose
`the meta-level policy selects a page to reclaim and pages
`it out. Ifthe guest OS is under memory pressure, it may
`choose the very same page to write to its own virtual
`paging device. This will cause the page contents to be
`faulted in from the system paging device, only to be im-
`mediately written out to the virtual paging device.
`
`3.2
`
`Ballooning
`
`a VM from which memory has been re-
`Ideally,
`claimed should perform as if it had been configured with
`less memory. ESX Server uses a ballooning technique
`to achieve such predictable performance by coaxing the
`guest OS into cooperating with it when possible. This
`process is depicted in Figure 1.
`
`A small balloon module is loaded into the guest OS
`as a pseudo-device driver or kernel service.
`It has no
`external interface within the guest, and communicates
`with ESX Server via a private channel. When the server
`wants to reclaim memory, it instructs the driver to “in-
`flate” by allocating pinned physical pages within the
`VM, using appropriate native interfaces. Similarly, the
`server may “deflate” the balloon by instructing it to deal-
`locate previously-allocated pages.
`
`Inflating the balloon increases memory pressure in the
`guest OS, causing it to invoke its own native memory
`management algorithms. When memory is plentiful, the
`guest OS will return memory from its free list. When
`memory is scarce, it must reclaim space to satisfy the
`driver allocation request. The guest OS decides which
`particular pages to reclaim and, if necessary, pages them
`out to its own virtual disk.
`The balloon driver com-
`municates the physical page number for each allocated
`page to ESX Server, which may then reclaim the corre-
`sponding machine page. Deflating the balloon frees up
`
`
`
`(Guest Memory
`
`may
`page out
`
`balloon >
`
`inflate
`
`-一
`
`
`
`Guest Memory
`
`
`
`
`
`
`
`
`
`~
`deflate
`
`
`
`
`
`page in
`
`Guest Memory
`
`
`
`Figure 1: Ballooning. ESX Server controls a balloon mod-
`ule running within the guest, directing it to allocate guest pages
`and pin them in “physical” memory. The machine pages back-
`
`ing this memory can then be reclaimed by ESX Server. Inflat-
`
`ing the balloon increases memory pressure, forcing the guest
`
`OS to invoke its own memory management algorithms. The
`
`guest OS may page out to its virtual disk when memory is
`
`scarce. Deflating the balloon decreases pressure, freeing guest
`memory.
`
`
`
`memory for general use within the guest OS.
`
`Although a guest OS should not touch any physical
`memory it allocates to a driver, ESX Server does not
`depend on this property for correctness. When a guest
`PPN is ballooned, the system annotates its pmap entry
`and deallocates the associated MPN. Any subsequent at-
`tempt to access the PPN will generate a fault that is han-
`dled by the server; this situation is rare, and most likely
`the result of complete guest failure, such as a reboot
`or crash. The server effectively “pops” the balloon, so
`that the next interaction with (any instance of) the guest
`driver will first reset its state. The fault is then handled
`by allocating a new MPN to back the PPN, just as if the
`page was touched for the first time.”
`
`Our balloon drivers for the Linux, FreeBSD, and Win-
`dows operating systems poll the server once per sec-
`ond to obtain a target balloon size, and they limit their
`allocation rates adaptively to avoid stressing the guest
`OS. Standard kernel interfaces are used to allocate phys-
`ical pages, such as get_-free_page() in Linux, and
`
`MmAllocatePagesForMdl() or MmProbeAndLock-
`
`Pages ()
`
`in Windows.
`
`Future guest OS support for hot-pluggable memory
`cards would enable an additional form of coarse-grained
`ballooning. Virtual memory cards could be inserted into
`
`2 ESX Server zeroes the contents of newly-allocated machine pages
`to avoid leaking information between VMs. Allocation also respects
`cache coloring by the guest OS; when possible, distinct PPN colors are
`mapped to distinct MPN colors.
`
`Pg. 04
`
`Pg. 04
`
`

`

`Pg. 05
`
`3.3
`
`Demand Paging
`
`ESX Server preferentially uses ballooning to reclaim
`memory, treating it
`as
`a common-case optimization.
`When ballooning is not possible or insufficient, the sys-
`tem falls back to a paging mechanism. Memory is re-
`claimed by paging out to an ESX Server swap area on
`disk, without any guest involvement.
`
`The ESX Server swap daemon receives information
`about target swap levels for each VM from a higher-
`level policy module. It manages the selection of candi-
`date pages and coordinates asynchronous page outs to a
`swap area on disk. Conventional optimizations are used
`to maintain free slots and cluster disk writes.
`
`A randomized page replacement policy is used to pre-
`vent the types of pathological interference with native
`guest OS memory management algorithms described in
`Section 3.1.
`This choice was also guided by the ex-
`pectation that paging will be a fairly uncommon oper-
`ation. Nevertheless, we are investigating more sophisti-
`cated page replacement algorithms, as well policies that
`may be customized on a per-VM basis.
`
`4
`
`Sharing Memory
`
`Server consolidation presents numerous opportunities
`for sharing memory between virtual machines. For ex-
`ample, several VMs may be running instances of the
`same guest OS, have the same applications or compo-
`nents loaded, or contain common data. ESX Server ex-
`ploits these sharing opportunities, so that server work-
`loads running in VMs on a single machine often con-
`sume less memory than they would running on separate
`physical machines. As a result, higher levels of over-
`commitment can be supported efficiently.
`
`4.1
`
`Transparent Page Sharing
`
`Disco [3] introduced transparent page sharing as a
`method for eliminating redundant copies of pages, such
`as code or read-only data, across virtual machines. Once
`copies are identified, multiple guest “physical” pages are
`mapped to the same machine page, and marked copy-
`on-write.
`Writing to a shared page causes a fault that
`generates a private copy.
`
`Unfortunately, Disco required several guest OS mod-
`ifications to identify redundant copies as they were cre-
`ated. For example, the bcopy() routine was hooked to
`
`wv
`
`全
`
`ww
`
`Throughput (MB/sec)
`
`
`128
`160
`192
`224
`256
`VM Size (MB)
`
`Figure 2: Balloon Performance. Throughput of single
`Linux VM running dbench with 40 clients. The black bars
`
`plot the performance when the VM is configured with main
`
`memory sizes ranging from 128 MB to 256 MB. The gray bars
`
`plot the performance of the same VM configured with 256 MB,
`ballooned down to the specified size.
`
`
`
`or removed from a VM in order to rapidly adjust its
`physical memory size.
`
`To demonstrate the effectiveness of ballooning, we
`used the synthetic dbench benchmark [28] to simulate
`fileserver performance under load from 40 clients. This
`workload benefits significantly from additional memory,
`since a larger buffer cache can absorb more disk traffic.
`For this experiment, ESX Server was running on a dual-
`processor Dell Precision 420, configured to execute one
`VM running Red Hat Linux 7.2 on a single 800 MHz
`Pentium HI CPU.
`
`Figure 2 presents dbench throughput as a function
`of VM size, using the average of three consecutive runs
`for each data point.
`The ballooned VM tracks non-
`ballooned performance closely, with an observed over-
`head ranging from 4.4% at 128 MB (128 MB balloon)
`down to 1.4% at 224 MB (32 MB balloon). This over-
`head is primarily due to guest OS data structures that are
`sized based on the amount of “physical” memory; the
`Linux kernel uses more space in a 256 MB system than
`in a 128 MB system.
`Thus, a 256 MB VM ballooned
`down to 128 MB has slightly less free space than a VM
`configured with exactly 128 MB.
`
`Despite its advantages, ballooning does have limita-
`tions. The balloon driver may be uninstalled, disabled
`explicitly, unavailable while a guest OS is booting, or
`temporarily unable to reclaim memory quickly enough
`to satisfy current system demands. Also, upper bounds
`on reasonable balloon sizes may be imposed by various
`guest OS limitations.
`
`Pg. 05
`
`

`

`enable file buffer cache Sharing across virtual machines.
`Some sharing also required the use of non-standard or
`restricted interfaces.
`A special network interface with
`support for large packets facilitated sharing data com-
`municated between VMs on a virtual subnet.
`Interpo-
`sition on disk accesses allowed data from shared, non-
`persistent disks to be shared across multiple guests.
`
`4.2
`
`Content-Based Page Sharing
`
`Because modifications to guest operating system in-
`ternals are not possible in our environment, and changes
`to application programming interfaces are not accept-
`able, ESX Server takes a completely different approach
`to page sharing. The basic idea is to identify page copies
`by their contents. Pages with identical contents can be
`shared regardless of when, where, or how those contents
`were generated. This general-purpose approach has two
`key advantages.
`First, it eliminates the need to mod-
`ify, hook, or even understand guest OS code. Second,
`it can identify more opportunities for sharing; by defini-
`tion, all potentially shareable pages can be identified by
`their contents.
`
`The cost for this unobtrusive generality is that work
`must be performed to scan for sharing opportunities.
`Clearly, comparing the contents of each page with ev-
`ery other page in the system would be prohibitively ex-
`pensive; naive matching would require O(n”) page com-
`parisons. Instead, hashing is used to identify pages with
`potentially-identical contents efficiently.
`
`A hash value that summarizes a page’s contents is
`used as a lookup key into a hash table containing entries
`for other pages that have already been marked copy-on-
`write (COW). If the hash value for the new page matches
`an existing entry, it is very likely that the pages are iden-
`tical, although false matches are possible. A successful
`match is followed by a full comparison of the page con-
`tents to verify that the pages are identical.
`
`Once a match has been found with an existing shared
`page, a standard copy-on-write technique can be used
`to share the pages, and the redundant copy can be re-
`claimed. Any subsequent attempt to write to the shared
`page will generate a fault, transparently creating a pri-
`vate copy of the page for the writer.
`
`If no match is found, one option is to mark the page
`COW in anticipation of some future match. However,
`this simplistic approach has the undesirable side-effect
`of marking every scanned page copy-on-write, incurring
`unnecessary overhead on subsequent writes. As an op-
`
`PPN 2863
`
`Bpeioay
`
`hash
`
`...2bd806at
`
`
`
`
`
`hash
`
`’
`
`*
`
`’
`
`’
`
`
`
`
`hash:
`MPN:
`refs:
`
`...07d8
`8f44
`4
`
`hint frame
`
`
`
`...06af|
`hash:
`123b
`MPN:
`3
`VM:
`43f8
`PPN:
`
`
`
`
`
`
`shared frame
`
`ESX Server
`Figure 3: Content-Based Page Sharing.
`scans for sharing opportunities, hashing the contents of can-
`
`didate PPN 0x2868 in VM 2. The hash is used to index into a
`
`table containing other scanned pages, where a match is found
`with a hint frame associated with PPN 0x43f8 in VM 3. Ifa
`
`full comparison confirms the pages are identical, the PPN-to-
`
`MPN mapping for PPN 0x2868 in VM 2
`
`is changed from MPN
`
`0x 1096 to MPN 0x123b, both PPNs are marked
`
`COW, and the
`
`redundant MPN is reclaimed.
`
`
`
`timization, an unshared page is not marked COW, but
`instead tagged as a special hint entry.
`On any future
`match with another page, the contents of the hint page
`are rehashed. If the hash has changed, then the hint page
`has been modified, and the stale hint is removed. If the
`hash is still valid, a full comparison is performed, and
`the pages are shared if it succeeds.
`
`Higher-level page sharing policies control when and
`where to scan for copies. One simple option is to scan
`pages incrementally at some fixed rate. Pages could be
`considered sequentially, randomly, or using heuristics to
`focus on the most promising candidates, such as pages
`marked read-only by the guest OS, or pages from which
`code has been executed. Various policies can be used
`to limit CPU overhead, such as scanning only during
`otherwise-wasted idle cycles.
`
`4.3
`
`Implementation
`
`The ESX Server implementation of content-based
`page sharing is illustrated in Figure 3. A single global
`hash table contains frames for all scanned pages, and
`chaining is used to handle collisions. Each frame is en-
`coded compactly in 16 bytes. A shared frame consists
`of a hash value, the machine page number (MPN) for
`the shared page, a reference count, and a
`link for chain-
`ing.
`A hint frame
`is similar, but encodes a truncated
`
`Pg. 06
`
`Pg. 06
`
`

`

`
`
`
`
`一 大 一 VM Memory
`—e— Shared (COW)
`—t#— Reclaimed
`—o— Zero Pages
`
`
`
`—e— Shared (COW)
`
`—t— Reclaimed
`—¢#— Shared - Reclaimed
`
`
`
`s 3
`Memory (MB)
`
`名
`
`60
`50
`40
`
`同
`=
`s 30
`>
`3s
`
`
`
`Number of VMs
`
`Figure 4: Page Sharing Performance. Sharing metrics
`for a series of experiments consisting of identical Linux VMs
`
`running SPEC95 benchmarks. The top graph indicates the ab-
`
`solute amounts of memory shared and saved increase smoothly
`
`with the number of concurrent VMs. The bottom graph plots
`
`these metrics as a percentage of aggregate VM memory. For
`
`large numbers of VMs, sharing approaches 67% and nearly
`60% of all VM memory is reclaimed.
`
`
`
`concurrent VMs running SPEC95 benchmarks for thirty
`minutes. For these experiments, ESX Server was run-
`ning on a Dell PowerEdge 1400SC multiprocessor with
`two 933 MHz Pentium II CPUs.
`
`Figure 4 presents several sharing metrics plotted as
`a function of the number of concurrent VMs. Surpris-
`ingly, some sharing is achieved with only a single VM.
`Nearly 5 MB of memory was reclaimed from a single
`VM, of which about 55% was due to shared copies of
`the zero page. The top graph shows that after an initial
`jump in sharing between the first and second VMs, the
`total amount of memory shared increases linearly with
`the number of VMs, as expected. Little sharing is at-
`tributed to zero pages, indicating that most sharing is
`due to redundant code and read-only data pages.
`The
`bottom graph plots these metrics as a percentage of ag-
`gregate VM memory. As the number of VMs increases,
`the sharing level approaches 67%, revealing an over-
`lap of approximately two-thirds of all memory between
`the VMs. The amount of memory required to contain
`the single copy of each common shared page (labelled
`Shared - Reclaimed), remains nearly constant, decreasing
`as a percentage of overall VM memory.
`
`Pg. 07
`
`hash value to make room for a reference back to the cor-
`responding guest page, consisting of a VM identifier and
`a physical page number (PPN). The total space overhead
`for page sharing is less than 0.5% of system memory.
`
`Unlike the Disco page sharing implementation, which
`maintained a backmap for each shared page, ESX Server
`uses a simple reference count. A small 16-bit count is
`stored in each frame, and a separate overflow table is
`used to store any extended frames with larger counts.
`This allows highly-shared pages to be represented com-
`pactly.
`For example, the empty zero page filled com-
`pletely with zero bytes is typically shared with a large
`reference count. A similar overflow technique for large
`reference counts was used to save space in the early
`OOZE virtual memory system [15].
`
`is used to
`A fast, high-quality hash function [14]
`generate a 64-bit hash value for each scanned page.
`Since the chance of encountering a false match due to
`hash aliasing is incredibly smallB the system can make
`the simplifying assumption that all shared pages have
`unique hash values. Any page that happens to yield a
`false match is considered ineligible for sharing.
`
`The current ESX Server page sharing implementation
`scans guest pages randomly. Although more sophisti-
`cated approaches are possible, this policy is simple and
`effective. Configuration options control maximum per-
`VM and system-wide page scanning rates.
`Typically,
`these values are set to ensure that page sharing incurs
`negligible CPU overhead.
`As an additional optimiza-
`tion, the system always attempts to share a page before
`paging it out to disk.
`
`To evaluate the ESX Server page sharing implemen-
`tation, we conducted experiments to quantify its effec-
`tiveness at reclaiming memory and its overhead on sys-
`tem performance. We first analyze a “best case” work-
`load consisting of many homogeneous VMs, in order to
`demonstrate that ESX Server is able to reclaim a large
`fraction of memory when the potential for sharing exists.
`We then present additional data collected from produc-
`tion deployments serving real users.
`
`We performed a series of controlled experiments us-
`ing identically-configured virtual machines, each run-
`ning Red Hat Linux 7.2 with 40 MB of “physical” mem-
`ory. Each experiment consisted of between one and ten
`
`
`
`3 Assuming page contents are randomly mapped to 64-bit hash val-
`ues, the probability of a single collision doesn’t exceed 50% until ap-
`
`proximately V264 = 23? distinct pages are hashed [14]. For a static
`snapshot of the largest possible [A-32 memory configuration with 274
`pages {64 GB), the collision probability is less than 0.01%.
`
`Pg. 07
`
`

`

`
`Total
`Shared
`Reclaimed
`
`Guest Types | MB | MB
`%
`MB
`%
`
`
`
`2048 | 880 | 42.9 | 673 | 32.9
`A | 10 WinNT
`
`B
`9 Linux
`1846 | 539 | 29.2 | 345 |
`18.7
`Cc
`S Linux
`1658 | 165 | 10.0 | 120
`7.2
`
`
`
`
`
`
`
`
`
`
`Sharing metrics
`Figure 5: Real-World Page Sharing.
`from production deployments of ESX Server. (a) Ten Windows
`
`NT VMs serving users at a Fortune 50 company, running a va-
`
`riety of database (Oracle, SQL Server), web (IIS, Websphere),
`
`development (Java, VB), and other applications.
`
`(b) Nine
`
`Linux VMs serving a large user community for a nonprofit
`organization, executing a mix of web (Apache), mail (Major-
`
`domo, Postfix, POP/IMAP, MailArmor), and other servers. (c)
`
`Five Linux VMs providing web proxy (Squid), mail (Postfix,
`
`RAV), and remote access (ssh) services to VMware employees.
`
`
`
`The CPU overhead due to page sharing was negligi-
`ble. We ran an identical set of experiments with page
`sharing disabled, and measured no significant difference
`in the aggregate throughput reported by the CPU-bound
`benchmarks running in the VMs. Over all runs, the ag-
`gregate throughput was actually 0.5% higher with page
`sharing enabled, and ranged from 1.6% lower to 1.8%
`higher. Although the effect is generally small, page shar-
`ing does improve memory locality, and may therefore
`increase hit rates in physically-indexed caches.
`
`These experiments demonstrate that ESX Server is
`able to exploit sharing opportunities effectively.
`Of
`course, more diverse workloads will typically exhibit
`lower degrees of sharing. Nevertheless, many real-world
`server consolidation workloads do consist of numerous
`VMs running the same guest OS with similar applica-
`tions. Since the amount of memory reclaimed by page
`sharing is very workload-dependent, we collected mem-
`ory sharing statistics from several ESX Server systems
`in production use.
`
`Figure 5 presents page sharing metrics collected from
`three different production deployments of ESX Server.
`Workload A, from a corporate IT department at a For-
`tune 50 company, consists of ten Windows NT 4.0 VMs
`running a wide variety of database, web, and other
`servers. Page sharing reclaimed nearly a third of all VM
`memory, saving 673 MB. Workload B, from a nonprofit
`organization’s Internet server, consists of nine Linux
`VMs ranging in size from 64 MB to 768 MB, running
`a mix of mail, web, and other servers. In this case, page
`sharing was able to reclaim 18.7% of VM memory, sav-
`ing 345 MB, of which 70 MB was attributed to zero
`pages. Finally, workload C is from VMware’s own IT
`department, and provides web proxy, mail, and remote
`access services to our employees using five Linux VMs
`
`ranging in size from 32 MB to 512 MB. Page sharing
`reclaimed about 7% of VM memory, for a savings of
`120 MB, of which 25 MB was due to zero pages.
`
`5
`
`Shares vs. Working Sets
`
`Traditional operating systems adjust memory alloca-
`tions to improve some aggregate, system-wide perfor-
`mance metric.
`While this is usually a desirable goal,
`it often conflicts with the need to provide quality-of-
`service guarantees to clients of varying importance.
`Such guarantees are critical for server consolidation,
`where each VM may be entitled to different amounts
`of resources based on factors such as importance, own-
`ership, administrative domains, or even the amount of
`money paid to a service provider for executing the VM.
`In such cases, it can be preferable to penalize a less im-
`portant VM, even when that VM would derive the largest
`performance benefit from additional memory.
`
`ESX Server employs a new allocation algorithm that
`is able to achieve efficient memory utilization while
`maintaining memory performance isolation guarantees.
`In addition, an explicit parameter is introduced that al-
`lows system administrators to control the relative impor-
`tance of these conflicting goals.
`
`5.1
`
`Share-Based Allocation
`
`In proportional-share frameworks, resource rights are
`encapsulated by shares, which are owned by clients that
`consume resources.’ A
`client is entitled to consume re-
`sources proportional to its share allocation; it is guaran-
`teed a minimum resource fraction equal to its fraction of
`the total shares in the system. Shares represent relative
`resource rights that depend on the total number of shares
`contending for a resource.
`Client allocations degrade
`gracefully in overload situations, and clients proportion-
`ally benefit from extra resources when some allocations
`are underutilized.
`
`Both randomized and deterministic alg

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket