Monthly Archives: August 2008

Xen 3.3 Feature: Shadow 3

Shadow 3 is the next step in the evolution of the shadow pagetable code.  By making the shadow pagetables behave more like a TLB, we take advantage of guest operating system TLB behavior to reduce and coalesce the number of guest pagetable changes that the hypervisor has to translate to the shadow pagetables.  This can dramatically reduce the virtualization overhead for HVM guests.

Shadow paging overhead is one of the largest source of cpu virtualization overhead for HVM guests.  Because HVM guest operating systems don’t know the physical frame numbers of the pages assigned to them, they use guest frame numbers instead.  This requires the hypervisor to translate each guest frame numbers into machine frames in the shadow pagetables before they can be used by the guest.

Those who have been around awhile may remember the Shadow-1 code.  Its method of propagating changes from guest pagetables to the shadow pagetables was as follows:

  • Remove write access to any guest pagetable.
  • When a guest attempts to write to the guest pagetable, mark it out-of-sync, add the page to the out-of-sync list and give write permission.
  • On the next page fault or cr3 write, take each page from the out-of-sync list and:
    • resync the page: look for changes to the guest pagetable, propagate those entries into the shadow pagetable
    • remove write permission, and clear the out-of-sync bit.

While this method worked so-so for Linux, it was disastrous for Windows.  Windows heavily uses a technique called demand-paging.  Resyncing a guest page is an expensive operation, and under Shadow-1, every time a page was faulted in would cause an out-of-sync, write, and a resync.

The next step, Shadow-2, (among many other things) did away with the out-of-sync mechanism and instead emulated every write to guest pagetables.  Emulation avoids the expensive unsync-resync cycle for demand paging.  However, it removes any “batching” effects: every write is immediately reflected in the shadow pagetables, even though the guest operating system may not have been expecting the address change to be available until later.

Furthermore, Windows will frequently write “transition values” into pagetable entries when a page is being mapped in or mapped out.  The cycle for demand-faulting zero pages in 32-bit Windows looks like:

  • Guest process page faults
  • Write transition PTE
  • Write real PTE
  • Guest process accesses page

On bare hardware, this looks like “Page fault / memory write / memory write”.  Memory writes are relatively inexpensive.  But in Shadow-2, this looks like:

  • Page fault
  • Emulated write
  • Emulated write

Each emulated write involves a VMEXIT/VMENTER as well as about 8000 cycles of emulation inside the hypervisor, much more expensive than a mere memory write.

Shadow-3 brings back the out-of-sync mechanism, but with some key changes.  First, only L1 pagetables are allowed to go out-of-sync.  All L2+ pagetables are emulated.  Secondly, we don’t necessarily resync on the next page fault.  One of the things this enables is to do a “lazy pull-through”: if we get a page fault where the shadow is not-present but the guest is present, we can simply propagate that entry to the shadows, and return to the guest, leaving the rest of the page out-of-sync.   This means that once a page is out-of-sync, demand-faulting looks like this:

  • Page fault
  • Memory write
  • Memory write
  • Propagate guest entry to shadows

Pulling through a single guest value is actually cheaper than emulation.  So for demand-paging under Windows, we have 1/3 fewer trips into the hypervisor.  Furthermore, batch updates, like process destruction or mapping large address spaces, are propagated to the shadows in a batch at the next CR3 switch, rather than going into and out of the hypervisor on each individual write.

All of this adds up to greatly improved performance for workloads like compilation, compression, databases, and any workload which does a lot of memory management in an HVM guest.

Xen Summit 2009 Proposal

I am currently working on the Community Plans for 2009 Xen Summits and I wanted to share my thoughts with the community to get feedback on my ideas. In the past, Xen Summits have been held every 9 months with the majority of them being in North America. It is my intention, as you can see with the upcoming Xen Summit Tokyo/Asia in November, to ensure that we provide an opportunity for all community members to attend a Xen Summit without having to travel a great distance. To support this concept, I am proposing the following plan for 2009:

  • Xen Summit North America
  • Location: Oracle is scheduled to host this event on February 24 – 25, 2009 in Redwood Shores, CA at Oracle’s Conference Center
  • Focus:  The development community with an agenda highlighting the latest features being developed, status updates on research projects leveraging Xen, and customer demonstrations of Xen solutions
  • Length: 2 Days
  •  Xen Summit Europe
  • Location: I am speaking to the LinuxTAG and Linux Kongress organizations about co-locating with one of their events in Germany
  • Focus: Research and customer demonstrations of how they are using Xen; As this event is co-located with a Linux event this is a good opportunity to promote the Xen solution to a wider audience so the agenda needs to be more “how to use Xen”.
  • Length: 1 Day
  •  Xen Summit Asia
  • Location: OPEN (Xen Summit Tokyo/Asia 2008 is being hosted by Fujitsu in Tokyo)
  • Focus: Specific developer topics related to areas critical in Asia (e.g. IA64), how Xen is being used in Asia and new research occurring in Asia [Cross b/w Xen Summit North America and Xen Summit Europe for overall agenda focus]
  • Length: 2 Days
  • Xen Summit North America II
  • Location: If we follow previous schedules, this event will be held nine months after Xen Summit North America; therefore this would be in the Fall 09
  • Focus: Same as Xen Summit North America
  • Length: 2 Days

Note, I have tried to create different focuses for the events to ensure that community members are not required to attend all the events to stay in touch with the community. The Xen Summit North America event will be the developer focused meeting while the other Xen Summits will take a more customer/researcher focus.

As for the future, I have received requests to hold a Xen Summit in India and possibly South America. As the community grows, I expect to see us offer more events globally to better serve the global community and we will revisit the plan when scheduling for 2010.

Community Questions for Discussion

  1. Do we want to host 2 Xen Summits North America next year to continue the 9 month separation of events?
  2.  Is there demand for a 1 day Xen Summit event in Germany? Is there another location in Europe that would be better? Is there another event to consider for co-location?

Xen 3.3 Feature : Memory Overcommit

From Dan Magenheimer at Oracle:

Memory overcommit provides the ability for the sum of the physical memory allocated to all active domains to exceed the total physical memory on the system.  For example, if your machine has 4GB of RAM and you want to run as many 1GB domains as possible, you can run at most three — because Xen and domain 0 require some physical memory also.  With the new memory overcommit feature in Xen 3.3, in some environments, you can run six or ten or even more.

To be clear, there is no magic:  Memory overcommit may have some performance impact and may be unusable in some environments.  Memory for new domains is obtained by taking it away from currently running domains so environments where all domains heavily utilize memory are not a candidate for memory overcommit.  And to maximize benefit, all domains must be properly configured.  But for environments which require a ratio of high virtual-domains-to- physical-machines and that are willing to make some tradeoffs, memory overcommit can substantially increase “VM density” and save cost.

Memory is taken from one domain and given to another using the existing Xen “ballooning” mechanism, which has recently been improved to be more robust.  For example, a domain that is idle (or nearly so) is probably not using much memory; this memory can be made available to use in another domain, or for a newly created domain.  The tricky part is to determine how MUCH memory can be taken away from domains without causing problems for them; and, even more importantly, how to give the memory back if a domain suddenly needs it again.

This careful memory balancing ideally should be done in a management tool that can monitor memory needs of all domains and add or subtract memory from each domain as needed.  A very simple management tool supplied with Xen 3.3 provides “self-ballooning” and, while more sophisticated tools may be needed in the future, self-ballooning is sufficient for many environments.

To best implement memory overcommit, all domains should be configured with a properly sized and configured virtual swap disk and all HVM domains must have a working balloon driver and runnable Xenstore tools.   Next self-ballooning scripts are installed in each domain and enabled as a service.

The scripts, along with a comprehensive README, are found in xen.hg/tools/xenballoond in the open source Xen distribution. Once all domains are rebooted, automatic memory balancing will occur and idle memory is freed up to run additional domains, thus resulting in memory overcommit!

For more information, see: XenSummit2008.pdf

Xen 3.3 Feature Details Community:

As part of the Xen 3.3 release, I have asked the various development authors to supply me with information on their new features. Over the next few weeks, I will be posting their overviews  to this  blog to give everyone further information on the features in the new release.

Preview the new Website Community:

As many of you are aware, I have been working the past few months to update the current website to better target various users of the site as well as simplify the organization of the information. I have completed the web development and am now making the site available for feedback and comment. You can reach the site at and I encourage all feedback from broken links to “what were you thinking?”. I plan to allow at least 2 weeks for comments before I make the final changes and transition the new site to

If you are interested in seeing the design document that I based the new site on or the target profile descriptions, please search in the blog for “web development” and you will get those documents.

Please note that links to the Wiki, Bug Tracker, Source Browser, and Mercurial Repository will take you to the existing header structure as I am in the process of preparing those services to link to the new site.