Tag Archives: policy

Xen Project automatic testing on community infrastructure

Currently the Xen Project’s automatic testing setup runs on a small set of hardware in space borrowed from Citrix. Because it’s on the Citrix network, it’s not possible to give access to other community members. The underlying systems are creaking rather. And the system is too small – we already find that testing is rather too slow.

The Xen Project Test Framework Working Group has agreed to press forward with a plan to provide a new setup (in a public colo, probably). We have a budget for this from the Advisory Board, which we think will be sufficient to provide a bigger and better setup than we have now.

We have decided to separate this immediately pressing concern – the inadequate and inaccessible hosting – from the longer-term questions of how to make more use of Xen community members’ existing test software. In particular, we have deferred the question of whether to stick with the existing osstest system long-term, or move to another system such as Citrix’s XenRT.

We’ll consider whether, when and how to make such a transition after we have sorted out our underlying infrastructure. We will make sure that the hardware and facilities we are organising now will be suitable for whatever software system we might want to run.

So, our immediate task now is to set out a more detailed plan for the amount and kind of hardware to acquire, and to identify a suitable hosting facility.

Ballooning, rebooting, and the feature you’ve never heard of

Today I’d like to talk about a functionality of Xen you may not have heard of, but might have actually used without even knowing it. If you use memory ballooning to resize your guests, you’ve likely used “populate-on-demand” at some point. 

As you may know, ballooning is a technique used to dynamically adjust the physical memory in use by a guest. It involves having a driver in the guest OS, called a balloon driver, allocate pages from the guest OS and then hand those pages back to Xen. From the guest OS perspective, it still has all the memory that it started with; it just has a device driver that’s a real memory hog. But from Xen’s perspective, the memory which the device driver asked for is no longer real memory — it’s just empty space (hence “balloon”). When the administrator wants to give memory back to the VM, the balloon driver will ask Xen to fill the empty space with memory again (shrinking or “deflating” the balloon), and then “free” the resulting pages back to the guest OS (making the memory available for use again).

While this can be used to shrink guest memory and then expand it again, this technique has an important limitation: It can never grow the memory above the starting size of the VM. This is because the only way to grow guest memory is to “deflate” the balloon. Once it gets back to the starting size of the VM, the balloon is entirely deflated and no additional memory can be added by the balloon driver.

To see why this is important, consider the following scenario.

Host A and B both have 4GiB of RAM, and 2 VMs with 2GiB of RAM each. Suppose you want to reboot host B to do some hardware maintenance. You could do the following:

  • Balloon all 4 VMs down to 1GiB
  • Migrate the 2 VMs from host B onto host A
  • Shut down host B to do your maintenance
  • Bring up host B
  • Migrate the 2 VMs originally on host B back
  • Balloon all 4 VMs back up to 2GiB

All well and good. But suppose that while you had one of those VMs ballooned down to 1GiB, you needed to reboot one. Now you have a problem: Most operating systems will only check how much memory is available at boot time. You only have 1GiB of free memory. If you boot with 1GiB of memory, you will be able to balloon *smaller* than 1GiB, but you will not be able to balloon back up to 2GiB when the maintenance of host B is done.

This is where populate-on-demand comes in. It allows a VM to boot with a maximum memory larger than its current target memory. It enables a guest that thinks it has 2GiB of RAM to boot while only actually using 1GiB of RAM. It can do this because it only needs to allow the guest to run until the balloon driver can start. Once the balloon driver starts, it will “inflate” the balloon to the proper size. At that point, there is nothing special to do; the VM looks like it did when we shut it down (guest thinks it has 2GiB of RAM, but 1GiB is allocated to the balloon driver and not accessed). When host B comes back up and more memory is free, the balloon driver can deflate the balloon, bringing the total memory back up to 2GiB.

Populate-on-demand comes into play in Xen whenever you start an HVM guest with maxmem and memory set to different values. In that case, the guest will be told it has maxmem RAM, but will only have memory allocated to it; the populate-on-demand code will allow the guest to run in this mode until the balloon driver comes up and hands “frees” maxmem-memory back to Xen.

Virtualizing memory: A primer

In order to desrcibe how populate-on-demand works, I’ll need to explain a bit more about how Xen virtualizes memory. On real hardware, the actual hardware memory is referred to as physical memory; and it is typically divided into 4k-chunks called physical frames. These frames are addressed by their physical frame number, or pfn. In the x86 world, pfns typically start at 0, and are mostly contiguous (with the occasional “hole” for IO devices). Historically, on x86 platforms, a description of which pfns are available for use by memory is in something called the E820 map, provided by the BIOS to operating systems at boot.

When we virtualize, we need to provide the guest with virtual “physical address space,” described in the virtual E820 map provided to the guest. These are called guest physical frame numbers, or gpfns. But of course there is still real hardware backing this memory; in the virtualization world, it is common to refer to these as machine frames, or mfns. Every useable gpfn must have a mfn behind it.

But the gpfns have to start at 0 and be contiguous, while the mfns which back them may come from anywhere in Xen’s memory. So every VM has a physical to machine translation table, or p2m table, which maps the gpfn space onto the mfn space. Each gpfn will have an entry in the table, and every useable bit of RAM has an mfn behind it. Normally this is done by the domain builder in domain 0, which will ask Xen to fill the p2m table appropriately (including any holes for IO devices if necessary).

Ballooning then works like this. To inflate the balloon, the balloon driver will ask the guest OS for a free page. After allocating the page, it puts it on its list of pages and finds the gpfn for that page. It then tells Xen it can take the memory behind the gpfn back. Xen will replace the mfn in that gpfn space with “invalid entry,” and put the mfn on its own free list (available to be given to another VM). If the guest were to attempt to read or write this memory now, it would crash; but it won’t, because the guest OS thinks the page is in use by the balloon driver. The balloon driver won’t touch it, and the OS won’t use it for anything else.

To deflate the balloon, the balloon driver will choose one of the pages on its list that it has allocated, and then asks Xen to put some memory behind the gpfn. If Xen determines that the guest is allowed to increase its memory, and there is free memory available, then it will allocate an mfn and put it in the p2m table behing that gpfn. Now the gpfn is useable again; the balloon driver then frees the page back to the guest OS, which will put it on its own free list to use for whatever needs memory.

Populate on Demand: The Basics

The idea behind populate-on-demand was that the guest didn’t actually need all of its memory to boot up until the balloon driver was active — it only needed a small portion of it. But there was no way for the domain builder to know ahead of time which gpfns the guest OS will actually need to use in order to do that; nor which memory will be given to the balloon driver by the guest OS once it starts up.

So when building a domain in populate-on-demand mode the domain builder tells Xen to allocate the mfns into a special pool, which I will call here the PoD pool, according to how much memory is specified in the memory parameter. (In the Xen code it’s actually called the PoD cache, but it’s not a good name, because in computer science “cache” has a very specific meaning that doesn’t match what the PoD pool does. This will probably be renamed at some point for clarity.)

It then creates the guest’s p2m table as it did before, but instead of filling it with mfns, it fills it with a special PoD entry. The PoD entry is an invalid entry; so as the guest boots, whenever it touches a gpfn backed by a PoD entry, it will trap up into Xen. When Xen sees that the PoD entry, it will take an mfn from the PoD pool and put it in the p2m for that gpfn. It will then return to the guest, at which point the memory access will succeed and the guest can continue.

Thus, rather than populating the p2m table when building the domain, the p2m table is populated on demand; hence the name.

The key reason for having the the PoD pool is that the memory is already allocated to the domain. If you do a domain list it shows up as owned by the domain; and it cannot be allocated to a different domain. If this were instead allocate on demand, where you actually allocated the memory from Xen when you hit an invalid entry, there would be a risk that the memory you needed to boot until the balloon driver could run would already have been allocated to a different domain.

However, the guest can’t run like this for long. There are far more PoD entries in the p2m table than there are mfns in the PoD pool — that was the point. But the guest OS doesn’t know that; as far as it’s concerned, it has maxmem to work with. If the balloon driver doesn’t start, nothing will keep it from trying to use all of its memory. If it uses up all the memory in the PoD pool, the next time Xen hits a PoD entry, there won’t be any mfns in the PoD pool to populate the entry with. At that point, Xen would have no choice but to kill the guest.

Getting back to normal: the balloon driver

The balloon driver, like the guest operating system, knows nothing about populate-on-demand. It just knows that it has maxmem gpfn space, and it needs to hand maxmem-memory back to Xen. So it begins allocating pages from the guest operating system, and freeing the gpfns back to Xen.

What Xen does next depends on a few things. Xen keeps track of both the number of PoD entries in the p2m table, and the number of mfns in the PoD pool.

  • If the gpfn is a PoD entry, Xen will simply replace the PoD entry with a normal invalid entry and return. This reduces the number of outstanding PoD entries in the pool.
  • If the gpfn has a real mfn behind it, and the number of PoD entries left in the p2m table is more than the number of mfns in the PoD pool, Xen will replace the entry with an invalid entry, and put the mfn back into the PoD pool. This increases the size of the pool.
  • If the gpfn has a real mfn behind it, but the number of PoD entries left in the p2m table is equal to the number of mfns in the pool, it will put the mfn back on the free list, ready to be used by another domain.

Eventually, the number of outstanding PoD entries is equal to the number of entries in the PoD pool, and the system is now in a stable state. There is no more risk that the guest will touch a PoD entry and not find memory in the pool; and for an active OS, eventually all pages will be touched, and the VM will be the same as one booted not in PoD mode.

It’s never that simple: Page scrubbing

At a high level, that’s the idea behind populate-on-demand. Unfortunately, the real world is often a bit more messy than we would like.

On real hardware, if you do a soft reboot (or if you do some special trick, like spraying the RAM with liquid nitrogen), the memory when the operating system starts may still contain information from a previous boot. The freshly booting operating system has no idea what may be in there: it may be security sensitive information like someone’s taxes or private data keys.

To avoid any risk that information from the previous boot might leak into untrusted programs which might run this time, most operating systems will scrub the memory at boot — that is, fill all the memory with zeros. This also means that drivers can assume that freshly allocated memory will already be zeroed, and not bother doing it themselves. Doing this all at once, at the beginning, allows the operating system to use more efficient algorithms, and also localizes the processor cache pollution.

For an operating system running under Xen this is unnecessary, because Xen will scrub any memory before giving it to the guest (for pretty much the same potential security issue). However, many operating systems which run on Xen — in particular, proprietary operating systems like Windows — don’t know this, and will do their own scrub of memory anyway. Typically this happens very early in boot, long before it is possible to load the balloon driver. This pretty much guarantees that every gpfn will be written to before the balloon driver loads. How does populate on demand deal with that?

The key is that the state of a gpfn after it has been scrubbed by the operating system is the same as the default initial state of a gpfn just populated by the PoD code. This means that after a gpfn has been scrubbed by the operating system, Xen can reclaim the page: it can replace the mfn in the p2m table with a PoD entry, and put the mfn in the PoD pool. The next time the VM touches the page, it will be replaced with a different zero page from the PoD pool; but to the VM it will look the same.

So the populate-on-demand system has a number of zero-page reclaim techniques. The primary one is that when populating a new PoD entry, we look at recently populated entries and see if they are zero, and if they are, we reclaim them. The effect of this is to have each scrubbing thread only have one outstanging PoD page at a time.

If that fails, there is another technique we call the “emergency sweep.” When Xen hits a PoD entry, but the PoD pool is empty, before crashing the guest, it will search through all of guest memory, looking for zeroed pages to reclaim. Because this method is very slow, it is only used as a last resort.

Conclusion

So that’s populate-on-demand in a nutshell. There are more complexities under the hood (like trying to keep superpages together), but I’ll leave those for another day.

Linux 3.14 and PVH

The Linux v3.14 will sport a new mode in which the Linux kernel can run thanks to Mukesh Rathor (Oracle).

Called ‘ParaVirtualized Hardware,’ it allows the guest to utilize many hardware features – while at the same time having no emulated devices. It is the next step in PV evolution, and it is pretty fantastic.

Here is a great blog that explains the background and history in detail at:
The Paravirtualization Spectrum, Part 2: From poles to a spectrum.

The short description is that Xen guests can run as HVM or PV. PV is a mode where the kernel lets the hypervisor program page-tables, segments, etc. With EPT/NPT capabilities in current processors, the overhead of doing this in an HVM (Hardware Virtual Machine) container is much lower than the hypervisor doing it for us. In short, we let a PV guest run without doing page-table, segment, syscall, etc updates through the hypervisor – instead it is all done within the guest container.

It is a hybrid PV – hence the ‘PVH’ name – a PV guest within an HVM container.
Continue reading

Improved Xen support in FreeBSD

FreeBSD Logo
As most FreeBSD users already know, FreeBSD 10 has just been released, and we expect this to be a very good release regarding Xen support. FreeBSD with Xen support includes many improvements, including several performance and stability enhancements that we expect will greatly please and interest users. With many bug fixes already completed, the following description only focuses on new features.

New vector callback

Previous releases of FreeBSD used an IRQ interrupt as the callback mechanism for Xen event channels. While it’s easier to setup, using a IRQ interrupt doesn’t allow to inject events to specific CPUs, basically limiting the use of event channels in disk and network drivers. Also, all interrupts were delivered to a single CPU (CPU#0), not allowing proper interrupt balancing between CPUs.

With the introduction of the vector callback, events can now be delivered to any CPU, allowing FreeBSD to have specific per-CPU interrupts for PV timers and PV IPIs, and balancing the others across the several CPU usually available on a domain.

PV timers

Thanks to the introduction of the vector callback, now we can make use of the Xen PV timer, which is implemented as a per-CPU singleshot timer. This alone doesn’t seem like a great benefit, but it allows FreeBSD to avoid making use of the emulated timers, greatly reducing the emulation overhead and the cost of unnecessary VMEXITs.

PV IPIs

As with PV timers, the introduction of the vector callback allows FreeBSD to get rid of the bare metal IPI implementation, and instead route IPIs through event channels. Again, this allows us to get rid of the emulation overhead and unnecessary VMEXITS, providing better performance.

PV disk devices

FLUSH/BARRIER support has been recently added, together with a couple of fixes that allow FreeBSD to run with a CDROM driver under XenServer (which was quite of a pain for XenServer users).

Support for migration

With these new features, migration doesn’t break since it has been reworked to handle the fact that timers and IPIs are also paravirtualized now.

Merge of the XENHVM config into GENERIC

One of the most interesting improvements from a user/admin point of view (and something similar to what the pvops Linux kernel is already doing), the GENERIC kernel on i386 and amd64 now includes full Xen PVHVM support, so there’s no need to recompile a Xen-specific kernel. When run as a Xen guest, the kernel will detect the available Xen features and automatically make use of them in order to obtain the best possible performance.

This work has been done in conjunction between Spectra Logic and Citrix.

libvirt support for Xen’s new libxenlight toolstack

Originally posted on my blog, here.

Xen has had a long history in libvirt.  In fact, it was the first hypervisor supported by libvirt.  I’ve witnessed an incredible evolution of libvirt over the years and now not only does it support managing many hypervisors such as Xen, KVM/QEMU, LXC, VirtualBox, hyper-v, ESX, etc., but it also supports managing a wide range of host subsystems used in a virtualized environment such as storage pools and volumes, networks, network interfaces, etc.  It has really become the swiss army knife of virtualization management on Linux, and Xen has been along for the entire ride.

libvirt supports multiple hypervisors via a hypervisor driver interface, which is defined in $LIBVIRT_ROOT/src/drvier.h – see struct _virDriver.  libvirt’s virDomain* APIs map to functions in the hypervisor driver interface, which are implemented by the various hypervisor drivers.  The drivers are located under $LIBVIRT_ROOT/src/<hypervisor-name>.  Typically, each driver has a $LIBVIRT_ROOT/src/<hypervisor-name>/<hypervisor-name>_driver.c file which defines a static instance of virDriver and fills in the functions it implements.  As an example, see the definition of libxlDriver in $libvirt_root/src/libxl/libxl_driver.c, the firsh few lines of which are

static virDriver libxlDriver = {
    .no = VIR_DRV_LIBXL,
    .name = “xenlight”,
    .connectOpen = libxlConnectOpen, /* 0.9.0 */
    .connectClose = libxlConnectClose, /* 0.9.0 */
    .connectGetType = libxlConnectGetType, /* 0.9.0 */
    ...
}

Continue reading

What is the ARINC653 Scheduler?

The Xen ARINC 653 scheduler is a real time scheduler that has been in Xen since 4.1.0.  It is a cyclic executive scheduler with a specific usage in mind, so unless one has aviation experience they are unlikely to have ever encountered it.

The scheduler was created and is currently maintained by DornerWorks.

Background

The primary goal of the ARINC 653 specification [1] is the isolation or partitioning of domains.  The specification goes out of its way to prevent one domain from adversely affecting any other domain, and this goal extends to any contended resource, including but not limited to I/O bandwidth, CPU caching, branch prediction buffers, and CPU execution time.

This isolation is important in aviation because it allows applications at different levels of certification (e.g. Autopilot – Level A Criticality, In-Flight Entertainment – Level E Criticality, etc…) to be run in different partitions (domains) on the same platform.  Historically to maintain this isolation each application had its own separate computer and operating system, in what was called a federated system.  Integrated Modular Avionics (IMA) systems were created to allow multiple applications to run on the same hardware.  In turn, the ARINC653 specification was created to standardize an Operating System for these platforms.  While it is called an operating system and could be implemented as such, it can also be implemented as a hypervisor running multiple virtual machines as partitions.  Since the transition from federated to IMA systems in avionics closely mirrors the transition to virtualized servers in the IT sector, the latter implementation seems more natural.

Beyond aviation, an ARINC 653 scheduler can be used where temporal isolation of domains is a top priority, or in security environments with indistinguishability requirements, since a malicious domain should be unable to extract information through a timing side-channel.  In other applications, the use of an ARINC 653 scheduler would not be recommended due to the reduced performance.

Scheduling Algorithm

The ARINC 653 scheduler in Xen provides the groundwork for the temporal isolation of domains from each other. The domain scheduling algorithm itself is fairly simple:  a fixed predetermined list of domains is repeatedly scheduled with a fixed periodicity resulting in a complete and, most importantly, predictable schedule.  The overall period of the scheduler is know as a major frame, while the individual domain execution windows in the schedule are know as minor frames.

Major_Minor_Frame

As an example, suppose we have 3 domains all with periods of 5, 6, 10 ms and worst case running times respectively of 1 ms, 2 ms, and 3 ms.  The major frame is set to the least common multiple of these periods (30 ms) and minor frames are selected so that the period, runtime, and deadline constraints are met.  One resulting schedule is shown below, though there are other possibilities.

ExampleSchedule

The ARINC 653 scheduler is only concerned with the scheduling of domains. The scheduling of real-time processes within a domain is performed by that domain’s process scheduler.  In a compliant ARINC 653 system, these processes are scheduled using a fixed priority scheduling algorithm, but if ARINC 653 compliance is not a concern any other process scheduling method may be used.

Using the Scheduler

Directions for using the scheduler can be found on the Xen wiki at ARINC653 Scheduler. When using the scheduler, the most obvious effect will be that the cpu usage and execution windows for each domain will be fixed regardless of whether the domain is performing any work.

Currently multicore operation of the scheduler is not supported.  Extending the scheduling algorithm to multiple cores is trivial, but the isolation of domains in a multicore system requires a number of mitigation techniques not required in single-core systems.[2]

References

[1] ARINC Specification 653P1-3, “Avionics Application Software Standard Interface Part 1 – Required Services” November 15, 2010

[2] EASA.2011/6 MULCORS – Use of Multicore Processors in airborne systems