Author Archives: George Dunlap

About George Dunlap

I received my PhD from the University of Michigan in December 2006. I joined XenSource in May of 2006. I've been working on analyzing performance in the Xen hypervisor. I'm currently working on a new scheduler.

XSA-108: Not the vulnerability you’re looking for

There has an unusual amount of media attention to XSA-108 during the embargo period (which ended Wednesday) — far more than any of the previous security issues the Xen Project has reported. It began when a blogger complained that Amazon was telling customers it would be rebooting VMs in certain regions before a specific date. Other media outlets picked it up and noticed that the date happened to coincide with the release of XSA-108, and conjectured that the reboots had something to do with that. Soon others were conjecturing that, because of the major impact to customers of rebooting, that it must be something very big and important, similar to the recent Heartbleed and Shell Shock vulnerabilities. Amazon confirmed that the reboots had to do with XSA-108, but could say nothing else because of the security embargo.

Unfortunately, because of the nature of embargoes, nobody with any actual knowledge of the vulnerability was allowed to say anything about it, and so the media was entirely free to speculate without any additional information to ground the discussion in reality.

Now that the embargo has lifted, we can talk in detail about the vulnerability; and I’m afraid that people looking for another Shell Shock or Heartbleed are going to be disappointed. No catchy name for this one.

Continue reading

The Docker exploit and the security of containers

We normally only cover news and information directly related to Xen in this channel, but we thought it might be useful to briefly expand our scope a bit to mention the recent discussion about the Docker security exploit.

What’s the news?

Well to begin with, a few weeks ago Docker 1.0 was released, just in time for DockerCon.

Then last week, with timing that seems rather spiteful, someone released an exploit that allows a process running as root within a Docker container to break out of a Docker container.

I say it was a bit spiteful, because as the post-mortem on Docker’s site demonstrates, it exploits a vulnerability which was only present until Docker 0.11; it was fixed in Docker 0.12, which eventually became Docker 1.0.

Nonetheless, this kicked off a bit of a discussion about Docker, Linux containers, and security in several places, including Hacker News, the Register, seclists.org, and the OSv blog.

There is a lot of interesting discussion there. I recommend skimming through the Hacker News discussion in particular, if you have the time. But I think the comment on the Hacker News discussion by Solomon Hykes from Docker, puts it best:

As others already indicated this doesn’t work on 1.0. But it could have. Please remember that at this time, we don’t claim Docker out-of-the-box is suitable for containing untrusted programs with root privileges. So if you’re thinking “pfew, good thing we upgraded to 1.0 or we were toast”, you need to change your underlying configuration now. Add apparmor or selinux containment, map trust groups to separate machines, or ideally don’t grant root access to the application.

I’ve played around with Docker a little bit, and it seems like an excellent tool for packaging and deploying applications. Using containers to separate an application from the rest of the user-space of your distribution, exposing it only to the very forward-compatible Linux API is a really clever idea.

However, using containers for security isolation is not a good idea. In a blog last August, one of Docker’s engineers expressed optimism that containers would eventually catch up to virtual machines from a security standpoint. But in a presentation given in January, the same engineer said that the only way to have real isolation with Docker was to either run one Docker per host, or one Docker per VM. (Or, as Solomon Hykes says here, to use Dockers that trust each other in the same host or the same VM.)

We would concur with that assessment.

Release management: Risk, intuition, and freeze exceptions

I’ve been release coordinator for Xen’s 4.3 and 4.4 releases. For the 4.5 release, I’ve handed this role off to Konrad Wilk, from Oracle. In this blog, I try to capture some of my thoughts and experience about one aspect of release management: deciding what patches to accept during a freeze.

I have three goals when doing release management:

  1. A bug-free release
  2. An awesome release
  3. An on-time release

One of the most time-consuming seasons of being a release manager is during any kind of freeze. You can read in detail about our release process elsewhere; I’ll just summarize it here. During normal development, any patch which has the approval of the relevant maintainer can be accepted. As the release approaches, however, we want to start being more and more conservative in what patches we accept.

Obviously, no patch would ever be considered for acceptance which didn’t improve Xen in some way, making it more awesome. However, it’s a fact of software that any change, no matter how simple or obvious, may introduce a bug. If this bug is discovered before the release, it may delay the release, making it not on-time; or it may not be found until after the release, making the release not bug-free. The job of helping decide whether to take a patch or not falls to the release coordinator.

But how do you actually make decisions? There are two general things to say about this.

Risk

The only simple rule to follow that will make sure that there are no new bugs introduced is to do no development at all. Since we must do development, we have to learn to deal with risk.

Making decisions about accepting or rejecting patches as release coordinator is about making calculated risks: look at the benefits, look at the potential costs, look at the probabilities, and try to balance them the best you can.

Part of making calculated risks is accepting that sometimes your gamble won’t pay off. You may approve a patch to go in, and it will then turn out to have a bug in it which delays the release. This is not necessarily a failure: if you can look back at your decision and say with honesty, “That was the best decision I could have made given what I knew at the time”, then your choice was the right one.

The extreme example of this kind of thinking that of a poker player: a poker player may make a bet that she knows she only has a 1 in 4 chance of winning, if the pay-off is more than 4 to 1; say, 5 to 1. Even though she loses 75% of the time, the 25% that she does win will pay for the losses. And when she makes the bet and loses (as she will 75% of the time), she knows she didn’t make a mistake; taking risks is just a part of the game.

Obviously as release coordinator, the costs of a bug are generally higher than the benefits of a feature. But the general principle — of taking calculated risks, and accepting occasional failure as the inevitable consequence of doing so — applies.

Intuition

But how do we actually calculate the risks? A poker player frequently deals with known quantities: she can calculate that there is exactly a 25% chance of winning, exactly a 5x monetary pay-off, and do the math; the release coordinator’s decisions are not so quantifiable.

shutterstock_58404295This is where intuition comes in. While there are a handful of metrics that can be applied to patches (e.g., the number of lines in the patch), for the most part the risk and benefit are not very quantifiable at all: expert judgement / intuition is the only thing we have.

Now, research has shown that the intuition of experts can, under the right circumstances, be very good. Intuition can quickly analyze hundreds of independent factors, and compare against thousands of instances, and give you a result which is often very close to the mark.

However, research has also shown that in other circumstances, expert intuition is worse than random guessing, and far worse than a simple algorithm. (For a reference, see the books listed at the bottom.)

One of the biggest ways intuition goes wrong is by only looking at part of the picture. It is very natural for programmers, when looking at a patch, to consider only the benefits. The programmer’s intuition then accurately gives them a good sense of the advantage of taking the patch; but doesn’t warn them about the risk because they haven’t thought about it. Since they have a positive feeling, then they may end up taking a patch even when it’s actually too risky.

The key then is to make sure that your intuition considers the risks properly, as well as the benefits. To help myself with this, during the 4.4 code freeze I developed a sort of checklist of things to think about and consider. They are as follows:

  1. What is the benefit of this patch?
  2. What is the probability this patch has a bug?
  3. If this patch had a bug, what kind of bug might it be?
  4. If this patch had a bug, what is the probability we would find it
    before the release?

When considering the probability of a bug, I look at two things:

  • The complexity of the patch
  • My confidence in my / the other reviewers’ judgement.

Sometimes you’re looking at code you’re very familiar with or is straightforward; sometimes you’re looking at code that is very complicated or you’re not that familiar with. If the patch looks good and it’s code you’re familiar with, it’s probably fine. If the patch looks good but it’s code you’re not familiar with, there’s a risk that your judgement may be off.

When trying to think of what kind of bug it might be, I look at the code that it’s modifying, and consider things on a spectrum:

  1. Error / tool crash on unexpected trusted input; or normal input to library-only commands
  2. Error / tool crash on normal input, secondary commands / new functionality
  3. Error / tool crash on normal input, core commands
  4. Performance
  5. Guest crash / hang
  6. Host crash / hang
  7. Security vulnerability
  8. Data loss

Usually you can tell right away where in the list a bug might be. Modifying xenpm or xenctx? 3 max. Modifying the scheduler? Probably #6. Modifying hypercalls available to guests? #7. And so on.

When asking whether we’d find the bug before the release, consider the kind of testing the codepath is likely to get. Is it tested in osstest? In XenRT? Or is it in a corner case that few people really use?

After thinking through those four questions, and going over the criteria in detail, then my intuition is probably about as well-formed as it’s going to get. Now I ask the fifth question: given the risks, is it worth it to accept this patch?

After giving it it some thought, I went with my best guess. Sometimes I’m just not sure; in which case go away and do something else for a couple of hours, then come back to it (going over again the four questions to make sure they’re fresh in my mind). The first few dozen times this took a very long time; as I gained experience, judgements came faster (although many were still painfully slow).

In some cases, I just didn’t have enough knowledge of the code to make the judgement myself; this happened once or twice with the ARM code in the 4.4 release. In that case, my goal was to try to make sure that those who did have the relevant knowledge were making sound decisions: thinking about both the benefits and the risks and weighing them appropriately.

For those who want to look further into risk and intuition, several books have had a pretty big influence on my thinking in this area. Probably the best one, but also the hardest (most dense) one, is Thinking, Fast and Slow, by Daniel Kahneman. It’s very-well written and accessible, but just contains a huge amount of information that is different to the way you normally think. Not a light read. Another one I would recommend is The Black Swan, by Nassim Nicholas Taleb. And finally, Blink, by Malcolm Gladwell.

Xen 4.4 Released

Xenproject.org is pleased to announce the release of Xen 4.4.0. The release is available from the download page:

Xen 4.4 is the work of 8 months of development, with 1193 changesets. It’s our first release made with an attempt at a 6-month development cycle. Between Christmas, and a few important blockers, we missed that by about 6 weeks; but still not too bad overall.

Additionally, this cycle we’ve had a massive increase in the amount of testing. The XenProject’s regression testing system, osstest, has recieved a number of additional tests, and the XenServer team at Citrix have put Xen through their massive testing suite (XenRT). Additionally, early in this development cycle we had the go-ahead to use Coverity static analysis engine to comb through the source code for hard-to-spot bugs. The result should be that Xen 4.4 is one of the most secure, reliable releases yet.

Highlights

Although the development part of the release cycle was shorter than the previous one, we still have far too many exciting improvements than we can mention in this blog post; I’ll call out just a few.

Probably one of the most important is solid libvirt support for libxl. Jim Fehlig from SuSE and Ian Jackson from Citrix worked together to test and improve the interface between libvirt and libxl, making it fast and reliable. This lays the foundation for solid integration into any tools that can use libvirt, from GUI VM managers to cloud orchestration layers like CloudStack or OpenStack.

Another big one is a new scalable event channel interface, designed and implemented by David Vrabel from Citrix. The original Xen event channel interface was limited to the number of bits on the platform squared — 1024 for 32-bit guests and 4096 for 64-bit guests. With many VMs requiring 4 event channels each, that means a theoretical maximum of 256 guests on a 32-bit dom0 — more than enough back when a large machine had 8 cores, and every VM was a full OS; but a major limitation on systems with 128 cores, or those using cloud OSes like Mirage or OSv. The new “FIFO” event channel interface by default scales up to over 200,000 event channels, and in the future can be extended even further if necessary in a backwards-compatible manner. This should be enough for many years to come.

The ARM port is maturing quickly. As of 4.4, the hypervisor ABI for ARM has been declared stable, meaning that any guest which uses the 4.4 ARM ABI can rely on being able to boot on all future versions of Xen. There are a number of improvements making Xen on ARM more flexible, easier to set up and use, and easier to extend to new platforms. More details can be found in the Xen 4.4 feature list.

One other feature worth a note is Nested Virtualization on Intel hardware. It’s not ready for production use yet, but it has improved to the point where we feel comfortable moving it from “experimental” to “tech preview”. Please feel free to take it for a spin and report any issues you find.

There are many more improvements and changes under the hood. For a more complete list, see the Xen 4.4 feature list.

Features in related projects

The Xen Project is part of a much larger ecosystem of projects. We are typically very closely tied to Linux and qemu, but a number of other projects have had important developments that are worth a mention.

The first is the pv port of grub2. Rather than having a re-implementation of grub in the Xen tree, grub2 now has native support for running in Xen and using the Xen pv block protocol. This guarantees 100% compatibility with grub2 going forward.

Another project worth a mention is the 3.3 release of Xen Orchestra. Xen Orchestra is a web interface that interfacer with the xapi protocol (and thus can be used for XCP, XenServer, or other xapi-based systems). New creating snapshots, revert or delete, remove host from pool, restart toolstack/reboot/shutdown host) and more stable upgrade process from appliance

Finally, GlusterFS 3.5 now supports creating iSCSI nodes. One of the benefits of this is that now, by creating iSCSI devices in dom0, Xen guest disks can be stored in GlusterFS.

Updates

Ballooning, rebooting, and the feature you’ve never heard of

Today I’d like to talk about a functionality of Xen you may not have heard of, but might have actually used without even knowing it. If you use memory ballooning to resize your guests, you’ve likely used “populate-on-demand” at some point. 

As you may know, ballooning is a technique used to dynamically adjust the physical memory in use by a guest. It involves having a driver in the guest OS, called a balloon driver, allocate pages from the guest OS and then hand those pages back to Xen. From the guest OS perspective, it still has all the memory that it started with; it just has a device driver that’s a real memory hog. But from Xen’s perspective, the memory which the device driver asked for is no longer real memory — it’s just empty space (hence “balloon”). When the administrator wants to give memory back to the VM, the balloon driver will ask Xen to fill the empty space with memory again (shrinking or “deflating” the balloon), and then “free” the resulting pages back to the guest OS (making the memory available for use again).

While this can be used to shrink guest memory and then expand it again, this technique has an important limitation: It can never grow the memory above the starting size of the VM. This is because the only way to grow guest memory is to “deflate” the balloon. Once it gets back to the starting size of the VM, the balloon is entirely deflated and no additional memory can be added by the balloon driver.

To see why this is important, consider the following scenario.

Host A and B both have 4GiB of RAM, and 2 VMs with 2GiB of RAM each. Suppose you want to reboot host B to do some hardware maintenance. You could do the following:

  • Balloon all 4 VMs down to 1GiB
  • Migrate the 2 VMs from host B onto host A
  • Shut down host B to do your maintenance
  • Bring up host B
  • Migrate the 2 VMs originally on host B back
  • Balloon all 4 VMs back up to 2GiB

All well and good. But suppose that while you had one of those VMs ballooned down to 1GiB, you needed to reboot one. Now you have a problem: Most operating systems will only check how much memory is available at boot time. You only have 1GiB of free memory. If you boot with 1GiB of memory, you will be able to balloon *smaller* than 1GiB, but you will not be able to balloon back up to 2GiB when the maintenance of host B is done.

This is where populate-on-demand comes in. It allows a VM to boot with a maximum memory larger than its current target memory. It enables a guest that thinks it has 2GiB of RAM to boot while only actually using 1GiB of RAM. It can do this because it only needs to allow the guest to run until the balloon driver can start. Once the balloon driver starts, it will “inflate” the balloon to the proper size. At that point, there is nothing special to do; the VM looks like it did when we shut it down (guest thinks it has 2GiB of RAM, but 1GiB is allocated to the balloon driver and not accessed). When host B comes back up and more memory is free, the balloon driver can deflate the balloon, bringing the total memory back up to 2GiB.

Populate-on-demand comes into play in Xen whenever you start an HVM guest with maxmem and memory set to different values. In that case, the guest will be told it has maxmem RAM, but will only have memory allocated to it; the populate-on-demand code will allow the guest to run in this mode until the balloon driver comes up and hands “frees” maxmem-memory back to Xen.

Virtualizing memory: A primer

In order to desrcibe how populate-on-demand works, I’ll need to explain a bit more about how Xen virtualizes memory. On real hardware, the actual hardware memory is referred to as physical memory; and it is typically divided into 4k-chunks called physical frames. These frames are addressed by their physical frame number, or pfn. In the x86 world, pfns typically start at 0, and are mostly contiguous (with the occasional “hole” for IO devices). Historically, on x86 platforms, a description of which pfns are available for use by memory is in something called the E820 map, provided by the BIOS to operating systems at boot.

When we virtualize, we need to provide the guest with virtual “physical address space,” described in the virtual E820 map provided to the guest. These are called guest physical frame numbers, or gpfns. But of course there is still real hardware backing this memory; in the virtualization world, it is common to refer to these as machine frames, or mfns. Every useable gpfn must have a mfn behind it.

But the gpfns have to start at 0 and be contiguous, while the mfns which back them may come from anywhere in Xen’s memory. So every VM has a physical to machine translation table, or p2m table, which maps the gpfn space onto the mfn space. Each gpfn will have an entry in the table, and every useable bit of RAM has an mfn behind it. Normally this is done by the domain builder in domain 0, which will ask Xen to fill the p2m table appropriately (including any holes for IO devices if necessary).

Ballooning then works like this. To inflate the balloon, the balloon driver will ask the guest OS for a free page. After allocating the page, it puts it on its list of pages and finds the gpfn for that page. It then tells Xen it can take the memory behind the gpfn back. Xen will replace the mfn in that gpfn space with “invalid entry,” and put the mfn on its own free list (available to be given to another VM). If the guest were to attempt to read or write this memory now, it would crash; but it won’t, because the guest OS thinks the page is in use by the balloon driver. The balloon driver won’t touch it, and the OS won’t use it for anything else.

To deflate the balloon, the balloon driver will choose one of the pages on its list that it has allocated, and then asks Xen to put some memory behind the gpfn. If Xen determines that the guest is allowed to increase its memory, and there is free memory available, then it will allocate an mfn and put it in the p2m table behing that gpfn. Now the gpfn is useable again; the balloon driver then frees the page back to the guest OS, which will put it on its own free list to use for whatever needs memory.

Populate on Demand: The Basics

The idea behind populate-on-demand was that the guest didn’t actually need all of its memory to boot up until the balloon driver was active — it only needed a small portion of it. But there was no way for the domain builder to know ahead of time which gpfns the guest OS will actually need to use in order to do that; nor which memory will be given to the balloon driver by the guest OS once it starts up.

So when building a domain in populate-on-demand mode the domain builder tells Xen to allocate the mfns into a special pool, which I will call here the PoD pool, according to how much memory is specified in the memory parameter. (In the Xen code it’s actually called the PoD cache, but it’s not a good name, because in computer science “cache” has a very specific meaning that doesn’t match what the PoD pool does. This will probably be renamed at some point for clarity.)

It then creates the guest’s p2m table as it did before, but instead of filling it with mfns, it fills it with a special PoD entry. The PoD entry is an invalid entry; so as the guest boots, whenever it touches a gpfn backed by a PoD entry, it will trap up into Xen. When Xen sees that the PoD entry, it will take an mfn from the PoD pool and put it in the p2m for that gpfn. It will then return to the guest, at which point the memory access will succeed and the guest can continue.

Thus, rather than populating the p2m table when building the domain, the p2m table is populated on demand; hence the name.

The key reason for having the the PoD pool is that the memory is already allocated to the domain. If you do a domain list it shows up as owned by the domain; and it cannot be allocated to a different domain. If this were instead allocate on demand, where you actually allocated the memory from Xen when you hit an invalid entry, there would be a risk that the memory you needed to boot until the balloon driver could run would already have been allocated to a different domain.

However, the guest can’t run like this for long. There are far more PoD entries in the p2m table than there are mfns in the PoD pool — that was the point. But the guest OS doesn’t know that; as far as it’s concerned, it has maxmem to work with. If the balloon driver doesn’t start, nothing will keep it from trying to use all of its memory. If it uses up all the memory in the PoD pool, the next time Xen hits a PoD entry, there won’t be any mfns in the PoD pool to populate the entry with. At that point, Xen would have no choice but to kill the guest.

Getting back to normal: the balloon driver

The balloon driver, like the guest operating system, knows nothing about populate-on-demand. It just knows that it has maxmem gpfn space, and it needs to hand maxmem-memory back to Xen. So it begins allocating pages from the guest operating system, and freeing the gpfns back to Xen.

What Xen does next depends on a few things. Xen keeps track of both the number of PoD entries in the p2m table, and the number of mfns in the PoD pool.

  • If the gpfn is a PoD entry, Xen will simply replace the PoD entry with a normal invalid entry and return. This reduces the number of outstanding PoD entries in the pool.
  • If the gpfn has a real mfn behind it, and the number of PoD entries left in the p2m table is more than the number of mfns in the PoD pool, Xen will replace the entry with an invalid entry, and put the mfn back into the PoD pool. This increases the size of the pool.
  • If the gpfn has a real mfn behind it, but the number of PoD entries left in the p2m table is equal to the number of mfns in the pool, it will put the mfn back on the free list, ready to be used by another domain.

Eventually, the number of outstanding PoD entries is equal to the number of entries in the PoD pool, and the system is now in a stable state. There is no more risk that the guest will touch a PoD entry and not find memory in the pool; and for an active OS, eventually all pages will be touched, and the VM will be the same as one booted not in PoD mode.

It’s never that simple: Page scrubbing

At a high level, that’s the idea behind populate-on-demand. Unfortunately, the real world is often a bit more messy than we would like.

On real hardware, if you do a soft reboot (or if you do some special trick, like spraying the RAM with liquid nitrogen), the memory when the operating system starts may still contain information from a previous boot. The freshly booting operating system has no idea what may be in there: it may be security sensitive information like someone’s taxes or private data keys.

To avoid any risk that information from the previous boot might leak into untrusted programs which might run this time, most operating systems will scrub the memory at boot — that is, fill all the memory with zeros. This also means that drivers can assume that freshly allocated memory will already be zeroed, and not bother doing it themselves. Doing this all at once, at the beginning, allows the operating system to use more efficient algorithms, and also localizes the processor cache pollution.

For an operating system running under Xen this is unnecessary, because Xen will scrub any memory before giving it to the guest (for pretty much the same potential security issue). However, many operating systems which run on Xen — in particular, proprietary operating systems like Windows — don’t know this, and will do their own scrub of memory anyway. Typically this happens very early in boot, long before it is possible to load the balloon driver. This pretty much guarantees that every gpfn will be written to before the balloon driver loads. How does populate on demand deal with that?

The key is that the state of a gpfn after it has been scrubbed by the operating system is the same as the default initial state of a gpfn just populated by the PoD code. This means that after a gpfn has been scrubbed by the operating system, Xen can reclaim the page: it can replace the mfn in the p2m table with a PoD entry, and put the mfn in the PoD pool. The next time the VM touches the page, it will be replaced with a different zero page from the PoD pool; but to the VM it will look the same.

So the populate-on-demand system has a number of zero-page reclaim techniques. The primary one is that when populating a new PoD entry, we look at recently populated entries and see if they are zero, and if they are, we reclaim them. The effect of this is to have each scrubbing thread only have one outstanging PoD page at a time.

If that fails, there is another technique we call the “emergency sweep.” When Xen hits a PoD entry, but the PoD pool is empty, before crashing the guest, it will search through all of guest memory, looking for zeroed pages to reclaim. Because this method is very slow, it is only used as a last resort.

Conclusion

So that’s populate-on-demand in a nutshell. There are more complexities under the hood (like trying to keep superpages together), but I’ll leave those for another day.

Xen 4.3.0 released!

Xenproject.org is pleased to announce the release of Xen 4.3.0. The release is available from the download page:

Xen 4.3 is the work of just over 9 months of development, with 1362 changesets containing changes to over 136128 lines of code, made by 90 individuals from 27 different organizations and 25 unaffiliated individual developers.

Xen 4.3 is also the first release made with using the roadmap to track what people were doing and aim for what to try to get into this release, as well as the first release to have consistent Xen test days. This, combined with the increased number of contributors, should make this one of the best Xen releases so far. Read on for more information.

New Features

Probably the biggest single feature of this release is the experimental support for ARM virtualization, both 32-bit and 64-bit variants. The 32-bit port of Xen boots on ARMv7 Fast Models, the Cortex A15 Versatile Express platform and the Arndale board (equipped with the Exynos5 SoC by Samsung). It can boot dom0, create other virtual machines and it supports all the basic virtual machine lifecycle operations. Hardware is not yet available for 64-bit ARM processors yet, but Xen is running well in 64-bit mode on AEMv8 Real-time System Models by ARM.

Continue reading