The Xen Security team recently disclosed a vulnerability, Xen Security Advisory 7 (CVE-2012-0217), which would allow guest administrators to escalate to hypervisor-level privileges. The impact is much wider than Xen; many other operating systems seem to have the same vulnerability, including NetBSD, FreeBSD, some versions of Microsoft Windows (including Windows 7).
So what was the vulnerability? It has to do with a subtle difference in the way in which Intel processors implement error handling in their version of AMD’s
SYSRET instruction. The
SYSRET instruction is part of the x86-64 standard defined by AMD. If an operating system is written according to AMD’s spec, but run on Intel hardware, the difference in implementation can be exploited by an attacker to write to arbitrary addresses in the operating system’s memory. This blog will explore the technical details of the vulnerability.
One of the goals for the 4.2 release is for
xl to have feature parity with
xm for the most important functions. But along the way, we’ve also been adding a number of improvements to the interface as well. One of the ways in which xl has changed and improved the interface is in passing through pci devices directly to VMs.
A basic device pass-through review
As you may know, Xen has for several years had the ability to “pass through” a pci device to a guest, allowing that guest to control the device directly. This has several applications, including driver domains and increased performance for graphics or networking.
To pass through a device, you need to find out its BDF (Bus, Device, Function). A BDF now consists of three or four numbers in this format:
DDDD is a 4-digit hex for the PCI domain. This is optional (if not included, it will be assumed to be 0000).
bb is a 2-digit hex of the PCI bus number
dd is a 2-digit hex of the PCI device number
f is a 1-digit decimal of the PCI function number
Among the more unique features of Xen 4.2 is a feature called cpupools, designed and implemented by Jürgen Groß at Fujitsu. At its core it’s a simple idea, but one that allows it to be a flexible and powerful solution to a number of different problems.
The core idea behind cpupools is to divide the physical cores on the machine into different pools. Each of these pools has an entirely separate cpu scheduler, and can be set with different scheduling parameters. At any time, a given logical cpu can be assigned to only one of these pools (or none). A VM is assigned to one pool at a time, but can be moved from pool to pool.
There are a number of things one can do with this functionality. Suppose you are a hosting or cloud provider, and you have a number of customers who have multiple VMs with you. Instead of selling based on CPU metering, you want to sell access to a fixed number of cpus for all of their VMs: e.g. a customer with 6 single-vcpu VMs might buy 2 cores worth of computing space which all of the VMs share.
You could solve this problem by using cpu masks to pin all of the customer’s vcpus to a single set of cores. However, cpu masks do not work well with the scheduler’s weight algorithm — the customer wont’ be able to specify that VM A should get twice the cpu as VM B. Solving the weight issue in a general way is very difficult, since VMs can have any combination of overlapping cpu masks. Furthermore, this extra complication would be there for all users of the credit algorithm, regardless of whether they use this particular mode or not.
Xen 4.2 will contain two new scheduling parameters for the credit1 scheduler which significantly increase its confurability and performance for cloud-based workloads:
ratelimit_us. This blog post describes what they do, and how to configure them for best performance.
The timeslice for the credit1 has historically been fixed at 30ms. This is actually a fairly long time — it’s great for computationally-intensive workloads, but not so good for latency-sensitive workloads, particularly ones involving network traffic or audio.
Xen 4.2 introduces the
tslice_ms parameter, which sets the timeslice of the scheduler in milliseconds. This can be set either using the Xen command-line option,
sched_credit_tslice_ms, or by using the new scheduling parameter interface to
# xl sched-credit -t [n]
One of the fun things about a hackathon is the chance to get everyone together in a room and just talk about crazy ideas you might try at some point in the future.
One of the advantage that a certain competing virtualization technology has over Xen is that you don’t have to reboot to start using it. It’s not that big of a thing, but if you just want to play around with VMs, the additional step of rebooting and probably having to muck about with a grub entry makes it pretty certain that casual users will prefer our competition.
Wouldn’t it be great, someone said, if you could just do “insmod xen” in a running kernel, and have it hoist up the kernel (which is currently running on bare metal), put Xen underneath, and make the currently running kernel into domain 0?
The idea sounds pretty crazy at first, but after some examination, it’s actually quite do-able. In fact, there’s precedent: Windows 2008, apparently, does that when booting into Hyper-V. It may involve a certain amount of switching from bare metal code to PV code; but there’s precedent for that too, in the form of SMP alternatives.
One thing that it would depend upon is another project we’ve been kicking around for a year or so now, that being running dom0 in an HVM container. That would greatly reduce the amount of PVOPS necessary to run Linux as dom0, making the “hoist” a lot cleaner.
We have a lot of work to do before this can become a priority, but it’s a project that’s attractive enough that I’m sure someone will pick it up in due time, at which point there’s no technical reason that Xen can’t be as convenient for casual users to being using as any other virtualization technology out there.
by Stefano Stabellini
Linux 2.6.37, released just few days ago, is the first upstream Linux kernel that can boot on Xen as Dom0: Linus pulled my “xen initial domain” patch series on the 28th of October and on the 5th of January the first Linux kernel was released with early Dom0 support!
Dom0 is the first domain started by the Xen hypervisor on boot and until now adding domain 0 support to the Linux kernel has required out of tree patches (note that NetBSD and Solaris have had Dom0 support for a very long time). This means that every Linux distro supporting Xen as virtualization platform has to maintain an additional kernel patch series.
Distro maintainers, worry no more: Dom0 support is upstream! It is now very easy to enable and support Xen in the standard kernel distro images and I hope this will lead to an upsurge in distribution support for Xen. Just enabling CONFIG_XEN in the kernel config of a 2.6.37 Linux kernel allows the very same Linux kernel image to boot on native, on Xen as Dom0, on Xen as normal PV guest and on Xen as PV on HVM guest!
That said, the kernel backends, in particular netback and blkback, are not yet available in the upstream kernel. Therefore a 2.6.37 vanilla kernel can only be used to start VMs on the very latest xen-unstable. In fact xen-unstable contains additional functionalities that allow qemu-xen to offer a userspace fallback for the missing backends. This support will become part of the Xen 4.1 release which is due in the next couple of months.
In the short term the out of tree patch set has been massively reduced. It is expected that the xen.git kernel tree will soon contain the proposed upstreamable versions of the backend drivers. I strongly encourage everyone to pull these and start testing upstream dom0 support!
I want to thank Jeremy Fitzharding, Konrad Rzeszutek Wilk, Ian Campbell and everyone else who was involved for the major contributions and general help that made this possible.