Some time ago Konrad Rzeszutek Wilk (the Xen Linux maintainer) came up with a list of possible improvements to the Xen PV block protocol, which is used by Xen guests to reduce the overhead of emulated disks.
This document is quite long, and the list of possible improvements is also not small, so we decided to go implementing them step by step. One of the first items that seemed like a good candidate for performance improvement was what is called “indirect descriptors”. This idea is borrowed from the VirtIO protocol, and to put it clear consists on fitting more data in a request. I am going to expand how is this done later in the blog post, but first I would like to speak a little bit about the Xen PV disk protocol in order to understand it better.
Xen PV disk protocol
The Xen PV disk protocol is very simple, it only requires a shared memory page between the guest and the driver domain in order to queue requests and read responses. This shared memory page has a size of 4KB, and is mapped by both the guest and the driver domain. The structure used by the fronted and the backend to exchange data is also very simple, and is usually known as a ring, which is a circular queue and some counters pointing to the last produced requests and responses.
With the release process for Xen 4.3 in full swing (we intend to release the third release candidate this week) and with the Xen Test Days initiative (the next one is this Wednesday 5 June, join us on IRC freenode
#xentest) I thought it would be useful to offer up some information and guidance on how the Xen project deals with bug reports and how to report bugs against the Xen Hypervisor Project.
Reporting a Bug
Confirm That Your Issue Is a Bug
Experience has shown that oftentimes what appears to be a bug is often just a misconfiguration or misunderstanding of how things are supposed to work. This can be down to a lack of documentation (a perennial problem for Open Source projects and something which we are working to address with our regular Xen Document Days) or the inherent complexity of something of the things which one can achieve (or try to achieve!) using Xen.
With that in mind it is often useful to try seeking help via other means before resorting to submitting a bug report. Useful resources for asking questions are:
- In a Xen system logs the logs can usually be found in
/var/log/xen and will sometimes provide useful insight into an issue.
- Search engines. As well as simply searching for key terms relating to your issue you can also search the User and Developer list archives.
Xen.org recently released a number of (related) security updates, XSA-7 through to -9. This was done by the Xen.org Security Team who are charged with following the Xen.org Security Problem Response Process.
As part of the process of releasing XSA-7..9 several short-comings (a few of which Ian Jackson has discussed already in Security vulnerabilities – the coordinated disclosure sausage mill) were found in the process.
In order to address these short-comings we have started a discussion on the xen-devel mailing list which describes the issues which we faced and proposes some potential options for updates. However this process is supposed to serve you, the Xen user community, and therefore your feedback and input is critical to ensuring that the policy meets the needs of the community.
So whether you are a small or large consumer of Xen you should feel free to have your say and to help formulate an updated policy which best serves the needs of the community. To take part in the discussion please send mail to firstname.lastname@example.org.
The Xen Security team recently disclosed a vulnerability, Xen Security Advisory 7 (CVE-2012-0217), which would allow guest administrators to escalate to hypervisor-level privileges. The impact is much wider than Xen; many other operating systems seem to have the same vulnerability, including NetBSD, FreeBSD, some versions of Microsoft Windows (including Windows 7).
So what was the vulnerability? It has to do with a subtle difference in the way in which Intel processors implement error handling in their version of AMD’s
SYSRET instruction. The
SYSRET instruction is part of the x86-64 standard defined by AMD. If an operating system is written according to AMD’s spec, but run on Intel hardware, the difference in implementation can be exploited by an attacker to write to arbitrary addresses in the operating system’s memory. This blog will explore the technical details of the vulnerability.
Where were we?
So, here it is what we said up to now. Basically:
- NUMA is becoming increasingly common;
- properly dealing with NUMA is important for performance;
- one can tweak Xen for NUMA, but it would be nice for that to happen automagically!
So, let’s tackle some automatic NUMA handling mechanisms this time!
NUMA Scheduling, whatsit
Suppose we have a VM with all its memory allocated on NODE#0 and NODE#2 of our NUMA host. As already said, the best thing to do would be to pin the VM’s vCPUs on the pCPUs related to the two nodes. However, pinning is quite unflexible: what if those pCPUs get very busy while there are completely idle pCPUs on other nodes? It will depend on the workload, but it is not hard to imagine that having some chance to run –even if on a remote node– would be better than not running at all! It is therefore preferable to give the scheduler some hints about where a VM’s vCPUs should be executed. It then can try at its best to honor these requests of ours, but not at the cost of subverting its own algorithm. From now on, we’ll call this hinting mechanism node affinity (don’t confuse it with CPU affinity, which is about to static CPU pinning).
NUMA? What’s NUMA?
Having to deal with a Non-Uniform Memory Access (NUMA) machine is becoming more and more common. This is true no matter whether you are part of an engineering research center with access to one of the first Intel SCC-based machines, a virtual machine hosting provider with a bunch of dual 2376 AMD Opteron pieces of iron in your server farm, or even just a Xen developer using a dual socket E5620 Xeon based test-box (any reference to the recent experiences of the author of this post is purely accidental :-D). Just very quickly, NUMA means the memory accessing times of a program running on a CPU depends on the relative distance between that CPU and that memory. In fact, most of the NUMA systems are built in such a way that each processor has its local memory, on which it can operate very fast. On the other hand, getting and storing data from and on remote memory (that is, memory local to some other processor) is quite more complex and slow. Therefore, while hardware engineers bump their heads against cache coherency protocols and routing strategies for the system BUSes to be put on such machines, the most urgent issue for us, OS and hypervisor developers, is the following: how can we couple scheduling and memory management so that most of the accesses for most of our tasks/VMs stay local?