Author Archives: Dario Faggioli

About Dario Faggioli

Dario has been interested in CS, programming and Open Source since like forever. He is now a very happy Xen developer, and Citrix is from where he gets his paycheck. He is, in fact, a Senior Software Engineer there, working on scheduling related issues, NUMA support, and other things... Personally wise, he lives in central Italy with his wife (Luana) and daughter (Lara, 2yo). Check out more info about Dario, his personal Webpage and his blog

Xen 4.3.0-RC1 is out!

We proudly announce that the Xen 4.3 RC-cycle has just started, with the tagging of 4.3.0-rc1 in our repository. Read the official announcement from George on xen-devel here.

A tarball has been made available for ease and speed-up testing: Xen 4.3.0 RC1 Tarball (and signature).

For more detailed instructions on how to effectively test this first release candidate, look at this Wiki page: Xen 4.3 RC1 Test Instructions.

And like if this wasn’t enough, today (Wednesday, 8th May 2013) is the first Xen Test Day for Xen 4.3, so come on #xentest (on freenode) and help us nailing nasty bugs! Further Xen Test Days are scheduled for May 22nd and June 4th.

NUMA Aware Scheduling Development Report

Background and Motivation

This blog already hosted a couple of stories about what is going on, in the Xen development community, regarding improving Xen NUMA support. Therefore, if you really are interested in some background and motivation, feel free to check them out:

Long story  short, they say how NUMA is becoming more and more common and that, therefore, it is very important to: (1) achieve a good initial placement, when creating a new VM; (2) have a solution that is both flexible and effective enough to take advantage of that placement during the whole VM lifetime. The former, basically, means: <<When starting a new Virtual Machine, to which NUMA node should I “associate” it with?>>. The latter is more about: <<How hard should the VM be associated to that NUMA node? Could it, perhaps temporarily, run elsewhere?>>.

NUMA Placement and Scheduling

So, here’s the situation: automatic initial placement has been included in Xen 4.2, inside libxl. This means, when a VM is created (of course, if that happens through libxl) a set of heuristics decide on which NUMA node his memory has to be allocated, and the vCPUs of the VM are statically pinned to the pCPUs of such node.
On the other hand, NUMA aware scheduling  has been under development during the last months, and is going to be included in Xen 4.3. This mean, instead of being statically pinned, the vCPUs of the VM will strongly prefer to run on the pCPUs of the NUMA node, but they can run somewhere else as well… And this is what this status report is all about.

NUMA Aware Scheduling Development

The development of this new feature started pretty early in the Xen 4.3 development cycle, and has undergone a couple of major rework along the way. The very first RFC for it dates back to the Xen 4.2 development cycle, and it showed interesting performance already. However, what was decided at the time was to concentrate only on placement, and leave scheduling for the future. After that, v1, v2 and v3 of a patch series entirely focused on NUMA aware scheduling followed. It has been discussed during XenSummit NA 2012, in a talk about NUMA future development in Xen in general (slides here).  While at it, a couple of existing scheduling anomalies of the stock credit scheduler where found and fixed (for instance, the one described here).

Right now, we can say we are almost done. In fact, v3 received positive feedback and is basically what is going to be merged, and so what Xen 4.3 will ship. Actually, there is going to be a v4 (being released on xen-devel right at the same time of this blog post), but it only accommodates very minor changes, and it is 100% functionally equal to v3.

Any Performance Numbers?

Sure thing! Benchmarks similar to the ones already described in the previous blog posts have been performed. More specifically, directly from the cover letter of the v3 of the patch series, here’s what has been done:

I ran the following benchmarks (again):
* SpecJBB is all about throughput, so pinning is likely the ideal
  solution.
* Sysbench-memory is the time it takes for writing a fixed amount
  of memory (and then it is the throughput that is measured). What
  we expect is locality to be important, but at the same time the
  potential imbalances due to pinning could have a say in it.
* LMBench-proc is the time it takes for a process to fork a fixed
  number of children. This is much more about latency than
  throughput, with locality of memory accesses playing a smaller
  role and, again, imbalances due to pinning being a potential
  issue.

This all happened on a 2 node host, where 2 to 10 VMs (2 vCPUs and 960 RAM each) were executing the various benchmarks concurrently. Here they are the results:

 ----------------------------------------------------
 | SpecJBB2005, throughput (the higher the better)  |
 ----------------------------------------------------
 | #VMs | No affinity |  Pinning  | NUMA scheduling |
 |    2 |  43318.613  | 49715.158 |    49822.545    |
 |    6 |  29587.838  | 33560.944 |    33739.412    |
 |   10 |  19223.962  | 21860.794 |    20089.602    |
 ----------------------------------------------------
 | Sysbench memory, throughput (the higher the better)
 ----------------------------------------------------
 | #VMs | No affinity |  Pinning  | NUMA scheduling |
 |    2 |  469.37667  | 534.03167 |    555.09500    |
 |    6 |  411.45056  | 437.02333 |    463.53389    |
 |   10 |  292.79400  | 309.63800 |    305.55167    |
 ----------------------------------------------------
 | LMBench proc, latency (the lower the better)     |
 ----------------------------------------------------
 | #VMs | No affinity |  Pinning  | NUMA scheduling |
 ----------------------------------------------------
 |    2 |  788.06613  | 753.78508 |    750.07010    |
 |    6 |  986.44955  | 1076.7447 |    900.21504    |
 |   10 |  1211.2434  | 1371.6014 |    1285.5947    |
 ----------------------------------------------------

Which, reasoning in terms of %-performance increase/decrease, means NUMA aware
scheduling does as follows, as compared to no-affinity at all and to static pinning:

     ----------------------------------
     | SpecJBB2005 (throughput)       |
     ----------------------------------
     | #VMs | No affinity |  Pinning  |
     |    2 |   +13.05%   |  +0.21%   |
     |    6 |   +12.30%   |  +0.53%   |
     |   10 |    +4.31%   |  -8.82%   |
     ----------------------------------
     | Sysbench memory (throughput)   |
     ----------------------------------
     | #VMs | No affinity |  Pinning  |
     |    2 |   +15.44%   |  +3.79%   |
     |    6 |   +11.24%   |  +5.72%   |
     |   10 |    +4.18%   |  -1.34%   |
     ----------------------------------
     | LMBench proc (latency)         |
     | NOTICE: -x.xx% = GOOD here     |
     ----------------------------------
     | #VMs | No affinity |  Pinning  |
     ----------------------------------
     |    2 |    -5.66%   |  -0.50%   |
     |    6 |    -9.58%   | -19.61%   |
     |   10 |    +5.78%   |  -6.69%   |
     ----------------------------------

The tables show how, when not in overload (where overload=’more vCPUs than pCPUs’), NUMA scheduling is the absolute best. In fact, not only it does a lot better than no-pinning on throughput biased benchmarks, as well as a lot better than pinning on latency biased benchmarks (especially with 6 VMs), it also equals or beats both under adverse circumstances (adverse to NUMA scheduling, i.e., beats/equals pinning in throughput benchmarks, and beats/equals no-affinity on the latency benchmark).

When the system is overloaded, NUMA scheduling scores in the middle, as it could have been expected. It must also be noticed that, when it brings benefits, they are not as huge as in the non-overloaded case. However, this only means that there is still room for more optimization, right?  In some more details, the current way a pCPU is selected for a vCPU that is waking-up, couples particularly bad with the new concept of NUMA node affinity. Changing this is not trivial, because it involves rearranging some locks inside the scheduler code, but is already being worked-on.
Anyway, even with what we have right now, we are overloading the test box by 20% here (without counting Dom0 vCPUs!) and still seeing improvements, which is definitely not bad!

What Else Is Going On?

Well, a lot… To the point that it is probably pointless to try make a list here! We have a NUMA roadmap on our Wiki, which we are trying to keep updated and, more important, to honor and fulfill so, if interested in knowing what will come next, go check it out!

Using xen-tools on Fedora

Xen.org blog already hosted a very nice post by Ian Jackson, greatly explaining how useful xen-tools is for automatically installing Debian (and Debian-derived) VMs. Now, if this all happens on a Debian host, it is nice and easy, as getting xen-tools is just a matter of apt-get install-ing it. But what if your host machine runs something else, for instance, a copy of Fedora? As a matter of fact, starting from Fedora 16, Xen is quite easy to install and use on Fedora, making it interesting to cover this case too.

There is no xen-tools RPM package, thus we need to go the good old way: download the sources, compile and  install them. Luckily enough, this is not difficult at all, and this blog post will explain in details how to achieve it.

Installing Fedora and Xen

So, let’s assume that you just finished installing the new and shiny Spherical Cow. Official instructions and advice on that are available here. The first thing to do now is to install Xen there. This has become very simple these days; all that’s needed is the following (where an # prompt means the command must be run as root):

# yum install xen

Followed by a reboot. Note that Xen will not be the default boot option, so you’ll need to make sure to select it from the GRUB2 menu. You can also make Xen the default by setting GRUB_DEFAULT=saved in your /etc/defaults/grub.conf and running the following:

# grub2-mkconfig -o /boot/grub2/grub.cfg
# XEN=$(grep ^menuentry /boot/grub2/grub.cfg | cut -f2 -d"'" | tail -n1)
# grub2-set-default $XEN

If the libvirt‘s services are needed too, some more packages must be installed, but this is out of the scope of this post. For more information on how to install Xen on Fedora, check the Fedora pages on Xen.org’s Wiki, in particular, this one: Fedora Host Installation.

Continue reading

Fedora 18 Virtualization Test Day

As usual, Fedora has planned a number of test days for their upcoming Fedora 18 release, including include a Virtualization Test Day on November 1st (tomorrow!).Fedora Logo

We are therefore calling all our community members to participate in the test day as much as possible. Specific information regarding testing Xen on the new Fedora can be found in this Wiki page. For attending and participating, be sure you hang out on IRC at #fedora-test-day (Freenode) on Thursday !

Fedora 18 will be one of the first distros shipping Xen 4.2… Join and help us making sure it will work great for all Fedora 18 future users !!

Tracing with Xentrace and Xenalyze

Figuring out “what’s going on?” is always something very important. For example, knowing what processes were running on which processors can be very useful if you are doing OS development and/or performances evaluation. If applied to virtualization, that turns into figuring out what VMs were running on which processors (or, better, what virtual CPUs were running on which physical CPUs), during the execution of some workload.

When using Xen as hypervisor, all that is possible by means of two tools: xentrace and xenalyze.

Xentrace

Xen has a number of trace points at key locations to allow developers to get a picture of what is going on inside of Xen. When these trace points are enabled, Xen will write the tracing information into per-cpu buffers within Xen. Then a program in dom0, called xentrace, which sets up and enables tracing, and periodically reads these buffers and writes them to disk. Xentrace is part of the xen.org tree and will come with every distribution of Xen. A very basic, but already quite useful invocation of the tool is the following:

# xentrace -D -e EVT_MASK > trace_file.bin &
run_your_workload
# killall xentrace

where EVT_MASK can be one of the following values:

0x0001f000          TRC_GEN
0x0002f000          TRC_SCHED
0x0004f000          TRC_DOM0OP
0x0008f000          TRC_HVM
0x0010f000          TRC_MEM
0xfffff000          TRC_ALL

These are Event Classes. Using one of them tells xentrace to gather information about a group of events. For example 0x0002f000 can be used to obtain all the events related to vCPU scheduling. There are other options available, and it is also possible to achieve a finer grain control on the events (for complete list, refer to xentrace -? and/or man xentrace.)

Xenalyze

Unlike xentrace, xenalyze is an external tool, available from its own source code repository (see below). It has been publicly released in 2009 by George Dunlap, and the latest version is always available in this mercurial repository: http://xenbits.xen.org/gitweb/?p=xenalyze.git . Getting and installing xenalyze is really worthwhile. There also are slides and videos explaining what it is and what it does, so go check them out! :-)

Continue reading

NUMA and Xen: Part II, Scheduling and Placement

Where were we?

So, here it is what we said up to now. Basically:

  1. NUMA is becoming increasingly common;
  2. properly dealing with NUMA is important for performance;
  3. one can tweak Xen for NUMA, but it would be nice for that to happen automagically!

So, let’s tackle some automatic NUMA handling mechanisms this time!

NUMA Scheduling, whatsit

Suppose we have a VM with all its memory allocated on NODE#0 and NODE#2 of our NUMA host. As already said, the best thing to do would be to pin the VM’s vCPUs on the pCPUs related to the two nodes. However, pinning is quite unflexible: what if those pCPUs get very busy while there are completely idle pCPUs on other nodes? It will depend on the workload, but it is not hard to imagine that having some chance to run –even if on a remote node– would be better than not running at all! It is therefore preferable to give the scheduler some hints about where a VM’s vCPUs should be executed. It then can try at its best to honor these requests of ours, but not at the cost of subverting its own algorithm. From now on, we’ll call this hinting mechanism node affinity (don’t confuse it with CPU affinity, which is about to static CPU pinning).

Continue reading