Author Archives: Stefano Stabellini

PV Calls: a new paravirtualized protocol for POSIX syscalls

Let’s take a step back and look at the current state of virtualization in the software industry. X86 hypervisors were built to run a few different operating systems on the same machine. Nowadays they are mostly used to execute several instances of the same OS (Linux), each running a single server application in isolation. Containers are a better fit for this use case, but they expose a very large attack surface. It is possible to reduce the attack surface, however it is a very difficult task, one that requires minute knowledge of the app running inside. At any scale it becomes a formidable challenge. The 15-year-old hypervisor technologies, principally designed for RHEL 5 and Windows XP, are more a workaround than a solution for this use case. We need to bring them to the present and take them into the future by modernizing their design.

The typical workload we need to support is a Linux server application which is packaged to be self contained, complying to the OCI Image Format or Docker Image Specification. The app comes with all required userspace dependencies, including its own libc. It makes syscalls to the Linux kernel to access resources and functionalities. This is the only interface we must support.

Many of these syscalls closely correspond to function calls which are part of the POSIX family of standards. They have well known parameters and return values. POSIX stands for “Portable Operating System Interface”: it defines an API available on all major Unixes today, including Linux. POSIX is large to begin with and Linux adds its own set of non-standard calls on top of it. As a result a Linux system has a very high number of exposed calls and, inescapably, also a high number of vulnerabilities. It is wise to restrict syscalls by default. Linux containers struggle with it, but hypervisors are very accomplished in this respect. After all hypervisors don’t need to have full POSIX compatibility. By paravirtualizing hardware interfaces, Xen provides powerful functionalities with a small attack surface. But PV devices are the wrong abstraction layer for Docker apps. They cause duplication of functionalities between the guest and the host. For example, the network stack is traversed twice, first in DomU then in Dom0. This is unnecessary. It is better to raise hypervisor abstractions by paravirtualizing a small set of syscalls directly.

PV Calls

It is far easier and more efficient to write paravirtualized drivers for syscalls than to emulate hardware because syscalls are at a higher level and made for software. I wrote a protocol specification called PV Calls to forward POSIX calls from DomU to Dom0. I also wrote a couple of prototype Linux drivers for it that work at the syscall level. The initial set of calls covers socket, connect, accept, listen, recvmsg, sendmsg and poll. The frontend driver forwards syscalls requests over a ring. The backend implements the syscalls, then returns success or failure to the caller. The protocol creates a new ring for each active socket. The ring size is configurable on a per socket basis. Receiving data is copied to the ring by the backend, while sending data is copied to the ring by the frontend. An event channel per ring is used to notify the other end of any activity. This tiny set of PV Calls is enough to provide networking capabilities to guests.

We are still running virtual machines, but mainly to restrict the vast majority of applications syscalls to a safe and isolated environment. The guest operating system kernel, which is provided by the infrastructure (it doesn’t come with the app), implements syscalls for the benefit of the server application. Xen gives us the means to exploit hardware virtualization extensions to create strong security boundaries around the application. Xen PV VMs enable this approach to work even when virtualization extensions are not available, such as on top of Amazon EC2 or Google Compute Engine instances.

This solution is as secure as Xen VMs but efficiently tailored for containers workloads. Early measurements show excellent performance. It also provides a couple of less obvious advantages. In Docker’s default networking model, containers’ communications appear to be made from the host IP address and containers’ listening ports are explicitly bound to the host. PV Calls are a perfect match for it: outgoing communications are made from the host IP address directly and listening ports are automatically bound to it. No additional configurations are required.

Another benefit is ease of monitoring. One of the key aspects of hardening Linux containers is keeping applications under constant observation with logging and monitoring. We should not ignore it even though Xen provides a safer environment by default. PV Calls forward networking calls made by the application to Dom0. In Dom0 we can trivially log them and detect misbehavior. More powerful (and expensive) monitoring techniques like memory introspection offer further opportunities for malware detection.

PV Calls are unobtrusive. No changes to Xen are required as the existing interfaces are enough. Changes to Linux are very limited as the drivers are self-contained. Moreover, PV Calls perform extremely well! Let’s take a look at a couple of iperf graphs (higher is better):

iperf client

iperf server

The first graph shows network bandwidth measured by running an iperf server in Dom0 and an iperf client inside the VM (or container in the case of Docker). PV Calls reach 75 gbit/sec with 4 threads, far better than netfront/netback.

The second graph shows network bandwidth measured by running an iperf server in the guest (or container in the case of Docker) and an iperf client in Dom0. In this scenario PV Calls reach 55 gbit/sec and outperform not just netfront/netback but even Docker.

The benchmarks have been run on an Intel Xeon D-1540 machine, with 8 cores (16 threads) and 32 GB of ram. Xen is 4.7.0-rc3 and Linux is 4.6-rc2. Dom0 and DomU have 4 vcpus each, pinned. DomU has 4 GB of ram.

For more information on PV Calls, read the full protocol specification on xen-devel. You are welcome to join us and participate in the review discussions. Contributions to the project are very appreciated!

Xen now available in CentOS 7 for ARM64 servers

A little more than a week ago at Linaro Connect SFO15 in Burlingame Jim Perrin of the CentOS project publicly announced the availability of the Xen hypervisor in CentOS 7 for ARM64 (also known as aarch64). Jim and I have been working closely with George Dunlap, maintainer of Xen in CentOS for the x86 architecture, to produce high quality Xen binaries for 64-bit ARM servers. As a result you can setup an ARM64 virtualization host with just a couple of yum commands.

CentOS 7 aarch64 is available here. Installation is trivial: download the live image, try it out, and write it to disk if you like it. You can easily extend the root partition and filesystem to match the size of your disk.

Once you have CentOS 7 up and running on your ARM64 server, you can install Xen and Libvirt with the following commands:

yum install centos-release-xen
yum update
yum install xen libvirt

If you are using AppliedMicro X-Gene, you need to add a Xen command line option to specify which serial to use. This is due to the firmware missing one piece of information. We are working with AppliedMicro to fix the issue as soon as possible. In the meantime you can edit /etc/default/grub and add the following to GRUB_CMDLINE_XEN_DEFAULT:


recreate the grub config file:

grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg

Reboot and you’ll have Xen and Libvirt ready to use! Simple, right? :-)

If you want to try running a CentOS guest, just download the CentOS 7 live image, unpack it, and write a basic VM config file, using the Dom0 kernel and initramfs. For example:

disk=[ "file:/path/to/CentOS-7-aarch64-rolling.img,xvda,w" ]
name = "centos7"
vcpus = 1

Use xl to create the guest and connect to its console:

xl create -c config

Rinse and repeat as many times as you like, and you’ll have many little CentOS virtual machines keeping you company.

Xen 4.6 will be released shortly and you can count on us updating the Xen rpm in CentOS 7 shortly after. You’ll be able to install the latest and greatest Xen hypervisor release for ARM64 with a simple yum install.

At Linaro Connect I went further by showing ready to use OpenStack packages for CentOS 7 aarch64. Thanks to Anthony Perard, who produced those rpms, setting up Nova with Xen on ARM64 is just a matter of installing the packages and starting Nova services. Jim promised to have the OpenStack rpms online at in a couple of weeks. Stay tuned!

Will Docker Replace Virtual Machines?

Docker is certainly the most influential open source project of the moment. Why is Docker so successful? Is it going to replace Virtual Machines? Will there be a big switch? If so, when?

Let’s look at the past to understand the present and predict the future. Before virtual machines, system administrators used to provision physical boxes to their users. The process was cumbersome, not completely automated, and it took hours if not days. When something went wrong, they had to run to the server room to replace the physical box.

With the advent of virtual machines, DevOps could install any hypervisor on all their boxes, then they could simply provision new virtual machines upon request from their users. Provisioning a VM took minutes instead of hours and could be automated. The underlying hardware made less of a difference and was mostly commoditized. If one needed more resources, it would just create a new VM. If a physical machine broke, the admin just migrated or resumed her VMs onto a different host.

Finer-grained deployment models became viable and convenient. Users were not forced to run all their applications on the same box anymore, to exploit the underlying hardware capabilities to the fullest. One could run a VM with the database, another with middleware and a third with the webserver without worrying about hardware utilization. The people buying the hardware and the people architecting the software stack could work independently in the same company, without interference. The new interface between the two teams had become the virtual machine. Solution architects could cheaply deploy each application on a different VM, reducing their maintenance costs significantly. Software engineers loved it. This might have been the biggest innovation introduced by hypervisors.

A few years passed and everybody in the business got accustomed to working with virtual machines. Startups don’t even buy server hardware anymore, they just shop on Amazon AWS. One virtual machine per application is the standard way to deploy software stacks.

Application deployment hasn’t changed much since the ’90s though. Up until then, it still involved installing a Linux distro, mostly built for physical hardware, installing the required deb or rpm packages, and finally installing and configuring the application that one actually wanted to run.

In 2013 Docker came out with a simple, yet effective tool to create, distribute and deploy applications wrapped in a nice format to run in independent Linux containers. It comes with a registry that is like an app store for these applications, which I’ll call “cloud apps” for clarity. Deploying the Nginx webserver had just become one “docker pull nginx” away. This is much quicker and simpler than installing the latest Ubuntu LTS. Docker cloud apps come preconfigured and without any unnecessary packages that are unavoidably installed by Linux distros. In fact the Nginx Docker cloud app is produced and distributed by the Nginx community directly, rather than Canonical or Red Hat.

Docker’s outstanding innovations are the introduction of a standard format for cloud applications, including the registry. Instead of using VMs to run cloud apps, Linux containers are used instead. Containers had been available for years, but they weren’t quite popular outside Google and few other circles. Although they offer very good performance, they have fewer features and weaker isolation compared to virtual machines. As a rising star, Docker made Linux containers suddenly popular, but containers were not the reason behind Docker’s success. It was incidental.

What is the problem with containers? Their live-migration support is still very green and they cannot run non-native workloads (Windows on Linux or Linux on Windows). Furthermore, the primary challenge with containers is security: the surface of attack is far larger compared to virtual machines. In fact, multi-tenant container deployments are strongly discouraged by Docker, CoreOS, and anybody else in the industry. With virtual machines you don’t have to worry about who is going to use it or how it will be used. On the other hand, only containers that belong to the same user should be run on the same host. Amazon and Google offer container hosting, but they both run each container on top of a separate virtual machine for isolation and security. Maybe inefficient but certainly simple and effective.

People are starting to notice this. At the beginning of the year a few high profile projects launched to bring the benefits of virtual machines to Docker, in particular Clear Linux by Intel and Hyper. Both of them use conventional virtual machines to run Docker cloud applications directly (no Linux containers are involved). We did a few tests with Xen: tuning the hypervisor for this use case allowed us to reach the same startup times offered by Linux containers, retaining all the other features. A similar effort by Intel for Xen is being presented at the Xen Developer Summit and Hyper is also presenting their work.

This new direction has the potential to deliver the best of both worlds to our users: the convenience of Docker with the security of virtual machines. Soon Docker might not be fighting virtual machines at all, Docker could be the one deploying them.

A Chinese translation of the article is available here:

Project Raisin – Raise Xen!

It all started with pvgrub2: it was March 2015 and I wanted to add grub2 to the Xen build system. We were already building grub-legacy as part of the Xen build, so that we could produce a pvgrub binary to be used to boot PV guests. After Vladimir ‘phcoder‘ Serbinenko’s good work on grub2, the latest and greatest upstream grub2 could be built with Xen support and used to boot PV guests. It made perfect sense to add grub2 to the Xen build system too, right? Maybe not. Unexpectedly some key Xen Project contributors pushed back: “there doesn’t seem to be a good reason for cloning and building yet another third-party project as part of the Xen build”, wrote David Vrabel.

Conflicting requirements

It was then that I realized that we have two contrasting set of requirements: on one hand we want to support users that clone xen-unstable, build everything from source, and expect the system to be fully ready after typing ./configure; make; make install. On the other hand, we also want to support distros and product groups that take Xen releases and integrate them into their Linux distros or enterprise build systems. The former want things like grub2 to be part of the xen-unstable build, because the grub2 package provided by their distro doesn’t necessarily comes with Xen support enabled. While the distro packagers are already building a grub2 package and certainly don’t want xen-unstable to go and clone grub2 again. They probably abhor the whole idea of xen-unstable git cloning external trees without their explicit assent. In fact they had been carrying patches to make sure xen-unstable doesn’t clone anything else “behind their back”, until we provided build options to disable all the third-party builds.

Raisin: Xen’s DevStack

How to find a solution that would make both camps happy? Surely others must have had the same issue. Is there another open source project that has to build several separate components in order to be fully functional? Yes, of course, there are many. One of them is OpenStack and it solves the problem by providing a set of scripts called DevStack, which build and setup the system from scratch.

This is where “Raisin” comes from. I announced the new project on the 31st of March 2015. The idea is that Raisin takes care of building Xen and all the other components, which are required to have a fully functional Xen system, but that don’t belong to xen-unstable. For example QEMU, SeaBIOS, and, of course, grub2. Users that build everything from source will clone Raisin to find a single place where they can build all the latest and greatest Xen stuff with a single command. Raisin can be very useful to setup a development environment too. Distro people can refer to Raisin as the most common way to build, install and configure Xen and related components, but they are unlikely to actually use it to build their packages. Raisin helps Xen developers improve the boundaries and interfaces between Xen and external components, by making such boundaries clearer and more explicit. Things like QEMU and SeaBIOS, currently cloned and built by xen-unstable, will be moved out to Raisin, making both Xen maintainers and distro packagers happier. Other Xen related components, that are good to have but not actually required, such as libvirt, will find their place in Raisin too.

Raisin: where we are, what’s next

After few busy months of development, we now have a set of bash scripts that can install dependencies and build Xen, QEMU, qemu-traditional, SeaBIOS, OVMF, Grub2, Libvirt and Linux with a single command. All you need to do is edit the config file, type raise -y build, go get a coffee, and everything will be ready when you come back. Raisin is not tied to a specific version of Xen. In fact, one can choose any git tags or commit ids newer than Xen 4.5 (RELEASE-4.5.0 is the git tag for the Xen 4.5 release) and Raisin will build it. Other commands are available to install and configure the system with the most common settings. Give a look at the README for an introduction on how to use the command line tool.

During the last few weeks I have been working on integrating Raisin in OSSTest, the automated testing framework run by Xen Project. I am currently adding a new test to validate Raisin itself, but going forward it makes sense to actually use Raisin to build Xen, QEMU and anything else OSSTest needs, similarly to what DevStack does for the OpenStack gate.

Making testing easier and accessible to everybody

Talking about tests, this is another area where Raisin can help greatly. I always liked the idea of providing a set of unit and functional tests, quick and simple to run, that can be executed by any Xen contributors to validate their changes before sending a patch to xen-devel. However we didn’t really have place to put them. OSSTest is too big and tightly coupled to the Xen Project Test Lab infrastructure for this use case, and the last thing xen-unstable needs is more scripts. On the other hand, Raisin is at the right abstraction level to run functional tests for the components it already builds. I introduced a few simple tests, which can stack on top of each other, to create busybox based PV and HVM guests. I plan to continue adding more tests, then expose them to OSSTest via Raisin, so that they will be continuously run by the Xen Project Test Lab. But, at the same time, anybody can still manually execute them on their test box with a single raise test command. I am hoping to be able to start asking contributors to run Raisin tests before submitting patches early in the next release cycle. If you use Xen and know bash scripting, you should consider writing a Raisin test to validate your favourite functionality today!

Raisin, you didn’t know you needed it, you can’t live without it ;-)


The Raisin git repository is available here. The README is up to date and describes the command line interface. We also have quickstart guide on our wiki. Raisin patches are discussed on xen-devel and follow the regular Xen development process.

How fast is Xen on ARM, really?

With Xen on ARM getting out of the early preview phase and becoming more mature, it is time to run a few benchmarks to check that the design choices paid out, the architecture is sound and the code base is solid. It is time to find out how much is the overhead introduced by Xen on ARM and how it compares with Xen and other hypervisors on x86.
I measured the overhead by running the same benchmark on a virtual machine on Xen on ARM and on native Linux on the same hardware. It takes a bit longer to complete the benchmark inside a VM, but how much longer? The answer to this question is the virtualization overhead.


I chose AppliedMicro X-Gene as the ARM platform to run the benchmarks on: it is an ARMv8 64-bit SoC with an 8 cores cpu and 16GB of RAM. I had Dom0 running with 8 vcpus and 1GB of RAM, the virtual machine that ran the tests had 2GB of RAM and 8 vcpus. To make sure that the results are comparable I restricted the amount of memory available to the native Linux run, so that Linux had all the 8 physical cores at its disposal but only 2GB of RAM.

For the x86 tests, I used a Dell server with an Intel Xeon x5650, that is a 6 cores HyperThreading cpu. HyperThreading was disabled during the tests for better performances. Similarly to the ARM tests, I had Dom0 running with 6 vcpus and 1GB of RAM and the virtual machine running with 2GB of RAM and 6 vcpus. The native Linux run had 6 physical cores and 2GB of RAM. For the KVM tests I booted the host with 3GB of RAM, then assigned 2GB of RAM to the KVM virtual machine.

In terms of software on both ARMv8 and x86 I used:

  • Linux 3.13 as Dom0, DomU and native kernel
  • Xen 4.4
  • OpenSUSE 13.1
  • QEMU-KVM 1.6.2 (for the KVM tests on x86)

I could not test KVM on ARMv8 because KVM support for X-Gene is not upstream in Linux 3.13.

Benchmarks – lower is better

The y-axis shows the overhead in terms of percentage of native: “0%” means that it is a fast as native. “1%” means that it takes 1% longer than native Linux to complete the benchmark inside a virtual machine. Given that we are dealing with overheads, lower is better.


Kernbench is a popular benchmark that measures the time that it takes to compile the Linux kernel. It is a cpu and memory intensive benchmark.
chart_4 (1)


PBzip2 is a parallel implementation of bzip2. This benchmark measures the time that it takes to compress a 4GB file.

SPECjbb2005 (non-compliant)

SPECjbb2005 simulates a Java server workload. It is a cpu and memory bound benchmark.
The runs are non-compliant (therefore cannot be compared with compliant runs) and the overhead is calculated on the peak warehouse alone.

Next I ran a couple of disk IO benchmarks, but both X-Gene and the Dell server came with spinning disks for storage: the following tests showed that both disks were too slow to actually measure the virtualization overhead (it is lower than 1%).


FIO is a popular tool to measure disk performances. This benchmark uses FIO to perform a combination of random reads and writes and measures the overhead on iops.


PGBench is the PostgresSQL database benchmarking tool. This benchmark is disk IO bound.


Developing Xen on ARM we have been focused on correctness and feature completeness rather than performances. Nonetheless it provides a very lower overhead that is already on par or lower than Xen’s on x86, that in turn is lower than KVM’s on x86. Given the benefits that virtualization brings to the table, including ease of deployment and lower downtimes, it really makes sense to deploy Xen on your ARM based cloud.

SWIOTLB by Morpheus

The following monologue explains how Linux drivers are able to program a device when running in a Xen virtual machine on ARM.

The problem that needs to be solved is that Xen on ARM guests run with second stage translation in hardware enabled. That means that what the Linux kernel sees as a physical address doesn’t actually correspond to a machine address. An additional translation layer is set by the hypervisor to do the conversion in hardware.

Many devices use DMA to read or write buffers in main memory and they need to be programmed with the addresses of the buffers. In the absence of an IOMMU, DMA requests don’t go through the same physical to machine translation set by the hypervisor for virtual machines, devices need to be programmed with machine addresses rather than physical addresses. Hence the problem we are trying to solve.

Definitions of some of the technical terms used in this article are available at the bottom of the page.

Given the complexity of the topic, we decided to ask for help to somebody with hands-on experience with teaching the recognition of the differences between “virtual” and “real”.


At last.
Please. Come. Sit.
Xen Project Matrix

Do you realize that everything running on Xen is a virtual machine — that Dom0, the OS from which you control the rest of the system, is just the first virtual machine created by the hypervisor? Usually Xen assigns all the devices on the platform to Dom0, which runs the drivers for them.

I imagine, right now, you must be feeling a bit like Alice, tumbling down the rabbit hole?
Let me tell you why you are here.

You are here because you want to know how to program a device in Linux on Xen on ARM.

Continue reading