Tag Archives: xen

PV Calls: a new paravirtualized protocol for POSIX syscalls

Let’s take a step back and look at the current state of virtualization in the software industry. X86 hypervisors were built to run a few different operating systems on the same machine. Nowadays they are mostly used to execute several instances of the same OS (Linux), each running a single server application in isolation. Containers are a better fit for this use case, but they expose a very large attack surface. It is possible to reduce the attack surface, however it is a very difficult task, one that requires minute knowledge of the app running inside. At any scale it becomes a formidable challenge. The 15-year-old hypervisor technologies, principally designed for RHEL 5 and Windows XP, are more a workaround than a solution for this use case. We need to bring them to the present and take them into the future by modernizing their design.

The typical workload we need to support is a Linux server application which is packaged to be self contained, complying to the OCI Image Format or Docker Image Specification. The app comes with all required userspace dependencies, including its own libc. It makes syscalls to the Linux kernel to access resources and functionalities. This is the only interface we must support.

Many of these syscalls closely correspond to function calls which are part of the POSIX family of standards. They have well known parameters and return values. POSIX stands for “Portable Operating System Interface”: it defines an API available on all major Unixes today, including Linux. POSIX is large to begin with and Linux adds its own set of non-standard calls on top of it. As a result a Linux system has a very high number of exposed calls and, inescapably, also a high number of vulnerabilities. It is wise to restrict syscalls by default. Linux containers struggle with it, but hypervisors are very accomplished in this respect. After all hypervisors don’t need to have full POSIX compatibility. By paravirtualizing hardware interfaces, Xen provides powerful functionalities with a small attack surface. But PV devices are the wrong abstraction layer for Docker apps. They cause duplication of functionalities between the guest and the host. For example, the network stack is traversed twice, first in DomU then in Dom0. This is unnecessary. It is better to raise hypervisor abstractions by paravirtualizing a small set of syscalls directly.

PV Calls

It is far easier and more efficient to write paravirtualized drivers for syscalls than to emulate hardware because syscalls are at a higher level and made for software. I wrote a protocol specification called PV Calls to forward POSIX calls from DomU to Dom0. I also wrote a couple of prototype Linux drivers for it that work at the syscall level. The initial set of calls covers socket, connect, accept, listen, recvmsg, sendmsg and poll. The frontend driver forwards syscalls requests over a ring. The backend implements the syscalls, then returns success or failure to the caller. The protocol creates a new ring for each active socket. The ring size is configurable on a per socket basis. Receiving data is copied to the ring by the backend, while sending data is copied to the ring by the frontend. An event channel per ring is used to notify the other end of any activity. This tiny set of PV Calls is enough to provide networking capabilities to guests.

We are still running virtual machines, but mainly to restrict the vast majority of applications syscalls to a safe and isolated environment. The guest operating system kernel, which is provided by the infrastructure (it doesn’t come with the app), implements syscalls for the benefit of the server application. Xen gives us the means to exploit hardware virtualization extensions to create strong security boundaries around the application. Xen PV VMs enable this approach to work even when virtualization extensions are not available, such as on top of Amazon EC2 or Google Compute Engine instances.

This solution is as secure as Xen VMs but efficiently tailored for containers workloads. Early measurements show excellent performance. It also provides a couple of less obvious advantages. In Docker’s default networking model, containers’ communications appear to be made from the host IP address and containers’ listening ports are explicitly bound to the host. PV Calls are a perfect match for it: outgoing communications are made from the host IP address directly and listening ports are automatically bound to it. No additional configurations are required.

Another benefit is ease of monitoring. One of the key aspects of hardening Linux containers is keeping applications under constant observation with logging and monitoring. We should not ignore it even though Xen provides a safer environment by default. PV Calls forward networking calls made by the application to Dom0. In Dom0 we can trivially log them and detect misbehavior. More powerful (and expensive) monitoring techniques like memory introspection offer further opportunities for malware detection.

PV Calls are unobtrusive. No changes to Xen are required as the existing interfaces are enough. Changes to Linux are very limited as the drivers are self-contained. Moreover, PV Calls perform extremely well! Let’s take a look at a couple of iperf graphs (higher is better):

iperf client

iperf server

The first graph shows network bandwidth measured by running an iperf server in Dom0 and an iperf client inside the VM (or container in the case of Docker). PV Calls reach 75 gbit/sec with 4 threads, far better than netfront/netback.

The second graph shows network bandwidth measured by running an iperf server in the guest (or container in the case of Docker) and an iperf client in Dom0. In this scenario PV Calls reach 55 gbit/sec and outperform not just netfront/netback but even Docker.

The benchmarks have been run on an Intel Xeon D-1540 machine, with 8 cores (16 threads) and 32 GB of ram. Xen is 4.7.0-rc3 and Linux is 4.6-rc2. Dom0 and DomU have 4 vcpus each, pinned. DomU has 4 GB of ram.

For more information on PV Calls, read the full protocol specification on xen-devel. You are welcome to join us and participate in the review discussions. Contributions to the project are very appreciated!

Xen Project Hackathon 16 : Event Report

We just wrapped another successful Xen Project Hackathon, which is an annual event, hosted by Xen Project member companies, typically at their corporate offices. This year’s event was hosted by ARM at their Cambridge HQ. 42 delegates descended on Cambridge from Aporeto, ARM, Assured Information Security, Automotive Electrical Systems, BAE Systems, Bromium, Citrix, GlobalLogic, OnApp, Onets, Oracle, StarLab, SUSE and Vates to attend. A big thank you (!) to ARM and in particular to Thomas Molgaard for organising the event and the social activities afterwards.

Here are a few images that helped capture the event:

Taking a breather and photo opp outside of ARM headquarters in Cambridge

Taking a breather and photo opp outside of ARM headquarters in Cambridge

Working on solving the mysteries of the world.

Working on solving the mysteries of the world

Continuing to work hard on solving the mysteries of the world

Continuing to work hard on solving the mysteries of the world

Xen Project Hackathons have evolved in format into a series of structured problem solving sessions that scale up to 50 people. We combine this with a more traditional hackathon approach where programmers (and others involved in software development) collaborate intensively on software projects.

This year’s event was particularly productive because all our core developers and the project’s leadership were present. We focused on a lot of topics, but two of our main themes this year evolved around security and community development. We’ll cover these topics in more detail and how they fit within our next release 4.7 and development going forward, but below is a little taste of some of the other themes of this year’s Hackathon sessions:

  • Security improvements: A trimmed down QEMU to reduce attack surface, de-privileging QEMU and the x86 emulator to reduce the impact of security vulnerabilities in those components, XSplice, KConfig support which allows to remove parts of Xen at compile time, run-time disablement of Xen features to reduce the attack surface, vulnerabilities, disaggregation and enabling XSM (Xen’s equivalent of the Linux Security Modules which are also known as SELinux) by default.
  • Security features: We had two sessions on the future of XSplice (first version to be released in Xen 4.7), which allows users of Xen to apply security fixes on a running Xen instance (aka no need to reboot).
  • Robustness: A session on restartable Dom0 and driver domains, which again will significantly reduce the overhead of applying security patches.
  • Community and code review: A couple of sessions on optimising our working practices: most notably some clarifications to the maintainer role and how we can make code reviews more efficient.
  • Virtualization Modes: The next stage of PVH, which combines the best of HVM and PV. We also had discussions around some functionality that is currently developed in Linux on which PVH has dependencies.
  • Making Development more Scalable: A number of sessions to improve the toolstack and libxl. We covered topics such as making storage support pluggable via a plug-in architecture, making it easier to develop new PV drivers to support automotive and embedded vendors, and improvements to our build system, testing, stub domains and xenstored.
  • ARM support: There were a number of planning sessions for Xen ARM support. We covered the future roadmap, how to implement PCI passthrough, and how we can improve testing for the increasing range of ARM HW with support for virtualization, also applicable outside the server space.

There were many more sessions covering performance, scalability and other topics. The session’s host(s) post meeting notes on xen-devel@ (search for Hackathon in the subject line), if you want to explore any topic in more detail. To make it easier for people who do not follow our development lists, we also posted links to Hackathon related xen-devel@ discussions on our wiki.

Besides providing an opportunity to meet face-to-face, build bridges and solve problems, we always make sure that we have social events. After all Hackathons should be fun and bring people together. This year we had a dinner in Cambridge and of course the obligatory punting trip, which is part of every Cambridge trip.

Embarking on the punting journey

Embarking on the punting journey

Continued exploration of discovering the mysteries of the universe, while on a boat

Continued exploration of discovering the mysteries of the universe, while on a boat

Again, a big thanks to ARM for hosting the event! Also, a reminder that we’ll be hosting our Xen Project Developer Summit next August in Toronto, Canada. This event will happen directly after LinuxCon North America and is a great opportunity to learn more about Xen Project development and what’s happening within the Linux Foundation ecosystem at large. CFPs are still open until May 6th!

Xen Project 4.5.2 Maintenance Release Available

I am pleased to announce the release of Xen 4.5.2. Xen Project Maintenance releases are released roughly every 4 months, in line with our Maintenance Release Policy. We recommend that all users of the 4.5 stable series update to this point release.

Xen 4.5.2 is available immediately from its git repository:

    xenbits.xenproject.org/gitweb/?p=xen.git;a=shortlog;h=refs/heads/stable-4.5
    (tag RELEASE-4.5.2)

or from the Xen Project download page at www.xenproject.org/downloads/xen-archives/xen-45-series/xen-452.html.

This release contains many bug fixes and improvements. For a complete list of changes in this release, please check lists of changes on the download page.

Will Docker Replace Virtual Machines?

Docker is certainly the most influential open source project of the moment. Why is Docker so successful? Is it going to replace Virtual Machines? Will there be a big switch? If so, when?

Let’s look at the past to understand the present and predict the future. Before virtual machines, system administrators used to provision physical boxes to their users. The process was cumbersome, not completely automated, and it took hours if not days. When something went wrong, they had to run to the server room to replace the physical box.

With the advent of virtual machines, DevOps could install any hypervisor on all their boxes, then they could simply provision new virtual machines upon request from their users. Provisioning a VM took minutes instead of hours and could be automated. The underlying hardware made less of a difference and was mostly commoditized. If one needed more resources, it would just create a new VM. If a physical machine broke, the admin just migrated or resumed her VMs onto a different host.

Finer-grained deployment models became viable and convenient. Users were not forced to run all their applications on the same box anymore, to exploit the underlying hardware capabilities to the fullest. One could run a VM with the database, another with middleware and a third with the webserver without worrying about hardware utilization. The people buying the hardware and the people architecting the software stack could work independently in the same company, without interference. The new interface between the two teams had become the virtual machine. Solution architects could cheaply deploy each application on a different VM, reducing their maintenance costs significantly. Software engineers loved it. This might have been the biggest innovation introduced by hypervisors.

A few years passed and everybody in the business got accustomed to working with virtual machines. Startups don’t even buy server hardware anymore, they just shop on Amazon AWS. One virtual machine per application is the standard way to deploy software stacks.

Application deployment hasn’t changed much since the ’90s though. Up until then, it still involved installing a Linux distro, mostly built for physical hardware, installing the required deb or rpm packages, and finally installing and configuring the application that one actually wanted to run.

In 2013 Docker came out with a simple, yet effective tool to create, distribute and deploy applications wrapped in a nice format to run in independent Linux containers. It comes with a registry that is like an app store for these applications, which I’ll call “cloud apps” for clarity. Deploying the Nginx webserver had just become one “docker pull nginx” away. This is much quicker and simpler than installing the latest Ubuntu LTS. Docker cloud apps come preconfigured and without any unnecessary packages that are unavoidably installed by Linux distros. In fact the Nginx Docker cloud app is produced and distributed by the Nginx community directly, rather than Canonical or Red Hat.

Docker’s outstanding innovations are the introduction of a standard format for cloud applications, including the registry. Instead of using VMs to run cloud apps, Linux containers are used instead. Containers had been available for years, but they weren’t quite popular outside Google and few other circles. Although they offer very good performance, they have fewer features and weaker isolation compared to virtual machines. As a rising star, Docker made Linux containers suddenly popular, but containers were not the reason behind Docker’s success. It was incidental.

What is the problem with containers? Their live-migration support is still very green and they cannot run non-native workloads (Windows on Linux or Linux on Windows). Furthermore, the primary challenge with containers is security: the surface of attack is far larger compared to virtual machines. In fact, multi-tenant container deployments are strongly discouraged by Docker, CoreOS, and anybody else in the industry. With virtual machines you don’t have to worry about who is going to use it or how it will be used. On the other hand, only containers that belong to the same user should be run on the same host. Amazon and Google offer container hosting, but they both run each container on top of a separate virtual machine for isolation and security. Maybe inefficient but certainly simple and effective.

People are starting to notice this. At the beginning of the year a few high profile projects launched to bring the benefits of virtual machines to Docker, in particular Clear Linux by Intel and Hyper. Both of them use conventional virtual machines to run Docker cloud applications directly (no Linux containers are involved). We did a few tests with Xen: tuning the hypervisor for this use case allowed us to reach the same startup times offered by Linux containers, retaining all the other features. A similar effort by Intel for Xen is being presented at the Xen Developer Summit and Hyper is also presenting their work.

This new direction has the potential to deliver the best of both worlds to our users: the convenience of Docker with the security of virtual machines. Soon Docker might not be fighting virtual machines at all, Docker could be the one deploying them.

A Chinese translation of the article is available here: http://dockone.io/article/598

Xen Project Hypervisor versions 4.3.4 and 4.4.2 are available

I am pleased to announce the release of Xen 4.3.4 and Xen 4.4.2. Both releases are available immediately from their git repositories and download pages (see below). We recommend to all users of the 4.3 and 4.4 stable series to update to these latest point releases.

Xen 4.3.4

This release is available immediately from its git repository at
http://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=refs/heads/stable-4.3
(tag RELEASE-4.3.4) or from the XenProject download page.

Note that this is expected to be the last release of the 4.3 stable series. The tree will be switched to security only maintenance mode after this release.

This fixes the following critical vulnerabilities:

  • CVE-2014-5146, CVE-2014-5149 / XSA-97: Long latency virtual-mmu operations are not preemptible
  • CVE-2014-7154 / XSA-104: Race condition in HVMOP_track_dirty_vram
  • CVE-2014-7155 / XSA-105: Missing privilege level checks in x86 HLT, LGDT, LIDT, and LMSW emulation
  • CVE-2014-7156 / XSA-106: Missing privilege level checks in x86 emulation of software interrupts
  • CVE-2014-7188 / XSA-108: Improper MSR range used for x2APIC emulation
  • CVE-2014-8594 / XSA-109: Insufficient restrictions on certain MMU update hypercalls
  • CVE-2014-8595 / XSA-110: Missing privilege level checks in x86 emulation of far branches
  • CVE-2014-8866 / XSA-111: Excessive checking in compatibility mode hypercall argument translation
  • CVE-2014-8867 / XSA-112: Insufficient bounding of “REP MOVS” to MMIO emulated inside the hypervisor
  • CVE-2014-9030 / XSA-113: Guest effectable page reference leak in MMU_MACHPHYS_UPDATE handling
  • CVE-2014-9065, CVE-2014-9066 / XSA-114: p2m lock starvation
  • CVE-2015-0361 / XSA-116: xen crash due to use after free on hvm guest teardown
  • CVE-2015-1563 / XSA-118: arm: vgic: incorrect rate limiting of guest triggered logging
  • CVE-2015-2152 / XSA-119: HVM qemu unexpectedly enabling emulated VGA graphics backends
  • CVE-2015-2044 / XSA-121: Information leak via internal x86 system device emulation
  • CVE-2015-2045 / XSA-122: Information leak through version information hypercall
  • CVE-2015-2151 / XSA-123: Hypervisor memory corruption due to x86 emulator flaw

Additionally a bug in the fix for CVE-2014-3969 / CVE-2015-2290 / XSA-98 (which got assigned CVE-2015-2290) got addressed.

Sadly the workaround for CVE-2013-3495 / XSA-59 (Intel VT-d Interrupt Remapping engines can be evaded by native NMI interrupts) still can’t be guaranteed to cover all affected chipsets; Intel continues to be working on providing us with a complete list.

Apart from those there are many further bug fixes and improvements.

We recommend all users of the 4.3 stable series to update to this last point release.

Xen 4.4.2

This release is available immediately from its git repository at
http://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=refs/heads/stable-4.4
(tag RELEASE-4.4.2) or from the XenProject download page.

This fixes the following critical vulnerabilities:

  • CVE-2014-5146, CVE-2014-5149 / XSA-97: Long latency virtual-mmu operations are not preemptible
  • CVE-2014-7154 / XSA-104: Race condition in HVMOP_track_dirty_vram
  • CVE-2014-7155 / XSA-105: Missing privilege level checks in x86 HLT, LGDT, LIDT, and LMSW emulation
  • CVE-2014-7156 / XSA-106: Missing privilege level checks in x86 emulation of software interrupts
  • CVE-2014-6268 / XSA-107: Mishandling of uninitialised FIFO-based event channel control blocks
  • CVE-2014-7188 / XSA-108: Improper MSR range used for x2APIC emulation
  • CVE-2014-8594 / XSA-109: Insufficient restrictions on certain MMU update hypercalls
  • CVE-2014-8595 / XSA-110: Missing privilege level checks in x86 emulation of far branches
  • CVE-2014-8866 / XSA-111: Excessive checking in compatibility mode hypercall argument translation
  • CVE-2014-8867 / XSA-112: Insufficient bounding of “REP MOVS” to MMIO emulated inside the hypervisor
  • CVE-2014-9030 / XSA-113: Guest effectable page reference leak in MMU_MACHPHYS_UPDATE handling
  • CVE-2014-9065, CVE-2014-9066 / XSA-114: p2m lock starvation
  • CVE-2015-0361 / XSA-116: xen crash due to use after free on hvm guest teardown
  • CVE-2015-1563 / XSA-118: arm: vgic: incorrect rate limiting of guest triggered logging
  • CVE-2015-2152 / XSA-119: HVM qemu unexpectedly enabling emulated VGA graphics backends
  • CVE-2015-2044 / XSA-121: Information leak via internal x86 system device emulation
  • CVE-2015-2045 / XSA-122: Information leak through version information hypercall
  • CVE-2015-2151 / XSA-123: Hypervisor memory corruption due to x86 emulator flaw

Additionally a bug in the fix for CVE-2014-3969 / CVE-2015-2290 / XSA-98 (which got assigned CVE-2015-2290) got addressed.

Sadly the workaround for CVE-2013-3495 / XSA-59 (Intel VT-d Interrupt Remapping engines can be evaded by native NMI interrupts) still can’t be guaranteed to cover all affected chipsets; Intel continues to be working on providing us with a complete list.

Apart from those there are many further bug fixes and improvements.

We recommend all users of the 4.4 stable series to update to this first point release.

Xen.org at LinuxWorld (August 5 – 7)

The Xen.org community will have a booth  (#210) at the joint LinuxWorld / Next Generation Data Center event held in San Francisco, August 5- 7 at the Moscone Center. The booth will be staffed by myself as well as community members with spare time during the event.  If you are planning on attending this event and would like to help please let me know. I will be creating a schedule for community members later this year as we get closer to August.