Tag Archives: networking

PV Calls: a new paravirtualized protocol for POSIX syscalls

Let’s take a step back and look at the current state of virtualization in the software industry. X86 hypervisors were built to run a few different operating systems on the same machine. Nowadays they are mostly used to execute several instances of the same OS (Linux), each running a single server application in isolation. Containers are a better fit for this use case, but they expose a very large attack surface. It is possible to reduce the attack surface, however it is a very difficult task, one that requires minute knowledge of the app running inside. At any scale it becomes a formidable challenge. The 15-year-old hypervisor technologies, principally designed for RHEL 5 and Windows XP, are more a workaround than a solution for this use case. We need to bring them to the present and take them into the future by modernizing their design.

The typical workload we need to support is a Linux server application which is packaged to be self contained, complying to the OCI Image Format or Docker Image Specification. The app comes with all required userspace dependencies, including its own libc. It makes syscalls to the Linux kernel to access resources and functionalities. This is the only interface we must support.

Many of these syscalls closely correspond to function calls which are part of the POSIX family of standards. They have well known parameters and return values. POSIX stands for “Portable Operating System Interface”: it defines an API available on all major Unixes today, including Linux. POSIX is large to begin with and Linux adds its own set of non-standard calls on top of it. As a result a Linux system has a very high number of exposed calls and, inescapably, also a high number of vulnerabilities. It is wise to restrict syscalls by default. Linux containers struggle with it, but hypervisors are very accomplished in this respect. After all hypervisors don’t need to have full POSIX compatibility. By paravirtualizing hardware interfaces, Xen provides powerful functionalities with a small attack surface. But PV devices are the wrong abstraction layer for Docker apps. They cause duplication of functionalities between the guest and the host. For example, the network stack is traversed twice, first in DomU then in Dom0. This is unnecessary. It is better to raise hypervisor abstractions by paravirtualizing a small set of syscalls directly.

PV Calls

It is far easier and more efficient to write paravirtualized drivers for syscalls than to emulate hardware because syscalls are at a higher level and made for software. I wrote a protocol specification called PV Calls to forward POSIX calls from DomU to Dom0. I also wrote a couple of prototype Linux drivers for it that work at the syscall level. The initial set of calls covers socket, connect, accept, listen, recvmsg, sendmsg and poll. The frontend driver forwards syscalls requests over a ring. The backend implements the syscalls, then returns success or failure to the caller. The protocol creates a new ring for each active socket. The ring size is configurable on a per socket basis. Receiving data is copied to the ring by the backend, while sending data is copied to the ring by the frontend. An event channel per ring is used to notify the other end of any activity. This tiny set of PV Calls is enough to provide networking capabilities to guests.

We are still running virtual machines, but mainly to restrict the vast majority of applications syscalls to a safe and isolated environment. The guest operating system kernel, which is provided by the infrastructure (it doesn’t come with the app), implements syscalls for the benefit of the server application. Xen gives us the means to exploit hardware virtualization extensions to create strong security boundaries around the application. Xen PV VMs enable this approach to work even when virtualization extensions are not available, such as on top of Amazon EC2 or Google Compute Engine instances.

This solution is as secure as Xen VMs but efficiently tailored for containers workloads. Early measurements show excellent performance. It also provides a couple of less obvious advantages. In Docker’s default networking model, containers’ communications appear to be made from the host IP address and containers’ listening ports are explicitly bound to the host. PV Calls are a perfect match for it: outgoing communications are made from the host IP address directly and listening ports are automatically bound to it. No additional configurations are required.

Another benefit is ease of monitoring. One of the key aspects of hardening Linux containers is keeping applications under constant observation with logging and monitoring. We should not ignore it even though Xen provides a safer environment by default. PV Calls forward networking calls made by the application to Dom0. In Dom0 we can trivially log them and detect misbehavior. More powerful (and expensive) monitoring techniques like memory introspection offer further opportunities for malware detection.

PV Calls are unobtrusive. No changes to Xen are required as the existing interfaces are enough. Changes to Linux are very limited as the drivers are self-contained. Moreover, PV Calls perform extremely well! Let’s take a look at a couple of iperf graphs (higher is better):

iperf client

iperf server

The first graph shows network bandwidth measured by running an iperf server in Dom0 and an iperf client inside the VM (or container in the case of Docker). PV Calls reach 75 gbit/sec with 4 threads, far better than netfront/netback.

The second graph shows network bandwidth measured by running an iperf server in the guest (or container in the case of Docker) and an iperf client in Dom0. In this scenario PV Calls reach 55 gbit/sec and outperform not just netfront/netback but even Docker.

The benchmarks have been run on an Intel Xeon D-1540 machine, with 8 cores (16 threads) and 32 GB of ram. Xen is 4.7.0-rc3 and Linux is 4.6-rc2. Dom0 and DomU have 4 vcpus each, pinned. DomU has 4 GB of ram.

For more information on PV Calls, read the full protocol specification on xen-devel. You are welcome to join us and participate in the review discussions. Contributions to the project are very appreciated!

Xen Related Talks @ FOSDEM 2014

Going to FOSDEM’14? Well, you want to check out the schedule of the Virtualization & IaaS devroom then, and make sure you do not miss the talks about Xen. There are 4 of them, and they will provide some details about new and interesting usecases for virtualization, like in embedded systems of various kind (from phones and tablets to network middleboxes), and about new features in the upcoming Xen release, such as PVH, and how to use them with profit.

Here they are the talks, in some more details:
Dual-Android on Nexus 10 using XEN, on Saturday morning
High Performance Network Function Virtualization with ClickOS, on Saturday afternoon
Virtualization in Android based and embedded systems, on Sunday morning
How we ported FreeBSD to PVH, on Sunday afternoon

There actually is more: one called Porting FreeBSD on Xen on ARM, in the BSD devroom, and one about MirageOS one in the miscellaneous Main track, but the schedule for them has not been announced yet.

Last but certainly not least, there will be a Xen-Project booth, where you can meet the members of the Xen community as well as enjoying some other, soon to be revealed, activities. I and some of my colleagues from Citrix will be in Brussels, and will definitely spend some time at the booth, so come and visit us. The booth will be in building K, on level 1.

Read more here: http://xenproject.org/about/events.html

Edit:

The schedule for the FreeBSD and MirageOS talks have been announced. Here it comes:
Porting FreeBSD on Xen on ARM, will be given on Saturday early afternoon (15:00), in the BSD devroom
MirageOS: compiling functional library operating systems, will happen on Sunday late morning (13:00), in the misc main track

Also, there is another Xen related talk, in the Automotive development devroom: Xen on ARM: Virtualization for the Automotive industry, on Sunday morning (11:45).

RT-Xen: Real-Time Virtualization in Xen

RT-XenThe researchers at Washington University in St. Louis and University of Pennsylvania are pleased to announce, here on this blog, the release of a new and greatly improved version of the RT-Xen project. Recent years have seen increasing demand for supporting real-time systems in virtualized environments (for example, the Xen-ARM projects and several other real-time enhancements to Xen), as virtualization enables greater flexibility and reduces cost, weight and energy by breaking the correspondence between logical systems and physical systems. As an example of this, check out the video below from the 2013 Xen Project Developer Summit

The video describes how Xen could be used in an in-vehicle infotainement system.

In order to combine real-time and virtualization, a formally defined real-time scheduler at the hypervisor level is needed to provide timing guarantees to the guest virtual machines. RT-Xen bridges the gap between real-time scheduling theory and the virtualization technology by providing a suite of multi-core real-time schedulers to deliver real-time performance to domains running on the Xen hypervisor.

Background: Scheduling in Xen

In Xen, each domain’s core is abstracted as a Virtual CPU (VCPU), and the hypervisor scheduler is responsible for scheduling VCPUs. For example, the default credit scheduler would assign a weight per domain, which decides the proportional share of CPU cycles that a domain would get. The credit scheduler works great for general purpose computing, but is not suitable for real-time applications due to the following reasons:

  1. There is no reservation with credit scheduler. For example, when two VCPUs runs on a 2 GHz physical core, each would get 1 GHz. However, if another VCPU also boots on the same PCPU, the resource share shrinks to 0.66 GHz. The system manager have to carefully configure the number of VMs/VCPUs to ensure that each domain get an appropriate amount of CPU resource;
  2. There is little timing predictability or real-time performance provided to the VM. If a VM is running some real-time workload (video decoding, voice processing, and feedback control loops) which are periodically triggered and have a timing requirement — for example, the VM must be scheduled every 10 ms to process the data — there is no way the VM can express this information to the underlying VMM scheduler. The existing SEDF scheduler can help with this, but it has poor support for multi-core.

RT-Xen: Combining real-time and virtualization

RT-Xen aims to solve this problem by providing a suite of real-time schedulers. The users can specify (budget, period, CPU mask) for each VCPU individually. The budget represents the maximum CPU resource a VCPU will get during a period; the period represents the timing quantum of the CPU resources provided to the VCPU; the CPU mask defines a subset of physical cores a VCPU is allowed to run. For each VCPU, the budget is reset at each starting point of the period (all in milliseconds), consumed when the VCPU is executing, and deferred when the VCPU has budget but no work to do.

Within each scheduler, the users can switch between different priority schemes: earliest deadline first (EDF), where VCPU with earlier deadline has higher priority; or rate monotonic (RM), where VCPU with shorter period has higher priority. As a results, not only the VCPU gets a resource reservation (budget/period), but also an explicit timing information for the CPU resources (period). The real-time schedulers in RT-Xen delivers the desired real-time performance to the VMs based on the resource reservations.

To be more specific, the two multi-core schedulers in RT-Xen are:

  • RT-globalwhich uses a global run queue to hold all VCPUs (in runnable state). It is CPU mask aware, and provides better resource utilization, as VCPU can migrate freely between physical cores (within CPU mask)
  • RT-partition: which uses a run queue per physical CPU. In this way, each physical CPU only looks at its own run queue to make scheduling decisions, which incurs less overhead and potentially better cache performance. However, load-balancing between physical cores is not provided in the current release.

Source Code and References

The developers of RT-Xen are looking closely at how to integrate both schedulers into the Xen mainstream. In the meantime, please check out publications at [EMSOFT’14], [EMSOFT’11], [RTAS’12] and source code.

Bringing Open Source Communities closer together

For the last several years, the Xen developer community has been increasing its ability to collaborate well with other projects. We succeeded in finally getting the necessary infrastructure for dom0 support into Linux in 2011. We have upstreamed the most important changes to QEMU, and will be using an upstream QEMU based tree in the Xen 4.3 release. Additionally, during the the last several months we have improved Xen support in libvirt. We are not just looking to upstream projects but we are also improving our relationship with downstream users, like Linux distros. We worked closely with Debian, Ubuntu and CentOS, for example announcing the availability of Xen packages for CentOS 6 at FOSDEM few months ago.

Today is the perfect day to announce that in addition to these efforts, we have been working behind the scenes on some other exciting new initiatives designed to increase our collaboration and influence with other projects.

Continue reading

Why CloudStack joining Apache is good news!

Today, Citrix and the Apache Software Foundation (ASF) announced that it will relicense the CloudStack open source project under the Apache License and contribute the CloudStack code to the ASF. Before I explain why this is good for the Xen community and the Open Cloud, I wanted to congratulate CloudStack to become the first cloud platform in the industry to join the ASF.

CloudStack has always been open source, with Citrix as the vendor behind the project. Moving from a privately operated open source community to the ASF has a number of implications: Citrix is giving up control over the project and it is moving to a collaborative and meritocratic development process, which values community, diversity and openness. For a community guy like me this is really exciting!

So why is this good news for Xen? In fact, the internal discussions preceding this decision already made a big impact: more staff within Citrix are engaged with open source and are actively supporting and understanding projects such as Xen, Linux and of course CloudStack. My experience as open source guy in various organisations is that open source and community can be easily made the responsibility of a few people and then be forgotten about. However, to be truly successful in the long haul, knowledge and support for open source in an organization needs to be broad. In the last few months the level of understanding and support for Xen across Citrix has increased hugely. You may not yet see the impact of all this: good initiatives and change need planning and take time. Don’t get me wrong: on many counts Xen is a very successful project. We have an active developer community, we have a huge user base, many successful products and businesses were built on Xen, etc. But the project could have done and can do better!

When I was at Scale 10x earlier this year, Greg DeKoenigsberg from Eucalyptus said in his keynote that most cloud projects are open source today, well sort of! To me that said it all: the more cloud related projects move from single vendor driven projects to independent and community driven projects, the better for the user and the “Open Cloud”. Why? Simple: independent projects increase the user’s ability to be in control of their infrastructure by influencing the projects they care about. Thus, CloudStack becoming an Apache project, is a major milestone for achieving a better and more open cloud. Of course, the same thinking lies behind the creation of the OpenStack Foundation, which we will hopefully see later this year.

Building OpenNebula Clouds on XCP

The Xen.org and OpenNebula.org open source communities are working together to add XCP support to OpenNebula. This collaboration will produce the   OpenNebula Toolkit for XCP, which will be hosted as freely available open source project on OpenNebula.org. The XCP project team and Xen.org community will provide technical guidance and assistance to the OpenNebula open-source project.

We are really excited to collaborate with Xen.org in offering Xen Cloud Platform support. This will be a huge step forward towards achieving a complete open-source stack for cloud infrastructure deployment. We are planning to have a first prototype of the integration by November.” said Ignacio M. Llorente, Director of OpenNebula Project and Chief Executive Advisor at C12G Labs.

“I am really pleased that Xen and OpenNebula are collaborating on a new project, building on a history that goes all the way back to 2008, OpenNebula 1 and Xen 3.1. This renewed collaboration, together with projects such Kronos which will deliver XCP with different Linux distributions, will make building rich Xen based clouds much easier.” said Ian Pratt, Chairman of Xen.org and SVP, Products at Bromium.

OpenNebula is a fully open-source, Apache licensed toolkit for on-premise IaaS cloud computing, offering a comprehensive solution for the management of virtualized data centres to enable private, public and hybrid clouds. OpenNebula interoperability makes cloud an evolution by offering common cloud standards and interfaces, leveraging existing IT infrastructure, protecting existing investments, and avoiding vendor lock-in. OpenNebula is used by many research projects as a powerful tool for innovation and interoperability, and by thousands of organizations to build large-scale production clouds using KVM, Xen and VMware. This new collaboration will extend Xen support in OpenNebula to include XCP, bringing the rich capabilities of XenAPI to OpenNebula.

XCP is an open source, GPLv2 licensed, enterprise-ready server virtualization and cloud computing platform, delivering the Xen Hypervisor with support for a range of guest operating systems including Windows® and Linux® network and storage support, management tools in a single, tested installable image. XCP addresses the needs of cloud providers, hosting services and data centres by combining the isolation and multi-tenancy capabilities of the Xen hypervisor with enhanced security, storage and network virtualization technologies to offer a rich set of virtual infrastructure cloud services. In addition, XCP addresses user requirements for security, availability, performance and isolation across both private and public clouds. The collaboration will add OpenNebula support to the list of Cloud Orchestration stacks that build on top of XCP.