Xen network: the future plan

As many of you might have (inevitably) noticed, Xen frontend / backend network drivers in Linux suffered from regression several months back after the XSA-39 fix (various reports here, here and here). Fortunately that’s now fixed (see the most important patch of that series) and the back-porting process to stable kernels is on-going. Now that we’ve put everything back into stable-ish state, it’s time to look into the future to prepare Xen network drivers for the next stage. I mainly work on Linux drivers, but some of the backend improvements ideas should benefit all frontends.

The goal is to improve network performance and scalability without giving up the advanced security feature Xen offers. Just to name a few items:

Split event channels: In the old network drivers there’s only one event channel between frontend and backend. That event channel is used by frontend to do TX notification and RX buffer allocation notification to backend. It is also used by backend to do TX completion and RX notification to frontend. So this is definitely not ideal as TX and RX interferes with each other. So with a little change to the protocol we can split TX and RX notifications into two event channels. This work is now in David Miller’s tree (patch for backend, frontend and document).

1:1 model netback: The current model of netback is M:N. That is, we create nr_vcpus kthreads in Dom0 and attach every DomU’s vif to a specific kthread. The old model works well so far, but it certainly has drawback. One significant drawback is that the fairness among vifs is somewhat poor as vifs are statically attached to one kthread. It’s easy to run into a situation that several vifs on a kthread compete for CPU time while another worker thread is idle. The idea behind the 1:1 model is to create one kthread for each vif and trust backend scheduler to do the right thing. Preliminarily test shows that this model indeed improve fairness. What’s more, this model is also prerequisite for implementing multi-queue in Xen network. This work is under-going some test (with many nice-looking graphs) and discussions (1, 2).

Multi-page ring: The size of TX / RX ring is only one page. According to Konrad’s calculation, we can only have ~898K in flight data on the ring. Hardware is becoming faster and faster which can possibly make the ring a bottleneck. Extending the ring can be generally useful. This work can benefit bulk transfer like NFS. All other Xen frontend / backend drivers can also benefit from the new multi-page ring Xenbus API.

Multi-queue vif: This should help vifs scale better with number of vcpus. The XenServer team from Citrix is working on this (see the discussion thread).

The ideas listed above are concrete, we also have many other vague ideas :-) :

Zero-copy TX path: This idea is not likely to be upstreamed in the near future as there’s some prerequisite patches for core network driver and we are now also considering whether copying is really such a bad idea – modern hardware copies data blazingly fast and TLB shot-down required for mapping is expensive (at least that’s our impression at the moment). The only way to verify the worthiness of zero-copy is to hack a prototype. If TLB shot-down is less expensive than we expect and the gain overweights the lost we might consider adding in zero-copy TX. I implemented a vhost-net like netback to verify this. An unexpected side effect of this new prototype is that it also reveals some problem in notification scheme – DomU TX sends too many notifications than necessary. We need to solve the notification problem before moving on.

Separate ring indices: Producer and consumer index is on the same cache line. In present hardware that means the reader and writer will compete for the same cacheline causing a ping-pong between sockets. This involves altering the ring protocol.

Cache alignment for ring request / response: Pretty self-explanatory. This also involves altering the ring protocol.

Affinity of FE / BE on the same NUMA node: Then the Xen scheduler, with some help from the toolstack, can make sure that the vCPU in Dom0 where the backend runs is kept on the same NUMA node of the vCPUs of the DomU (where the frontend runs), for improved locality and, hence, performance. We discussed this during Xen Hackathon in May and we also have an email thread on Xen-devel.

That’s pretty much it. If you’re interested in any of the items above, don’t hesitate to mail your thought to Xen-devel. You can also find our TODO list on Xen wiki.