Background and Motivation
This blog already hosted a couple of stories about what is going on, in the Xen development community, regarding improving Xen NUMA support. Therefore, if you really are interested in some background and motivation, feel free to check them out:
Long story  short, they say how NUMA is becoming more and more common and that, therefore, it is very important to: (1) achieve a good initial placement, when creating a new VM; (2) have a solution that is both flexible and effective enough to take advantage of that placement during the whole VM lifetime. The former, basically, means: <<When starting a new Virtual Machine, to which NUMA node should I “associate” it with?>>. The latter is more about: <<How hard should the VM be associated to that NUMA node? Could it, perhaps temporarily, run elsewhere?>>.
NUMA Placement and Scheduling
So, here’s the situation: automatic initial placement has been included in Xen 4.2, inside libxl. This means, when a VM is created (of course, if that happens through libxl) a set of heuristics decide on which NUMA node his memory has to be allocated, and the vCPUs of the VM are statically pinned to the pCPUs of such node.
On the other hand, NUMA aware scheduling  has been under development during the last months, and is going to be included in Xen 4.3. This mean, instead of being statically pinned, the vCPUs of the VM will strongly prefer to run on the pCPUs of the NUMA node, but they can run somewhere else as well… And this is what this status report is all about.
NUMA Aware Scheduling Development
The development of this new feature started pretty early in the Xen 4.3 development cycle, and has undergone a couple of major rework along the way. The very first RFC for it dates back to the Xen 4.2 development cycle, and it showed interesting performance already. However, what was decided at the time was to concentrate only on placement, and leave scheduling for the future. After that, v1, v2 and v3 of a patch series entirely focused on NUMA aware scheduling followed. It has been discussed during XenSummit NA 2012, in a talk about NUMA future development in Xen in general (slides here).  While at it, a couple of existing scheduling anomalies of the stock credit scheduler where found and fixed (for instance, the one described here).
Right now, we can say we are almost done. In fact, v3 received positive feedback and is basically what is going to be merged, and so what Xen 4.3 will ship. Actually, there is going to be a v4 (being released on xen-devel right at the same time of this blog post), but it only accommodates very minor changes, and it is 100% functionally equal to v3.
Any Performance Numbers?
Sure thing! Benchmarks similar to the ones already described in the previous blog posts have been performed. More specifically, directly from the cover letter of the v3 of the patch series, here’s what has been done:
I ran the following benchmarks (again): * SpecJBB is all about throughput, so pinning is likely the ideal solution. * Sysbench-memory is the time it takes for writing a fixed amount of memory (and then it is the throughput that is measured). What we expect is locality to be important, but at the same time the potential imbalances due to pinning could have a say in it. * LMBench-proc is the time it takes for a process to fork a fixed number of children. This is much more about latency than throughput, with locality of memory accesses playing a smaller role and, again, imbalances due to pinning being a potential issue.
This all happened on a 2 node host, where 2 to 10 VMs (2 vCPUs and 960 RAM each) were executing the various benchmarks concurrently. Here they are the results:
---------------------------------------------------- | SpecJBB2005, throughput (the higher the better) | ---------------------------------------------------- | #VMs | No affinity | Pinning | NUMA scheduling | | 2 | 43318.613 | 49715.158 | 49822.545 | | 6 | 29587.838 | 33560.944 | 33739.412 | | 10 | 19223.962 | 21860.794 | 20089.602 | ---------------------------------------------------- | Sysbench memory, throughput (the higher the better) ---------------------------------------------------- | #VMs | No affinity | Pinning | NUMA scheduling | | 2 | 469.37667 | 534.03167 | 555.09500 | | 6 | 411.45056 | 437.02333 | 463.53389 | | 10 | 292.79400 | 309.63800 | 305.55167 | ---------------------------------------------------- | LMBench proc, latency (the lower the better) | ---------------------------------------------------- | #VMs | No affinity | Pinning | NUMA scheduling | ---------------------------------------------------- | 2 | 788.06613 | 753.78508 | 750.07010 | | 6 | 986.44955 | 1076.7447 | 900.21504 | | 10 | 1211.2434 | 1371.6014 | 1285.5947 | ----------------------------------------------------
Which, reasoning in terms of %-performance increase/decrease, means NUMA aware
scheduling does as follows, as compared to no-affinity at all and to static pinning:
---------------------------------- | SpecJBB2005 (throughput) | ---------------------------------- | #VMs | No affinity | Pinning | | 2 | +13.05% | +0.21% | | 6 | +12.30% | +0.53% | | 10 | +4.31% | -8.82% | ---------------------------------- | Sysbench memory (throughput) | ---------------------------------- | #VMs | No affinity | Pinning | | 2 | +15.44% | +3.79% | | 6 | +11.24% | +5.72% | | 10 | +4.18% | -1.34% | ---------------------------------- | LMBench proc (latency) Â | | NOTICE: -x.xx% = GOOD here | ---------------------------------- | #VMs | No affinity | Pinning | ---------------------------------- | 2 | -5.66% | -0.50% | | 6 | -9.58% | -19.61% | | 10 | +5.78% | -6.69% | ----------------------------------
The tables show how, when not in overload (where overload=’more vCPUs than pCPUs’), NUMA scheduling is the absolute best. In fact, not only it does a lot better than no-pinning on throughput biased benchmarks, as well as a lot better than pinning on latency biased benchmarks (especially with 6 VMs), it also equals or beats both under adverse circumstances (adverse to NUMA scheduling, i.e., beats/equals pinning in throughput benchmarks, and beats/equals no-affinity on the latency benchmark).
When the system is overloaded, NUMA scheduling scores in the middle, as it could have been expected. It must also be noticed that, when it brings benefits, they are not as huge as in the non-overloaded case. However, this only means that there is still room for more optimization, right? Â In some more details, the current way a pCPU is selected for a vCPU that is waking-up, couples particularly bad with the new concept of NUMA node affinity. Changing this is not trivial, because it involves rearranging some locks inside the scheduler code, but is already being worked-on.
Anyway, even with what we have right now, we are overloading the test box by 20% here (without counting Dom0 vCPUs!) and still seeing improvements, which is definitely not bad!
What Else Is Going On?
Well, a lot… To the point that it is probably pointless to try make a list here! We have a NUMA roadmap on our Wiki, which we are trying to keep updated and, more important, to honor and fulfill so, if interested in knowing what will come next, go check it out!