The Xen community achieved a major milestone last summer when all the necessary components for Xen dom0 support made it into the upstream kernel for the 3.0 release. However, during that process developers were focused on functionality, and not on performance. As a result a handful of performance regressions were introduced in pv-ops kernels compared to the classic kernels..
Recently I have started looking at performance aspects of pv-ops Linux by using xentrace and xenalyze (see George Dunlap’s presentation for an introduction) to compare the number and pattern of hypercalls between a classic 2.6.32 kernel and a 3.3 pvops one. I found a number of performance regressions which, luckily, are easily fixed or have minimal impact. The individual fixes are listed below.
xen_version hypercall spam
Xen guests check for pending events (interrupts) raised by the hypervisor when leaving a hypercall. If local event delivery is disable, the guest must do a dummy hypercall when reenabling events to check for any it may have missed. The xen_version hypercall is used for this as it has no side-effects.
The pvops kernel was checking for pending events much more often than necessary.
Fix: xen: correctly check for pending events when restoring irq flags which is available in 3.4-rc2.
Result: About 10% performance improvement to a wide range of kernel operations.
Unnecessary TLS descriptor updates during task switch
On the x86 architecture, the location of a thread’s thread local storage (TLS) are stored in three entries in the global descriptor table (GDT). Under Xen, the GDT is managed by the hypervisor and guests use three hypercalls (bundled in a multicall) to update the descriptors on every context switch. Very often the descriptors don’t need to change so the kernel can avoid many of these hypercalls by tracking what the old value was.
Fix: xen/x86: avoid updating TLS descriptors if they haven’t changed which is available in 3.6-rc1.
Result: About 9% improvement in context switch time.
PTE updates for page faults were emulated (32-bit guests only)
When a userspace process accesses memory that is not currently mapped into the process’s address space a page fault (e.g., is a file has been newly mmap(), or the process has been swapped out), an exception occurs. The page fault handler then updates the page table entry (PTE) for that page to make the it accessible to the process. Under Xen, this update was done by writing directly to the read-only page table. Xen would trap and emulate this memory write. PTEs are 64-bits wide so with a 32-bit guest, the two 32-bit wide writes result in two traps into Xen. By using a single hypercall we half the number of entries into Xen.
Fix: xen/mm: do direct hypercall in xen_set_pte() if batching is unavailable which is available in 3.6-rc1.
Result: About 25% improvement in page fault speed.
Two traps per page when unmapping pages in munmap() (32-bit guests only)
When unmapping pages from a userspace process (such as with munmap()), the corresponding PTE must be cleared and the current dirty and accessed bits must be saved. The kernel’s ptep_get_and_clear() function does both of these atomically. In a 32-bit pv-ops kernel this is implemented with two 32-bit accesses (an xchg and a store), classic kernels do one 64-bit access (a cmpxchg8b).
Unfortunately, there isn’t a way to fix this in Xen-specific code.  Profiling suggests that very little time (< 1%) is spent doing munmap() in a running system so improving its performance is going have very little real-word benefit.
Fix: None planned.