Shadow 3 is the next step in the evolution of the shadow pagetable code. By making the shadow pagetables behave more like a TLB, we take advantage of guest operating system TLB behavior to reduce and coalesce the number of guest pagetable changes that the hypervisor has to translate to the shadow pagetables. This can dramatically reduce the virtualization overhead for HVM guests.
Shadow paging overhead is one of the largest source of cpu virtualization overhead for HVM guests. Because HVM guest operating systems don’t know the physical frame numbers of the pages assigned to them, they use guest frame numbers instead. This requires the hypervisor to translate each guest frame numbers into machine frames in the shadow pagetables before they can be used by the guest.
Those who have been around awhile may remember the Shadow-1 code. Its method of propagating changes from guest pagetables to the shadow pagetables was as follows:
- Remove write access to any guest pagetable.
- When a guest attempts to write to the guest pagetable, mark it out-of-sync, add the page to the out-of-sync list and give write permission.
- On the next page fault or cr3 write, take each page from the out-of-sync list and:
- resync the page: look for changes to the guest pagetable, propagate those entries into the shadow pagetable
- remove write permission, and clear the out-of-sync bit.
While this method worked so-so for Linux, it was disastrous for Windows. Windows heavily uses a technique called demand-paging. Resyncing a guest page is an expensive operation, and under Shadow-1, every time a page was faulted in would cause an out-of-sync, write, and a resync.
The next step, Shadow-2, (among many other things) did away with the out-of-sync mechanism and instead emulated every write to guest pagetables. Emulation avoids the expensive unsync-resync cycle for demand paging. However, it removes any “batching” effects: every write is immediately reflected in the shadow pagetables, even though the guest operating system may not have been expecting the address change to be available until later.
Furthermore, Windows will frequently write “transition values” into pagetable entries when a page is being mapped in or mapped out. The cycle for demand-faulting zero pages in 32-bit Windows looks like:
- Guest process page faults
- Write transition PTE
- Write real PTE
- Guest process accesses page
On bare hardware, this looks like “Page fault / memory write / memory write”. Memory writes are relatively inexpensive. But in Shadow-2, this looks like:
- Page fault
- Emulated write
- Emulated write
Each emulated write involves a VMEXIT/VMENTER as well as about 8000 cycles of emulation inside the hypervisor, much more expensive than a mere memory write.
Shadow-3 brings back the out-of-sync mechanism, but with some key changes. First, only L1 pagetables are allowed to go out-of-sync. All L2+ pagetables are emulated. Secondly, we don’t necessarily resync on the next page fault. One of the things this enables is to do a “lazy pull-through”: if we get a page fault where the shadow is not-present but the guest is present, we can simply propagate that entry to the shadows, and return to the guest, leaving the rest of the page out-of-sync.  This means that once a page is out-of-sync, demand-faulting looks like this:
- Page fault
- Memory write
- Memory write
- Propagate guest entry to shadows
Pulling through a single guest value is actually cheaper than emulation. So for demand-paging under Windows, we have 1/3 fewer trips into the hypervisor. Furthermore, batch updates, like process destruction or mapping large address spaces, are propagated to the shadows in a batch at the next CR3 switch, rather than going into and out of the hypervisor on each individual write.
All of this adds up to greatly improved performance for workloads like compilation, compression, databases, and any workload which does a lot of memory management in an HVM guest.