When I checked back on this post something appeared to have broken; I’m blaming Firefox somehow playing badly with the WordPress web interface. Hopefully the problems are now fixed and the article will be visible. Sorry about that.
Now and again, I get e-mails asking me about the state of the XenFS project. It’s something I’ve been working on for some time now and I haven’t always been very good at keeping the world up-to-date about my progress. I thought it would be good if I summarised the state of things here, to start with. In this article I give an overview of the arguments for XenFS and the techniques used. I’ve necessarily missed out a lot of detail but I might provide occasional updates on the work in future, depending on peoples’ interest in this stuff.
View the full article to read more…
What is XenFS for?
XenFS is a remote filesystem protocol for sharing filesystem data within a single Xen host. You use it to share directory trees, much like you would with NFS.
Wait, within a host?
Yep. XenFS is only designed for sharing data between VMs within a Xen host. It uses shared memory, rather than networking, so it is only capable of acting within a host. If you want to access filesystems across multiple hosts you still need to use a Network filesystem or a Cluster filesystem with some kind of shared storage.
Why not just use NFS / CIFS / GFS / OCFS / other within the host?
Several reasons:
- If you use a network filesystem like NFS or CIFS for sharing data within a Xen host then all the network protocol stuff is pure overhead – there’s no network there but you’re going through the entire network stack and then over the virtual ethernet. They’re fine protocols but it’s easy to see that this isn’t optimal.
- If you use a cluster filesystem like GFS or OCFS then you need to set up each guest to correctly access the shared filesystem. There are vendor-supplied tools to help you do this but it doesn’t alter the fact that you must trust all the guests to behave themselves when writing to shared storage. That could be a problem.
- Any of those solutions entail that identical data you access in multiple VMs will get copied into memory multiple times. That’s a waste of memory and misses a potential performance optimisation.
OK, so why use a filesystem at all?
Because filesystems are pretty fundamental to modern operating systems and their users. The designers of Unix had it right when they recognised the filesystem as the most important namespace. If you simply export a block device to a guest then it has a number of advantages (good performance, encapsulates the guest state, guest can use the disk blocks as it sees fit) but you also lose useful semantic information. You can’t (easily) see what the guest is doing. That prevents you doing all kinds of cool and useful things, like:
- Running your favourite file indexer / search engine on the contents of your virtual machines. Wouldn’t it be handy to have a friendly Google Desktop-style interface to the files inside your VMs?
- Running an intrusion detection system or virus scanner in dom0 to protect your VMs. Running this software inside the VM makes it vulnerable to compromise or to the VM’s administrator simply turning it off. If you could run this software within dom0 then the Xen host’s administrator can make sure the software is running correctly and manage policy centrally.
- Making consistent backups of your VMs’ filesystems. Sure, you can snapshot the virtual disk but there’s no guarantee that the filesystem on it is mountable at any given time.
- Store VM data in a cluster filesystem without trusting the VM itself to play nice with the shared disk.
- and so on…
What does XenFS do to address this?
XenFS provides functionality that – to a user or administrator – is somewhat like a traditional network filesystem. The server VM elects to export a portion of its filesystem hierarchy, the client “mounts” this using a remote access protocol. The filesystem tree then appears on the client, much like a local filesystem would. Changes made by the client show up at the server and are stored on the server’s disk.
The major differences from a traditional network filesystem are in the implementation. XenFS is implemented as a XenLinux “split driver”, with kernel modules implementing the client and server portions. Instead of exchanging protocol messages over a network socket, XenFS exchanges requests and responses using shared memory, similar to the “device channels” used by the block and network split drivers. Beyond that, instead of copying data from the server to the client (and back) XenFS also shares the memory containing the actual file data. The result of this is that VMs end up participating in a common, host-wide shared buffer cache. A combination of page flipping and shared mappings is used to ensure that clients can’t abuse this facility to “steal” memory from the server. This is the most “researchy” part of XenFS and requires the “weirdest” code but does promise a number of potential benefits:
- If multiple guests access the same data, it’d take up no more memory than if one guest is using it. Potentially useful if your VMs are working on common data sets, or if /usr was hosted in XenFS.
- If one guest has already loaded some data into memory, the others can access it without disk IO being required. Potentially improves performance and – in these energy-concious days – may avoid spinning up the hard drive.
- If applications in multiple VMs want to do shared memory communications they need only mmap a common file as they would if they were running in the same VM. No need to write apps to Xen-specific memory sharing APIs.
- and so on…
Does it work?
Yes … and no. I have a prototype implementation of some of these ideas which is already almost sufficient for me to do preliminary benchmarking with fairly substantial workloads. It’s definitely a research prototype and it’s emphatically not ready for production use.  Many of the features / behaviours described above already work or could be made to work with a reasonably modest effort. I’ve perpetrated a number of slightly grim modifications to the Linux kernel, although I’ve tried to be as uninvasive as possible. Future work might be required to clean this up a bit and / or modify the hooks I’ve added to be more generally useful.
Using a customised initrd I can boot a domain from a XenFS root filesystem, then clone and build the XenFS hg repository in order to compile a new, bootable XenFS-enabled kernel. I wouldn’t trust the domain to stay running indefinitely but it happily survives that workload. Very large builds that create lots and lots of files still tickle some bugs. iozone tests complete with file sizes of 2G or more (substantially more than the available RAM in the VM). The code doubtless has all sorts of bugs, leaks, hacks, etc that are yet to be fixed.
The current code is based on a fairly ancient version of xen-unstable and I’m not planning on rebase for the moment, unless I need access to newer features.
What next?
Well, as described, XenFS is primarily a research project at the moment. I’ll be adding some more interesting features and trying to get some of the more advanced usecases described above to work reliably. Production use is not a goal for the time being. My hope is that – at minimum – the insights gained through this work will be useful to similar projects in the future. Once the pure research phase is over I will think about whether to continue to develop the filesystem into something more robust.
If you’re interested in XenFS, you can e-mail me at mark.williamson@cl.cam.ac.uk and / or pester me to write more blog posts.