Saturday, December 14, 2013

Experimenting with zero-copy network IO in FreeBSD-HEAD

Back when I started all of this networking hacking, the "big thing" was the overhead of doing poll() and select(). Various operating systems came up with ways of eliminating these - FreeBSD grew the kqueue infrastructure; linux received epoll, Solaris received an epoll-like device and then ended up with some form of kqueue-like event mechanism. Windows has completion ports/overlapped IO which combined the event mechanism with a zero-copy way of doing network IO.

So the Free/Open operating systems have scalable event notification mechanisms for handling large numbers of concurrent sockets but they don't all have some nice, efficient way of doing zero-copy network IO.

Linux has splice()/tee()/vmsplice(). So yes, it effectively does have a way of doing zero-copy socket reading and writing.

OpenBSD does have a splice style syscall to copy data from a source to a destination TCP socket.

FreeBSD, however, has mostly focused on the "disk to network" path for content serving and thus has a lot of time invested in their sendfile() implementation. This is great if you're doing a lot of file to network sending (which Netflix does), but it has some serious shortcomings. The main one I'll address here is the lack of being able to do general zero-copy socket writes from userland. So it can only send data from disk files to the network. You can't implement a zero-copy intermediary proxy server, nor a memory cache that keeps things in pre-allocated memory regions. You have to use disk files (whether that be a real filesystem on disks, or a memory filesystem) and leverage VM hints to control caching.

Recently there was some new sendfile() work to allow sending from POSIX shared memory segments. This intrigued me - it's not the most effective way of doing zero-copy network IO from userland but it's a start. So I set off to write an updated version of my network library from yesteryear to implement some massively parallel network applications with.

The idea is simple - you allocate a POSIX shared memory segment. You then mmap() that region into memory and treat it as a place to allocate write-side network buffers from. Then you use the shared memory filedescriptor and offset to schedule a sendfile() from the shared memory segment to the destination network socket. It's not as elegant as having a write path that wires the memory down and just populates mbufs from that, but that'll come later.

Here's what I found.

Firstly, there's no asynchronous "I'm done!" notification for the sendfile path. So you have no explicit notification that the underlying memory has been freed so you can reuse it. sendfile() has the SF_SYNC flag which causes it to sleep until the transaction is done - primarily so users can be sure they can change the underlying file contents after the syscall completes. This is used by caches such as Varnish that leverage on-disk files as their cache filesystem space.

So I've been adding that. I have a working prototype that is scaling quite well under load and I'll look to commit it to FreeBSD-HEAD soon. It posts a knote to a kqueue filedescrpitor once a transaction has completed.

Once that was done, I started benchmarking the performance of this setup.

The first real roadblock I hit was massive VM contention on the shared memory segment. It turns out that a single POSIX shared memory segment is represented as a single vm_object and this is protected by a single lock. So when 8 threads are actively doing IO from the same shared memory segment it hits massive lock contention. I fixed this in my test suite by allocating one shared memory segment per thread. It's not elegant but it works well enough for benchmarking.

I next hit issues with contention on the VM page lists. Besides the per-object list, there's also a global per-type list (active, inactive, etc.) There's one lock protecting each of these lists. What I found was the VM was shuffling pages between active/inactive and at the traffic rates I was doing (20+gbit/sec) it was a few hundred thousand pages a second being shuffled around. The solution? mlock() the whole region into memory. This prevented the VM from having the pages change state so often and eliminated that overhead.

The code for doing this sendfile() work with posix shared memory is in my libiapp code - http://github.com/erikarn/libiapp . It's terrible and hacky - I'm just experimenting with things for now. But with some tuning, I can get a good 35Gbit/sec out of 70,000 active TCP sockets. There's still a long way to go - I shouldn't be saturating an 8-core CPU with this traffic level when I'm doing no socket data copies. I'll write another update or two about that soon.

Now, what would I like to see? I did some experiments with physical disk IO using the FreeBSD AIO paths doing the same kinds of IO patterns as I am doing with network socket IO (4KiB to 64KiB random disk reads.) It turns out if you do everything correctly, the FreeBSD AIO code will turn physical disk IO into asynchronous disk buffer transactions by wiring the userland buffer into memory and then using that as the backing buffer memory. The overhead of doing the pmap work for this was not too high. So, I wonder if it's worth writing a new transmit path that uses the pmap code (and not the VM!) to wire in a region of memory and then use that for transmit buffers. Combined with an iovec style array of buffers and the above kqueue notification of the network IO completion, I think we can end up with a much more flexible method of doing network IO from userland without the shortcomings by using POSIX shared memory with sendfile().