I interpret that as including "find ways to break things."
So, I've written a crappy little multi-threaded network library (http://github.com/erikarn/libiapp) which is absolutely, positively crappy and FreeBSD specific. Right now all it does is TCP and UDP network smashing using read() / write() for TCP, and recvfrom() / sendto() for UDP.
The aim with this is to stress test things and find where they break. So, the first thing I've written is a very simple TCP client/server - the client connects to the server and just write()s a lot of data.
.. except, that the clients are light-weight, in C, and multi-threaded.
So, I end up 'n' threads, with 'm' TCP sockets each, all doing write(). Right now I'm watching 4 threads with 12,288 sockets each sending data.
The test hardware is a pair of 1ru supermicro boxes with Intel E3-1260L CPUs in them, 32GB of RAM and dual-port Intel 82599EB 10GE NICs. The NICs are channel-bonded (using LACP) through a Cisco ASR9k switch.
I initially tested this on FreeBSD-9. I was rudely reminded of how utterly crappy the default mbuf sizing is. I constantly ran out of mbufs. Then, since FreeBSD-10 is on the cards, I just updated everything to the latest development branch and ran with it.
The result? The test ran for about 90 seconds before things got plainly pissed. The client (sender) would immediately hang. I'd get short packet errors, the LACP session would get unstable... everything was just plain screwed. The server (receiver) never saw any issues. I also saw lots of RX stalls, where one ring would seemingly fill up - and the whole RX path just ground to a halt. In addition, I'd also see a whole lot of out of order TCP segments on the server (receiver) side. Grr.
So, cue some driver hacking to see what was going on, reading the Intel 82599EB datasheet (that's freely available, by the way!) as well as discussions with Intel, Verisign and a few other companies that are using Intel 10GE hardware quite heavily, and here's what was discovered.
There's a feature called "RX_COPY" where small packets that are received are copied into a small, new mbuf - and the existing receive buffer is left in the RX ring. This improves performance - there's less churn of the mbuf allocator for those larger buffers. However. there was some dangling pointers around the management of that, leading so some stuff being DMAed where it shouldn't .. which, since ACKs and LACP frames are "small", would be triggered by this. Since the sender (client) is sending lots of segments, it's going to be receiving a lot of ACKs and this explains why the receiver (server) didn't hit this bug.
Next, the RX stalls. By default, if one of the RX rings fills up, the whole RX engine stalls. This is apparently configurable (read the data sheet!) but it's not on by default in FreeBSD/Linux. One of the verisign guys found the problem - in the general MSIX interrupt handler path, it was acknowledging all of the interrupts that were currently pending, rather than only the ones that were activated. The TX/RX interrupts are routed to other MSIX messages and thus should be handled by those interrupt threads. So, under sufficient load - and if you had any link status flaps - you may hit a situation where the non-packet MSIX interrupt thread runs, ACKs all the interrupts, and you immediately end up filling up the RX ring. You won't generate a subsequent interrupt as you've already hit the limit and the hardware won't generate anything further.. so you're stuck. That's been fixed. The annoying bit? It was fixed in the Linux driver but not the FreeBSD driver. Growl.
So, now the driver behaves much, much better. I can smash it with 20 gigabits a second of TCP traffic across 50,000 odd sockets and nary a crash/hang. But what bugs me is the out-of-order TCP packets on the receiver side of things.
The reason - it's highly likely due to the driver architecture. The driver will schedule deferred packet processing using the taskqueue if the interrupt handler ends up with too many packets to work with. Now, this taskqueue is totally separate to the interrupt thread - which means, you can have both of them running at the same time, and on separate CPUs.
So I'm going to hack the driver up to not schedule the taskqueue and instead just poke the hardware to post another interrupt to do further processing. I hope this will resolve the out of order TCP frames being received.