After doing some digging, I discovered that.. I was doing something a little odd in my testing framework and it (surprise!) elicited some very negative behaviour from FreeBSD. Said behaviour is actually valid - it's to avoid denial of service attacks. But it's worth talking about.
My test client was bursting 'n' connections per thread each second. So, I would do a test of say, 128 new connections back to back, each second, from each thread. This is definitely odd (but easy to implement!)
Here's what the server was doing.
Firstly - there's a "syncache". The syncache handles incoming embryonic requests (ie, the SYN from a remote peer.) It's separate from the rest of the TCP stack so a large flood of new connections (valid or otherwise) doesn't need to grab TCP stack locks in order to process these frames, or waste RAM with PCB (protocol control block) entries for these embryonic requests. It also makes it easier to time out half-completed requests - the PCB will only have completed or closing connections.
If the handshake succeeds but there's a failure in allocating a new PCB or socket for the connection, the TCP stack can return an RST to the peer.
If the syncache fills up, it should be sending syncookies. (google "SYN cookies" for more information.) The point of using SYN cookies is that it doesn't fill the syncache up with embryonic connections - there's a cookie that the client will reflect back to the server that validates the connection.
If the syncookie exchange suceeds but the application can't create new sockets fast enough (ie, servicing the accept() socket queue quickly enough), the TCP stack will throw an RST back at the client.
Now, for the fun bits.
- The RST responses back to the server are rate limited - via net.inet.icmp.icmplim. Yes, it's not just for rate limiting ICMP responses.
- So the client would see some connections hit an RST and fail immediately; others just wouldn't get the ACK and would try again, so..
- .. over time, there'd be a burst of new connections every second from the client (causing the issue) as well as the connection retransmits for embryonic-but-not-yet-finished connections
When I staggered the new connections over smaller, quicker bursts (so instead of 128 connections a second per thread, I'd do 12 connections every 100mS) then the problem went away. This is better behaviour (I can connect thousands of new connections a second here!) but I still expect to see this problem in the real world. As I approach my intended TCP connection rate (100,000 connections a second - which isn't specifically a Netflix requirement, but an "Adrian proxy load" requirement! - I'm going to start seeing microbursts of new connections that will temporarily look like back-to-back new connections, thus triggering this bug.
So, to work around this for now, one just has to bump up the accept queue depth (sysctl kern.ip.somaxconn) to something much higher than the default of 128.
Now - why is this happening? My theory is this:
- We're getting this burst of frames coming in the NIC;
- The syncache / cookie code is being run in the NIC RX path;
- The new connection path gets run and quickly overflows the syncache and new connection queue handling in the TCP stack, as the userland code doesn't get a notification in time
- .. so the accept queue overflows before userland gets a chance to run, and we start sending rate limited RSTs.