The main tricky part of this is the difference between incoming, outgoing and listening sockets.
For incoming traffic, the NIC has already calculated the RSS hash value and there's already a map between RSS hash and destination CPU. Well, destination queue to be much more precise; then there's a CPU for that queue.
For outgoing traffic, the thread(s) in question can be scheduled on any CPU core and as you have more cores, it's increasingly unlikely to be the right one. In FreeBSD, the default is to direct dispatch transmit related socket and protocol work in the thread that started it, save a handful of places like TCP timers. Once the driver if_transmit() method is called to transmit a frame it can check the mbuf to see what the flowid is and map that to a destination transmit queue. Before RSS, that's typically done to keep packets vaguely in some semblance of in-order behaviour - ie, for a given traffic flow between two endpoints (say, IP, or TCP, or UDP) the packets should be transmitted in-order. It wasn't really done for CPU affinity reasons.
Before RSS, there was no real consistency with how drivers hashed traffic upon receive, nor any rules on how it should select an outbound transmit queue for a given buffer. Most multi-queue drivers got it "mostly right". They definitely didn't try to make any CPU affinity choices - it was all done to preserve the in-order behaviour of traffic flows.
For an incoming socket, all the information about the destination CPU can be calculated from the RSS hash provided during frame reception. So, for TCP, the RSS hash for the received ACK during the three way handshake goes into the inpcb entry. For UDP it's not so simple (and the inpcb doesn't get a hash entry for UDP - I'll explain why below.)
For an outgoing socket, all the information about the eventual destination CPU isn't necessarily available. If the application knows the source/destination IP and source/destination port then it (or the kernel) can calculate the RSS hash that the hardware would calculate upon frame reception and use that to populate the inpcb. However this isn't typically known - frequently the source IP and port won't be explicitly defined and it'll be up to the kernel to choose them for the application. So, during socket creation, the destination CPU can't be known.
So to make it simple (and to make it simple for me to ensure the driver and protocol stack parts are working right) my focus has been on incoming sockets and incoming packets, rather than trying to handle outgoing sockets. I can handle outbound sockets easily enough - I just need to do a software hash calculation once all of the required information is available (ie, the source IP and port is selected) and populate the inpcb with that particular value. But I decided to not have to try and debug that at the same time as I debugged the driver side and the protocol stack side, so it's a "later" task.
For TCP, traffic for a given connection will use the same source/destination IP and source/destination port values. So for a given socket, it'll always hash to the same value. However, for UDP, it's quite possible to get UDP traffic from a variety of different source IP/ports and respond from a variety of different source/IP ports. This means that the RSS hash value that we can store in the inpcb isn't at all guaranteed to be the same for all subsequent socket writes.
Ok, so given all of that above information, how exactly is this supposed to work?
Well, the slightly more interesting and pressing problem is how to break out incoming requests/packets to multiple receive threads. In traditional UNIX socket setups, there are a couple of common design patterns for farming off incoming requests to multiple worker threads:
- There's one thread that just does accept() (for TCP) or recv() (for UDP) and it then farms off new connections to userland worker threads; or
- There are multiple userland worker threads which all wait on a single socket for accept() or recv() - and hope that the OS will only wake up one thread to hand work to.
I decided this wasn't really acceptable for the RSS work. I needed a way to redirect traffic to a thread that's also pinned to the same CPU as the receive RSS bucket. I decided the cheapest way would be to allow multiple PCB entries for the same socket details (eg, multiple TCP sockets listening on *:80). Since the PCBGROUPS code in this instance has one PCB hash per RSS bucket, all I had to do was to teach the stack that wildcard listen PCB entries (eg, *:80) could also exist in each PCB hash bucket and to use those in preference to the global PCB hash.
The idea behind this decision is pretty simple - Robert Watson already did all this great work in setting up and debugging PCBGROUPS and then made the RSS work leverage that. All I'd have to do is to have one userland thread in each RSS bucket and have the listen socket for that thread be in the RSS bucket. Then any incoming packet would first check the PCBGROUP that matched the RSS bucket indicated by the RSS hash from the hardware - and it'd find the "right" PCB entry in the "right" PCBGROUP PCB has table for the "right" RSS bucket.
That's what I did for both TCP and UDP.
So the programming model is thus:
- First, query the RSS sysctl (net.inet.rss) for the RSS configuration - this gives the number of RSS buckets and the RSS bucket -> CPU mapping.
- Then create one worker thread per RSS bucket..
- .. and pin each thread to the indicated CPU.
- Next, each worker thread creates one listen socket..
- .. sets the IP_BINDANY or IP6_BINDANY option to indicate that there'll be multiple RSS entries bound to the given listen details (eg, binding to *:80);
- .. then IP_RSS_LISTEN_BUCKET to set which RSS bucket the incoming socket should live in;
- Then for UDP - call bind()
- Or for TCP - call bind(), then call listen()
Each worker thread will then receive TCP connections / UDP frames that are local to that CPU. Writing data out the TCP socket will also stay local to that CPU. Writing UDP frames out doesn't - and I'm about to cover that.
Yes, it's annoying because now you're not just able to choose an IO model that's convenient for your application / coding style. Oops.
Ok, so what's up with UDP?
The problem with UDP is that outbound responses may be to an arbitrary destination setup and thus may actually be considered "local" to another CPU. Most common services don't do this - they'll send the UDP response to the same remote IP and port that it was sent from.
My plan for UDP (and TCP in some instances, see below!) is four-fold:
- When receiving UDP frames, optionally mark them with RSS hash and flowid information.
- When transmitting UDP frames, allow userspace to inform the kernel about a pre-calculated RSS hash / flow information.
- For the fully-connected setup (ie, where a single socket is connect() ed to a given UDP remote IP:port and frame exchange only occurs between the fixed IP and port details) - cache the RSS flow information in the inpcb;
- .. and for all other situations (if it's not connected, if there's no hint from userland, if it's going to a destination that isn't in the inpcb) - just do a software hash calculation on the outgoing details.
I mostly have the the first two UDP options implemented (ie, where userland caches the information to re-use when transmitting the response) and I'll commit them to FreeBSD soon. The other two options are the "correct" way to do the default methods but it'll take some time to get right.
Ok, so does it work?
I don't have graphs. Mostly because I'm slack. I'll do up some before I present this - likely at BSDCan 2015.
My testing has been done with Intel 1G and 10G NICs on desktop Ivy Bridge 4-core hardware. So yes, server class hardware will behave better.
For incoming TCP workloads (eg a webserver) then yes, there's no lock contention between CPUs in the NIC driver or network stack any longer. The main lock contention between CPUs is the VM and allocator paths. If you're doing disk IO then that'll also show up.
For incoming UDP workloads, I've seen it scale linearly on 10G NICs (ixgbe(4)) from one to four cores. This is with no-defragmentation, 510 byte sized datagrams.
Ie, 1 core reception (ie, all flows to one core) was ~ 250,000 pps into userland with just straight UDP reception and no flow/hash information via recvmsg(); 135,000 pps into userland with UDP reception and flow/hash information via recvmsg().
4 core reception was ~ 1.1 million pps into userland, roughly ~ 255,000 pps per core. There's no contention between CPU cores at all.
Unfortunately what I was sending was markedly different. The driver quite happily received 1.1 million frames on one queue and up to 2.1 million when all four queues were busy. So there's definitely room for improvement.
Now, there is lock contention - it's just not between CPU cores. Now that I'm getting past the between-core contention, we see the within-core contention.
For TCP HTTP request reception and bulk response transmission, most of the contention I'm currently seeing is between the driver transmit paths. So, the following occurs:
- TCP stack writes some data out;
- NIC if_transmit() method is called;
- It tries to grab the queue lock and succeeds;
But then whilst the transmit lock is held, because the driver is taking frames from the buf_ring to push into the NIC TX DMA queue
- The NIC queue interrupt fires, scheduling the software interrupt thread;
- This pre-empts the existing running transmit thread;
- The NIC code tries to grab the transmit lock to handle completed transmissions;
- .. and it fails, because the code it preempted holds the transmit lock already.
So there's some context switching and thrashing going on there which needs to be addressed.
Ok, what about UDP? It turns out there's some lock contention with the socket receive buffer.
The soreceive_dgram() routine grabs the socket receive buffer (SOCKBUF_LOCK()) to see if there's anything to return. If not, and if it can sleep, it'll call sbwait() that will release the lock and msleep() waiting for the protocol stack to indicate that something has been received. However, since we're receiving packets at such a very high rate, it seems that the receive protocol path contends with the socket buffer lock that is held by the userland code trying to receive a datagram. It pre-empts the user thread, tries to grab the lock and fails - and then goes to sleep until the userland code finishes with the lock. soreceive_dgram() doesn't hold the lock for very long - but I do see upwards of a million context switches a second.
To wrap up - I'm pleased with how things are going. I've found and fixed some issues with the igb(4) and ixgbe(4) drivers that were partly my fault and the traffic is now quite happily and correctly being processed in parallel. There are issues with scaling within a core that are now being exposed and I'm glad to say I'm going to ignore them for now and focus on wrapping up what I've started.
There's a bunch more to talk about and I'm going to do it in follow-up posts.
- what I'm going to do about UDP transmit in more detail;
- what about creating outbound connections and how applications can be structured to handle this;
- handling IP fragments and rehashing packets to be mostly in-order - and what happens when we can't guarantee ordering with the hardware hashing UDP frames to a 4-tuple;
- CPU hash rebalancing - what if a specific bucket gets too much CPU load for some reason;
- randomly creating a toeplitz RSS hash key at bootup and how that should be verified;
- multi-socket CPU and IO domain awareness;
- .. and whatever else I'm going to stumble across whilst I'm slowly fleshing this stuff out.
I hope to get the UDP transmit side of things completed in the next couple of weeks so I can teach memcached about TCP and UDP RSS. After that, who knows!