I started digging deeper into the RSS performance on my home test platform. Four cores and one (desktop) socket isn't all that much, but it's a good starting point for this.
It turns out that there was some lock contention inside netisr. Which made no sense, as RSS should be keeping all the flows local to each CPU.
After a bunch of digging, I discovered that the NIC was occasionally receiving packets into the wrong ring. Have a look at tihs:
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff80047713d00; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff8004742e100; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff800474c2e00; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff800474c5000; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff8004742ec00; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100032:
m=0xfffff8004727a700; flowid=0x335a5c03; rxr->me=2
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100032:
m=0xfffff80006f11600; flowid=0x335a5c03; rxr->me=2
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100032:
m=0xfffff80047279b00; flowid=0x335a5c03; rxr->me=2
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100032:
m=0xfffff80006f0b700; flowid=0x335a5c03; rxr->me=2
The RX flowid was correct - I hashed the packets in software too and verified the software hash equaled the hardware hash. But they were turning up on the wrong receive queue. "rxr->me" is the queue id; the hardware should be hashing on the last 7 bits. 0x3 -> ring 3, 0x2 -> ring 2.
It also only happened when I was sending traffic to more than one receive ring. Everything was okay if I just transmitted to a single receive ring.
Luckily for me, some developers from Verisign saw some odd behaviour in their TCP stress testing and had dug in a bit further. They were seeing corrupted frames on the receive side that looked a lot like internal NIC configuration state. They figured out that the ixgbe(4) driver wasn't initialising the flow director and receive units correctly - the FreeBSD driver was not correctly setting up the amount of memory each was allocated on the NIC and they were overlapping. They also found a handful of incorrectly handled errors and double-freed mbufs.
So, with that all fixed, their TCP problem went away and my UDP tests started properly behaving themselves. Now all the flows are ending up on the right CPUs.
The flow director code was also dynamically programming flows into the NIC to try and rebalance traffic. Trouble is, I think it's a bit buggy and it's likely not working well with generic receive offload (LRO).
What's it mean for normal people? Well, it's fixed in FreeBSD-HEAD now. I'm hoping I or someone else will backport it to FreeBSD-10 soon. It fixes my UDP tests - now I hit around 1.3 million packets per second transmit and receive on my test rig; the server now has around 10-15% CPU free. It also fixed issues that Verisign were seeing with their high transaction rate TCP tests. I'm hoping that it fixes the odd corner cases that people have seen with Intel 10 gigabit hardware on FreeBSD and makes LRO generally more useful and stable.
Next up - some code refactoring, then finishing off IPv6 RSS!
 
 
Friday, September 19, 2014
Wednesday, September 10, 2014
Receive side scaling: testing UDP throughput
I think it's about time I shared some more details about the RSS stuff going into FreeBSD and how I'm testing it.
For now I'm focusing on IPv4 + UDP on the Intel 10GE NICs. The TCP side of things is done (and the IPv6 side of things works too!) but enough of the performance walls show up in the IPv4 UDP case that it's worth sticking to it for now.
I'm testing on a pair of 4-core boxes at home. They're not special - and they're very specifically not trying to be server-class hardware. I'd like to see where these bottlenecks are even at low core count.
The test setup in question:
# for now redirect processing just makes the lock overhead suck even more.
# disable it.
net.inet.ip.redirect=0
net.inet.icmp.drop_redirect=1
/boot/loader.conf:
hw.ix.num_queues=8
# experiment with deferred dispatch for RSS
net.isr.numthreads=4
net.isr.maxthreads=4
net.isr.bindthreads=1
 
kernel config:
include GENERIC
ident HACKBOX
device netmap
options RSS
options PCBGROUP
# in-system lock profiling
options LOCK_PROFILING
# Flowtable - the rtentry locking is a bit .. slow.
options FLOWTABLE
# This debugging code has too much overhead to do accurate
# testing with.
nooptions INVARIANTS
nooptions INVARIANT_SUPPORT
nooptions WITNESS
nooptions WITNESS_SKIPSPIN
The server runs the "rss-udp-srv" process, which behaves like a multi-threaded UDP echo server on port 8080.
$ rss-udp-srv 1 10.11.2.1
The client-side runs four couples of udp-clt, each from different IP addresses. These are run in parallel (i do it in different screens, so I can quickly see what's going on):
$ ./udp-clt -l 10.11.2.3 -r 10.11.2.1 -p 8080 -n 10000000000 -s 510
$ ./udp-clt -l 10.11.2.4 -r 10.11.2.1 -p 8080 -n 10000000000 -s 510
$ ./udp-clt -l 10.11.2.32 -r 10.11.2.1 -p 8080 -n 10000000000 -s 510
$ ./udp-clt -l 10.11.2.33 -r 10.11.2.1 -p 8080 -n 10000000000 -s 510
The IP addresses are chosen so that the 2-tuple topelitz hash using the default Microsoft key hash to different RSS buckets that live on individual CPUs.
There are a couple of parts to this.
Firstly - I had left turbo boost on. What this translated to:
Finally, let's repeat the process but only receiving instead also echoing back the packet to the client:
$ rss-udp-srv 0 10.11.2.1
For now I'm focusing on IPv4 + UDP on the Intel 10GE NICs. The TCP side of things is done (and the IPv6 side of things works too!) but enough of the performance walls show up in the IPv4 UDP case that it's worth sticking to it for now.
I'm testing on a pair of 4-core boxes at home. They're not special - and they're very specifically not trying to be server-class hardware. I'd like to see where these bottlenecks are even at low core count.
The test setup in question:
Testing software:
- http://github.com/erikarn/freebsd-rss
- It requires libevent2 - an updated copy; previous versions of libevent2 didn't handle FreeBSD specific errors gracefully and would early error out of the IO loop.
Server:
- CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz (3292.59-MHz K8-class CPU)
- There's no SMT/HTT, but I've disabled it in the BIOS just to be sure
- 4GB RAM
- FreeBSD-HEAD, amd64
- NIC: '82599EB 10-Gigabit SFI/SFP+ Network Connection
- ix0: 10.11.2.1/24
# for now redirect processing just makes the lock overhead suck even more.
# disable it.
net.inet.ip.redirect=0
net.inet.icmp.drop_redirect=1
/boot/loader.conf:
hw.ix.num_queues=8
# experiment with deferred dispatch for RSS
net.isr.numthreads=4
net.isr.maxthreads=4
net.isr.bindthreads=1
kernel config:
include GENERIC
ident HACKBOX
device netmap
options RSS
options PCBGROUP
# in-system lock profiling
options LOCK_PROFILING
# Flowtable - the rtentry locking is a bit .. slow.
options FLOWTABLE
# This debugging code has too much overhead to do accurate
# testing with.
nooptions INVARIANTS
nooptions INVARIANT_SUPPORT
nooptions WITNESS
nooptions WITNESS_SKIPSPIN
The server runs the "rss-udp-srv" process, which behaves like a multi-threaded UDP echo server on port 8080.
Client
The client box is slightly more powerful to compensate for (currently) not using completely affinity-aware RSS UDP transmit code.- CPU: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz (3192.68-MHz K8-class CPU)
- SMT/HTT: Disabled in BIOS
- 8GB RAM
- FreeBSD-HEAD amd64
- Same kernel config, loader and sysctl config as the server
- ix0: configured as 10.11.2.2/24, 10.11.2.3/32, 10.11.2.4/32, 10.11.2.32/32, 10.11.2.33/32
Running things
The server-side simply runs the listen server, configured to respond to each frame:$ rss-udp-srv 1 10.11.2.1
The client-side runs four couples of udp-clt, each from different IP addresses. These are run in parallel (i do it in different screens, so I can quickly see what's going on):
$ ./udp-clt -l 10.11.2.3 -r 10.11.2.1 -p 8080 -n 10000000000 -s 510
$ ./udp-clt -l 10.11.2.4 -r 10.11.2.1 -p 8080 -n 10000000000 -s 510
$ ./udp-clt -l 10.11.2.32 -r 10.11.2.1 -p 8080 -n 10000000000 -s 510
$ ./udp-clt -l 10.11.2.33 -r 10.11.2.1 -p 8080 -n 10000000000 -s 510
The IP addresses are chosen so that the 2-tuple topelitz hash using the default Microsoft key hash to different RSS buckets that live on individual CPUs.
Results: Round one
When the server is responding to each frame, the following occurs. The numbers are "number of frames generated by the client (netstat)", "number of frames received by the server (netstat)", "number of frames seen by udp-rss-srv", "number of responses transmitted from udp-rss-srv", "number of frames seen by the server (netstat)"- 1 udp-clt process: 710,000; 710,000; 296,000; 283,000; 281,000
- 2 udp-clt processes: 1,300,000; 1,300,000; 592,000; 592,000; 575,000
- 3 udp-clt processes: 1,800,000; 1,800,000; 636,000; 636,000; 600,000
- 4 udp-clt processes: 2,100,000; 2,100,000; 255,000; 255,000; 255,000
There are a couple of parts to this.
Firstly - I had left turbo boost on. What this translated to:
- One core active: ~ 30% increase in clock speed
- Two cores active: ~ 30% increase in clock speed
- Three cores active: ~ 25% increase in clock speed
- Four cores active: ~ 15% increase in clock speed.
Secondly and more importantly - I had left flow control enabled. This made a world of difference.
The revised results are mostly linear - with more active RSS buckets (and thus CPUs) things seem to get slightly more efficient:
- 1 udp-clt process: 710,000; 710,000; 266,000; 266,000; 266,000
- 2 udp-clt processes: 1,300,000; 1,300,000; 512,000; 512,000; 512,000
- 3 udp-clt processes: 1,800,000; 1,800,000; 810,000; 810,000; 810,000
- 4 udp-clt processes: 2,100,000; 2,100,000; 1,120,000; 1,120,000; 1,120,000
Finally, let's repeat the process but only receiving instead also echoing back the packet to the client:
$ rss-udp-srv 0 10.11.2.1
- 1 udp-clt process: 710,000; 710,000; 204,000
- 2 udp-clt processes: 1,300,000; 1,300,000; 378,000
- 3 udp-clt processes: 1,800,000; 1,800,000; 645,000
- 4 udp-clt processes: 2,100,000; 2,100,000; 900,000
The receive-only workload is actually worse off versus the transmit + receive workload!
What's going on here?
Well, a little digging shows that in both instances - even with a single udp-clt thread running which means only one CPU on the server side is actually active! - there's active lock contention.
Here's an example dtrace output for measuring lock contention with only one active process, where one CPU is involved (and the other three are idle):
Receive only, 5 seconds:
root@adrian-hackbox:/home/adrian/git/github/erikarn/freebsd-rss # dtrace -n 'lockstat:::adaptive-block { @[stack()] = sum(arg1); }'
dtrace: description 'lockstat:::adaptive-block ' matched 1 probe
^C
              kernel`udp_append+0x11c
              kernel`udp_input+0x8cc
              kernel`ip_input+0x116
              kernel`netisr_dispatch_src+0x1cb
              kernel`ether_demux+0x123
              kernel`ether_nh_input+0x34d
              kernel`netisr_dispatch_src+0x61
              kernel`ether_input+0x26
              kernel`ixgbe_rxeof+0x2f7
              kernel`ixgbe_msix_que+0xb6
              kernel`intr_event_execute_handlers+0x83
              kernel`ithread_loop+0x96
              kernel`fork_exit+0x71
              kernel`0xffffffff80cd19de
         46729281
Transmit + receive, 5 seconds:
dtrace: description 'lockstat:::adaptive-block ' matched 1 probe
^C
              kernel`knote+0x7e
              kernel`sowakeup+0x65
              kernel`udp_append+0x14a
              kernel`udp_input+0x8cc
              kernel`ip_input+0x116
              kernel`netisr_dispatch_src+0x1cb
              kernel`ether_demux+0x123
              kernel`ether_nh_input+0x34d
              kernel`netisr_dispatch_src+0x61
              kernel`ether_input+0x26
              kernel`ixgbe_rxeof+0x2f7
              kernel`ixgbe_msix_que+0xb6
              kernel`intr_event_execute_handlers+0x83
              kernel`ithread_loop+0x96
              kernel`fork_exit+0x71
              kernel`0xffffffff80cd19de
             3793
              kernel`udp_append+0x11c
              kernel`udp_input+0x8cc
              kernel`ip_input+0x116
              kernel`netisr_dispatch_src+0x1cb
              kernel`ether_demux+0x123
              kernel`ether_nh_input+0x34d
              kernel`netisr_dispatch_src+0x61
              kernel`ether_input+0x26
              kernel`ixgbe_rxeof+0x2f7
              kernel`ixgbe_msix_que+0xb6
              kernel`intr_event_execute_handlers+0x83
              kernel`ithread_loop+0x96
              kernel`fork_exit+0x71
              kernel`0xffffffff80cd19de
          3823793
              kernel`ixgbe_msix_que+0xd3
              kernel`intr_event_execute_handlers+0x83
              kernel`ithread_loop+0x96
              kernel`fork_exit+0x71
              kernel`0xffffffff80cd19de
          9918140
Somehow it seems there's less lock contention / blocking going on when both transmit and receive is running!
So then I dug into it using the lock profiling suite. This is for 5 seconds with receive-only traffic on a single RSS bucket / CPU (all other CPUs are idle):
# sysctl debug.lock.prof.enable = 1; sleep 5 ; sysctl debug.lock.prof.enable=0
root@adrian-hackbox:/home/adrian/git/github/erikarn/freebsd-rss # sysctl debug.lock.prof.enable=1 ; sleep 5 ; sysctl debug.lock.prof.enable=0
debug.lock.prof.enable: 1 -> 1
debug.lock.prof.enable: 1 -> 0
root@adrian-hackbox:/home/adrian/git/github/erikarn/freebsd-rss # sysctl debug.lock.prof.stats | head -2 ; sysctl debug.lock.prof.stats | sort -nk4 | tail -10
debug.lock.prof.stats: 
     max  wait_max       total  wait_total       count    avg wait_avg cnt_hold cnt_lock name
    1496         0       10900           0          28    389      0  0      0 /usr/home/adrian/work/freebsd/head/src/sys/dev/usb/usb_device.c:2755 (sx:USB config SX lock)
debug.lock.prof.stats: 
       0         0          31           1          67      0      0  0      4 /usr/home/adrian/work/freebsd/head/src/sys/kern/sched_ule.c:888 (spin mutex:sched lock 2)
       0         0        2715           1       49740      0      0  0      7 /usr/home/adrian/work/freebsd/head/src/sys/dev/random/random_harvestq.c:294 (spin mutex:entropy harvest mutex)
       1         0          51           1         131      0      0  0      2 /usr/home/adrian/work/freebsd/head/src/sys/kern/sched_ule.c:1179 (spin mutex:sched lock 1)
       0         0          69           2         170      0      0  0      8 /usr/home/adrian/work/freebsd/head/src/sys/kern/sched_ule.c:886 (spin mutex:sched lock 2)
       0         0       40389           2      287649      0      0  0      8 /usr/home/adrian/work/freebsd/head/src/sys/kern/kern_intr.c:1359 (spin mutex:sched lock 2)
       0         2           2           4          12      0      0  0      2 /usr/home/adrian/work/freebsd/head/src/sys/dev/usb/usb_device.c:2762 (sleep mutex:Giant)
      15        20        6556         520        2254      2      0  0    105 /usr/home/adrian/work/freebsd/head/src/sys/dev/acpica/Osd/OsdSynch.c:535 (spin mutex:ACPI lock (0xfffff80002b10f00))
       4         5      195967       65888     3445501      0      0  0  28975 /usr/home/adrian/work/freebsd/head/src/sys/netinet/udp_usrreq.c:369 (sleep mutex:so_rcv)
Notice the lock contention for the so_rcv (socket receive buffer) handling? What's going on here is pretty amusing - it turns out that because there's so much receive traffic going on, the userland process receiving the data is being preempted by the NIC receive thread very often - and when this happens, there's a good chance it's going to be within the small window that the receive socket buffer lock is held. Once this happens, the NIC receive thread processes frames until it gets to one that requires it to grab the same sock buffer lock that is already held by userland - and it fails - so the NIC thread sleeps until the userland thread finishes consuming a packet. Then the CPU flips back to the NIC thread and continues processing a packet.
When the userland code is also transmitting frames it's increasing the amount of time in between socket receives and decreasing the probability of hitting the lock contention condition above.
Note there's no contention between CPUs here - this is entirely contention within a single CPU.
So for now I'm happy that the UDP IPv4 path is scaling well enough with RSS on a single core. The main performance problem here is the socket receive buffer locking (and, yes, copyin() / copyout().)
Next!
Subscribe to:
Comments (Atom)
