Adrian Chadd's Ramblings: Receive side scaling: testing UDP throughput

I think it's about time I shared some more details about the RSS stuff going into FreeBSD and how I'm testing it.

For now I'm focusing on IPv4 + UDP on the Intel 10GE NICs. The TCP side of things is done (and the IPv6 side of things works too!) but enough of the performance walls show up in the IPv4 UDP case that it's worth sticking to it for now.

I'm testing on a pair of 4-core boxes at home. They're not special - and they're very specifically not trying to be server-class hardware. I'd like to see where these bottlenecks are even at low core count.

The test setup in question:

Testing software:

http://github.com/erikarn/freebsd-rss
It requires libevent2 - an updated copy; previous versions of libevent2 didn't handle FreeBSD specific errors gracefully and would early error out of the IO loop.

Server:

CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz (3292.59-MHz K8-class CPU)
There's no SMT/HTT, but I've disabled it in the BIOS just to be sure
4GB RAM
FreeBSD-HEAD, amd64
NIC: '82599EB 10-Gigabit SFI/SFP+ Network Connection
ix0: 10.11.2.1/24

/etc/sysctl.conf:

# for now redirect processing just makes the lock overhead suck even more.
# disable it.
net.inet.ip.redirect=0
net.inet.icmp.drop_redirect=1
/boot/loader.conf:

hw.ix.num_queues=8

# experiment with deferred dispatch for RSS
net.isr.numthreads=4
net.isr.maxthreads=4
net.isr.bindthreads=1

kernel config:

include GENERIC
ident HACKBOX

device netmap
options RSS
options PCBGROUP

# in-system lock profiling
options LOCK_PROFILING

# Flowtable - the rtentry locking is a bit .. slow.
options   FLOWTABLE

# This debugging code has too much overhead to do accurate
# testing with.
nooptions         INVARIANTS
nooptions         INVARIANT_SUPPORT
nooptions         WITNESS
nooptions         WITNESS_SKIPSPIN

The server runs the "rss-udp-srv" process, which behaves like a multi-threaded UDP echo server on port 8080.

Client

The client box is slightly more powerful to compensate for (currently) not using completely affinity-aware RSS UDP transmit code.

CPU: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz (3192.68-MHz K8-class CPU)
SMT/HTT: Disabled in BIOS
8GB RAM
FreeBSD-HEAD amd64
Same kernel config, loader and sysctl config as the server
ix0: configured as 10.11.2.2/24, 10.11.2.3/32, 10.11.2.4/32, 10.11.2.32/32, 10.11.2.33/32

The client runs 'udp-clt' programs to source and sink traffic to the server.

Running things

The server-side simply runs the listen server, configured to respond to each frame:

$ rss-udp-srv 1 10.11.2.1

The client-side runs four couples of udp-clt, each from different IP addresses. These are run in parallel (i do it in different screens, so I can quickly see what's going on):

$ ./udp-clt -l 10.11.2.3 -r 10.11.2.1 -p 8080 -n 10000000000 -s 510
$ ./udp-clt -l 10.11.2.4 -r 10.11.2.1 -p 8080 -n 10000000000 -s 510
$ ./udp-clt -l 10.11.2.32 -r 10.11.2.1 -p 8080 -n 10000000000 -s 510
$ ./udp-clt -l 10.11.2.33 -r 10.11.2.1 -p 8080 -n 10000000000 -s 510

The IP addresses are chosen so that the 2-tuple topelitz hash using the default Microsoft key hash to different RSS buckets that live on individual CPUs.

Results: Round one

When the server is responding to each frame, the following occurs. The numbers are "number of frames generated by the client (netstat)", "number of frames received by the server (netstat)", "number of frames seen by udp-rss-srv", "number of responses transmitted from udp-rss-srv", "number of frames seen by the server (netstat)"

1 udp-clt process: 710,000; 710,000; 296,000; 283,000; 281,000
2 udp-clt processes: 1,300,000; 1,300,000; 592,000; 592,000; 575,000
3 udp-clt processes: 1,800,000; 1,800,000; 636,000; 636,000; 600,000
4 udp-clt processes: 2,100,000; 2,100,000; 255,000; 255,000; 255,000

So, it's not actually linear past two cores. The question here is: why?

There are a couple of parts to this.

Firstly - I had left turbo boost on. What this translated to:

One core active: ~ 30% increase in clock speed
Two cores active: ~ 30% increase in clock speed
Three cores active: ~ 25% increase in clock speed
Four cores active: ~ 15% increase in clock speed.

Secondly and more importantly - I had left flow control enabled. This made a world of difference.

The revised results are mostly linear - with more active RSS buckets (and thus CPUs) things seem to get slightly more efficient:

1 udp-clt process: 710,000; 710,000; 266,000; 266,000; 266,000
2 udp-clt processes: 1,300,000; 1,300,000; 512,000; 512,000; 512,000
3 udp-clt processes: 1,800,000; 1,800,000; 810,000; 810,000; 810,000
4 udp-clt processes: 2,100,000; 2,100,000; 1,120,000; 1,120,000; 1,120,000

Finally, let's repeat the process but only receiving instead also echoing back the packet to the client:

$ rss-udp-srv 0 10.11.2.1

1 udp-clt process: 710,000; 710,000; 204,000
2 udp-clt processes: 1,300,000; 1,300,000; 378,000
3 udp-clt processes: 1,800,000; 1,800,000; 645,000
4 udp-clt processes: 2,100,000; 2,100,000; 900,000

The receive-only workload is actually worse off versus the transmit + receive workload!

What's going on here?

Well, a little digging shows that in both instances - even with a single udp-clt thread running which means only one CPU on the server side is actually active! - there's active lock contention.

Here's an example dtrace output for measuring lock contention with only one active process, where one CPU is involved (and the other three are idle):

Receive only, 5 seconds:

root@adrian-hackbox:/home/adrian/git/github/erikarn/freebsd-rss # dtrace -n 'lockstat:::adaptive-block { @[stack()] = sum(arg1); }'

dtrace: description 'lockstat:::adaptive-block ' matched 1 probe

kernel`udp_append+0x11c

kernel`udp_input+0x8cc

kernel`ip_input+0x116

kernel`netisr_dispatch_src+0x1cb

kernel`ether_demux+0x123

kernel`ether_nh_input+0x34d

kernel`netisr_dispatch_src+0x61

kernel`ether_input+0x26

kernel`ixgbe_rxeof+0x2f7

kernel`ixgbe_msix_que+0xb6

kernel`intr_event_execute_handlers+0x83

kernel`ithread_loop+0x96

kernel`fork_exit+0x71

kernel`0xffffffff80cd19de

46729281

Transmit + receive, 5 seconds:

dtrace: description 'lockstat:::adaptive-block ' matched 1 probe

kernel`knote+0x7e

kernel`sowakeup+0x65

kernel`udp_append+0x14a

kernel`udp_input+0x8cc

kernel`ip_input+0x116

kernel`netisr_dispatch_src+0x1cb

kernel`ether_demux+0x123

kernel`ether_nh_input+0x34d

kernel`netisr_dispatch_src+0x61

kernel`ether_input+0x26

kernel`ixgbe_rxeof+0x2f7

kernel`ixgbe_msix_que+0xb6

kernel`intr_event_execute_handlers+0x83

kernel`ithread_loop+0x96

kernel`fork_exit+0x71

kernel`0xffffffff80cd19de

3793

kernel`udp_append+0x11c

kernel`udp_input+0x8cc

kernel`ip_input+0x116

kernel`netisr_dispatch_src+0x1cb

kernel`ether_demux+0x123

kernel`ether_nh_input+0x34d

kernel`netisr_dispatch_src+0x61

kernel`ether_input+0x26

kernel`ixgbe_rxeof+0x2f7

kernel`ixgbe_msix_que+0xb6

kernel`intr_event_execute_handlers+0x83

kernel`ithread_loop+0x96

kernel`fork_exit+0x71

kernel`0xffffffff80cd19de

3823793

kernel`ixgbe_msix_que+0xd3

kernel`intr_event_execute_handlers+0x83

kernel`ithread_loop+0x96

kernel`fork_exit+0x71

kernel`0xffffffff80cd19de

9918140

Somehow it seems there's less lock contention / blocking going on when both transmit and receive is running!

So then I dug into it using the lock profiling suite. This is for 5 seconds with receive-only traffic on a single RSS bucket / CPU (all other CPUs are idle):

# sysctl debug.lock.prof.enable = 1; sleep 5 ; sysctl debug.lock.prof.enable=0

root@adrian-hackbox:/home/adrian/git/github/erikarn/freebsd-rss # sysctl debug.lock.prof.enable=1 ; sleep 5 ; sysctl debug.lock.prof.enable=0

debug.lock.prof.enable: 1 -> 1

debug.lock.prof.enable: 1 -> 0

root@adrian-hackbox:/home/adrian/git/github/erikarn/freebsd-rss # sysctl debug.lock.prof.stats | head -2 ; sysctl debug.lock.prof.stats | sort -nk4 | tail -10

debug.lock.prof.stats:

max wait_max total wait_total count avg wait_avg cnt_hold cnt_lock name

1496 0 10900 0 28 389 0 0 0 /usr/home/adrian/work/freebsd/head/src/sys/dev/usb/usb_device.c:2755 (sx:USB config SX lock)

debug.lock.prof.stats:

0 0 31 1 67 0 0 0 4 /usr/home/adrian/work/freebsd/head/src/sys/kern/sched_ule.c:888 (spin mutex:sched lock 2)

0 0 2715 1 49740 0 0 0 7 /usr/home/adrian/work/freebsd/head/src/sys/dev/random/random_harvestq.c:294 (spin mutex:entropy harvest mutex)

1 0 51 1 131 0 0 0 2 /usr/home/adrian/work/freebsd/head/src/sys/kern/sched_ule.c:1179 (spin mutex:sched lock 1)

0 0 69 2 170 0 0 0 8 /usr/home/adrian/work/freebsd/head/src/sys/kern/sched_ule.c:886 (spin mutex:sched lock 2)

0 0 40389 2 287649 0 0 0 8 /usr/home/adrian/work/freebsd/head/src/sys/kern/kern_intr.c:1359 (spin mutex:sched lock 2)

0 2 2 4 12 0 0 0 2 /usr/home/adrian/work/freebsd/head/src/sys/dev/usb/usb_device.c:2762 (sleep mutex:Giant)

15 20 6556 520 2254 2 0 0 105 /usr/home/adrian/work/freebsd/head/src/sys/dev/acpica/Osd/OsdSynch.c:535 (spin mutex:ACPI lock (0xfffff80002b10f00))

4 5 195967 65888 3445501 0 0 0 28975 /usr/home/adrian/work/freebsd/head/src/sys/netinet/udp_usrreq.c:369 (sleep mutex:so_rcv)

Notice the lock contention for the so_rcv (socket receive buffer) handling? What's going on here is pretty amusing - it turns out that because there's so much receive traffic going on, the userland process receiving the data is being preempted by the NIC receive thread very often - and when this happens, there's a good chance it's going to be within the small window that the receive socket buffer lock is held. Once this happens, the NIC receive thread processes frames until it gets to one that requires it to grab the same sock buffer lock that is already held by userland - and it fails - so the NIC thread sleeps until the userland thread finishes consuming a packet. Then the CPU flips back to the NIC thread and continues processing a packet.

When the userland code is also transmitting frames it's increasing the amount of time in between socket receives and decreasing the probability of hitting the lock contention condition above.

Note there's no contention between CPUs here - this is entirely contention within a single CPU.

So for now I'm happy that the UDP IPv4 path is scaling well enough with RSS on a single core. The main performance problem here is the socket receive buffer locking (and, yes, copyin() / copyout().)

Next!

Adrian Chadd's Ramblings

Wednesday, September 10, 2014

Receive side scaling: testing UDP throughput

Testing software:

Server:

Client

Running things

Results: Round one

1 comment: