Saturday, October 31, 2015

Fixing up the QCA9558 performance on FreeBSD, or "why attention to detail matters when you're a kernel developer."

When I started with this Atheros MIPS 11n stuff a few years ago, my first test board was a Routerstation Pro with a pair of AR9160 NICs. I could get ~ 150mbit/sec bridging performance out of it, and I thought I was doing pretty good.

Fast forward to now, and I've been bringing FreeBSD up on each of the subsequent boards. But the performance never improved. Now, I never bothered to look into it because I was always too busy with my day job, but finally someone trolled me correctly on the FreeBSD embedded IRC channel and I took a look.

It turns out that.. things could've benefitted from a lot of improvement.

First up - I'm glad George Neville-Neil brought up PMC (performance counters) on the MIPS24k platform. It made it easier for me to bring it up on the MIPS74k platform and it was absolutely instrumental in figuring out performance issues here. No, there's no real ability to get DTrace up on these boards - some have 32MB of RAM. Heck, the packet filter (bpf) consumes most of a megabyte of RAM when you first start it up.

My initial tests are on an AP135 reference design board from Qualcomm Atheros. It's a QCA9558 SoC with an AR8327 switch on board. Both on-chip ethernet ports (arge0, arge1) are available. I set it up as a straight bridge between arge0 and arge1 and then I used iperf between two laptops to measure performance.

The first test - 130mbit bridging performance. That's terrible for this platform.

So I fired up hwpmc, and I found the first problem - packets were being copied in the receive and transmit path. Since I'm more familiar with the transmit path, I decided to look into that.

The AR7161 MAC requires both transmit and receive buffers to be both DWORD (32 bit or 4 bytes) aligned. In addition, all transmit frames save the last frame are required to be a multiple of DWORD in length. Plenty of frames don't meet this requirement and end up being copied.

The AR7240 and later MAC relaxed this - transmit/receive buffers can now be byte-aligned. So that particular workaround can be removed. It still needs to do it for multi-descriptor transmits that aren't DWORD sized (eg if you just prepend a fresh ethernet header) but that doesn't happen in the bridging path in the normal case.

Fixing that got bridging performance from 130mbit to 180mbit. That's not a huge difference, but it's something.

Next up is the receive path.  This was more .. complicated. The receive code copies the whole buffer back two bytes in order to ensure that the IP payload presented to the FreeBSD network stack is aligned. This is a problem in FreeBSD's network stack - it assumes the hardware handles unaligned accesses fine. So if your receive engine is DWORD aligned, the 14 byte ethernet header will result in the start of the IP payload being non-DWORD aligned, and .. the stack blows up. Now, I have vague plans to start fixing that as a general rule, but I did the next worst hack - I grabbed a buffer, set its RX start point to two bytes in, so the ethernet header is unaligned but the IP header is. Now, the ethernet stack in FreeBSD handles unaligned stuff correctly, so that works.

Except it wasn't faster. It turns out that the MIPS busdma code was doing very inefficient things with mbuf handling if everything wasn't completely aligned. Ian Lepore (who does ARM work) recently fixed this for armv6, so he ported it to MIPS and I added it.

The result? bridging performance leaped from 180mbit to 420mbit. Quite nice, but not where Linux was.

I left it for a few days, and someone on the freebsd-mips mailing list pointed out big stability issues with his tests. I started looking at the Linux OpenWRT driver and the MIPS24K/MIPS74K memory coherency operations. I found a couple of interesting things:

  • The busdma sync code never did a "SYNC" operation if things weren't being copied or invalidated; and
  • I was using cache-writethrough instead of cache-writeback for the cached memory attribute for MIPS74K.
The former is a problem with driver memory / driver access sync - you need to ensure that the changes you've made are actually in memory before you tell the hardware to look at it. So I fixed that in the busdma routines.

The latter makes everything slow. It means each write is going through the cache and into memory - the cache hardware doesn't get to batch writes to memory. I changed that, and found more instability in some parts of the arge ethernet driver - the MDIO bus accesses started misbehaving. After looking at the Linux code and the sync operations, I reimplemented the MDIO code correctly and I added explicit read/write barriers as needed. The MDIO code does lots of same-register accesses in loops to look for things, and the hardware may subtly reorder things. I committed this, flipped on the correct cache attribute to support cache-writeback, and things got .. faster. Much faster in fact.

So, that worked - and I hit the hardware instability issue. But, I hit it at a higher traffic rate. The final thing was fixed by looking at the OpenWRT driver (ag71xx_main.c) and going "Aha!" - the transmit side was buggy.

Specifically - the transmit side is a linked list of descriptors, but it's formed into a ring. The TX DMA engine stops when it hits a descriptor that isn't marked "ready" (ie, has ARGE_DESC_EMPTY set.) Now, we didn't see this before when we were copying transmit buffers for a packet into a single correct transmit buffer, but now that I am doing multi-descriptor transmit more frequently, this bug was hit. The bug is that because the TX descriptors are in a big ring, it's possible the hardware will transmit everything and hit the end of the ring before we've completely setup the descriptors for the next packet. If this happens, and it hits ARGE_DESC_EMPTY, then it stops. But if we have say a 3 descriptor packet, and we set the descriptors up in order, the hardware may hit that first descriptor out of three before we've finished setting things up, and start transmitting. It hits a descriptor we've not setup yet, thinks we're done, and transmits what it's seen. Then when I finish the setup and hit "transmit" on the hardware, it stalls, and everything sticks.

The fix was to initialise the first descriptor as EMPTY, then when we're done setting them up, flip that first descriptor to non-empty.

And voila! The bug is fixed and things perform now at a much faster rate - 720mbit. Yup, it bridges at 720mbit and it routes at around 320mbit. I'd like to get routing up from 320mbit to somewhere near bridge performance, but that'll have to wait a while.

120mbit -> 720mbit. Yup, I'm happy with that.

No comments:

Post a Comment