Saturday, October 31, 2015

Fixing up the QCA9558 performance on FreeBSD, or "why attention to detail matters when you're a kernel developer."

When I started with this Atheros MIPS 11n stuff a few years ago, my first test board was a Routerstation Pro with a pair of AR9160 NICs. I could get ~ 150mbit/sec bridging performance out of it, and I thought I was doing pretty good.

Fast forward to now, and I've been bringing FreeBSD up on each of the subsequent boards. But the performance never improved. Now, I never bothered to look into it because I was always too busy with my day job, but finally someone trolled me correctly on the FreeBSD embedded IRC channel and I took a look.

It turns out that.. things could've benefitted from a lot of improvement.

First up - I'm glad George Neville-Neil brought up PMC (performance counters) on the MIPS24k platform. It made it easier for me to bring it up on the MIPS74k platform and it was absolutely instrumental in figuring out performance issues here. No, there's no real ability to get DTrace up on these boards - some have 32MB of RAM. Heck, the packet filter (bpf) consumes most of a megabyte of RAM when you first start it up.

My initial tests are on an AP135 reference design board from Qualcomm Atheros. It's a QCA9558 SoC with an AR8327 switch on board. Both on-chip ethernet ports (arge0, arge1) are available. I set it up as a straight bridge between arge0 and arge1 and then I used iperf between two laptops to measure performance.

The first test - 130mbit bridging performance. That's terrible for this platform.

So I fired up hwpmc, and I found the first problem - packets were being copied in the receive and transmit path. Since I'm more familiar with the transmit path, I decided to look into that.

The AR7161 MAC requires both transmit and receive buffers to be both DWORD (32 bit or 4 bytes) aligned. In addition, all transmit frames save the last frame are required to be a multiple of DWORD in length. Plenty of frames don't meet this requirement and end up being copied.

The AR7240 and later MAC relaxed this - transmit/receive buffers can now be byte-aligned. So that particular workaround can be removed. It still needs to do it for multi-descriptor transmits that aren't DWORD sized (eg if you just prepend a fresh ethernet header) but that doesn't happen in the bridging path in the normal case.

Fixing that got bridging performance from 130mbit to 180mbit. That's not a huge difference, but it's something.

Next up is the receive path.  This was more .. complicated. The receive code copies the whole buffer back two bytes in order to ensure that the IP payload presented to the FreeBSD network stack is aligned. This is a problem in FreeBSD's network stack - it assumes the hardware handles unaligned accesses fine. So if your receive engine is DWORD aligned, the 14 byte ethernet header will result in the start of the IP payload being non-DWORD aligned, and .. the stack blows up. Now, I have vague plans to start fixing that as a general rule, but I did the next worst hack - I grabbed a buffer, set its RX start point to two bytes in, so the ethernet header is unaligned but the IP header is. Now, the ethernet stack in FreeBSD handles unaligned stuff correctly, so that works.

Except it wasn't faster. It turns out that the MIPS busdma code was doing very inefficient things with mbuf handling if everything wasn't completely aligned. Ian Lepore (who does ARM work) recently fixed this for armv6, so he ported it to MIPS and I added it.

The result? bridging performance leaped from 180mbit to 420mbit. Quite nice, but not where Linux was.

I left it for a few days, and someone on the freebsd-mips mailing list pointed out big stability issues with his tests. I started looking at the Linux OpenWRT driver and the MIPS24K/MIPS74K memory coherency operations. I found a couple of interesting things:

  • The busdma sync code never did a "SYNC" operation if things weren't being copied or invalidated; and
  • I was using cache-writethrough instead of cache-writeback for the cached memory attribute for MIPS74K.
The former is a problem with driver memory / driver access sync - you need to ensure that the changes you've made are actually in memory before you tell the hardware to look at it. So I fixed that in the busdma routines.

The latter makes everything slow. It means each write is going through the cache and into memory - the cache hardware doesn't get to batch writes to memory. I changed that, and found more instability in some parts of the arge ethernet driver - the MDIO bus accesses started misbehaving. After looking at the Linux code and the sync operations, I reimplemented the MDIO code correctly and I added explicit read/write barriers as needed. The MDIO code does lots of same-register accesses in loops to look for things, and the hardware may subtly reorder things. I committed this, flipped on the correct cache attribute to support cache-writeback, and things got .. faster. Much faster in fact.

So, that worked - and I hit the hardware instability issue. But, I hit it at a higher traffic rate. The final thing was fixed by looking at the OpenWRT driver (ag71xx_main.c) and going "Aha!" - the transmit side was buggy.

Specifically - the transmit side is a linked list of descriptors, but it's formed into a ring. The TX DMA engine stops when it hits a descriptor that isn't marked "ready" (ie, has ARGE_DESC_EMPTY set.) Now, we didn't see this before when we were copying transmit buffers for a packet into a single correct transmit buffer, but now that I am doing multi-descriptor transmit more frequently, this bug was hit. The bug is that because the TX descriptors are in a big ring, it's possible the hardware will transmit everything and hit the end of the ring before we've completely setup the descriptors for the next packet. If this happens, and it hits ARGE_DESC_EMPTY, then it stops. But if we have say a 3 descriptor packet, and we set the descriptors up in order, the hardware may hit that first descriptor out of three before we've finished setting things up, and start transmitting. It hits a descriptor we've not setup yet, thinks we're done, and transmits what it's seen. Then when I finish the setup and hit "transmit" on the hardware, it stalls, and everything sticks.

The fix was to initialise the first descriptor as EMPTY, then when we're done setting them up, flip that first descriptor to non-empty.

And voila! The bug is fixed and things perform now at a much faster rate - 720mbit. Yup, it bridges at 720mbit and it routes at around 320mbit. I'd like to get routing up from 320mbit to somewhere near bridge performance, but that'll have to wait a while.

120mbit -> 720mbit. Yup, I'm happy with that.

Tuesday, October 13, 2015

Fixing up the RTL8188SU/RTL8192SU 802.11bgn driver (rsu)

I recently figured out most of the missing pieces for 11n support and stability with the rsu driver in FreeBSD. This handles support for the RTL8188SU and RTL8192SU chipsets. I'll cover what I found and fixed in this post.

First off - the driver was in reasonably poor shape. It sometimes paniced when the NIC was removed, it didn't support 802.11n at all and it wouldn't associate reliably. I was pestered enough by one of the original users behind getting the driver ported (Idwer! Hi!) and decided I'd give fixing it up a go.

Importantly - it's a mostly real fullmac device. "fullmac" here means that the firmware on the device does almost everything interesting - it handles association, it can do encryption/decryption for you if you want, it'll handle retransmission and transmit rate control. There are some important things it doesn't do - I'll cover those shortly.

Here's a fun bit of trivia - this firmware outputs text debugging via a magic firmware notification, and it's on by default. This made all of the debugging much, much easier as I didn't have to guess so much about what was going on in the firmware. All firmware developers - please do this. Please!

I first looked at the association issue. The device does full scan offload - you send it a firmware command to start scanning and it'll return scan results as they come in. Plenty of firmware devices do this. Then you send it an association message, then a join_bss message. For those looking at the source - rsu_site_survey() starts the scan, and rsu_join_bss() attempts an association. Now, I noticed that it was sending a join message before the site survey finished. I also noticed that I never really received any management frames, and when I used a sniffer to see what was going on, I saw double-associations sometimes occuring.

I then checked OpenBSD. Their driver just stubbed out the management frame transmit routine. This wasn't done in FreeBSD, so I added it. It turns out the firmware here does all management frame transmit and receive, so I just plainly have to do none of it. This tidied things up a bit but it didn't fix association.

Next up - the whole way scan results were pushed into net80211 was wrong. Sometimes scan results ended up on the wrong channel. The driver was doing dirty things to the current channel state directly and then faking a beacon to the net80211 stack. I replaced that with some code I wrote for the 7260 wifi driver - the stack now accepts a channel (and other things!) as part of the receive frame, so you can do proper off-channel frame reception. This tidied up the scan results so they were now consistent.

Then I thought about an evil hack - how about delaying the call to rsu_join_bss() until after the survey finished. That worked - associations were now very reliable.


Now the device associated reliably and worked okay. There were some missing bits for the firmware setup path for doing things like power saving, saying how many transmit/receive streams are available, etc, but those were easy to add. Next up - 802.11n.

On the receive side, 802.11n requires you to do A-MPDU reordering as the transmitter is free to retransmit failed frames out of order. But the net80211 stack only handled the case where it saw the management frames and it itself drove the A-MPDU negotiation. Here, the firmware drives the negotiation and just tells us what's just happened. So, I had to extend net80211 to be told what the A-MPDU parameters are. It turned out that yes, the firmware sends a notification about A-MPDU going up, but it doesn't tell you how big the block-ack window is. Sigh. So, I needed to add that.

But the access point still wasn't negotiating it. Here was the next fun bit - join rsu_join_bss() it lets the stack assemble optional IEs to send to the access point and, the more interesting part, it looks at said IEs for an idea of what its own configuration should be. I added the HTINFO IE and voila! It started negotiating 802.11n.

(Oh, and I had to add M_AMPDU to each RX'ed frame from an 802.11n node before I called net80211, or the receive code would never do A-MPDU reorder processing.)

The final hack - I stubbed out the A-MPDU TX negotiation so we would never attempt to do it. So yes, there's no TX aggregation support, but that's fine for now.

Then Idwer told me it wasn't working for him. After much digging with the Linux driver authors (Thanks Christian and Larry!) we found that the OpenBSD driver tried to program the chip directly for 40MHz mode and that's wrong - instead, I just missed one of the 802.11n IEs. The firmware looked into that to see what the channel setup should've been. Two lines of diff later and I was on at 40MHz wide modes.

Finally - stability. It turns out that the USB drivers do inconsistent things when it comes to the detach path. They're supposed to stop transmit/receive, then flush buffers which flushes the net80211 node references, and then tear down the net80211 interface. Some, eg if_rsu, were doing it the other way. I fixed if_rsu and if_urtwn - they're now both stable.

Thursday, October 1, 2015

As requested: progress of AR9170

Hi!

The progress of the AR9170 FreeBSD-ification can be found here:

https://github.com/erikarn/otus

Yes, I did actually keep the history of the driver bring-up here.