Saturday, June 18, 2016

Debugging TDMA on the AR9380

So, it turns out that TDMA didn't work on the AR9380. I started digging into it a bit more with AR9380's in 5GHz mode and found that indeed no, it was just transmitting whenever the heck it wanted to.

The first thing I looked at was the transmit packet timing. Yes, they were going out at arbitrary times, rather than after the beacon. So I dug into the AR9380 HAL code and found the TX queue setup code just didn't know how to setup arbitrary TX queues to be beacon-gated. The CABQ does this by default, and the HAL just hard-codes that for the CAB queue, but it wasn't generic for all queues. So, I fixed that and tried again. Now, packets were exchanged, but I couldn't get more than around 1mbit of transmit throughput. The packets were correctly being beacon gated, but they were going out at very long intervals (one every 25ms or so.)

After a whole lot of digging and asking around, I found out what's going on. It turns out that the new TX DMA engine in the AR9380 treats queue gating slightly different versus previous chips. In previous chips you would see it transmit whatever it could, and then be gated until the next time it could transmit. As long as you kept poking the AR_TXE bit to re-start queue DMA it would indeed continue along transmitting whenver it could. But, the AR9380 TX DMA FIFO works differently.

Each queue has 8 TX FIFO descriptors, which can contain a list of frames or a single frame. For the CABQ I just added the whole list of frames in one hit and that works fine. But for the normal data paths it would push one frame into a TX DMA FIFO slot. If it's an A-MPDU aggregate then yes, it'd be a whole list of frames, but still a single PPDU. But for non-aggregate traffic it'd push a single frame in.

With this in mind, the TX DMA gating now works on FIFO slots, not just descriptor lists. That is, if you have the queue setup to gate on something (say a timer firing, like the beacon timer) then that un-gating is for a single FIFO slot only. If that FIFO slot has one PPDU in it then indeed it'll only burst out a single frame and then the rest of the channel burst time is ignored. It won't go to the next FIFO slot until the burst time expires and the queue is re-gated again. This is why I was only seeing one frame every 25ms - that's the beacon interval for two devices in a TDMA setup. It didn't matter that the queue had more data available - it ran out of data servicing a single TX FIFO slot and that was that.

So I did some local hacks to push more data into each TX FIFO slot. When I buffered things and only leaked out 32 frames at a time (which is roughly the whole slot time worth of large frames) then it indeed behaved roughly at the expected throughput. But there are bugs and it broke non-TDMA traffic. I won't commit it all to FreeBSD-HEAD until I figure out what's going on

There's also something else I noticed - there was some situation where it would push in a new frame and that would cause the next frame to go out immediately. I think it's actually just scheduling for the next gated burst (ie, it isn't doing multiple frames in a single burst window, but one every beacon interval) but I need to dig into it a bit more to see what's going on.

In any case, I'm getting closer to working TDMA on the AR9380 and later chips.

Oh, and it turns out that TDMA mode doesn't add some of the IEs to the beacon announcements - notably, no atheros fast-frames announcement. This means A-MSDUs or fast-frames aren't sent. I was hoping to leverage A-MSDU aggregation in its present state to improve things, even if it's just two frames at a time. Hopefully that'd double the throughput - I'm currently seeing 30mbit TX and 30mbit RX without it, so hopefully 60mbit with it.)

Friday, May 27, 2016

Updating the broadcom driver part #2

In Part 1, I described updating the FreeBSD bwn(4) driver and adding some support for the PHY-N driver from b43. It's GPL, but it works, and it gets me over the initial hump of getting support for updated NICs and initial 5GHz operation.

In this part, I'll describe what I did to tidy up RSSI handling and bring up the BCM4322 support.

To recap - I ported over PHY-N support from b43, updated the SPROM handling in the bus glue (siba(4)), and made 11a OFDM transmission work. I was lucky - I chose the first 11n, non-MIMO NIC that Broadcom made which behaved sufficiently similarly to the previous 11abg generation. It was non-MIMO and I could run non-MIMO microcode, which already shipped with the existing firmware FreeBSD builds. But, the BCM4322 is a 2x2 MIMO device, and requires updated firmware, which brought over a whole new firmware API.

Now, bwn(4) handles the earlier two firmware interfaces, but not the newer one that b43 also supports. I chose BCM4321 because it didn't require firmware API changes and anything in the Broadcom siba(4) bus layer, so I could focus on porting the PHY-N code and updating the MAC driver to work. This neatly compartmentalised the problem so I wouldn't be trying to make a completely changed thing work and spending days chasing down obscure bugs.

The BCM4322 is a bit of a different beast. It uses PHY-N, which is good. It requires the transmit path setup the PLCP header bits for OFDM to work (ie, 11a, 11g) which I had to do for BCM4321, so that's good. But, it required firmware API changes, and it required siba(4) changes. I decided to tackle the firmware changes first, so I could at least get the NIC loaded and ready.

So, I first fixed up the RX descriptor handling, and found that we were missing a whole lot of RSSI calculation math. I dutifully wrote it down on paper and reimplemented it from b43. That provided some much better looking RSSI values, which made the NIC behave much better. The existing bwn(4) driver just didn't decode the RSSI values in any sensible way and so some Very Poor Decisions were made about which AP to associate to.

Next up, the firmware API. I finished adding the new structure definitions and updating the descriptor sizes/offsets. There were a couple of new things I had to handle for later chip revision devices, and the transmit/receive descriptor layout changed. That took most of a weekend in Palm Springs (my first non-working holiday in .. well, since Atheros, really) and I had the thing up and doing DMA. But, I wasn't seeing any packets.

So, I next decided to finish implementing the siba(4) bus pieces. The 4322 uses a newer generation power management unit (PMU) with some changes in how clocking is configured. I did that, verified I was mostly doing the right thing, and fired that up - but it didn't show anything in the scan list. Now, I was wondering whether the PMU/clock configuration was wrong and not enabling the PHY, so I found some PHY reset code that bwn(4) was doing wrong, and I fixed that. Nope, still no scan results. I wondered if the thing was set up to clock right (since if we fed the PHY the wrong clock, I bet it wouldn't configure the radio with the right clock, and we'd tune to the wrong frequency) which was complete conjecture on my part - but, I couldn't see anything there I was missing.

Next up, I decided to debug the PHY-N code. It's a different PHY revision and chip revision - and the PHY code does check these to do different things. I first found that some of the PHY table programming was completely wrong, so after some digging I found I had used the wrong SPROM offsets in the siba(4) code I had added. It didn't matter for the BCM4321 because the PHY-N revision was early enough that these SPROM values weren't used. But they were used on the BCM4322. But, it didn't come up.

Then I decided to check the init path in more detail. I added some debug prints to the various radio programming functions to see what's being called in what order, and I found that none of them were being called. That sounded a bit odd, so I went digging to see what was supposed to call them.

The first thing it does when it changes channel is to call the rfkill method with the "on" flag set on, so it should program on the RF side of things. It turns out that, hilariously, the BCM4322 PHY revision has a slightly different code path, which checks the value of 'rfon' in the driver state. And, for reasons I don't yet understand, it's set to '1' in the PHY init path and never set to '0' before we start calling PHY code. So, the PHY-N code thought the radio was already up and didn't need reprogramming.

Oops.

I commented out that check, and just had it program the radio each time. Voila! It came up.

So, next on the list (as I do it) is adding PHY-HT support, and starting the path of supporting the newer bus (bhnd(4)) NICs. Landon Fuller is writing the bhnd(4) support and we're targeting the BCM943225 as the first bcma bus device. I'll write something once that's up and working!

Thursday, May 19, 2016

Updating the broadcom softmac driver (bwn), or "damnit, I said I'd never do this!"

If you're watching the FreeBSD commit lists, you may have noticed that I .. kinda sprayed a lot of changes into the broadcom softmac driver.

Firstly, I swore I'd never touch this stuff. But, we use Broadcom (fullmac!) parts at work, so in order to get a peek under the hood to see how they work, I decided fixing up bwn(4) was vaguely work related. Yes, I did the work outside of work; no, it's not sponsored by my employer.

I found a small cache of broadcom 43xx cards that I have and I plugged one in. Nope, didn't work. Tried another. Nope, didn't work. Oh wait - I need to make sure the right firmware module is loaded for it to continue. That was the first hiccup.

Then I set up the interface and connected it to my home AP. It worked .. for about 30 seconds. Then, 100% packet loss. It only worked when I was right up against my AP. I could receive packets fine, but transmits were failing. So, off I went to read the transmit completion path code.

Here's the first fun bit - there's no TX completion descriptor that's checked. There is in the v3 firmware driver (bwi), but not in the v4 firmware. Instead, it reads a pair shared memory registers to get completion status for each packet. This is where I learnt my first fun bits about the hardware API - it's a mix of PIO/DMA, firmware, descriptors and shared memory mailboxes. Those completion registers? Reading them advances the internal firmware state to read the next descriptor completion. You can't just read them for fun, or you'll miss transmit completions.

So, yes, we were transmitting, and we were failing them. The retry count was 7, and the ACK bit was 0. Ok, so it failed. It's using the net80211 rate control code, so I turned on rate control debugging (wlandebug +rate) and watched the hilarity.

The rate control code was never seeing any failures, so it just thought everything was hunky dory and kept pushing the rate up to 54mbit. Which was the exact wrong thing to do. It turns out the rate control code was only called if ack=1, which meant it was only notified if packets succeeded. I fixed up (through some revisions) the rate control notification path to be called always, error and success, and it began behaving better.

Now, bwn(4) was useful. But, it needs updating to support any of the 11n chipsets, and it certainly didn't do 5GHz operation on anything. So, off I went to investigate that.

There are, thankfully, three major sources of broadcom softmac information:
  • Linux b43
  • Linux brcmsmac
  • http://bcm-v4.sipsolutions.net/
The linux folk did a huge reverse engineering effort on the binary broadcom driver (wl) over many years, and generated a specification document with which they implemented b43 (and bcm-v3 for b43legacy.) It's .. pretty amazing, to be honest. So, armed with that, I went off to attempt to implement support for the first 11n chip, the BCM4321.

Now, there's some architectural things to know about these chips. Firstly, the broadcom hardware is structured (like all chips, really) with a bunch of cores on-die with an interconnect, and then some host bus glue. So, the hardware design can just reuse the same internals but a different host bus (USB, PCI, SDIO, etc) and reuse 90% of the chip design. That's a huge win. But, most of the other chips out there lie to you about the internal layout so you don't care - they map the internal cores into one big register window space so it looks like one device.

The broadcom parts don't. They expose each of the cores internally on a bus, and then you need to switch the cores on/off and also map them into the MMIO register window to access them.

Yes, that's right. There's not one big register window that it maps things to, PCI style. If you want to speak to a core, you have to unmap the existing core, map in the core you want, and do register access.

Secondly, the 802.11 core exposes MAC and PHY registers, but you can't have them both on at once. You switch on/off the MAC register window before you poke at the PHY.

Armed with this, I now understand why you need 'sys/dev/siba' (siba(4)) before you can use bwn(4). The siba driver provides the interface to PCI (and MIPS for an older Broadcom part) to probe/attach a SIBA bus, then enumerate all of the cores, then attach drivers to each. There's typically a PCI/PCIe core, then an 802.11 core, then a chipcommon core for the clock/power management, and then other things as needed (memory, USB, PCMCIA, etc.) bwn(4) doesn't attach to the PCI device, it sits on the siba bus as a child device.

So, to add support for a new chip, I needed to do a few things.

  • The device needs to probe/attach to siba(4);
  • The SPROM parsing is done by siba(4), so new fields have to be added there;
  • The 802.11 core revision is what's probe/attached by bwn(4), so add it there;
  • Then I needed to ensure the right microcode and radio initvals are added in bwn(4);
  • Then, new PHY code is needed. For the BCM4321, it's PHY-N.
There are two open PHY-N implementations - brcmfmac is BSD licenced, and b43's is GPL licenced. I looked at the brcmfmac one, which includes full 11n support, but I decided the interface was too different for me to do a first port with. The b43 PHY-N code is smaller, simpler and the API matched what was in the bcm-4 specification. And, importantly, bwn(4) was written from the same specification, so it's naturally in alignment.

This meant that I would be adding GPLv2'ed code to bwn(4). So, I decided to dump it in sys/gnu/dev/bwn so it's away from the main driver, and make compiling it in non-standard. At some point yes, I'd like to port the brcmfmac PHYs to FreeBSD, but I wanted to get familiar with the chips and make sure the driver worked fine. Debugging /all/ broken and new pieces didn't sound like fun to me.

So after a few days, I got PHY-N compiling and I fired it up. I needed to add SPROM field parsing too, so I did that too. Then, the moment of truth - I fired it up, and it connected. It scanned on both 2G and 5G, and it worked almost first time! But, two things were broken:
  • 5GHz operation just failed entirely for transmit, and
  • 2GHz operation failed transmitting all OFDM frames, but CCK was fine.
Since probing, association and authentication in 2GHz did it at the lowest rate (CCK), this worked fine. Data packets at OFDM rates failed with a PHY error of 0x80 (which isn't documented anywhere, so god knows what that means!) but CCK was fine. So, off I went to b43 and the brcmfmac driver to see what the missing pieces were.

There were two. Well, three, but two that broke everything.

Firstly, there's a "I'm 5GHz!" flag in the tx descriptor. I set that for 5GHz operation - but nothing.

Secondly, the driver tries a fallback rate if the primary rate fails. Those are hardcoded, same as the RTS/CTS rates. It turns out the fallback rate for 6MB OFDM is 11MB CCK, which is invalid for 5GHz. I fixed that, but I haven't yet fixed the 1MB CCK RTS/CTS rates. I'll go do that soon. (I also submitted a patch to Linux b43 to fix that!)

Thirdly, and this was the kicker - the PHY-N and later PHYs require more detailed TX setup. We were completely missing initializing some descriptor fields. It turns out it's also required for PHY-LP (which we support) but somehow the PHY was okay with that. Once I added those fields in, OFDM transmit worked fine.

So, a week after I started, I had a stable driver on 11bg chips, as well as 5GHz operation on the PHY-N BCM4321 NIC. No 11n yet, obviously, that'll have to wait.

In the next post I'll cover fixing up the RX RSSI calculations and then what I needed to do for the BCM94322MC, which is also a PHY-N chip, but is a much later core, and required new microcode with a new descriptor interface.



Monday, February 22, 2016

Why's my laptop running so hot? Or Firefox, pandora, and 1 million syscalls a second.

My FreeBSD-HEAD laptop runs very warm when running firefox, but even warmer when it's doing something simple - like say, streaming from pandora.

So, I decided to take a bit of a look.

Firstly, 'vmstat -a' - a good top level peek.

procs  memory       page                    disks     faults         cpu
r b w  avm   fre   flt  re  pi  po    fr   sr ad0 cd0   in    sy    cs us sy id
3 0 0  21G  180M 30403   0   2   0 21520  976  27   0  598 462567 10061 25 31 44
1 0 0  21G  174M 10389   0   0   0  3320  980   0   0  913 1203071 11892 15 24 61
3 0 0  21G  192M  4028   0   0   0  4763  983  14   0  563 1246314  8166 15 23 62
1 0 0  21G  192M  2305   0   0   0   334  988   1   0  390 1165396 10784 18 20 62
2 0 0  21G  202M 30493   0   0   0  3154  983   4   0  340 1072100 13287 28 23 49
2 0 0  21G  202M  8440   0   0   0   646  979   1   0  391 1071166  8802 32 20 48
1 0 0  21G  204M  3608   0   0   0  1841 1954  31   0  516 1041635 11319 33 21 46
3 0 0  20G  212M 67782   0   0   0  2895  973   1   0  387 1053575 10995 28 26 46
2 0 0  21G  187M 25368   0   0   0  2483  989   7   0  475 1047031 12056 29 23 48


.. ok, a million syscalls a second. Fine.Let's ask dtrace what's going on:

root@victoria:/home/adrian # dtrace -n 'syscall:::entry { @[probefunc] = count(); }'
dtrace: description 'syscall:::entry ' matched 1082 probes
^C


  gettimeofday                                                    305
  lstat                                                           336
  kevent                                                          598
  recvfrom                                                       1018
  __sysctl                                                       1384
  getpid                                                         2158
  sigprocmask                                                    5189
  select                                                         5443
  writev                                                         6215
  madvise                                                        6606
  setitimer                                                      6729
  recvmsg                                                       17556
  _umtx_op                                                      40740
  ppoll                                                        853940
  read                                                        1152896
  poll                                                        1158669
  write                                                       2159990
  ioctl                                                       2170830


root@victoria:/home/adrian # dtrace -n 'syscall::read:return /execname == "firefox"/ { @["rval (bytes)"] =
quantize(arg1); }'

dtrace: description 'syscall::read:return ' matched 2 probes

^C

  rval (bytes)                                     

           value  ------------- Distribution ------------- count  
              -2 |                                         0      
              -1 |                                         1      
               0 |                                         0      
               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 496294 
               2 |                                         5      
               4 |                                         0      
               8 |                                         0      
              16 |                                         0      
              32 |                                         0      
              64 |                                         0      
             128 |                                         0      
             256 |                                         0      
             512 |                                         0      
            1024 |                                         0      
            2048 |                                         0      
            4096 |                                         6      
            8192 |                                         0      
           16384 |                                         0      
           32768 |                                         36     
           65536 |                                         0       


root@victoria:/home/adrian # dtrace -n 'syscall::write:return /execname == "firefox"/ { @["rval (bytes)"] = quantize(arg1); }'
dtrace: description 'syscall::write:return ' matched 2 probes
^C
  rval (bytes)                                    
           value  ------------- Distribution ------------- count  
              -2 |                                         0      
              -1 |@@@@@@@@@@@@@@@@@@@@                     875025 
               0 |                                         0      
               1 |@@@@@@@@@@@@@@@@@@@@                     876075 
               2 |                                         0      
               4 |                                         0      
               8 |                                         0      
              16 |                                         14     
              32 |                                         1      
              64 |                                         0      
             128 |                                         15     
             256 |                                         8      
             512 |                                         8      
            1024 |                                         29     
            2048 |                                         563    
            4096 |                                         0      
            8192 |                                         0      
           16384 |                                         0      
           32768 |                                         14     
           65536 |                                         0       


... ok, so read and write is doing single byte transactions, and write is actually failing as often as it's succeeding.

so, what's actually going on? I decided to run truss briefly, and I get a lot of this:

_umtx_op(0x82efa2e80,UMTX_OP_MUTEX_WAIT,0x0,0x0,0x0) = 0 (0x0)
ioctl(68,SNDCTL_DSP_GETOPTR,0xce0c5ac0)         = 0 (0x0)
_umtx_op(0x82b16fb70,UMTX_OP_MUTEX_WAIT,0x0,0x0,0x0) = 0 (0x0)
_umtx_op(0x8d6a57e58,UMTX_OP_NWAKE_PRIVATE,0x1,0x0,0x0) = 0 (0x0)
write(158,"x",1)                 ERR#35 'Resource temporarily unavailable'
_umtx_op(0x8006bd4b8,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x7fffbde44c88) = 0 (0x0)

So I'm guessing there's a lot of inefficient single byte read/write syscalls to wake up a remote thread, along with a lot of inefficient use of the sound API.

For sound ioctls:

ioctl(68,SNDCTL_DSP_GETOSPACE,0xd7d54e10)     = 0 (0x0)
ioctl(68,SNDCTL_DSP_GETOPTR,0xd7d54e00)         = 0 (0x0)
ioctl(68,SNDCTL_DSP_GETOSPACE,0xd7d54df0)     = 0 (0x0)
ioctl(68,SNDCTL_DSP_GETOPTR,0xd7d54d50)         = 0 (0x0)


.. so i'm guessing it's doing it every thread wakeup or something stupid, even if it doesn't need to yet.

Wednesday, February 17, 2016

On being able to reflash your own devices, or "wow, millions of devices are potentially vulnerable."

If you work in software, you've likely heard of the latest hilarious bug - Linux glibc getaddrinfo() stack buffer overflow (https://isc.sans.edu/diary/CVE-2015-7547%3A+Critical+Vulnerability+in+glibc+getaddrinfo/20737). It was jointly found by redhat and google (https://googleonlinesecurity.blogspot.ca/2016/02/cve-2015-7547-glibc-getaddrinfo-stack.html), and it's been under investigation for a while. There are also some proof of concepts out there (eg https://github.com/fjserna/CVE-2015-7547).

I'm not sure if Android or OpenWRT devices are vulnerable - they don't use glibc out of the box, but they may use the relevant pieces of the NSS resolver library. But anything based on a linux distribution (centos, debian, ubuntu, redhat, etc) - ie, web services, docker installs, virtual machines, a heck of a lot of firewall/email/web gateway appliances, even some router management planes (hi Cisco?) may be vulnerable to this attack.

This means, well, most of the internet is likely vulnerable. I'm glad it's not an obvious bug in openwrt/android, as that'd also mean tens/hundreds of millions of devices are vulnerable. But it's a good study case - if you own something that has this bug, but there's no longer software updates available, you're short of luck. You may have working software on a perfectly working hardware, but since you (or some third party) can't fix it, it's effectively a paperweight.

But there may be devices which use glibc that I haven't covered. There may be set top boxes, televisions, cable modem / DSL gateways that are affected by this. There's likely a whole bunch of medical kit and control systems out there with this bug. Millions of potential consumer and industrial devices are impacted by this bug and it's likely never going to be patched. And since it's DNS, it's totally unencrypted/unauthorized, so anyone can hijack/spoof DNS to control what's going on.

So this is why I'm a big fan of open source software and being able to reflash your own devices. There's likely millions (or more!) devices this affects that is perfectly fine hardware but will never get software updates. This exposes a lot of people, with no easy fix besides "buy a replacement" and hope that it also isn't impacted. Heck, look at your home, office, workspace, outdoors - look at all those little electronic devices and think that at least some of them run Linux with this vulnerability and will be network connected. Any of them could be vulnerable to this and any of them may be owned by someone now.

This is "Hollywood" level of exploit. This is like, watching an episode of "Person of Interest" and realising all of those drive-by hacks are actually possible. This is like, anyone everywhere can do this - not just governments, but anyone with the minimum technical ability needed to run the exploit. Yes, this includes your internet connected fridge and your Internet-Of-Things lightbulbs.

Oh, and FreeBSD isn't vulnerable. Heh.

Saturday, October 31, 2015

Fixing up the QCA9558 performance on FreeBSD, or "why attention to detail matters when you're a kernel developer."

When I started with this Atheros MIPS 11n stuff a few years ago, my first test board was a Routerstation Pro with a pair of AR9160 NICs. I could get ~ 150mbit/sec bridging performance out of it, and I thought I was doing pretty good.

Fast forward to now, and I've been bringing FreeBSD up on each of the subsequent boards. But the performance never improved. Now, I never bothered to look into it because I was always too busy with my day job, but finally someone trolled me correctly on the FreeBSD embedded IRC channel and I took a look.

It turns out that.. things could've benefitted from a lot of improvement.

First up - I'm glad George Neville-Neil brought up PMC (performance counters) on the MIPS24k platform. It made it easier for me to bring it up on the MIPS74k platform and it was absolutely instrumental in figuring out performance issues here. No, there's no real ability to get DTrace up on these boards - some have 32MB of RAM. Heck, the packet filter (bpf) consumes most of a megabyte of RAM when you first start it up.

My initial tests are on an AP135 reference design board from Qualcomm Atheros. It's a QCA9558 SoC with an AR8327 switch on board. Both on-chip ethernet ports (arge0, arge1) are available. I set it up as a straight bridge between arge0 and arge1 and then I used iperf between two laptops to measure performance.

The first test - 130mbit bridging performance. That's terrible for this platform.

So I fired up hwpmc, and I found the first problem - packets were being copied in the receive and transmit path. Since I'm more familiar with the transmit path, I decided to look into that.

The AR7161 MAC requires both transmit and receive buffers to be both DWORD (32 bit or 4 bytes) aligned. In addition, all transmit frames save the last frame are required to be a multiple of DWORD in length. Plenty of frames don't meet this requirement and end up being copied.

The AR7240 and later MAC relaxed this - transmit/receive buffers can now be byte-aligned. So that particular workaround can be removed. It still needs to do it for multi-descriptor transmits that aren't DWORD sized (eg if you just prepend a fresh ethernet header) but that doesn't happen in the bridging path in the normal case.

Fixing that got bridging performance from 130mbit to 180mbit. That's not a huge difference, but it's something.

Next up is the receive path.  This was more .. complicated. The receive code copies the whole buffer back two bytes in order to ensure that the IP payload presented to the FreeBSD network stack is aligned. This is a problem in FreeBSD's network stack - it assumes the hardware handles unaligned accesses fine. So if your receive engine is DWORD aligned, the 14 byte ethernet header will result in the start of the IP payload being non-DWORD aligned, and .. the stack blows up. Now, I have vague plans to start fixing that as a general rule, but I did the next worst hack - I grabbed a buffer, set its RX start point to two bytes in, so the ethernet header is unaligned but the IP header is. Now, the ethernet stack in FreeBSD handles unaligned stuff correctly, so that works.

Except it wasn't faster. It turns out that the MIPS busdma code was doing very inefficient things with mbuf handling if everything wasn't completely aligned. Ian Lepore (who does ARM work) recently fixed this for armv6, so he ported it to MIPS and I added it.

The result? bridging performance leaped from 180mbit to 420mbit. Quite nice, but not where Linux was.

I left it for a few days, and someone on the freebsd-mips mailing list pointed out big stability issues with his tests. I started looking at the Linux OpenWRT driver and the MIPS24K/MIPS74K memory coherency operations. I found a couple of interesting things:

  • The busdma sync code never did a "SYNC" operation if things weren't being copied or invalidated; and
  • I was using cache-writethrough instead of cache-writeback for the cached memory attribute for MIPS74K.
The former is a problem with driver memory / driver access sync - you need to ensure that the changes you've made are actually in memory before you tell the hardware to look at it. So I fixed that in the busdma routines.

The latter makes everything slow. It means each write is going through the cache and into memory - the cache hardware doesn't get to batch writes to memory. I changed that, and found more instability in some parts of the arge ethernet driver - the MDIO bus accesses started misbehaving. After looking at the Linux code and the sync operations, I reimplemented the MDIO code correctly and I added explicit read/write barriers as needed. The MDIO code does lots of same-register accesses in loops to look for things, and the hardware may subtly reorder things. I committed this, flipped on the correct cache attribute to support cache-writeback, and things got .. faster. Much faster in fact.

So, that worked - and I hit the hardware instability issue. But, I hit it at a higher traffic rate. The final thing was fixed by looking at the OpenWRT driver (ag71xx_main.c) and going "Aha!" - the transmit side was buggy.

Specifically - the transmit side is a linked list of descriptors, but it's formed into a ring. The TX DMA engine stops when it hits a descriptor that isn't marked "ready" (ie, has ARGE_DESC_EMPTY set.) Now, we didn't see this before when we were copying transmit buffers for a packet into a single correct transmit buffer, but now that I am doing multi-descriptor transmit more frequently, this bug was hit. The bug is that because the TX descriptors are in a big ring, it's possible the hardware will transmit everything and hit the end of the ring before we've completely setup the descriptors for the next packet. If this happens, and it hits ARGE_DESC_EMPTY, then it stops. But if we have say a 3 descriptor packet, and we set the descriptors up in order, the hardware may hit that first descriptor out of three before we've finished setting things up, and start transmitting. It hits a descriptor we've not setup yet, thinks we're done, and transmits what it's seen. Then when I finish the setup and hit "transmit" on the hardware, it stalls, and everything sticks.

The fix was to initialise the first descriptor as EMPTY, then when we're done setting them up, flip that first descriptor to non-empty.

And voila! The bug is fixed and things perform now at a much faster rate - 720mbit. Yup, it bridges at 720mbit and it routes at around 320mbit. I'd like to get routing up from 320mbit to somewhere near bridge performance, but that'll have to wait a while.

120mbit -> 720mbit. Yup, I'm happy with that.

Tuesday, October 13, 2015

Fixing up the RTL8188SU/RTL8192SU 802.11bgn driver (rsu)

I recently figured out most of the missing pieces for 11n support and stability with the rsu driver in FreeBSD. This handles support for the RTL8188SU and RTL8192SU chipsets. I'll cover what I found and fixed in this post.

First off - the driver was in reasonably poor shape. It sometimes paniced when the NIC was removed, it didn't support 802.11n at all and it wouldn't associate reliably. I was pestered enough by one of the original users behind getting the driver ported (Idwer! Hi!) and decided I'd give fixing it up a go.

Importantly - it's a mostly real fullmac device. "fullmac" here means that the firmware on the device does almost everything interesting - it handles association, it can do encryption/decryption for you if you want, it'll handle retransmission and transmit rate control. There are some important things it doesn't do - I'll cover those shortly.

Here's a fun bit of trivia - this firmware outputs text debugging via a magic firmware notification, and it's on by default. This made all of the debugging much, much easier as I didn't have to guess so much about what was going on in the firmware. All firmware developers - please do this. Please!

I first looked at the association issue. The device does full scan offload - you send it a firmware command to start scanning and it'll return scan results as they come in. Plenty of firmware devices do this. Then you send it an association message, then a join_bss message. For those looking at the source - rsu_site_survey() starts the scan, and rsu_join_bss() attempts an association. Now, I noticed that it was sending a join message before the site survey finished. I also noticed that I never really received any management frames, and when I used a sniffer to see what was going on, I saw double-associations sometimes occuring.

I then checked OpenBSD. Their driver just stubbed out the management frame transmit routine. This wasn't done in FreeBSD, so I added it. It turns out the firmware here does all management frame transmit and receive, so I just plainly have to do none of it. This tidied things up a bit but it didn't fix association.

Next up - the whole way scan results were pushed into net80211 was wrong. Sometimes scan results ended up on the wrong channel. The driver was doing dirty things to the current channel state directly and then faking a beacon to the net80211 stack. I replaced that with some code I wrote for the 7260 wifi driver - the stack now accepts a channel (and other things!) as part of the receive frame, so you can do proper off-channel frame reception. This tidied up the scan results so they were now consistent.

Then I thought about an evil hack - how about delaying the call to rsu_join_bss() until after the survey finished. That worked - associations were now very reliable.


Now the device associated reliably and worked okay. There were some missing bits for the firmware setup path for doing things like power saving, saying how many transmit/receive streams are available, etc, but those were easy to add. Next up - 802.11n.

On the receive side, 802.11n requires you to do A-MPDU reordering as the transmitter is free to retransmit failed frames out of order. But the net80211 stack only handled the case where it saw the management frames and it itself drove the A-MPDU negotiation. Here, the firmware drives the negotiation and just tells us what's just happened. So, I had to extend net80211 to be told what the A-MPDU parameters are. It turned out that yes, the firmware sends a notification about A-MPDU going up, but it doesn't tell you how big the block-ack window is. Sigh. So, I needed to add that.

But the access point still wasn't negotiating it. Here was the next fun bit - join rsu_join_bss() it lets the stack assemble optional IEs to send to the access point and, the more interesting part, it looks at said IEs for an idea of what its own configuration should be. I added the HTINFO IE and voila! It started negotiating 802.11n.

(Oh, and I had to add M_AMPDU to each RX'ed frame from an 802.11n node before I called net80211, or the receive code would never do A-MPDU reorder processing.)

The final hack - I stubbed out the A-MPDU TX negotiation so we would never attempt to do it. So yes, there's no TX aggregation support, but that's fine for now.

Then Idwer told me it wasn't working for him. After much digging with the Linux driver authors (Thanks Christian and Larry!) we found that the OpenBSD driver tried to program the chip directly for 40MHz mode and that's wrong - instead, I just missed one of the 802.11n IEs. The firmware looked into that to see what the channel setup should've been. Two lines of diff later and I was on at 40MHz wide modes.

Finally - stability. It turns out that the USB drivers do inconsistent things when it comes to the detach path. They're supposed to stop transmit/receive, then flush buffers which flushes the net80211 node references, and then tear down the net80211 interface. Some, eg if_rsu, were doing it the other way. I fixed if_rsu and if_urtwn - they're now both stable.