Tuesday, March 26, 2013

Hey, look, it's lots of atheros NICs in one laptop

So after many months of evenings and a whole lot of work internally to get the AR9380 HAL release vetted by legal, I bring you: a single, unified ath(4) and ath_hal(4) driver which works on all chipsets.

Now, the only chipsets I can fit _in_ this laptop:

[100309] ath0: mem 0xebf00000-0xebf0ffff irq 17 at device 0.0 on pci3
ath0: AR9280 mac 128.2 RF5133 phy 13.0
[100309] ath1: mem 0xedf00000-0xedf1ffff irq 18 at device 0.0 on pci4
ath1: AR9380 mac 448.3 RF5110 phy 0.0
[100309] ath2: mem 0xe4310000-0xe431ffff irq 16 at device 0.0 on cardbus0
ath2: AR5212 mac 5.9 RF5112 phy 4.3

.. that's an AR9280, AR5212 and AR9380 in the same laptop.

And, that's a 3x3 AR9380:

static_rix (-1) ratemask 0xffffffff
[ 250] cur rate 20 MCS since switch: packets 1 ticks 2647581
[ 250] last sample (6  Mb) cur sample (0 ) packets sent 9
[ 250] packets since sample 9 sample tt 0
[1600] cur rate 22 MCS since switch: packets 15 ticks 2647530

[1600] last sample (21 MCS) cur sample (0 ) packets sent 6049
[1600] packets since sample 0 sample tt 532
   TX Rate     TXTOTAL:TXOK       EWMA          T/   F     avg last xmit

[ 6  Mb: 250]        4:4        (100.0%)        4/   0   760uS 2640242
[20 MCS: 250]        9:9        (100.0%)        9/   0   440uS 2647581
[20 MCS:1600]      969:969      (100.0%)       57/   0   572uS 2647445
[21 MCS:1600]     1517:1517     (100.0%)       74/   0   613uS 2647557
[22 MCS:1600]     1990:1990     (100.0%)       92/   0   529uS 2647557
[23 MCS:1600]    73986:73462    ( 99.5%)     5661/   0   755uS 2647538

Now, I'm sure the AR5210 will work with an AR9280 and an AR9380 in the same laptop - it's just that the hardware form factor won't let me fit them all at the same time.

Tuesday, March 19, 2013

AR9380 support on FreeBSD; why it's taken so long..

There's now public, open source support for the AR9380 and later chips for FreeBSD.

It's not yet in the -HEAD tree - I'll get to that.

Let me take you on a bit of a journey.

I started a little side project late last year - I wanted to see if I could make the AR9380 HAL from the Qualcomm Atheros mainline driver (10.x branch) work on FreeBSD. I was hoping that the HAL API hadn't drifted all that much over the years.

Why do this? Two reasons:
  • I wanted to see if I could open source the HAL and have it work with FreeBSD; and
  • I didn't want to take on a similar project to what ath9k had to do - which is to take the existing HAL, convert it into something Linux-upstream-compatible, then push THAT into open source.
There's only one of me, and I don't want to spend all of my evenings trying to figure out which changes to the internal driver HAL need merging into "my" version of the HAL. I want to leverage all of the development and debugging that we do internally for the HAL. The ath9k team (both public and internally) need to do a lot of manual inspection and coding in order to pick up fixes and features from the internal driver. Since there's one of me, I'd rather optimise my time (read: get some sleep at some point.)

Then there's the third point that I didn't mention above:
  • I want to see how feasible it is to do snapshots from our internal codebase and push those out, rather than having to maintain a separate driver tree (sometimes based on the internal driver tree, sometimes re-implemented) and all the associated complication there.
This bit is pretty important. There's plenty of code I didn't want to open up. The bulk of the AR9300 HAL is already open sourced via the ath9k driver in Linux. So for the most part I'm open sourcing what we already have open sourced. However, I want to try and streamline the process for taking internally developed code and push it open.

This involves a few things.

Firstly, how much of the internal driver code is written with the idea that it's going to appear in the public eye? It depends what you think of as public - are your company developers "public" ? Are your customers with source code "public" ? It may not necessarily be "the general community." When you're writing code that's eventually going to be open sourced, you may need to make some decisions about how you structure your code.

For me, it was (mostly) easy. A very large amount of the "stuff that shouldn't be released" was already wrapped up in #ifdef's - stuff like emulation code, for example. So the public HAL snapshot is actually missing a lot of code that our internal version has. All I did (heh!) was pass it through 'unifdef'.

Next is whether the code is nice to look at. Is it formatted well? Is it well designed? Does it compile without warnings? Even on clang? These should be thought about whether or not your target audience is public or not. It's just good design. Companies may be worried about exposing the code, as if it will show badly on them. Well, yes, you should. But hey - we the open community would rather you release the code and take constructive criticism instead of keeping it closed. Who knows, it may actually help you!

The Linux upstream push is actually good here - the Linux system maintainers don't take "bad code". They hold the developers to a higher standard and this is forcing companies to think a bit more about how they develop things. Now, whether companies view this as a cost-centre or a benefit is not something I wish to discuss here. The point is that by working in the Linux upstream community, companies are being forced to tidy up their game a little.

Ok, enough of the back-story. How'd it actually all happen?
The short version - there was API drift, yes. There was a bunch of driver layer stuff that needed to happen. But it wasn't terribly painful. It required me to clean up the driver a bit and implement some nicer tools.

The long version:
  • There was an internal attempt to partly convert the HAL code internally over to a format that is Linux-upstream compatible. This involved a variety of formatting changes - function names and indentation changed. It also involved a variety of variable / method changes - eg halMciSupport became hal_mci_support. The boolean type changed - HAL_BOOL and AH_TRUE/AH_FALSE became bool, true & false. These needed to be renamed back to the HAL style before I could make it compile.
  • FreeBSD stripped out the HAL_CHANNEL stuff from its HAL, replacing it with a direct reference to the net80211 type (struct ieee80211_channel.) This made things slightly tidier but it did put an external dependency on the HAL. I may end up going through the FreeBSD HAL and undoing this at some point; but it's a big job.
  • A variety of APIs changed over time. Although the bulk of the APIs stayed the same, they grew parameters (eg 11n TX and RX antenna and chain configuration); the TX descriptor APIs now take a list of TX buffers rather than a single TX buffer, and other random other things.
So, what was I going to do?

My first cut was to just take a snapshot of the HAL and rename / shuffle things around enough to make it compile.

The first thing I did was to create a set of HAL stub functions. All the stub functions did was print out their method name and return. This way I wasn't surprised by a NULL pointer dereference when the HAL or driver called an unimplemented method - I'd get told which method was being called.

I started with the bare minimum code needed to support probe and attach - which required a surprising amount of code to be converted over. But it was mostly mechanical work. And it worked - enough to get things probing and attaching. I didn't bother with frame transmission and reception just yet - getting probe/attach was enough.

Then I realised that I wanted to this in a git branch, so I could import future versions of the HAL into master and then merge it into my branch. That's what I did. The HAL from 10.x was in master, and my FreeBSD port lived in 'local/freebsd'.

Next was figuring out whether to rename/fix API functions, or to use glue functions in order to deal with API differences. I've fixed some API differences (eg the reset path), but I ended up using a lot of wrapper functions to get the APIs to line up.

The important bits to bring up (in rough order) in order to see whether things are working:
  • Probe/attach/detach;
  • The reset path;
  • The initial calibration path (ADC calibration, IQ calibration, NF/AGC calibration);
  • The radio configuration path (ie, programming the analog section with the right frequencies, channel width, filter setup and such);
  • Interrupt handling;
  • ANI support;
  • RX path.
The RX path was the important bit. Once frame RX was working, I could do things like run the NIC in monitor mode and verify that HT20 and HT40 were working. And yes, that's pretty much what I did.

But at this point, the RX path exposed the first major API change - the whole FIFO setup that the AR9380 and later required. They don't support the list-based TX and RX that previous NICs supported. (Well, they _kind_ of do on the TX side, see below.)

The major change here required in the driver is that the RX descriptor is actually in the same memory area as the RX buffer. Ie, the first 'x' bytes of the passed in buffer is where the NIC DMAs the RX completion information to. Previous NICs have two areas for each RX frame - a RX descriptor area and an RX buffer area. Descriptors are in non-cachable memory, so I had to teach the FIFO RX path to support descriptors in cachable memory. I also had to teach the RX path to "skip" the 'x' bytes in order to hand the start of the data payload up to the net80211 stack. Finally, there's two RX FIFOs - one for high priority frames (beacons, uAPSD frames, PS-POLL frames, etc) and low-priority frames (everything else.) I had to teach the stack about this.

So, you can see the changes to the RX code - there's now a set of methods that implement RX - stop, start, flush, descriptor processing. The legacy routines stayed where they were. The new routines just overrode those methods.

And with that, RX came to life.

Next was TX. TX is a bit more special. There's only 8 TX FIFO entries per hardware queue (QCU 0..9); so I can't just push all the frames I want into the list. I also have four TX data buffer pointers per descriptor, rather than one per descriptor in the past. Finally, the TX status FIFO is completely separate from the TX FIFO itself - legacy chips would put the TX status at the end of the final descriptor in a frame.

This required some pretty significant refactoring of the TX path in order to expose the correct hooks to do this all properly. I won't go into the details here - suffice to say that I'm still working on it.

The next problem with TX was figuring out exactly what TX descriptor flags I was setting incorrectly. I eventually gave in and wrote some ALQ based logging which dumps the TX and RX descriptors into an ALQ log which I can then read from userland. This made it very, very easy to inspect what was going on - I was even finding bugs with the earlier chipset code!

Initially I used this to discover I wasn't correctly filling out all four buffer pointers in each TX descriptor. I can't leave any NULL if there's more descriptors for a given frame.

Then I used it to discover whether I was setting up the general flags right - TX chainmask, TX rate, duration, etc. I (re) discovered a hardware limitation with the AR9380 - I need to pad aggregate frames that use RTS with a little more pad delimiters or the transmission underruns. I was able to take these text dumps and give them to the Qualcomm Atheros MAC/PHY team for assistance and they were very impressed by the sophistication of my debugging tools.

Now I have the TX and RX side working. I pushed all of the driver side code into the public FreeBSD repository. I promised people that I would eventually open up the HAL side of things, but I figured that keeping the driver side of this closed was just plain silly. It also meant that if I did stop working on things (for whatever reason), the driver side was done - all that would need porting was the HAL.

Then I began the internal process to get the HAL opened up. I won't go into this in too much detail - suffice to say it took some engineering and legal review to get approval for this. The approval came in about two weeks ago and I pushed the repository into github shortly afterward.

Shortly after that, people started testing it and filing bugs. This part made me happy - there's a few small bugs that are actually in our 10.x mainline tree. I'll be pushing fixes back into the internal driver tree soon.

So, what's next?

I need to push the repository into a vendor branch in FreeBSD, then merge it into the kernel tree so it can be compiled by default.

I then need to get an updated version of the HAL approved by legal/engineering and push that update into the public git repository. Once that's done, I'll do a git merge into my branch and fix up whatever merge issues there are. This updated HAL includes some fixes for TX power and the AR95xx embedded SoC that we've just released. I hope to try and do a HAL update every month or two based on what bugs and features are introduced into the internal mainline driver.

I still have a bunch of driver work to finish up - notably I need to finish optimising the TX FIFO path in the driver and I need to implement MCI support. But the driver is now usable for me at least and I hope it'll become increasingly usable by others.

This has been a long and interesting trip.

Sunday, March 3, 2013

Why PCI latency timers matter..

My latest "are you serious?" moment recently was trying to figure out the root cause of this performance issue with the AR5416 cardbus NIC on some of my test laptops.

Now, the AR5416 is Atheros' first 802.11n NIC, so it has some rough edges. But I was seeing some ridiculously bad transmission failures and I couldn't pinpoint them.

Not only that, I was seeing great performance (~ 130mbit TCP) on a specific laptop (Lenovo T41p) but the Lenovo T60 and T400 both performed extremely poorly.

To make matters weirder - the NIC performed great when speaking to another NIC in the same laptop. Just not to another physically separate device.

So, after much digging, here's what I discovered.

Firstly - I used my athalq packet descriptor logging and inspection tool (that's in FreeBSD-HEAD - no custom closed source code here!) to investigate the TX frames being sent to the hardware. What I found was troubling - large numbers of frames had TX data and TX delimiter underruns.

I then discovered that my code for counting TX data / delimiter underruns was totally incorrect - it's possible to see both a data/delimiter underrun error _with_ a valid transmitting frame. What was going on was cute - the hardware would start transmitting an aggregate frame but the DMA wouldn't keep up during said transmission and half way through the frame it would underrun. This only happened at higher MCS rates.

So making shorter aggregate frames fixed it, as well as increasing the delimiter count between frames. Both had the effect of reducing the likelihood of the NIC failing to transmit a longer aggregate. But they weren't solutions.

So I went digging. What I found was pretty simple in theory: the PCI latency timer on the NIC was being set to something appropriate (0xa8) but the PCI latency timer on the cardbus PCI bridge itself was not (0x20.) So any other bus activity would cause the NIC to not get the bus and it'd miss its DMA window.

Once I manually fixed the PCI bridge latency timer to be 0xa8, everything returned to normal.

However - there's only one thing on this PCI bridge - the cardbus interface itself. That's why it's so kooky. I would've thought that I'd have to up the value on the rest of the PCI bridges up to the root complex. There's no latency timer for PCIe, so it's not a problem there. So there's likely some very subtle timing involved that's just plain broken by default on how the BIOS initialises this cardbus slot and FreeBSD is not overriding it.

Now, if you see crappy performance on the PCI/cardbus 802.11n NICs in FreeBSD, you can check the output of 'athstats' to see if you do see TX underruns of any sort. If you are, the hardware isn't meeting the DMA deadlines it needs to DMA out frames and you need to do some further digging into your system to see why.