Tuesday, March 26, 2013

Hey, look, it's lots of atheros NICs in one laptop

So after many months of evenings and a whole lot of work internally to get the AR9380 HAL release vetted by legal, I bring you: a single, unified ath(4) and ath_hal(4) driver which works on all chipsets.

Now, the only chipsets I can fit _in_ this laptop:

[100309] ath0: mem 0xebf00000-0xebf0ffff irq 17 at device 0.0 on pci3
ath0: AR9280 mac 128.2 RF5133 phy 13.0
[100309] ath1: mem 0xedf00000-0xedf1ffff irq 18 at device 0.0 on pci4
ath1: AR9380 mac 448.3 RF5110 phy 0.0
[100309] ath2: mem 0xe4310000-0xe431ffff irq 16 at device 0.0 on cardbus0
ath2: AR5212 mac 5.9 RF5112 phy 4.3

.. that's an AR9280, AR5212 and AR9380 in the same laptop.

And, that's a 3x3 AR9380:

static_rix (-1) ratemask 0xffffffff
[ 250] cur rate 20 MCS since switch: packets 1 ticks 2647581
[ 250] last sample (6  Mb) cur sample (0 ) packets sent 9
[ 250] packets since sample 9 sample tt 0
[1600] cur rate 22 MCS since switch: packets 15 ticks 2647530

[1600] last sample (21 MCS) cur sample (0 ) packets sent 6049
[1600] packets since sample 0 sample tt 532
   TX Rate     TXTOTAL:TXOK       EWMA          T/   F     avg last xmit

[ 6  Mb: 250]        4:4        (100.0%)        4/   0   760uS 2640242
[20 MCS: 250]        9:9        (100.0%)        9/   0   440uS 2647581
[20 MCS:1600]      969:969      (100.0%)       57/   0   572uS 2647445
[21 MCS:1600]     1517:1517     (100.0%)       74/   0   613uS 2647557
[22 MCS:1600]     1990:1990     (100.0%)       92/   0   529uS 2647557
[23 MCS:1600]    73986:73462    ( 99.5%)     5661/   0   755uS 2647538

Now, I'm sure the AR5210 will work with an AR9280 and an AR9380 in the same laptop - it's just that the hardware form factor won't let me fit them all at the same time.

Tuesday, March 19, 2013

AR9380 support on FreeBSD; why it's taken so long..

There's now public, open source support for the AR9380 and later chips for FreeBSD.

It's not yet in the -HEAD tree - I'll get to that.

Let me take you on a bit of a journey.

I started a little side project late last year - I wanted to see if I could make the AR9380 HAL from the Qualcomm Atheros mainline driver (10.x branch) work on FreeBSD. I was hoping that the HAL API hadn't drifted all that much over the years.

Why do this? Two reasons:
  • I wanted to see if I could open source the HAL and have it work with FreeBSD; and
  • I didn't want to take on a similar project to what ath9k had to do - which is to take the existing HAL, convert it into something Linux-upstream-compatible, then push THAT into open source.
There's only one of me, and I don't want to spend all of my evenings trying to figure out which changes to the internal driver HAL need merging into "my" version of the HAL. I want to leverage all of the development and debugging that we do internally for the HAL. The ath9k team (both public and internally) need to do a lot of manual inspection and coding in order to pick up fixes and features from the internal driver. Since there's one of me, I'd rather optimise my time (read: get some sleep at some point.)

Then there's the third point that I didn't mention above:
  • I want to see how feasible it is to do snapshots from our internal codebase and push those out, rather than having to maintain a separate driver tree (sometimes based on the internal driver tree, sometimes re-implemented) and all the associated complication there.
This bit is pretty important. There's plenty of code I didn't want to open up. The bulk of the AR9300 HAL is already open sourced via the ath9k driver in Linux. So for the most part I'm open sourcing what we already have open sourced. However, I want to try and streamline the process for taking internally developed code and push it open.

This involves a few things.

Firstly, how much of the internal driver code is written with the idea that it's going to appear in the public eye? It depends what you think of as public - are your company developers "public" ? Are your customers with source code "public" ? It may not necessarily be "the general community." When you're writing code that's eventually going to be open sourced, you may need to make some decisions about how you structure your code.

For me, it was (mostly) easy. A very large amount of the "stuff that shouldn't be released" was already wrapped up in #ifdef's - stuff like emulation code, for example. So the public HAL snapshot is actually missing a lot of code that our internal version has. All I did (heh!) was pass it through 'unifdef'.

Next is whether the code is nice to look at. Is it formatted well? Is it well designed? Does it compile without warnings? Even on clang? These should be thought about whether or not your target audience is public or not. It's just good design. Companies may be worried about exposing the code, as if it will show badly on them. Well, yes, you should. But hey - we the open community would rather you release the code and take constructive criticism instead of keeping it closed. Who knows, it may actually help you!

The Linux upstream push is actually good here - the Linux system maintainers don't take "bad code". They hold the developers to a higher standard and this is forcing companies to think a bit more about how they develop things. Now, whether companies view this as a cost-centre or a benefit is not something I wish to discuss here. The point is that by working in the Linux upstream community, companies are being forced to tidy up their game a little.

Ok, enough of the back-story. How'd it actually all happen?
The short version - there was API drift, yes. There was a bunch of driver layer stuff that needed to happen. But it wasn't terribly painful. It required me to clean up the driver a bit and implement some nicer tools.

The long version:
  • There was an internal attempt to partly convert the HAL code internally over to a format that is Linux-upstream compatible. This involved a variety of formatting changes - function names and indentation changed. It also involved a variety of variable / method changes - eg halMciSupport became hal_mci_support. The boolean type changed - HAL_BOOL and AH_TRUE/AH_FALSE became bool, true & false. These needed to be renamed back to the HAL style before I could make it compile.
  • FreeBSD stripped out the HAL_CHANNEL stuff from its HAL, replacing it with a direct reference to the net80211 type (struct ieee80211_channel.) This made things slightly tidier but it did put an external dependency on the HAL. I may end up going through the FreeBSD HAL and undoing this at some point; but it's a big job.
  • A variety of APIs changed over time. Although the bulk of the APIs stayed the same, they grew parameters (eg 11n TX and RX antenna and chain configuration); the TX descriptor APIs now take a list of TX buffers rather than a single TX buffer, and other random other things.
So, what was I going to do?

My first cut was to just take a snapshot of the HAL and rename / shuffle things around enough to make it compile.

The first thing I did was to create a set of HAL stub functions. All the stub functions did was print out their method name and return. This way I wasn't surprised by a NULL pointer dereference when the HAL or driver called an unimplemented method - I'd get told which method was being called.

I started with the bare minimum code needed to support probe and attach - which required a surprising amount of code to be converted over. But it was mostly mechanical work. And it worked - enough to get things probing and attaching. I didn't bother with frame transmission and reception just yet - getting probe/attach was enough.

Then I realised that I wanted to this in a git branch, so I could import future versions of the HAL into master and then merge it into my branch. That's what I did. The HAL from 10.x was in master, and my FreeBSD port lived in 'local/freebsd'.

Next was figuring out whether to rename/fix API functions, or to use glue functions in order to deal with API differences. I've fixed some API differences (eg the reset path), but I ended up using a lot of wrapper functions to get the APIs to line up.

The important bits to bring up (in rough order) in order to see whether things are working:
  • Probe/attach/detach;
  • The reset path;
  • The initial calibration path (ADC calibration, IQ calibration, NF/AGC calibration);
  • The radio configuration path (ie, programming the analog section with the right frequencies, channel width, filter setup and such);
  • Interrupt handling;
  • ANI support;
  • RX path.
The RX path was the important bit. Once frame RX was working, I could do things like run the NIC in monitor mode and verify that HT20 and HT40 were working. And yes, that's pretty much what I did.

But at this point, the RX path exposed the first major API change - the whole FIFO setup that the AR9380 and later required. They don't support the list-based TX and RX that previous NICs supported. (Well, they _kind_ of do on the TX side, see below.)

The major change here required in the driver is that the RX descriptor is actually in the same memory area as the RX buffer. Ie, the first 'x' bytes of the passed in buffer is where the NIC DMAs the RX completion information to. Previous NICs have two areas for each RX frame - a RX descriptor area and an RX buffer area. Descriptors are in non-cachable memory, so I had to teach the FIFO RX path to support descriptors in cachable memory. I also had to teach the RX path to "skip" the 'x' bytes in order to hand the start of the data payload up to the net80211 stack. Finally, there's two RX FIFOs - one for high priority frames (beacons, uAPSD frames, PS-POLL frames, etc) and low-priority frames (everything else.) I had to teach the stack about this.

So, you can see the changes to the RX code - there's now a set of methods that implement RX - stop, start, flush, descriptor processing. The legacy routines stayed where they were. The new routines just overrode those methods.

And with that, RX came to life.

Next was TX. TX is a bit more special. There's only 8 TX FIFO entries per hardware queue (QCU 0..9); so I can't just push all the frames I want into the list. I also have four TX data buffer pointers per descriptor, rather than one per descriptor in the past. Finally, the TX status FIFO is completely separate from the TX FIFO itself - legacy chips would put the TX status at the end of the final descriptor in a frame.

This required some pretty significant refactoring of the TX path in order to expose the correct hooks to do this all properly. I won't go into the details here - suffice to say that I'm still working on it.

The next problem with TX was figuring out exactly what TX descriptor flags I was setting incorrectly. I eventually gave in and wrote some ALQ based logging which dumps the TX and RX descriptors into an ALQ log which I can then read from userland. This made it very, very easy to inspect what was going on - I was even finding bugs with the earlier chipset code!

Initially I used this to discover I wasn't correctly filling out all four buffer pointers in each TX descriptor. I can't leave any NULL if there's more descriptors for a given frame.

Then I used it to discover whether I was setting up the general flags right - TX chainmask, TX rate, duration, etc. I (re) discovered a hardware limitation with the AR9380 - I need to pad aggregate frames that use RTS with a little more pad delimiters or the transmission underruns. I was able to take these text dumps and give them to the Qualcomm Atheros MAC/PHY team for assistance and they were very impressed by the sophistication of my debugging tools.

Now I have the TX and RX side working. I pushed all of the driver side code into the public FreeBSD repository. I promised people that I would eventually open up the HAL side of things, but I figured that keeping the driver side of this closed was just plain silly. It also meant that if I did stop working on things (for whatever reason), the driver side was done - all that would need porting was the HAL.

Then I began the internal process to get the HAL opened up. I won't go into this in too much detail - suffice to say it took some engineering and legal review to get approval for this. The approval came in about two weeks ago and I pushed the repository into github shortly afterward.

Shortly after that, people started testing it and filing bugs. This part made me happy - there's a few small bugs that are actually in our 10.x mainline tree. I'll be pushing fixes back into the internal driver tree soon.

So, what's next?

I need to push the repository into a vendor branch in FreeBSD, then merge it into the kernel tree so it can be compiled by default.

I then need to get an updated version of the HAL approved by legal/engineering and push that update into the public git repository. Once that's done, I'll do a git merge into my branch and fix up whatever merge issues there are. This updated HAL includes some fixes for TX power and the AR95xx embedded SoC that we've just released. I hope to try and do a HAL update every month or two based on what bugs and features are introduced into the internal mainline driver.

I still have a bunch of driver work to finish up - notably I need to finish optimising the TX FIFO path in the driver and I need to implement MCI support. But the driver is now usable for me at least and I hope it'll become increasingly usable by others.

This has been a long and interesting trip.

Sunday, March 3, 2013

Why PCI latency timers matter..

My latest "are you serious?" moment recently was trying to figure out the root cause of this performance issue with the AR5416 cardbus NIC on some of my test laptops.

Now, the AR5416 is Atheros' first 802.11n NIC, so it has some rough edges. But I was seeing some ridiculously bad transmission failures and I couldn't pinpoint them.

Not only that, I was seeing great performance (~ 130mbit TCP) on a specific laptop (Lenovo T41p) but the Lenovo T60 and T400 both performed extremely poorly.

To make matters weirder - the NIC performed great when speaking to another NIC in the same laptop. Just not to another physically separate device.

So, after much digging, here's what I discovered.

Firstly - I used my athalq packet descriptor logging and inspection tool (that's in FreeBSD-HEAD - no custom closed source code here!) to investigate the TX frames being sent to the hardware. What I found was troubling - large numbers of frames had TX data and TX delimiter underruns.

I then discovered that my code for counting TX data / delimiter underruns was totally incorrect - it's possible to see both a data/delimiter underrun error _with_ a valid transmitting frame. What was going on was cute - the hardware would start transmitting an aggregate frame but the DMA wouldn't keep up during said transmission and half way through the frame it would underrun. This only happened at higher MCS rates.

So making shorter aggregate frames fixed it, as well as increasing the delimiter count between frames. Both had the effect of reducing the likelihood of the NIC failing to transmit a longer aggregate. But they weren't solutions.

So I went digging. What I found was pretty simple in theory: the PCI latency timer on the NIC was being set to something appropriate (0xa8) but the PCI latency timer on the cardbus PCI bridge itself was not (0x20.) So any other bus activity would cause the NIC to not get the bus and it'd miss its DMA window.

Once I manually fixed the PCI bridge latency timer to be 0xa8, everything returned to normal.

However - there's only one thing on this PCI bridge - the cardbus interface itself. That's why it's so kooky. I would've thought that I'd have to up the value on the rest of the PCI bridges up to the root complex. There's no latency timer for PCIe, so it's not a problem there. So there's likely some very subtle timing involved that's just plain broken by default on how the BIOS initialises this cardbus slot and FreeBSD is not overriding it.

Now, if you see crappy performance on the PCI/cardbus 802.11n NICs in FreeBSD, you can check the output of 'athstats' to see if you do see TX underruns of any sort. If you are, the hardware isn't meeting the DMA deadlines it needs to DMA out frames and you need to do some further digging into your system to see why.

Friday, November 30, 2012

Be careful of adding debugging, as microseconds count..

.. after tinkering with the TDMA code a bit more, I discovered why I was seeing larger swings in the TDMA slot timings.

Two words: Debug Code.

Well, to be more specific - I added some debugging code that by default didn't do anything. But it was still there; it checked a debug flag and didn't log anything if it was disabled. But that would take time to execute. Since that debugging code sat _between_ the routines doing math with the RX timestamp and the nexttbtt register, it would calculate a slightly larger TSF offset.

Once I moved the debug code out from where it is and grouped all that register access and math together, the slot timing swings dropped by a few microseconds and everything went back to smooth.

Tsk. I should've known better.

At least now the TDMA code is working well on the 802.11n chips. Yes, it's still only 802.11abg rates, but it works. I've also found the PCU MISC_MODE bit to enforce packets don't transmit outside of the burst window and that is working quite fine with TDMA.

So, I think I can say "mission accomplished." I'll tidy up a few more things and make sure TX only occurs in one data queue (as mentioned in my previous post, they all burst independently at the moment..) and then patiently wait for someone to implement 802.11n adhoc negotiation so 802.11n MCS rates and aggregation magically begins to work. Once that's done, 802.11n TDMA will become a reality.

Tuesday, November 27, 2012

Getting TDMA working on 802.11n chipsets

A few years ago, a bunch of clever people figured out how to implement TDMA using the Atheros 802.11abg NICs. Sam Leffler has a great write-up here. He finished that particular paper with some comments about the (then) upcoming 802.11n chipsets from Atheros and how they would be better suited to the kinds of tricks he pulled with the Atheros MAC.

But, if you tried bringing up TDMA on the Atheros 802.11n chips, it plain just didn't work. Lots of people gnashed teeth about it. I was knee deep in TX aggregation work at the time so I just pushed TDMA to the back of my mind.

How it works is pretty cute in itself. To setup a TX "slot", the beacon timer is used to gate the TX queues to be able to start transmitting. Then a "channel ready time" burst length is configured, which is the period of time the TX queue can transmit. Once that timer expires, no new TX is allowed to begin. Sam then slides the slave TX window along based on when it sees a beacon from the master, as everything is synchronised against that.

Luckily, someone did some initial investigation and discovered that a couple of things were very very wrong.

Firstly, when fetching the next target beacon transmission time ("TBTT"), the AR5212 era NICs returned it in TU, but the AR5416 and later returned it in TSF.

Secondly, the TSF from each RX frame on the AR5212 is only 15 bits; on the AR5416 and later its 32 bits. The wrong logic was used when extending the RX frame timestamp from the AR5416 from 32 bits to 64 bits, and it was causing the TSF to jump all over the place.

So with that in place, he managed to stop the NICs from spewing stuck beacons everywhere (a classic "whoa, who setup the timers wrong!" symptom) and got two 11n NICs configured in a TDMA setup. But he reported the traffic was very unstable, so he had to stop.

Fast-forward about 12 months. I've finished the TX aggregation and BAR handling; I've debugged a bunch of AP power save handling and I'm about to reimplement some things to allow me to finish of AP power save handling (legacy/ps-poll and uapsd) in a sane, correct fashion. I decide, "hey, TDMA shouldn't be that hard to fix. Hopefully there are no chip bugs, right?" So, I plug in a pair of AR5413 (pre-11n NICs) and get it up and running. Easy. Then I plug in an AR5416 as the slave node, and .. it worked. Ok, so why was he reporting such bad results?

Firstly, Sam exposed a bunch of useful TDMA stats from "athstats". Specifically, if you start tinkering with TDMA, do this:

$ athstats -i ath0 -o tdma 1


   input   output  bexmit tdmau   tdmadj crcerr  phyerr  TOR rssi noise  rate
  619817   877907   25152 25152    -4/+6    142     143    1   74   -96   24M
     492      712      20    20    -0/+7      0       0    0   74   -96   24M
     496      720      20    20    -2/+6      0       0    0   74   -96   24M
     500      723      21    21    -6/+4      0       0    0   75   -96   24M

When I was debugging the initial AR5416 TDMA stuff, the tdma adjust figures bounced everywhere between 0 and 1000uS off. That was obviously not stable.

So, I looked at what debugging was in the driver itself. There was some (check if_ath_debug.h for the TDMA and TDMA timer flags), and after a bit of digging I realised that every time the TSF was just about to converge, it would be bumped out 1000uS. Then it would slowly drift back to converge, then it'd fall out 1000uS. This kept repeating. It made no sense; every time it calculated the delta between the expected and real TSF, it would "bump" the TSF by that much. That way the TSF would actually be correct. It shouldn't be out by almost as much the next RX'ed frame.

I did some initial testing to ensure the TSF was running at the expected 1uS interval (it was) and the master side was also running at the expected 1uS interval (it also was), so it wasn't out of sync clocks. The TSF bump must not be "right".

Enter the next bug - on the AR5416 and later, the TSF writes must be done as a 64 bit write. Ie, you write TSF_L32 first, then TSF_U32. At that point it gets internally updated and everything is consistent. If you don't do that, it doesn't latch.

Ok, so that fixed the intial drift. But after about 60 seconds, the TSF adjust parameters started varying ridiculously wildly. Ok, so 60 seconds equaled around 65,535 TU (where a TU is 1.024 milliseconds) so I began to wonder if I was seeing something wrap at that point.

Enter the next bug. The math involved in calculating the expected slot time was based on the 64 bit TSF and it was converted down to a 16 bit TU value from 0 .. 65535 TU. On the AR5212 era chips, the nexttbtt timer had a 16 bit resolution. When the nexttbtt value was read from that register, it was already 16 bits. So the "TSF delta" between the expected and real slot time was calculated between these two 16 bit values. However, on the AR5416 and later, the nexttbtt value was a 32 bit TSF (microsecond) value. Even when converted to a TU (1.024 millisecond) value, it would wrap at a value much greater than 65,535 TU. So the comparison would soon be between a value from 0..65,535 TU and 0 .. much-bigger-than-65,535 TU. The tsfdelta would become very, very negative.. and things would go nuts.

Ok, so that fixed another behavioural issue. Things were looking good. The slot time sync was stable. So I started passing traffic. Everything looked good.. for about 60 seconds. Then everything went slightly nuts again. But only with traffic. The timing calculations went way, way out.

Here's an example of the beacons coming in. Note that the expected beacon interval here is 49,152uS.

[34759308] [100933] BEACON: RX TSF=67127545 Beacon TSF=3722387514 (49152)
[34759357] [100933] BEACON: RX TSF=67176714 Beacon TSF=3722436670 (49156)
[34759442] [100933] BEACON: RX TSF=67262432 Beacon TSF=3722521354 (84684)
[34759454] [100933] BEACON: RX TSF=67275216 Beacon TSF=3722533850 (12496)
[34759504] [100933] BEACON: RX TSF=67325995 Beacon TSF=3722583802 (49952)
[34759552] [100933] BEACON: RX TSF=67374479 Beacon TSF=3722632108 (48306)
[34759602] [100933] BEACON: RX TSF=67424546 Beacon TSF=3722681282 (49174)
[34759652] [100933] BEACON: RX TSF=67475842 Beacon TSF=3722731578 (50296)
[34759701] [100933] BEACON: RX TSF=67525900 Beacon TSF=3722780730 (49152)

The master beacons were not coming in stable in any way. The main reason this would happen is if the air was busy at the master target beacon transmission time. So it would delay transmitting the beacon until the air was free.

This is where I decided it was about time I inserted some tracing into the TDMA code. I had introduced some ALQ based tracing in the ath(4) driver recently, specifically to trace TX and RX descriptors. I decided to add TDMA trace points. That way I could look at the TDMA recalculation along with the TX and RX from the driver.

What I found was very .. grr-y. After about 60 seconds (surprise), the TX would burst FAR past the 2.5 milliseconds it was supposed to. Why the heck was that happening?

After a bunch of staring-at-documentation and talking with some people well-versed in how the Atheros MAC worked, we realised the only real explanation is that the beacon timer was firing after the burst time, retriggering the timer. But why would it be? I stared at the debugging output a little more, and look at what I saw:

[34759258] [100933] BEACON: RX TSF=67077388 Beacon TSF=3722338362 (49152)
[34759258] [100933] SLOTCALC: NEXTTBTT=67081216 nextslot=67081224 tsfdelta=8 avg (5/8)
[34759258] [100933] TIMERSET: bt_intval=8388616 nexttbtt=65510 nextdba=524078 nextswba=524070 nextatim=65511 flags=0x0 tdmadbaprep=2 tdmaswbaprep=10
[34759259] [100933] TSFADJUST: TSF64 was 67077561, adj=1016, now 67078577

.. everything here is fine. We're programming nexttbtt in TU, not TSF (because the HAL API specifies it in TU for the older, pre-11n chips. Ok. Suspiciously close to the 65,535 TU boundary.

Then:

[34759308] [100933] BEACON: RX TSF=67127545 Beacon TSF=3722387514 (49152)
[34759308] [100933] SLOTCALC: NEXTTBTT=22528 nextslot=67131381 tsfdelta=-11 avg (5/7)
[34759308] [100933] TSFADJUST: TSF64 was 67127704, adj=11, now 67127715

Ok, but it's just a TSF adjust, no biggie. But, then this happened:

[34759357] [100933] BEACON: RX TSF=67176714 Beacon TSF=3722436670 (49156)
[34759357] [100933] SLOTCALC: NEXTTBTT=71680 nextslot=67180550 tsfdelta=6 avg (5/7)
[34759357] [100933] TIMERSET: bt_intval=8388616 nexttbtt=71 nextdba=566 nextswba=558 nextatim=72 flags=0x0 tdmadbaprep=2 tdmaswbaprep=10
[34759357] [100933] TSFADJUST: TSF64 was 67176888, adj=1018, now 67177906

At this point, it was clear. nexttbtt was very very small. Somehow it was very very small - 71 TU is very, very much before the current TSF of somewhere around 67,127,545. At this point the Next TBTT timer would just keep continously firing. And this would keep re-gating the TX queue, allowing it to just plain keep bursting. That explains why everything was going crazy during traffic.

This again was another example of the code assuming it was an AR5212 era NIC. The nexttbtt value was being trimmed to be between 0 and 65,535 TU. After I fixed that and fixed up the math a bit, nexttbtt was being correctly programmed and suddenly everything started working. And quite well.

So, now the basics are working. I'll audit the math to ensure everything wraps consistently at the 32-bit TSF boundary (ie, 4 billion microseconds, give or take) as that doesn't take too long to occur. But the 11n chips now behave the same as the 11a chips do when doing TDMA.

So what's next?
  • The "tx time" calculation needs to be aware of the 11n rate configuration, so it can calculate the guard time correctly. Right now it uses the non-11n aware rate -> duration HAL function;
  • The TX path has to be rejiggled a bit to ensure _all_ traffic gets stuffed into one TX queue (well, besides beacons.) Management and higher priority traffic has to do this too. If not, then multiple TX queues can burst and they'll burst separately, blowing out the TX slot timing;
  • Someone needs to get 11n adhoc working, so that 11n rates are negotiated during adhoc peer establishment. Then aggregation can just magically work at that point (the TDMA code reuses a lot of adhoc mode vap behaviour code);
  • 802.11e / 802.11n delayed block-ACK support needs to be implemented;
  • Then when doing TDMA, we can just burst out an aggregate or two inside the given slot time, then wait for a delayed block ACK to come back from the remote peer in the next slot time! Yes, I'd like to try and reuse the standard stuff for doing delayed block-ack rather than implementing something specific for 802.11n aggregation + TDMA.
  • .. and yes, it'd be nice for this to support >2 slave terminals, but that's a bigger project.
Right now I think I'll tackle #1 and then make sure the 11n NICs can be configured in a static MCS rate, without aggregation. The rest will have to be up to someone else in the community. My plate is full.

So, TDMA on the 802.11n NICs is now working. Go forth and hack!

Tuesday, November 20, 2012

Making the AR5210 NIC work in the office..

I'm quite happy that FreeBSD's ath(4) driver supports almost all of the PCI and PCIe devices that Atheros has made. Once I find a way to open source this AR9380 HAL I've constructed, we'll actually support them all. However, there are a few little niggling things that have been bugging me. Today I addressed one of those.

The AR5210. It's their first 11a-only NIC. It does up to 54MBit OFDM 802.11a; it doesn't do QoS/WME (as it only has one data queue); it "may" go up to 72MBit if I hack on some magic extensions. And in open mode, it works great.

But it didn't work in the office or at home. All of which are 802.11n APs with WPA2 authentication and AES-CCMP encryption.

Now, the AR5210 only does open and WEP encryption. It doesn't do TKIP or AES-CCMP. So the encryption has to happen in software. The NIC was associating fine, but when wpa_supplicant went to program in the AES-CCMP encryption keys, the HAL simply refused.

What I discovered was this.

The driver keycache was also trying to allocate keycache slots for the AR5210, where it only supports the 4 WEP keys.  This is a big no-no. So once I mapped them to all be slot 0, I made a little progress.

The net80211 layer was trying to program in an AES-CCMP key, which the driver was dutifully passing to the HAL. The AR5210 HAL doesn't support anything but WEP or open, so the encryption key type was "clear". Now, "clear" means "for this MAC address, don't try decrypting anything." But the AR5210 HAL code rejected it - as I said, it doesn't do that.

Ok, so I ignored that entirely. I mapped all of the software encrypted key entries to slot 0 and just didn't program the hardware. So now the HAL didn't reject things. But it wasn't working. The received frames were being corrupted somehow and failed the CCMP MIC integrity check. I took at look at the frames being received (which should've been "clear" versus what was going on in the air - luckily, this laptop has an AR9280 inside so I could put it into monitor mode and sniff things. The packets just didn't add up. I was confused.

Then after discussing this with my flatmate, I idly wondered if the hardware was decrypting the traffic anyway. And, well, it was. Encrypted frames have the WEP bit set in the 802.11 header - whether they're WEP, TKIP, AES-CCMP. The AR5210 didn't know it wasn't WEP, so it tried decoding the frames itself. And corrupting them.

So after finding a PCU control register (hi AR_DIAG_SW) that lets me disable encryption/decryption, I was able to pass through the encrypted traffic fine and everything just plain worked. It's odd seeing an 11a, non-QoS station on my 11n AP, but that just goes to show that backwards interoperability is still useful.

And yes, I did take the AR5210 into the office and I did sit in a meeting with it and use it to work from. It let me onto the corporate wireless just fine, thankyou.

So now the FreeBSD AR5210 support doesn't do any hardware encryption. You can turn it on again if you'd like. Why? Because I don't want the headache of someone coming to me and asking why a dual-VAP AP with WEP and CCMP is failing. The hardware can only do _either_ WEP/open with hardware encryption, _or_ it can do everything without hardware encryption. So I decided to just disable it for now.

There's also a problem with how encryption is specified to net80211. It's done at startup time, when the driver attaches. Anything that isn't specified as being done in hardware is done in software. There is currently no clean way to dynamically change that configuration. So, if I have WEP encryption in hardware but CCMP/TKIP in software, I have to dynamically flip on/off the hardware encryption _AND_ I have to enforce that WEP and CCMP doesn't get configured at the same time.

The cleaner solution would be to:
  • Create a new driver attribute, which indicates the hardware can do WEP and CCMP at the same time - make sure it's off for the AR5210;
  • Add a HAL call to enable/disable hardware encryption;
  • If a user wants to do WEP or open - enable hardware encryption;
  • If a user wants to do CCMP/TKIP/etc - disable hardware encryption;
  • Complain if the user wants to create a VAP with CCMP/TKIP and WEP.
 If someone wants a mini-project - and they have an AR5210 - I'm all for it. But at the moment, this'll just have to do.

Thursday, October 4, 2012

Power save, CABQ, multicast frames, EAPOL frames and sequence numbers (or why does my Macbook Pro keep disassociating?)

I do lots of traffic tests when I commit changes to the FreeBSD Atheros HAL or driver. And I hadn't noticed any problems until very recently (when I was doing filtered frames work.) I noticed that my macbook pro would keep disassociating after a while. I had no idea why - it would happen with or without any iperf traffic. Very, very odd.

So I went digging into it a bit further (and it took quite a few evenings to narrow down the cause.) Here's the story.

Firstly - hostapd kept kicking off my station. Ok, so I had to figure out why. It turns out that the group rekey would occasionally fail. When it's time to do a group rekey, hostapd will send a unicast EAPOL frame to each associated station with the new key and each station must send back another EAPOL frame, acknowledging the fact. This wasn't happening so hostapd would just disconnect my laptop.

Ok, so then I went digging to see why. After adding lots of debugging code I found that the EAPOL frames were actually making to my Macbook Pro _AND_ it was ACKing them at the 802.11 layer. Ok, so the frame made it out there. But why the hell was it being dropped?

Now that I knew it was making it to the end node, I could eliminate a bunch of possibilities. What was left:


  • Sequence number is out of order;
  • CCMP IV replay counter is out of order;
  • Invalid/garbled EAPOL frame contents.
I quickly ruled out the EAPOL frame contents. The sequence number and CCMP IV were allocated correctly and in order (and never out of sequence from each other.) Good. So what was going on?

Then I realised - ok, all the traffic is in TID 16 (the non-QoS TID.) That means it isn't a QoS frame but it still has a sequence number; so it is allocated one from TID 16. There's only one CCMP IV number for a transmitter (the receiver tracks a per-TID CCMP IV replay counter, but the transmitter only has one global counter.) So that immediately rings alarm bells - what if the CCMP IV sequence number isn't being allocated in a correctly locked fashion?

Ok. So I should really fix that bug. Actually, let me go and file a bug right now. There.

There. Bug filed. PR 172338.

Now, why didn't this occur back in Perth? Why is it occuring here? Why doesn't it occur under high throughput iperf (150Mbps+) but it is when the iperf tests are capped at 100Mbps ethernet speeds? Why doesn't it drop my FreeBSD STAs?

Right. So what else is in TID 16? Guess what I found ? All the multicast and broadcast traffic like ARPs are in TID 16.

Then I discovered what was really going on. The pieces fell into place.

  • My mac does go in and out of powersave - especially when it does a background scan.
  • When the mac is doing 150Mbps+ of test traffic, it doesn't do background scans.
  • When it's doing 100Mbps of traffic, the stack sneaks in a background scan here and there.
  • Whenever it goes into background scan, it sends a "power save" to the AP..
  • .. and the AP puts all multicast traffic into the CABQ instead of sending it to the destination hardware queue.
  • Now, when this occured, the EAPOL frames would go into the software queue for TID 16 and the ARP/multicast/etc traffic would go into the CABQ
  • .. but the CABQ has higher priority, so it'll be transmitted just after the beacon frame goes out, before the EAPOL frames in the software queue.
Now, given the above set of conditions, the ARP/multicast traffic (which there's more of in my new place, thanks to a DSL modem that constantly scans the local DHCP range for rogue/disconnected devices) would be assigned sequence numbers AFTER the EAPOL frames that went out but are sitting in the TID 16 software queue. The Mac would receive those CABQ frames with later sequence numbers, THEN my EAPOL frame. Which would be rejected for being out of sequence.

The solution? Complicated.

The temporarily solution? TID 16 traffic is now in a higher priority hardware queue, so it goes out first. Yes, I should mark EAPOL frames that way. I'll go through and tidy this up soon. I just needed to fix this problem before others started reporting the instability.

The real solution is complicated. It's complicated because in power save mode, there's both unicast and multicast traffic going into the same TID(s) but different hardware queues. Given this, it's quite possible that the traffic in the CABQ will burst out before the unicast packets with the same TID make it out via another hardware queue.

I'm still thinking of the best way to fix this.