Friday, November 30, 2012

Be careful of adding debugging, as microseconds count..

.. after tinkering with the TDMA code a bit more, I discovered why I was seeing larger swings in the TDMA slot timings.

Two words: Debug Code.

Well, to be more specific - I added some debugging code that by default didn't do anything. But it was still there; it checked a debug flag and didn't log anything if it was disabled. But that would take time to execute. Since that debugging code sat _between_ the routines doing math with the RX timestamp and the nexttbtt register, it would calculate a slightly larger TSF offset.

Once I moved the debug code out from where it is and grouped all that register access and math together, the slot timing swings dropped by a few microseconds and everything went back to smooth.

Tsk. I should've known better.

At least now the TDMA code is working well on the 802.11n chips. Yes, it's still only 802.11abg rates, but it works. I've also found the PCU MISC_MODE bit to enforce packets don't transmit outside of the burst window and that is working quite fine with TDMA.

So, I think I can say "mission accomplished." I'll tidy up a few more things and make sure TX only occurs in one data queue (as mentioned in my previous post, they all burst independently at the moment..) and then patiently wait for someone to implement 802.11n adhoc negotiation so 802.11n MCS rates and aggregation magically begins to work. Once that's done, 802.11n TDMA will become a reality.

Tuesday, November 27, 2012

Getting TDMA working on 802.11n chipsets

A few years ago, a bunch of clever people figured out how to implement TDMA using the Atheros 802.11abg NICs. Sam Leffler has a great write-up here. He finished that particular paper with some comments about the (then) upcoming 802.11n chipsets from Atheros and how they would be better suited to the kinds of tricks he pulled with the Atheros MAC.

But, if you tried bringing up TDMA on the Atheros 802.11n chips, it plain just didn't work. Lots of people gnashed teeth about it. I was knee deep in TX aggregation work at the time so I just pushed TDMA to the back of my mind.

How it works is pretty cute in itself. To setup a TX "slot", the beacon timer is used to gate the TX queues to be able to start transmitting. Then a "channel ready time" burst length is configured, which is the period of time the TX queue can transmit. Once that timer expires, no new TX is allowed to begin. Sam then slides the slave TX window along based on when it sees a beacon from the master, as everything is synchronised against that.

Luckily, someone did some initial investigation and discovered that a couple of things were very very wrong.

Firstly, when fetching the next target beacon transmission time ("TBTT"), the AR5212 era NICs returned it in TU, but the AR5416 and later returned it in TSF.

Secondly, the TSF from each RX frame on the AR5212 is only 15 bits; on the AR5416 and later its 32 bits. The wrong logic was used when extending the RX frame timestamp from the AR5416 from 32 bits to 64 bits, and it was causing the TSF to jump all over the place.

So with that in place, he managed to stop the NICs from spewing stuck beacons everywhere (a classic "whoa, who setup the timers wrong!" symptom) and got two 11n NICs configured in a TDMA setup. But he reported the traffic was very unstable, so he had to stop.

Fast-forward about 12 months. I've finished the TX aggregation and BAR handling; I've debugged a bunch of AP power save handling and I'm about to reimplement some things to allow me to finish of AP power save handling (legacy/ps-poll and uapsd) in a sane, correct fashion. I decide, "hey, TDMA shouldn't be that hard to fix. Hopefully there are no chip bugs, right?" So, I plug in a pair of AR5413 (pre-11n NICs) and get it up and running. Easy. Then I plug in an AR5416 as the slave node, and .. it worked. Ok, so why was he reporting such bad results?

Firstly, Sam exposed a bunch of useful TDMA stats from "athstats". Specifically, if you start tinkering with TDMA, do this:

$ athstats -i ath0 -o tdma 1


   input   output  bexmit tdmau   tdmadj crcerr  phyerr  TOR rssi noise  rate
  619817   877907   25152 25152    -4/+6    142     143    1   74   -96   24M
     492      712      20    20    -0/+7      0       0    0   74   -96   24M
     496      720      20    20    -2/+6      0       0    0   74   -96   24M
     500      723      21    21    -6/+4      0       0    0   75   -96   24M

When I was debugging the initial AR5416 TDMA stuff, the tdma adjust figures bounced everywhere between 0 and 1000uS off. That was obviously not stable.

So, I looked at what debugging was in the driver itself. There was some (check if_ath_debug.h for the TDMA and TDMA timer flags), and after a bit of digging I realised that every time the TSF was just about to converge, it would be bumped out 1000uS. Then it would slowly drift back to converge, then it'd fall out 1000uS. This kept repeating. It made no sense; every time it calculated the delta between the expected and real TSF, it would "bump" the TSF by that much. That way the TSF would actually be correct. It shouldn't be out by almost as much the next RX'ed frame.

I did some initial testing to ensure the TSF was running at the expected 1uS interval (it was) and the master side was also running at the expected 1uS interval (it also was), so it wasn't out of sync clocks. The TSF bump must not be "right".

Enter the next bug - on the AR5416 and later, the TSF writes must be done as a 64 bit write. Ie, you write TSF_L32 first, then TSF_U32. At that point it gets internally updated and everything is consistent. If you don't do that, it doesn't latch.

Ok, so that fixed the intial drift. But after about 60 seconds, the TSF adjust parameters started varying ridiculously wildly. Ok, so 60 seconds equaled around 65,535 TU (where a TU is 1.024 milliseconds) so I began to wonder if I was seeing something wrap at that point.

Enter the next bug. The math involved in calculating the expected slot time was based on the 64 bit TSF and it was converted down to a 16 bit TU value from 0 .. 65535 TU. On the AR5212 era chips, the nexttbtt timer had a 16 bit resolution. When the nexttbtt value was read from that register, it was already 16 bits. So the "TSF delta" between the expected and real slot time was calculated between these two 16 bit values. However, on the AR5416 and later, the nexttbtt value was a 32 bit TSF (microsecond) value. Even when converted to a TU (1.024 millisecond) value, it would wrap at a value much greater than 65,535 TU. So the comparison would soon be between a value from 0..65,535 TU and 0 .. much-bigger-than-65,535 TU. The tsfdelta would become very, very negative.. and things would go nuts.

Ok, so that fixed another behavioural issue. Things were looking good. The slot time sync was stable. So I started passing traffic. Everything looked good.. for about 60 seconds. Then everything went slightly nuts again. But only with traffic. The timing calculations went way, way out.

Here's an example of the beacons coming in. Note that the expected beacon interval here is 49,152uS.

[34759308] [100933] BEACON: RX TSF=67127545 Beacon TSF=3722387514 (49152)
[34759357] [100933] BEACON: RX TSF=67176714 Beacon TSF=3722436670 (49156)
[34759442] [100933] BEACON: RX TSF=67262432 Beacon TSF=3722521354 (84684)
[34759454] [100933] BEACON: RX TSF=67275216 Beacon TSF=3722533850 (12496)
[34759504] [100933] BEACON: RX TSF=67325995 Beacon TSF=3722583802 (49952)
[34759552] [100933] BEACON: RX TSF=67374479 Beacon TSF=3722632108 (48306)
[34759602] [100933] BEACON: RX TSF=67424546 Beacon TSF=3722681282 (49174)
[34759652] [100933] BEACON: RX TSF=67475842 Beacon TSF=3722731578 (50296)
[34759701] [100933] BEACON: RX TSF=67525900 Beacon TSF=3722780730 (49152)

The master beacons were not coming in stable in any way. The main reason this would happen is if the air was busy at the master target beacon transmission time. So it would delay transmitting the beacon until the air was free.

This is where I decided it was about time I inserted some tracing into the TDMA code. I had introduced some ALQ based tracing in the ath(4) driver recently, specifically to trace TX and RX descriptors. I decided to add TDMA trace points. That way I could look at the TDMA recalculation along with the TX and RX from the driver.

What I found was very .. grr-y. After about 60 seconds (surprise), the TX would burst FAR past the 2.5 milliseconds it was supposed to. Why the heck was that happening?

After a bunch of staring-at-documentation and talking with some people well-versed in how the Atheros MAC worked, we realised the only real explanation is that the beacon timer was firing after the burst time, retriggering the timer. But why would it be? I stared at the debugging output a little more, and look at what I saw:

[34759258] [100933] BEACON: RX TSF=67077388 Beacon TSF=3722338362 (49152)
[34759258] [100933] SLOTCALC: NEXTTBTT=67081216 nextslot=67081224 tsfdelta=8 avg (5/8)
[34759258] [100933] TIMERSET: bt_intval=8388616 nexttbtt=65510 nextdba=524078 nextswba=524070 nextatim=65511 flags=0x0 tdmadbaprep=2 tdmaswbaprep=10
[34759259] [100933] TSFADJUST: TSF64 was 67077561, adj=1016, now 67078577

.. everything here is fine. We're programming nexttbtt in TU, not TSF (because the HAL API specifies it in TU for the older, pre-11n chips. Ok. Suspiciously close to the 65,535 TU boundary.

Then:

[34759308] [100933] BEACON: RX TSF=67127545 Beacon TSF=3722387514 (49152)
[34759308] [100933] SLOTCALC: NEXTTBTT=22528 nextslot=67131381 tsfdelta=-11 avg (5/7)
[34759308] [100933] TSFADJUST: TSF64 was 67127704, adj=11, now 67127715

Ok, but it's just a TSF adjust, no biggie. But, then this happened:

[34759357] [100933] BEACON: RX TSF=67176714 Beacon TSF=3722436670 (49156)
[34759357] [100933] SLOTCALC: NEXTTBTT=71680 nextslot=67180550 tsfdelta=6 avg (5/7)
[34759357] [100933] TIMERSET: bt_intval=8388616 nexttbtt=71 nextdba=566 nextswba=558 nextatim=72 flags=0x0 tdmadbaprep=2 tdmaswbaprep=10
[34759357] [100933] TSFADJUST: TSF64 was 67176888, adj=1018, now 67177906

At this point, it was clear. nexttbtt was very very small. Somehow it was very very small - 71 TU is very, very much before the current TSF of somewhere around 67,127,545. At this point the Next TBTT timer would just keep continously firing. And this would keep re-gating the TX queue, allowing it to just plain keep bursting. That explains why everything was going crazy during traffic.

This again was another example of the code assuming it was an AR5212 era NIC. The nexttbtt value was being trimmed to be between 0 and 65,535 TU. After I fixed that and fixed up the math a bit, nexttbtt was being correctly programmed and suddenly everything started working. And quite well.

So, now the basics are working. I'll audit the math to ensure everything wraps consistently at the 32-bit TSF boundary (ie, 4 billion microseconds, give or take) as that doesn't take too long to occur. But the 11n chips now behave the same as the 11a chips do when doing TDMA.

So what's next?
  • The "tx time" calculation needs to be aware of the 11n rate configuration, so it can calculate the guard time correctly. Right now it uses the non-11n aware rate -> duration HAL function;
  • The TX path has to be rejiggled a bit to ensure _all_ traffic gets stuffed into one TX queue (well, besides beacons.) Management and higher priority traffic has to do this too. If not, then multiple TX queues can burst and they'll burst separately, blowing out the TX slot timing;
  • Someone needs to get 11n adhoc working, so that 11n rates are negotiated during adhoc peer establishment. Then aggregation can just magically work at that point (the TDMA code reuses a lot of adhoc mode vap behaviour code);
  • 802.11e / 802.11n delayed block-ACK support needs to be implemented;
  • Then when doing TDMA, we can just burst out an aggregate or two inside the given slot time, then wait for a delayed block ACK to come back from the remote peer in the next slot time! Yes, I'd like to try and reuse the standard stuff for doing delayed block-ack rather than implementing something specific for 802.11n aggregation + TDMA.
  • .. and yes, it'd be nice for this to support >2 slave terminals, but that's a bigger project.
Right now I think I'll tackle #1 and then make sure the 11n NICs can be configured in a static MCS rate, without aggregation. The rest will have to be up to someone else in the community. My plate is full.

So, TDMA on the 802.11n NICs is now working. Go forth and hack!

Tuesday, November 20, 2012

Making the AR5210 NIC work in the office..

I'm quite happy that FreeBSD's ath(4) driver supports almost all of the PCI and PCIe devices that Atheros has made. Once I find a way to open source this AR9380 HAL I've constructed, we'll actually support them all. However, there are a few little niggling things that have been bugging me. Today I addressed one of those.

The AR5210. It's their first 11a-only NIC. It does up to 54MBit OFDM 802.11a; it doesn't do QoS/WME (as it only has one data queue); it "may" go up to 72MBit if I hack on some magic extensions. And in open mode, it works great.

But it didn't work in the office or at home. All of which are 802.11n APs with WPA2 authentication and AES-CCMP encryption.

Now, the AR5210 only does open and WEP encryption. It doesn't do TKIP or AES-CCMP. So the encryption has to happen in software. The NIC was associating fine, but when wpa_supplicant went to program in the AES-CCMP encryption keys, the HAL simply refused.

What I discovered was this.

The driver keycache was also trying to allocate keycache slots for the AR5210, where it only supports the 4 WEP keys.  This is a big no-no. So once I mapped them to all be slot 0, I made a little progress.

The net80211 layer was trying to program in an AES-CCMP key, which the driver was dutifully passing to the HAL. The AR5210 HAL doesn't support anything but WEP or open, so the encryption key type was "clear". Now, "clear" means "for this MAC address, don't try decrypting anything." But the AR5210 HAL code rejected it - as I said, it doesn't do that.

Ok, so I ignored that entirely. I mapped all of the software encrypted key entries to slot 0 and just didn't program the hardware. So now the HAL didn't reject things. But it wasn't working. The received frames were being corrupted somehow and failed the CCMP MIC integrity check. I took at look at the frames being received (which should've been "clear" versus what was going on in the air - luckily, this laptop has an AR9280 inside so I could put it into monitor mode and sniff things. The packets just didn't add up. I was confused.

Then after discussing this with my flatmate, I idly wondered if the hardware was decrypting the traffic anyway. And, well, it was. Encrypted frames have the WEP bit set in the 802.11 header - whether they're WEP, TKIP, AES-CCMP. The AR5210 didn't know it wasn't WEP, so it tried decoding the frames itself. And corrupting them.

So after finding a PCU control register (hi AR_DIAG_SW) that lets me disable encryption/decryption, I was able to pass through the encrypted traffic fine and everything just plain worked. It's odd seeing an 11a, non-QoS station on my 11n AP, but that just goes to show that backwards interoperability is still useful.

And yes, I did take the AR5210 into the office and I did sit in a meeting with it and use it to work from. It let me onto the corporate wireless just fine, thankyou.

So now the FreeBSD AR5210 support doesn't do any hardware encryption. You can turn it on again if you'd like. Why? Because I don't want the headache of someone coming to me and asking why a dual-VAP AP with WEP and CCMP is failing. The hardware can only do _either_ WEP/open with hardware encryption, _or_ it can do everything without hardware encryption. So I decided to just disable it for now.

There's also a problem with how encryption is specified to net80211. It's done at startup time, when the driver attaches. Anything that isn't specified as being done in hardware is done in software. There is currently no clean way to dynamically change that configuration. So, if I have WEP encryption in hardware but CCMP/TKIP in software, I have to dynamically flip on/off the hardware encryption _AND_ I have to enforce that WEP and CCMP doesn't get configured at the same time.

The cleaner solution would be to:
  • Create a new driver attribute, which indicates the hardware can do WEP and CCMP at the same time - make sure it's off for the AR5210;
  • Add a HAL call to enable/disable hardware encryption;
  • If a user wants to do WEP or open - enable hardware encryption;
  • If a user wants to do CCMP/TKIP/etc - disable hardware encryption;
  • Complain if the user wants to create a VAP with CCMP/TKIP and WEP.
 If someone wants a mini-project - and they have an AR5210 - I'm all for it. But at the moment, this'll just have to do.

Thursday, October 4, 2012

Power save, CABQ, multicast frames, EAPOL frames and sequence numbers (or why does my Macbook Pro keep disassociating?)

I do lots of traffic tests when I commit changes to the FreeBSD Atheros HAL or driver. And I hadn't noticed any problems until very recently (when I was doing filtered frames work.) I noticed that my macbook pro would keep disassociating after a while. I had no idea why - it would happen with or without any iperf traffic. Very, very odd.

So I went digging into it a bit further (and it took quite a few evenings to narrow down the cause.) Here's the story.

Firstly - hostapd kept kicking off my station. Ok, so I had to figure out why. It turns out that the group rekey would occasionally fail. When it's time to do a group rekey, hostapd will send a unicast EAPOL frame to each associated station with the new key and each station must send back another EAPOL frame, acknowledging the fact. This wasn't happening so hostapd would just disconnect my laptop.

Ok, so then I went digging to see why. After adding lots of debugging code I found that the EAPOL frames were actually making to my Macbook Pro _AND_ it was ACKing them at the 802.11 layer. Ok, so the frame made it out there. But why the hell was it being dropped?

Now that I knew it was making it to the end node, I could eliminate a bunch of possibilities. What was left:


  • Sequence number is out of order;
  • CCMP IV replay counter is out of order;
  • Invalid/garbled EAPOL frame contents.
I quickly ruled out the EAPOL frame contents. The sequence number and CCMP IV were allocated correctly and in order (and never out of sequence from each other.) Good. So what was going on?

Then I realised - ok, all the traffic is in TID 16 (the non-QoS TID.) That means it isn't a QoS frame but it still has a sequence number; so it is allocated one from TID 16. There's only one CCMP IV number for a transmitter (the receiver tracks a per-TID CCMP IV replay counter, but the transmitter only has one global counter.) So that immediately rings alarm bells - what if the CCMP IV sequence number isn't being allocated in a correctly locked fashion?

Ok. So I should really fix that bug. Actually, let me go and file a bug right now. There.

There. Bug filed. PR 172338.

Now, why didn't this occur back in Perth? Why is it occuring here? Why doesn't it occur under high throughput iperf (150Mbps+) but it is when the iperf tests are capped at 100Mbps ethernet speeds? Why doesn't it drop my FreeBSD STAs?

Right. So what else is in TID 16? Guess what I found ? All the multicast and broadcast traffic like ARPs are in TID 16.

Then I discovered what was really going on. The pieces fell into place.

  • My mac does go in and out of powersave - especially when it does a background scan.
  • When the mac is doing 150Mbps+ of test traffic, it doesn't do background scans.
  • When it's doing 100Mbps of traffic, the stack sneaks in a background scan here and there.
  • Whenever it goes into background scan, it sends a "power save" to the AP..
  • .. and the AP puts all multicast traffic into the CABQ instead of sending it to the destination hardware queue.
  • Now, when this occured, the EAPOL frames would go into the software queue for TID 16 and the ARP/multicast/etc traffic would go into the CABQ
  • .. but the CABQ has higher priority, so it'll be transmitted just after the beacon frame goes out, before the EAPOL frames in the software queue.
Now, given the above set of conditions, the ARP/multicast traffic (which there's more of in my new place, thanks to a DSL modem that constantly scans the local DHCP range for rogue/disconnected devices) would be assigned sequence numbers AFTER the EAPOL frames that went out but are sitting in the TID 16 software queue. The Mac would receive those CABQ frames with later sequence numbers, THEN my EAPOL frame. Which would be rejected for being out of sequence.

The solution? Complicated.

The temporarily solution? TID 16 traffic is now in a higher priority hardware queue, so it goes out first. Yes, I should mark EAPOL frames that way. I'll go through and tidy this up soon. I just needed to fix this problem before others started reporting the instability.

The real solution is complicated. It's complicated because in power save mode, there's both unicast and multicast traffic going into the same TID(s) but different hardware queues. Given this, it's quite possible that the traffic in the CABQ will burst out before the unicast packets with the same TID make it out via another hardware queue.

I'm still thinking of the best way to fix this.

Lessons learnt from fiddling with the rate control code..

(Note before I begin: a lot of these ideas about rate control are stuff I came up with before I began working at my current employer.)

Once I had implemented filtered frames and did a little digging, I found that the rate control code was doing some relatively silly things. Lots of rates were failing quite quickly and the rate control was bouncing all over the place.

The first bug I found was that I was checking the TX descriptor completion before I had copied it over - and so I was randomly failing TX when it didn't fail. Oops.

Next, don't call the rate control code on filtered frames. They've been filtered, not transmitted. My code wasn't doing that - I'm just pointing it out to anyone else who is implementing this.

Then I looked at what was going on with rate control. I noticed that whenever the higher transmission rates failed, it took a long time for the rate control code to try and sample them again. I went and did some digging - and found it was due to a coding decision I had made about 18 months ago. I treated higher rate failures with a low EWMA success rate as successive failures. The ath_rate_sample code treats "successive failures" as "don't try to probe this for ten seconds." Now, there's a few things you need to know about 802.11n:


  • The higher rates fail, often;
  • The channel state changes, often;
  • Don't be afraid to occasionally try those higher rates; it may actually work out better for you even under higher error rates.
So given that, I modified the rate control code a bit:

  • Only randomly sample a few rates lower than the current one; don't try sampling all 6, 14 or 22 rates below the high MCS rates;
  • Don't treat low EWMA as "successive failures"; just let the rate control code oscillate a bit;
  • Drop the EWMA decay time a bit to let the oscillation swing a little more.
Now the rate control code behaves much better and recovers much quicker during unstable channel conditions (eg - adrian walking around a house whilst doing iperf tests.)

Given this, what could I do better? I decided to start reading up on what the current state of play with 802.11n aware rate control and rapidly came to the conclusion that - wow, we likely could do it better. The Linux minstrel_ht algorithm is also based on John Bickett's sample rate code, but instead of using a EWMA and minimising packet transmission time, it uses the EWMA to calculate a theoretical throughput and maximises that. So, that sounds good, right?

Except that the research shows that 802.11n channels can vary very frequently and very often, especially at the higher MCS rates. The higher MCS rates can become better and worse within a window of a second or two. So, do you want to try and squeeze the last of throughput out of that, or not?

Secondly, using "throughput" as a metric is fine if your air time is .. well, cheap. But what if you have many, many clients on an AP? Your choice of maximising throughput based on what the error rate predicts your data throughput is doesn't take airtime into account. In fact, if you choose a higher MCS rate with a higher error rate but higher throughput, you may actually be wasting more air with those retransmissions. Great for a single station, but perhaps not so great when you have lots.

So what's the solution? The open source rate control stuff doesn't take the idea of "air utilisation" into account. There's enough data available to create an air time model, but no-one is using it yet. Patches are gratefully accepted. :-)

Finally, the current packet scheduler is pretty simple and stupid (and does break in a lot of scenarios, sigh.) It's just a FIFO, servicing nodes/TIDs with traffic in said FIFO mechanism. But that's not very fair - both from a "who is next" standpoint and "what's the most efficient use of the air" view. In addition, the decision about which node/TID to schedule next is done totally separate to the rate control decision. Rate control occurs rather late in the packet transmission process (ie, once we've committed to queuing it to the hardware.) Wouldn't it be better to have the packet scheduler and rate control code joined at the hip, where the scheduler doesn't aggressively schedule traffic to a slow/lossy end node?

Lots of things to think about..

Wednesday, October 3, 2012

Filtered frames support, or how not to spam the air with useless transmission attempts

I haven't updated this blog in a while - but that doesn't mean I haven't been a busy little beaver in the world of FreeBSD Atheros support.

A small corner part of correct and efficient software retransmission handling is what Atheros hardware calls "filtered frames", or what actually happens - "how to make sure the hardware doesn't just spam the air with transmission attempts to a dead remote node."

So here's the run-down.

You feed the Atheros hardware a set of frames to transmit. There's a linked list (or FIFO for the more recent hardware) of TX descriptors which represent the fragments of each frame you're transmitting. The hardware will attempt each one in turn and then return a TX completion status explaining what happened. For frames without ACKs the TX status is "did I get a chance to squeeze this out into the air" - there's no ACK, so there's no way to know if it made it out there. But for frames with ACKs, there's a response from the remote end which the transmitter uses, and it'll attempt retransmission in hardware before either succeeding or giving up. Either way, it tells you what happened.

The reality is a little more complicated - there's multiple TX queues with varying TX priorities (implementing the 802.11 "WME" QoS mechanism. There's all kinds of stuff going on behind the scenes to figure out which queue wins arbitration, gets access to the air to transmit, etc, etc. For this particular discussion it doesn't matter.

Now, say you then queue 10 frames to a remote node. The hardware will walk the TX queue (or queues) and transmit those frames one at a time. Some may fail, some may not. If you don't care about software retransmission then you're fine.

But say you want to handle software retransmission. Say that retransmission is for legacy, non-aggregation / non-blockack sessions. You transmit one frame at a time. Each traffic identifier (TID) has its own sequence number space, as well as the "Non-QoS" traffic identifier (ie, non-QoS traffic that does have a sequence number.) By design, each frame is transmitted in order, with incrementing sequence numbers. The receiver throws away frames that are out of sequence. That way packets are delivered in order. They may not be reliably received, but the sequence number ordering is enforced.

So, you now want to handle software retransmission of some frames. You get some frames that are transmitted and some frames that weren't. Can you retransmit the failed ones? Well, the answer is "sure", but there are implications. Specifically, those frames may now be out of sequence, so when you retransmit them the receiver will simply drop them. You could choose to reassign them new sequence numbers so the receiver doesn't reject them out of hand, but now the receiver is seeing out-of-order frames. It doesn't see out of sequence frames, but the underlying payloads are all jumbled. This makes various protocols (like TCP) very unhappy. If you're running some older PPTP VPN sessions, the end point may simply drop your now out-of-order frames. So it's very important that you actually maintain the order of frames to a station.

Given that, how can you actually handle software retransmission? The key lies in this idea of "filtered frames." The Atheros MAC has what it calls a "keycache", where it stuffs encryption keys for each destination node (and multicast keys, and WEP keys..) So upon reception of a frame, it'll look in the keycache for that particular station MAC, then attempt to decrypt the data with that key. Transmission is similar - there's keycache entries for each destination station and the TX descriptor has a TX Key ID.

Now, I don't know if the filtered frame bit is stored in the keycache entry or elsewhere (I should check, honest) but I'm pretty sure it is.

So the MAC has a bit for each station in the keycache (I think!) and when a TX fails, it sets that bit. That bit controls whether any further TX attempts to that station will actually occur. So when the first frame TX to that station fails, any subsequent attempts are automatically failed by the MAC, with the TX status "TX destination filtered." Thus, anything already in the hardware TX queue(s) for that destination get automatically filtered for you. The software can then grab those frames in the order you tried them (which is in sequence number order, right?) and since _all_ frames are filtered, you don't have to worry about some intermediary frames being transmitted. You just queue them in a software queue, wait until the node is alive again, and then set the "clear destination mask (CLRDMASK)" bit in the next TX attempt.

This gives you three main benefits:
  • All the frames are filtered at the point the first fails, so you get all the subsequent attempted frames back, in the order you queued them. This makes life much easier when retransmitting them;
  • The MAC now doesn't waste lots of time trying to transmit to destinations that aren't available. So if you have a bunch of UDP traffic being pushed to a dead or far away node, the airtime won't be spent trying to transmit all those frames. The first failure will filter the rest, freeing up the TX queue (and air) to transmit frames to other destinations;
  • When stations go into power save mode, you may have frames already in the hardware queue for said station. You can't cancel them (easily/cleanly/fast), so instead they'll just fail to transmit (as the receiver is asleep.) Now you just get them filtered; you store them away until the station wakes up and then you retransmit them. A little extra latency (which is ok for some things and not others!) but much, much lower packet loss.
This is all nice and easy. However, there are a few gotchas here.

Firstly, it filters all frames to that destination. For all TIDs, QoS or otherwise. That's not a huge deal; if however you're me and you have per-TID queues you need to requeue the frames into the correct queues to retry. No biggie.

Secondly, if a station is just far away or under interference, you'll end up filtering a lot of traffic to it. So a lot of frames will cycle through the filtered frames handling code. Right now in FreeBSD I'm treating them the same as normal software retransmissions and dropping them after 10 attempts. I have a feeling I need to fix that logic a bit as under heavy dropping conditions, the traffic is being prematurely filtered and prematurely dropped (especially when the node is going off-channel to do things like background scans.) So the retransmission and frame expiry is tricky. You can't just keep trying them forever as you'll just end up wasting TX air time and CPU time transmitting lots of frames that will just end up being immediately filtered. Yes, tricky.

Thirdly, there may be traffic that needs to go out to that node that can't be filtered. If that's the case, you may actually end up with some subsequent frames being transmitted. You then can't just naively requeue all of your failed/filtered frames or you'll transmit some frames out of sequence. You then _have_ to drop anything with a sequence number or that was queued _before_ the successfully transmitted frame.

Luckily for 802.11n aggregation sessions the third point isn't a big deal - you already can transmit out of sequence (just within the block-ack window or BAW), so you can just retransmit filtered frames with sequence numbers in any particular sequence. Life is good there.

So what have I done in FreeBSD?

I have filtered frames support working for 802.11n aggregation sessions. I haven't yet implemented software retransmission (and all of the hairy logic to handle point #3 above) for non-aggregate traffic, so I can't do filtered frames there. But I'm going to have to do it in order to nicely support AP power save operation.

Friday, July 20, 2012

Reading rate control information from userland..

I've begun working on exporting the ath(4) driver rate control information. The "current" way to extract this information has been to issue a sysctl to the rate control module, which dumps debugging state to the kernel console. This is quite suboptimal. Instead, I'm now packaging that up via a driver ioctl() call. It's temporary (hah, famous last words in the software industry :) and it's driver specific for now. The thirty second summary:

  • There's a new ioctl to ath(4) to query the rate control module for a single associated MAC address (or the BSS MAC when running as a STA);
  • Since the rate control is currently done at the driver level rather than at the VAP level, the call is to the driver rather than via the VAP (wlanX) interface;
  • There's no easy way to get "all" station details whilst maintaining correct locking.

The last point deserves a little more explanation. I've introduced (well, _using_ now) a per-node lock when doing rate control updates. I acquire this lock when copying the rate control data out, so the snapshot is consistent.

So to fetch the state for a node, the following occurs:


  • Call the net80211 layer to find an ieee80211_node for the given mac address - that involves locking the node table and getting a reference for the node (if found); 
  • Then locking the ath_node associated with it;
  • Copy the data out;
  • Unlock the ath_node;
  • Decrement the ieee80211_node reference counter (which requires the node table lock.)


Now, the node table lock only occurs whilst fetching the node reference. It isn't held whilst doing the actual rate control manipulation. Compare to what I'd do if I wanted to walk the node table. The net80211 API for doing this holds the node lock whilst waking the node list. This means that I'll end up holding the node table lock whilst acquiring the ath_node lock. Now, that's fine - however, if I then decide somewhere else to try and do any ieee80211 operation whilst holding the ath_node lock, I may find myself with a lock ordering problem.

So for now the API will just support doing a single lookup for a given MAC, rather than trying to pull all of the rate control table entries down at once.

Here's an example output from the command:


adrian@marilyn:~/work/freebsd/ath/head/src/tools/tools/ath/athratestats]> ./athratestats -i ath1 -m 06:16:16:03:40:d0
static_rix (-1) ratemask 0xf
[ 250] cur rate 5  Mb since switch: packets 1 ticks 43028655
[ 250] last sample (11  Mb) cur sample (0 ) packets sent 10708
[ 250] packets since sample 16 sample tt 6275
[1600] cur rate 11  Mb since switch: packets 15 ticks 43025720
[1600] last sample (5  Mb) cur sample (0 ) packets sent 2423
[1600] packets since sample 7 sample tt 12713
[ 2  Mb: 250]        9:9        (100%) (EWMA 100.0%) T       11 F    0 avg  2803 last 42176930
[ 5  Mb: 250]     3139:3139     (100%) (EWMA 100.0%) T     3273 F    0 avg  1433 last 43028656
[ 5  Mb:1600]       29:29       (100%) (EWMA 100.0%) T       39 F    0 avg  5303 last 42192044
[11  Mb: 250]     7560:7560     (100%) (EWMA 100.0%) T     7838 F    0 avg  1857 last 43026094
[11  Mb:1600]     2394:2394     (100%) (EWMA 100.0%) T     2581 F    0 avg  2919 last 43026411

Friday, June 15, 2012

Don't let anyone tell you that FreeBSD doesn't "do" 802.11n:

This is from my FreeBSD-HEAD 802.11n access point, currently doing ~ 130MBit/s TCP:


# athstats -i ath0
41838297     data frames received
31028383     data frames transmit
78260        short on-chip tx retries
3672         long on-chip tx retries
197          tx failed 'cuz too many retries
MCS13        current transmit rate
8834         tx failed 'cuz destination filtered
477          tx frames with no ack marked
239517       rx failed 'cuz of bad CRC
10           rx failed 'cuz of PHY err
    10           OFDM restart
42043        beacons transmitted
143          periodic calibrations
-0/+0        TDMA slot adjust (usecs, smoothed)
45           rssi of last ack
51           avg recv rssi
-96          rx noise floor
812          tx frames through raw api
41664029     A-MPDU sub-frames received
42075948     Half-GI frames received
42075981     40MHz frames received
13191        CRC errors for non-last A-MPDU subframes
129          CRC errors for last subframe in an A-MPDU
2645042      Frames transmitted with HT Protection
351457       Number of frames retransmitted in software
23299        Number of frames exceeding software retry
30674735     A-MPDU sub-frame TX attempt success
374408       A-MPDU sub-frame TX attempt failures
8676         A-MPDU TX frame failures
443          listen time
6435         cumulative OFDM phy error count
161          ANI forced listen time to zero
3672         missing ACK's
78260        RTS without CTS
1469003      successful RTS
239605       bad FCS
2            average rssi (beacons only)
Antenna profile:
[0] tx  1466665 rx        1
[1] tx        0 rx 41838296

Monday, June 11, 2012

A tale of two sequence numbers, or "when QoS seqno and CCMP PN don't match up"..

Many moons ago (say, 3 or 4 weeks - so hm, most-of-a-moon-ago actually) I found a rather curious failure condition in the ath(4) TX aggregation path. The colourful history is documented in FreeBSD kern/166190. In short - there are situations where sequence numbers were allocated in a different order to how frames were being added to the block-ack window tracking, and if you got unlucky, you'd cause the stack to think a frame was (far) outside the BAW.

The 30 second explanation:

Imagine you allocated four frames - sequence numbers 1, 2, 3 and 4. They have to be added to the block-ack window in precisely that order. Ie:

  1. Starting condition: Window is at 0:63 (64 frame window, starting at 0, so ending at 63)
  2. Add 1: Window is now at 0:63, starting at 1
  3. Add 2: Window is now at 0:63, starting at 2
  4. Add 3: Window is now at 0:63, starting at 3.
The reason the window pointer isn't moving along is because although you've sent the frames (or you're about to), you can't advance it until the other end has ACKed it (via a block-ack or a normal ACK.) For more information, google how 802.11n aggregation works.

The important bit here is that the window is still 0:63 and the starting point is now '3'. This continues all the way to trying to queue frame 64, where it will be outside of the current BAW and not be allowed to be transmitted. It'll sit in the software queue and wait until frame '0' has been ACKed and the BAW has been advanced to be 1:64 - at which point 63 will fall inside the window and will be transmitted.

So yes, the sender is tracking two things - the BAW and what the starting point is that they've added to the BAW.

Now, imagine instead of (1, 2, 3, 4) on the software queue, I somehow get preempted (or race between two sending threads, when using SMP) between 'allocated seqno' and 'queue to software queue'. In the existing code, a lock was held when:
  • Allocating a sequence number, then it was dropped; then
  • Adding it to the software queue.
Now because there was a period where no lock was being held, it's quite possible that what ends up on the software queue is (2, 1, 3, 4.) So:
  1. Starting condition: Window is at 0:63
  2. Add 2: Window is now 0:63, starting at 2.
  3. Add 1: Window is 0:63, starting at 2; 1 is outside of the BAW (it's treated as a 'wraparound', so imagine it's 4095 seqno's away) so TX stalls.
This was the cause of the TX stalls that I was seeing originally in kern/166190. I "fixed" it by only allocating sequence numbers when the frame was about to be transmitted for the first time, and then adding it to the BAW right there. Since both sequence number allocation and adding to the BAW happened inside the same lock, everything was sweet.

Except, I totally forgot about CCMP PN. So under high enough UDP TX loads (say, > 200MBit), I'd hit the same race, but between 802.11 sequence numbers and CCMP PN sequence numbers.

CCMP PN is assigned during 802.11 encapsulation time, in the driver. In the ath(4) case, it's done during transmit and before being queued to the software queue. And it was being done outside of any locking.  So it's very possible that frames would end up on the software queue with 802.11 and CCMP PN sequence numbers out of lock-step.

What would happen?

Simply - after the 802.11n reordering occured on the receive side, the CCMP PN replay detector would notice sequence numbers out of order, and start tossing said out of order frames. Lots of packet loss ensued.

So, I sat down and started trying to address it. The simplest thing - wrap the whole encapsulation path between ieee80211_crypto_encap(), 802.11 sequence number assignment and software/hardware queueing behind the TID (well, hardware TXQ) lock. It took some time; I had to revert two earlier commits which introduced the delayed sequence number allocation.

This didn't fix things. So I was back to square one.

I started looking at all the places where the frames were being queued to the software queue and .. well, let's just say I spent Sunday swearing _at myself_ for all the weird and wonderful stupid mistakes I had made when writing/porting this code over.

The short version follows (the long version is "read the sys/dev/ath/ commit logs and the PR history"):
  1. When I was queueing frames to the software queue, I'd check how deep the hardware queue was. If the hardware queue was shallow/empty, I'd direct dispatch up to two frames to the hardware to get things 'busy'. That will (hopefully) let further frames come along in the meantime and be aggregated. However, I was queueing the new frame to the hardware rather than queueing the new frame to the tail of the queue, and queueing the head frame of the queue to the hardware. That led to some out of order behaviour.
  2. ath_tx_xmit_aggr() would check if the sequence number was within the block-ack window and if it wasn't, it'd queue the frame to the tail of the queue. This meant that any new frames that came along would be queued to the end of the queue, even if they had been dequeued from the head of the queue. This lead to frames on the software queue being out of order.
    1. Frames on the software queue don't have to be in-sequence (as retries are prepended to the beginning of the list, and new frames are appended to the end) however they have to be in-order. If they end up being out of order, the BAW logic fails.
So, now that I allocate sequence numbers at packet queue time, I have to be triply sure that what ends up on the software queue is correctly in order, or the BAW logic will cause traffic stalls and potentially duplicate sequence number issues. Yes, this means that the old behaviour, whilst it now works right with all the right locking, requires me to correctly handle putting frames on the software queue. (Or, as I like to say, "keeping the bastard (me) honest.")

TL;DR - 802.11n aggregation works again. Now, to fix those pesky "queue full and I want to send a BAR frame so I can unblock the full queue and transmit" problems. At least that one is more tractable and easier to solve. Or is it.


Wednesday, June 6, 2012

FreeBSD, Netflix, CDN

The big news this week is the Netflix Openconnect platform, which was just announced. It uses off the shelf hardware - and FreeBSD + nginx.


The question is how you could spin it.


You could say "Netflix chose FreeBSD because they can keep their changes proprietary." Sure, they could. But they're not making appliances that they're selling - they're owning the infrastructure and servers. It's unclear whether they'd have to contribute back any Linux changes if they ran Linux on their open connect platform. They're making a conscious, public decision to distribute their changes back to FreeBSD - even though they don't have to.


You could say "Netflix chose FreeBSD because the people inside the company knew FreeBSD." Sure, they may have. The same thing could be said about why start-ups and tech companies choose Linux. A lot of the time its because they're chasing enterprise support from Redhat. But technology startups using Ubuntu or Debian tend not to be paying support fees - they hire smart people who know the technology. So, yes - "using what they know."


According to the Netflix Openconnect website:


"This was selected for its balance of stability and features, a strong development community and staff expertise. We will contribute changes we make as part of our project to the community through the FreeBSD committers on our team."



Let's pull this apart a little.

  1. "Balance of stability and features." FreeBSD has long been derided for how slowly it moves in some areas. The FreeBSD developers tend to be a conservative bunch, trying to find the balance between new feature development and maintaining both stability and backwards compatibility.
  2. "Strong community." FreeBSD has a strong technical development community and Netflix finds this very important. They're also willing to join and participate in the community like many other companies do.
  3. "Staff expertise." So yes, their staff are familiar with FreeBSD. They're also familiar with Linux. They chose a platform which they have the expertise to develop, use and improve. They didn't just choose an unfamiliar platform because of marketing brochures or sales promises. I don't see any negatives here. I'm sure that Google engineers chose Linux to begin with because they were familiar with Linux.
  4. "Contribute changes we make as part of our project to the community." Netflix  has committed to push improvements and fixes back to the upstream project They contributed some bug fixes in the 10GE Intel driver and IPv6 stack this week. This is collaborative open source working the way it should.
Why would Netflix push back changes and improvements into a project when they're not required to? That's something you should likely ask them. But the same good practice arguments hold for both Linux AND BSD projects:
  • The project is a constantly moving target. If you don't push your changes back upstream, you risk carrying around increasingly larger changes as your project and your BSD upstream project diverge. This will just make things more difficult in the long run.
  • By pushing your changes upstream, you make it easier to move with the project - including adopting improvements and new features. If you keep large changes to yourself, you will likely find it increasingly difficult to update your software to the newer upstream versions. And that upstream project is likely adding bug fixes, improvements and new features - which at some point you may wish to leverage. By pushing your changes upstream, you make it a lot easier to move to future versions of the upstream project, allowing you to leverage all those fixes and improvements without too much engineering time.
  • By participating, you encourage others to adopt your technology. By pushing your changes and improvements upstream, you decrease the amount of software you have to maintain yourself (and keep patching as the upstream project moves along.) But you also start to foster technology adoption. The FreeBSD jail project started out of the desire by a hosting company to support virtualisation. Since then, the Jail infrastructure has been adopted by many other companies and individuals.
  • When others use your technology, they also find and fix bugs in your technology; they may even improve it. The FreeBSD jail support has been extended to include IPv6 support, shared memory support and integrates into the VIMAGE (virtualised networking) stack (which, by the way, came from Ironport/Cisco.) As a company, you may find that the community will do quite a lot of the work that you would normally have to hire engineers to do yourself. This saves time and saves money.
  • When companies contribute upstream, it encourages other companies to also contribute upstream. A common issue is "reinventing the wheel", where companies end up having to reinvent the same technology privately because no-one has contributed it upstream. They solve the same problems, they implement the same new features .. and they all spend engineering time and resources to do so.
  • And when companies contribute upstream, it encourages (private) developers to contribute. Open source developers love to see their code out there in the wild, in places they never quite thought of. It's encouraging to see companies build products with their code and contribute back bug fixes and improvements. It fosters a sense of community and participation, of "give and take", rather than just "take". This is exactly the kind of thing that keeps developers coming back to contribute more - and it attracts new developers. Honestly, who wouldn't want to say that some popular device is running code that they wrote in their spare time?
So, you could rant and rave about the conspiracy side of Linux versus FreeBSD. You could rant on about GPL versus BSD. Or, you could see the more useful side of things. You could see a large company who didn't have to participate at all, agreeing to contribute back their improvements to an open source operating system. You could see that by doing so, the entire open source ecosystem benefits - not just FreeBSD. There's nothing stopping Linux or other BSD projects from keeping an eye on the improvements made by Netflix and incorporating those improvements into their own project. And it's another case of a company participating and engaging the open source community - and having that community engage them right back.

Good show, Netflix. Good show.

Wednesday, May 23, 2012

Fixing BAR handling and handle corner cases of things..

Part of the general 802.11n requirements is correctly handling TX aggregation and failures - where we need to pause the TX queue, send a BAR to forcibly move the remote end block-ack window to be in sync with the transmitters idea, and then continue sending frames.

This exposed a very annoying problem - what if the driver runs out of ath_buf entries to schedule TX frames? Or, what if the network stack runs out of mbufs? If we need to allocate an ath_buf/mbuf to send a BAR frame, but they're all allocated and unavailable, the driver/wireless stack will come to a grinding halt. Typically these allocated ath_buf's are allocated in the software queue, waiting for the BAR TX (or power-save wakeup) to send a frame.

So, I haven't fixed this. It's on my (very) short term to-do list. But it did expose some issues in how the net80211 BAR send code (ieee80211_send_bar()) works. In short - it didn't handle resource allocation failures at all. It worked fine if the driver send method (ic->ic_raw_xmit()) succeeded and just failed to TX the frame. But if it couldn't allocate an mbuf, or if the driver send method failed.. things just stopped. And when the BAR TX just stopped, the ath(4) software TX queue would just keep buffering frames, right until all the TX ath_buf entries were consumed.

This is obviously .. sub-optimal.

But this raises an interesting point - how much of your kernel and/or userland application handle resource shortages correctly? I've seen plenty of userland software just not check whether malloc() returned NULL and I've seen some that specifically terminate (non-gracefully) if malloc()/calloc() fails - Squid does this. But what about your network stack? How's it handle mbuf shortages? What about the driver stack? What about net80211 (ew) ? What if the kernel malloc() API has to sleep because there's no free memory available?

I don't (currently) have an answer - it's a difficult, cross-discipline problem. What I -can- do though (at least in my corner of the FreeBSD world - net80211 and ath(4)) is to start testing some of these corner cases, where I  force some resource shortages and ensure that the wireless stack and driver(s) recover somewhat gracefully. 802.11n is very unforgiving if you start dropping frames involved in an active aggregation session. So it's best I try and address these sooner rather than later.

FreeBSD/arm -

I've just found out that FreeBSD/armv6 now runs on this:

http://www.embeddedartists.com/products/kits/lpc3250_kit.php


This is excellent news. There's also been some recent work to improve pandaboard support (TI-OMAP) focusing on pmap and SMP fixes.

I'm so very glad that the FreeBSD community is pushing this ARM project along. Yes, the armv6 branch is not very well named (it supports more recent ARM platforms, not just armv6 improvements) and I'm hoping this will all make it into FreeBSD-HEAD soon.

Sunday, April 8, 2012

And the winner of the most committing committer to src/sys over the last 12 months is ..

.. well, it was me. Up until last week. marius@ snuck in to take my place. I guess all of those commits I did about this time last year to start bringing FreeBSD's ath(4) support up to scratch are finally expiring out of the 12 month window.

(Source: http://people.freebsd.org/~peter/commits.html)

.. but I wouldn't call myself the most important committer. Or the most active. What I'd call myself is the "most active fixing a sorely needed corner of the codebase."

What I _could_ have done is simply do all my work in a branch and then merge it back into -HEAD when I was done. And, for about 6 months, this is what I did. The "if_ath_tx" branch is where I did most of the initial TX aggregation work.

But as time goes on, your branch diverges more and more from the master branch (-HEAD in FreeBSD) and you are faced with some uncomfortable decisions.

If you stay on the same branch point and never merge in anything from your master branch, you _may_ have a stable snapshot of code, but who knows how stable (or relevant) your work will be when you merge it back into master.

You have no idea if your work will break anything in master and you have no idea if changes in master have broken your work.

As time goes on, the delta between your branch point and the master branch increases, making it even more difficult to do that final merge back. It also has the side effect of making it increasingly likely that problems will occur with the merge (your code breaking master, master breaking your code, etc.)

So as uncomfortable as it was - and as much as I wanted things to stay stable - I did press through with relatively frequent merging. This means:
  • I would pick specific development targets to work towards, at which point I'd stop developing and go into a code review/tidyup/testing phase;
  • I'd do frequent merges from master back into my branch during active development - I wouldn't leave this until I was ready to merge my work back into master;
  • Once I reached my development target and had done sufficient testing - including integrating changes from master back into my branch - I'd kick off a semi-formal review (read: email freebsd-wireless@) and call for testers/review;
  • Only _then_ would I merge what was suitable back into master.

I wouldn't merge everything from my branch into master. In my instance, there were some debugging extensions that were easy to maintain (read: lots of device_printf() calls) but weren't suitable for FreeBSD-HEAD. But I merged the majority of my work each time.

But that doesn't always work. I managed to merge a bunch of ath(4), ath_hal(4) and net80211 fixes back into -HEAD as appropriate. But the TX aggregation code was .. well, rather large. So I attempted to break up my commit into as many small, self-contained functional changes as possible. Yes, there was a big "here's software TX queue and aggregation" as a big commit at the end but I managed to peel off more than 30% of that in the lead-up commits.

Why bother doing that?

Two words - version bisection. Once I started having users report issues, they would report something like "FreeBSD-HEAD revision X worked, revision Y didn't." (If I were lucky, of course.) Or, they'd note that a certain snapshot from a certain day worked, but the next day had a regression. If I had committed everything as one enormous commit after having spent 6 + months on the branch, I'd be in for a whole lot of annoying line-by-line debugging of issues. Instead, I was able to narrow down most of the regressions by trying all the different commits.

Now that 802.11n ath(4) TX aggregation and general 802.11n support is in the tree, I only use branches for larger scale changes that take a couple of weeks. For example, when fixing up the reset path to not drop any TX/RX frames. I do most of the bugfixing in FreeBSD-HEAD. I could do it in a branch and then do monthly merges, but I then have the same problems I've listed above.

In summary: don't underestimate how helpful it is to break down your commits into little, piecemeal, self-contained functional changes. It has the side effect of making you look really good in the committer statistics.

Thursday, April 5, 2012

The initial introduction into "it's the NIC, stupid!"

I've finally hit that dreaded condition which hardware guys hate - where a NIC is behaving badly.

In my case, an IBM/Lenovo Thinkpad T60 has been modified (not by me) to take an Atheros AR9280 NIC. Unfortunately, the NIC was proving to be very unstable when doing 802.11n throughput. The investigations did show I was doing something slightly incorrect with TX descriptors (and I've since fixed that) but the stability issues remained.

The Atheros NICs can expose some host interface error conditions via the AR_INTR_SYNC_CAUSE register. These include PCI(e) transaction timeouts, illegal chip access (eg whilst the MAC is asleep), parity errors, and other rather nice things. FreeBSD's HAL and Linux ath9k does have the register definition for what the bits do - but unfortunately we don't keep statistics.

In my particular case:
  • I'd see AR_INTR_SYNC_LOCAL_TIMEOUT occur. This is because a PCI(e) transaction didn't complete in time. I can tune these timeouts via a local register but that's not the point - I was seeing these errors when receiving only beacons from the access point. That's a bit silly.
  • I'd also see AR_INTR_SYNC_RADM_CPL_DLLP_ABORT, which is an indication that the PCIe layer isn't behaving well.

I swapped it out with another AR9280 based NIC and suddenly all the instabilities have gone away. No TX hangs, no missed TX interrupts. Everything looks great.

So as an open source developer, I want to try and put some tools into the hands of the community to be able to debug what's going on - or, if that's not possible, at least get some indication that things are going wrong. Right now the only thing people see is "I see TX timeouts, it must be the driver/chip fault." There's too much going on to be able to conclude that.

My game plan is this:

  • Implement statistics keeping for each of the SYNC interrupts and expose those via a diagnostic interface. Ben Grear has done something similar for Linux ath9k after a private email discussion. He's also seeing MAC sleep accesses, so it's quite likely we'll start finding/squishing these.
  • Take the offending laptop/NIC to the office and attach it to a very expensive and fancy looking PCIe analyser. I'm hoping we'll find something really silly occuring - like lots of sleep state transitions, or a high number of parity errors.
  • Try documenting this a lot better so users are able to understand what's going on when their NIC is misbehaving.

Sunday, March 18, 2012

Concurrency in the TX path and when it all falls down..

I'm still (yes, still) hacking on the FreeBSD 802.11n ath(4) TX path. I'm trying to find and fix issues that creep up before I can flip on the 11n code by default.

I've become increasingly aware of the lack of locking in net80211, including some of the A-MPDU TX session management code. I know that I'll eventually have to plan out and implement some locking, but for now I'm just trying to squish whatever issues are showstoppers.

A user approached the freebsd-wireless list about two weeks ago and noticed that his 802.11n session was hanging after a bit of use. If he tried 802.11g, things worked fine. He tried SMP - and things worked fine. The symptoms? A number of frames seemed to be "stuck" and sitting in a software TX queue, not being transmitted. But other frames would be TXed fine, so it wasn't as simple as a totally confused TX BAW (Block-ack Window.)

After I managed to reproduce the issue locally, I discovered what was going on.

It was concurrency.

Specifically, that there's multiple places where ath_start() is being called, thus multiple concurrent TX is occuring.

Now, in the non-aggregate method, net80211 is doing all the sequence number assignment. I'm not so sure that in the normal case, the sequence numbers are being allocated in a consistent, sequential way - if the net80211 TX code is able to be called concurrently from multiple threads, sequence numbers can and will be occasionally "raced" and allocated in the reverse order that they're submitted to the driver. But I'm not here to fix that (however I'll eventually have to.)

In the aggregate method, the driver is doing the aggregation and sequence number assignment. For now, the driver is also doing the TX BAW tracking and frame queuing. So imagine this sequence of events:

* a frame is submitted via ath_start();
* since aggregation is enabled, it's allocated a sequence number;
* it's then thrown into the software queue;
* then at some later stage, the software queue is checked, the frame is popped out of the list and if it's inside the BAW, it's added to the BAW and TX;ed
* adding the frame to the TX BAW slides the left hand edge of the BAW to be at that sequence number. Any frames TXed with a seqno _less_ than this will be treated as outside of the BAW and put back onto the software queue for now.

Now, the locking only occurs at:

* the time the frame has the sequence number allocated;
* then when the queue is checked and the frame is popped off.

If two or more threads are allocating sequence numbers and doing work, it's quite possible that thread A will allocate (say) seqno 5, thread B will allocate seqno 6 and then add it to the BAW before thread A can. Then when thread A tries to do some work, it finds the queue has a frame with seqno 5 in it - and since it's before the left hand edge of the BAW (which is at 6, as it was successfully pushed to the hardware), it won't be transmitted and will stay in the software queue until the BAW sequence numbers wrap around to 0 and catch up.

Now, linux ath9k/mac80211 doesn't have this problem. The TX pathway is totally serialised, which means that even if multiple threads are trying to TX, only one thread will be able to enter the TX code at any time. The other threads get blocked.

So how can I solve this? The easy solution would be to serialise FreeBSD's net80211 and ath driver TX code. That way all of this nonsense will go away. For the net80211 side of things this may work - the legacy TX path, where sequence numbers are allocated by the stack, could benefit from serialisation. The throughput isn't ever going to be that great, so we wouldn't really hurt from it. But the trouble is making absolutely sure that the driver also does the same - even if I push the frames into the queue in order and ensure that they have sequential sequence numbers, there's no guarantee that a driver with concurrent entry paths into XXX_start() will de-queue the frames and push them into the hardware in the same order.

So what I've chosen to do instead is to ignore the legacy part for now, and not serialise anything. Instead, I'm doing the sequence number allocation (for aggregation, remember) -at the time I'm about to add the frame to the BAW and TX it to the hardware-. Ie, until the frame actually is able to be added to the BAW, it won't _have_ a sequence number. Since this action is done behind a lock, it's guaranteed to be sequential. The trick here is to only allocate the sequence numbers once it's known for certain that the frame _will_ be going out to the hardware.

For the legacy path, it's also likely worthwhile delaying the sequence number allocation until it's just about to go to the ifnet queue. That way the frames on the ifnet queue have sequence numbers that are in order. Then I need to fix each driver (ugh) to make sure they're dequeued fine.

I've written up the aggregation change and this so far works quite well. I don't want to tackle the legacy path yet or fix other drivers, not until I've verified this works. What I also should do is write some test cases to verify that indeed sequence numbers are being presented to the driver in order, so I can identify this happening in the wild.

Monday, February 13, 2012

.. and the price of packaging up software? Billions.

A friend of mine from Perth recently wrote an article outlining the "cost" of Debian Wheezy:

http://blog.james.rcpt.to/2012/02/13/debian-wheezy-us19-billion-your-price-free/

To quote:

"In my analysis the projected cost of producing Debian Wheezy in February 2012 is US$19,070,177,727 (AU$17.7B, EUR€14.4B, GBP£12.11B), making each package’s upstream source code wrth an average of US$1,112,547.56 (AU$837K) to produce. Impressively, this is all free (of cost)."

Now this has apparently caused a bit of a stir among Linux and IT news sites. It's a large number, right? It's all free, right?

However - Debian for the most part is a package repository. Sure, a lot of effort goes into building and maintaining that - and I think _that_ should be assigned a cost - but I think counting all of those package upstreams as part of Debian is hiding the true nature of software development.

According to Ohloh, the cost of producing FreeBSD, at $72,000 a year per programmer, would be $ 243,777,135, _before_ all of the packages in the FreeBSD ports repository. FreeBSD has over 23,000 packages too, just like Debian - if those were also counted, it may start to push that figure far up from millions to billions.

Does it mean Debian Wheezy is equivalent of $20B of effort? Maybe. But then a tiny (comparatively) more effort and you end up with FreeBSD. Or, with different effort - Redhat. Or Gentoo. Or Mandrake. Or NetBSD. Or OpenBSD. Or MacPorts/Fink, which packages this similar software for MacOS X.

What I guess I'm trying to say is this. You get cool stuff from programmers for free. Including that in your project "cost" just seems silly.