Friday, June 17, 2011

Implementing 11n TX aggregation, why it's not so easy..

The only real missing bit from the initial 11n support for ath(4) in FreeBSD is now TX aggregation. This is also quite a tricky thing to do right, because (like everything related to these Atheros NICs) it's all done in software.

I'm about half way through the first part - implementing per-node, per-TID software queues. Since A-MPDU sessions are implemented using TIDs (although in net80211 they're linked to WME access classes (AC), hm!) traffic for each TID has to be individually managed.

The basic premise is quite simple. Frames are queued based on the destination node and TID. Non-QoS frames to a unicast destination (ie, a known station) go to the "non-QoS TID" (TID 16 in FreeBSD's net80211 stack.) Multicast frames go to the software-driven multicast/content-after-beacon (cabq) queue. Then a separate task handles de-queuing packets form these queues and queuing them to the hardware.

The reality however is a little more difficult than that.

Firstly, frames are queued from multiple contexts. The most obvious is the per-interface queue (ie wlanX). These may run in parallel and target the same hardware, so now I need to add locking to the per-node structure. Previously this wasn't (much) of a problem since the TX code simply directly dispatched to the hardware, and these queues are correctly locked. However now there's software-driven queues (which need new locks) along with some more per-node state. That all needs locking.

Next, the rate control code wasn't at all locked. I've begun wrapping them in per-node locks. Making sure I get this correct isn't easy.

Then there's the question of marking nodes and TIDs that have traffic to send. Right now that's done by adding each node to a linked list of "ready nodes". I then check all the TIDs for that node and dequeue what packets I have to. It's ugly code, but it works well enough for testing. I'll finish off implementing what ath9k and the Atheros reference code does (ie, maintaining both a "node" and "TID" linked list, so I can directly walk which TIDs need walking), but there's implications here for getting "QoS" right. (Ie, which order do you dequeue packets? How many do you dequeue from which queue in order for things to be "fair" ?) I'm going to leave all of these questions unanswered for now (as they're not handled in the current code anyway) and hope someone else steps up to it!

At this point I now have a mostly-clean (for values of "clean") implementation of per-node, per-TID software queues. I'm still in the process of tidying all of this up in preparation to commit to -HEAD before I begin the aggregation code. I'll cover that in a moment.

Then there's handling the aggregation session stuff.

Firstly, there's handling the ADDBA exchange (ie, establishing an A-MPDU session.) When this is done, any unicast traffic pending to that node/TID needs to be paused and kept around until the ADDBA request/response occurs. This is known as "pausing" the queue. I've a hack in place to do this and will implement a correct "pause" API soon to clean it up.

Next there's handling packets already queued to the hardware. This crept up on me without warning.

Firstly - a little background. The ADDBA exchange establishes the initial sequence number and window size to use as the "window" for subsequent data exchanges. After that, any packets being queued to the destination must fall within this window (typically say 64 sequence numbers long.) The window stays a fixed size, but the window itself slides along when TX packets are successfully completed. Since packets may be lost at any time (whether aggregate or not), the window doesn't move until the sequence numbers at the beginning of the window are successfully received.

So for example, say my window begins at "0", my window size is "4" and the station sends 4 packets with sequence numbers "0", "1", "2", "3". If I receive all four packets correctly and thus send back a block-ACK indicating this, the new window position will begin at "4". I can then send packets "4", "5", "6" and "7". I can't send "8", even if I have it in my queue.

But say the other end only received "0", "1" and "3". "2" wasn't received, so I have to retransmit it. I can only advance the window to "2". I can't advance it to "3", since I didn't receive "2". I could advance it to "1", but that'd be pointless as I've already received it. So in this instance, both the station and access point would advance the window position to "2" and the station would then retransmit "2". If it's coded correctly, it could transmit "2", "4", "5". It can't transmit "6", as that's outside the window size of 4 packets. It could just retransmit "2" if it wanted to and not the others. In fact, it could transmit "4", "5" and "2" - the receiver would then be responsible for re-ordering the packets and only passing them up to the network layer once all the packets in the correct order are available.

Finally, if I don't receive "2" (say I just can't seem to transmit it), the STA need to do a few things. It first has to pause the queue so no further packets are queued to the hardware. At this point, packets may be queued to the hardware, but they'll be for sequence numbers greater than "2" but still within the block-ack window. Once the hardware has finished completing packets (successfully or not), it needs to send a "BAR" frame to the access point to establish a new starting point for the sequence number window. In this instance it could set it to "3", but what if "3" was already in the hardware queue and was successfully transmitted? That would confuse the hell out of the other end. So it looks at the last successfully transmitted sequence number was, then sends the "BAR" with a starting sequence number 1 after that.

So, if "2" came back as having failed too many times, but "3" and "4" were in the hardware TX queue, the driver will pause the queue, wait for pending frames to complete, then look at what happened. If "3" and "4" were successfully transmitted, it would send a BAR with the sequence beginning at "5". If "3" was successfully transmitted but not "4", it would send a BAR with the sequence beginning at "4". But if "3" failed, and "4" didn't - it would set the BAR at "3" and then try retransmitting it. Finally (and this is another strange corner case I have to handle!) if "3" came back as having failed - and it failed too many times! - but "4" TX'ed ok, then I have to set the initial sequence number at "5" and send the BAR.

The other end (in this instance, the hostap) would then receive the BAR, set the new expected sequence number to be whatever was sent (say "5" in this instance), flush all pending packets in the reorder queue - even if there are missing packets (in the above example, "3" may be missing but "4" was received ok, so the receive stack would've been waiting for "3" to be received before sending it all up to the network layer!) and send back a block-ACK to the STA confirming the new window position.

Now (phew!) given all of that, you can see what kind of complicated stuff is needed to properly handle all corner cases when doing software (re)transmitting of packets. I haven't even begun to talk about how to handle sending and re-sending aggregate frames! All of the above needs to be implemented before I can do that.

So! The above is what I'm now working on. Once that's done, I'll work on handling what's known as "filtered frames" (and that'll be brain-dumped in a future blog post, but it has to do with handling power save stations correctly, or A-MPDU sessions will be torn down prematurely!) and then when THAT is all done and stable, I'll look at handling aggregates.

And when I handle aggregates correctly, we'll finally have fast 11n TX in FreeBSD. Then I can enable "ATH_ENABLE_11N" by default. :-)

Friday, June 10, 2011

Uninformed programmers, or "stop assuming virtual memory is awesome."

I decided I needed to vent a bit post-exam. So where better to find some clue-void post to answer than slashdot.

The post, and my reply:

http://news.slashdot.org/comments.pl?sid=2229598&cid=36408502

In summary, the original poster:

"In short, don't worry about fine-tuning what's "in memory". Don't change behavior based on total amount of memory in the system. Operating systems (OpenBSD aside) ALREADY DO THAT. Just let the memory manager do its job, and give it enough information (via interactivity information, memory priority, etc.) to do its job properly. Don't try to hack around problems at the wrong layers."

And my response:

"You assume that the OS will make sensible paging decisions. You assume you can hint to the OS that you're going to make sensible paging decisions. You hope the application, which is likely big, multithreaded and such, is doing the sensible thing of not wrapping large accesses to "memory things" (eg big trees of data, as an example, or image caches, or whatever takes up more than a small bit of RAM) in mutexes. You assume that your application is using memory in a sensible fashion, and not simply using a few bytes here and there in each allocated chunk."

I hate to say it, but if you're assuming that virtual memory is the be all and end all of your memory management method, you're setting yourself up for some really bad worse case behaviour. Since you have no control as to where and when the OS decides to punt pages to disk, you can't possibly begin to design methods around it. You can only hope that the OS doesn't page out the data you've got locked, or that the malloc implementation you're using doesn't handle fragmentation poorly.