Sunday, March 18, 2012

Concurrency in the TX path and when it all falls down..

I'm still (yes, still) hacking on the FreeBSD 802.11n ath(4) TX path. I'm trying to find and fix issues that creep up before I can flip on the 11n code by default.

I've become increasingly aware of the lack of locking in net80211, including some of the A-MPDU TX session management code. I know that I'll eventually have to plan out and implement some locking, but for now I'm just trying to squish whatever issues are showstoppers.

A user approached the freebsd-wireless list about two weeks ago and noticed that his 802.11n session was hanging after a bit of use. If he tried 802.11g, things worked fine. He tried SMP - and things worked fine. The symptoms? A number of frames seemed to be "stuck" and sitting in a software TX queue, not being transmitted. But other frames would be TXed fine, so it wasn't as simple as a totally confused TX BAW (Block-ack Window.)

After I managed to reproduce the issue locally, I discovered what was going on.

It was concurrency.

Specifically, that there's multiple places where ath_start() is being called, thus multiple concurrent TX is occuring.

Now, in the non-aggregate method, net80211 is doing all the sequence number assignment. I'm not so sure that in the normal case, the sequence numbers are being allocated in a consistent, sequential way - if the net80211 TX code is able to be called concurrently from multiple threads, sequence numbers can and will be occasionally "raced" and allocated in the reverse order that they're submitted to the driver. But I'm not here to fix that (however I'll eventually have to.)

In the aggregate method, the driver is doing the aggregation and sequence number assignment. For now, the driver is also doing the TX BAW tracking and frame queuing. So imagine this sequence of events:

* a frame is submitted via ath_start();
* since aggregation is enabled, it's allocated a sequence number;
* it's then thrown into the software queue;
* then at some later stage, the software queue is checked, the frame is popped out of the list and if it's inside the BAW, it's added to the BAW and TX;ed
* adding the frame to the TX BAW slides the left hand edge of the BAW to be at that sequence number. Any frames TXed with a seqno _less_ than this will be treated as outside of the BAW and put back onto the software queue for now.

Now, the locking only occurs at:

* the time the frame has the sequence number allocated;
* then when the queue is checked and the frame is popped off.

If two or more threads are allocating sequence numbers and doing work, it's quite possible that thread A will allocate (say) seqno 5, thread B will allocate seqno 6 and then add it to the BAW before thread A can. Then when thread A tries to do some work, it finds the queue has a frame with seqno 5 in it - and since it's before the left hand edge of the BAW (which is at 6, as it was successfully pushed to the hardware), it won't be transmitted and will stay in the software queue until the BAW sequence numbers wrap around to 0 and catch up.

Now, linux ath9k/mac80211 doesn't have this problem. The TX pathway is totally serialised, which means that even if multiple threads are trying to TX, only one thread will be able to enter the TX code at any time. The other threads get blocked.

So how can I solve this? The easy solution would be to serialise FreeBSD's net80211 and ath driver TX code. That way all of this nonsense will go away. For the net80211 side of things this may work - the legacy TX path, where sequence numbers are allocated by the stack, could benefit from serialisation. The throughput isn't ever going to be that great, so we wouldn't really hurt from it. But the trouble is making absolutely sure that the driver also does the same - even if I push the frames into the queue in order and ensure that they have sequential sequence numbers, there's no guarantee that a driver with concurrent entry paths into XXX_start() will de-queue the frames and push them into the hardware in the same order.

So what I've chosen to do instead is to ignore the legacy part for now, and not serialise anything. Instead, I'm doing the sequence number allocation (for aggregation, remember) -at the time I'm about to add the frame to the BAW and TX it to the hardware-. Ie, until the frame actually is able to be added to the BAW, it won't _have_ a sequence number. Since this action is done behind a lock, it's guaranteed to be sequential. The trick here is to only allocate the sequence numbers once it's known for certain that the frame _will_ be going out to the hardware.

For the legacy path, it's also likely worthwhile delaying the sequence number allocation until it's just about to go to the ifnet queue. That way the frames on the ifnet queue have sequence numbers that are in order. Then I need to fix each driver (ugh) to make sure they're dequeued fine.

I've written up the aggregation change and this so far works quite well. I don't want to tackle the legacy path yet or fix other drivers, not until I've verified this works. What I also should do is write some test cases to verify that indeed sequence numbers are being presented to the driver in order, so I can identify this happening in the wild.