Thursday, October 4, 2012

Lessons learnt from fiddling with the rate control code..

(Note before I begin: a lot of these ideas about rate control are stuff I came up with before I began working at my current employer.)

Once I had implemented filtered frames and did a little digging, I found that the rate control code was doing some relatively silly things. Lots of rates were failing quite quickly and the rate control was bouncing all over the place.

The first bug I found was that I was checking the TX descriptor completion before I had copied it over - and so I was randomly failing TX when it didn't fail. Oops.

Next, don't call the rate control code on filtered frames. They've been filtered, not transmitted. My code wasn't doing that - I'm just pointing it out to anyone else who is implementing this.

Then I looked at what was going on with rate control. I noticed that whenever the higher transmission rates failed, it took a long time for the rate control code to try and sample them again. I went and did some digging - and found it was due to a coding decision I had made about 18 months ago. I treated higher rate failures with a low EWMA success rate as successive failures. The ath_rate_sample code treats "successive failures" as "don't try to probe this for ten seconds." Now, there's a few things you need to know about 802.11n:


  • The higher rates fail, often;
  • The channel state changes, often;
  • Don't be afraid to occasionally try those higher rates; it may actually work out better for you even under higher error rates.
So given that, I modified the rate control code a bit:

  • Only randomly sample a few rates lower than the current one; don't try sampling all 6, 14 or 22 rates below the high MCS rates;
  • Don't treat low EWMA as "successive failures"; just let the rate control code oscillate a bit;
  • Drop the EWMA decay time a bit to let the oscillation swing a little more.
Now the rate control code behaves much better and recovers much quicker during unstable channel conditions (eg - adrian walking around a house whilst doing iperf tests.)

Given this, what could I do better? I decided to start reading up on what the current state of play with 802.11n aware rate control and rapidly came to the conclusion that - wow, we likely could do it better. The Linux minstrel_ht algorithm is also based on John Bickett's sample rate code, but instead of using a EWMA and minimising packet transmission time, it uses the EWMA to calculate a theoretical throughput and maximises that. So, that sounds good, right?

Except that the research shows that 802.11n channels can vary very frequently and very often, especially at the higher MCS rates. The higher MCS rates can become better and worse within a window of a second or two. So, do you want to try and squeeze the last of throughput out of that, or not?

Secondly, using "throughput" as a metric is fine if your air time is .. well, cheap. But what if you have many, many clients on an AP? Your choice of maximising throughput based on what the error rate predicts your data throughput is doesn't take airtime into account. In fact, if you choose a higher MCS rate with a higher error rate but higher throughput, you may actually be wasting more air with those retransmissions. Great for a single station, but perhaps not so great when you have lots.

So what's the solution? The open source rate control stuff doesn't take the idea of "air utilisation" into account. There's enough data available to create an air time model, but no-one is using it yet. Patches are gratefully accepted. :-)

Finally, the current packet scheduler is pretty simple and stupid (and does break in a lot of scenarios, sigh.) It's just a FIFO, servicing nodes/TIDs with traffic in said FIFO mechanism. But that's not very fair - both from a "who is next" standpoint and "what's the most efficient use of the air" view. In addition, the decision about which node/TID to schedule next is done totally separate to the rate control decision. Rate control occurs rather late in the packet transmission process (ie, once we've committed to queuing it to the hardware.) Wouldn't it be better to have the packet scheduler and rate control code joined at the hip, where the scheduler doesn't aggressively schedule traffic to a slow/lossy end node?

Lots of things to think about..

1 comment: