So I went digging into it a bit further (and it took quite a few evenings to narrow down the cause.) Here's the story.
Firstly - hostapd kept kicking off my station. Ok, so I had to figure out why. It turns out that the group rekey would occasionally fail. When it's time to do a group rekey, hostapd will send a unicast EAPOL frame to each associated station with the new key and each station must send back another EAPOL frame, acknowledging the fact. This wasn't happening so hostapd would just disconnect my laptop.
Ok, so then I went digging to see why. After adding lots of debugging code I found that the EAPOL frames were actually making to my Macbook Pro _AND_ it was ACKing them at the 802.11 layer. Ok, so the frame made it out there. But why the hell was it being dropped?
Now that I knew it was making it to the end node, I could eliminate a bunch of possibilities. What was left:
- Sequence number is out of order;
- CCMP IV replay counter is out of order;
- Invalid/garbled EAPOL frame contents.
I quickly ruled out the EAPOL frame contents. The sequence number and CCMP IV were allocated correctly and in order (and never out of sequence from each other.) Good. So what was going on?
Then I realised - ok, all the traffic is in TID 16 (the non-QoS TID.) That means it isn't a QoS frame but it still has a sequence number; so it is allocated one from TID 16. There's only one CCMP IV number for a transmitter (the receiver tracks a per-TID CCMP IV replay counter, but the transmitter only has one global counter.) So that immediately rings alarm bells - what if the CCMP IV sequence number isn't being allocated in a correctly locked fashion?
Ok. So I should really fix that bug. Actually, let me go and file a bug right now. There.
There. Bug filed. PR 172338.
Now, why didn't this occur back in Perth? Why is it occuring here? Why doesn't it occur under high throughput iperf (150Mbps+) but it is when the iperf tests are capped at 100Mbps ethernet speeds? Why doesn't it drop my FreeBSD STAs?
Right. So what else is in TID 16? Guess what I found ? All the multicast and broadcast traffic like ARPs are in TID 16.
Then I discovered what was really going on. The pieces fell into place.
- My mac does go in and out of powersave - especially when it does a background scan.
- When the mac is doing 150Mbps+ of test traffic, it doesn't do background scans.
- When it's doing 100Mbps of traffic, the stack sneaks in a background scan here and there.
- Whenever it goes into background scan, it sends a "power save" to the AP..
- .. and the AP puts all multicast traffic into the CABQ instead of sending it to the destination hardware queue.
- Now, when this occured, the EAPOL frames would go into the software queue for TID 16 and the ARP/multicast/etc traffic would go into the CABQ
- .. but the CABQ has higher priority, so it'll be transmitted just after the beacon frame goes out, before the EAPOL frames in the software queue.
Now, given the above set of conditions, the ARP/multicast traffic (which there's more of in my new place, thanks to a DSL modem that constantly scans the local DHCP range for rogue/disconnected devices) would be assigned sequence numbers AFTER the EAPOL frames that went out but are sitting in the TID 16 software queue. The Mac would receive those CABQ frames with later sequence numbers, THEN my EAPOL frame. Which would be rejected for being out of sequence.
The solution? Complicated.
The temporarily solution? TID 16 traffic is now in a higher priority hardware queue, so it goes out first. Yes, I should mark EAPOL frames that way. I'll go through and tidy this up soon. I just needed to fix this problem before others started reporting the instability.
The real solution is complicated. It's complicated because in power save mode, there's both unicast and multicast traffic going into the same TID(s) but different hardware queues. Given this, it's quite possible that the traffic in the CABQ will burst out before the unicast packets with the same TID make it out via another hardware queue.
I'm still thinking of the best way to fix this.