Adrian Chadd's Ramblings: June 2012

Friday, June 15, 2012

Don't let anyone tell you that FreeBSD doesn't "do" 802.11n:

This is from my FreeBSD-HEAD 802.11n access point, currently doing ~ 130MBit/s TCP:


# athstats -i ath0
41838297     data frames received
31028383     data frames transmit
78260        short on-chip tx retries
3672         long on-chip tx retries
197          tx failed 'cuz too many retries
MCS13        current transmit rate
8834         tx failed 'cuz destination filtered
477          tx frames with no ack marked
239517       rx failed 'cuz of bad CRC
10           rx failed 'cuz of PHY err
    10           OFDM restart
42043        beacons transmitted
143          periodic calibrations
-0/+0        TDMA slot adjust (usecs, smoothed)
45           rssi of last ack
51           avg recv rssi
-96          rx noise floor
812          tx frames through raw api
41664029     A-MPDU sub-frames received
42075948     Half-GI frames received
42075981     40MHz frames received
13191        CRC errors for non-last A-MPDU subframes
129          CRC errors for last subframe in an A-MPDU
2645042      Frames transmitted with HT Protection
351457       Number of frames retransmitted in software
23299        Number of frames exceeding software retry
30674735     A-MPDU sub-frame TX attempt success
374408       A-MPDU sub-frame TX attempt failures
8676         A-MPDU TX frame failures
443          listen time
6435         cumulative OFDM phy error count
161          ANI forced listen time to zero
3672         missing ACK's
78260        RTS without CTS
1469003      successful RTS
239605       bad FCS
2            average rssi (beacons only)
Antenna profile:
[0] tx  1466665 rx        1
[1] tx        0 rx 41838296

Monday, June 11, 2012

A tale of two sequence numbers, or "when QoS seqno and CCMP PN don't match up"..

Many moons ago (say, 3 or 4 weeks - so hm, most-of-a-moon-ago actually) I found a rather curious failure condition in the ath(4) TX aggregation path. The colourful history is documented in FreeBSD kern/166190. In short - there are situations where sequence numbers were allocated in a different order to how frames were being added to the block-ack window tracking, and if you got unlucky, you'd cause the stack to think a frame was (far) outside the BAW.

The 30 second explanation:

Imagine you allocated four frames - sequence numbers 1, 2, 3 and 4. They have to be added to the block-ack window in precisely that order. Ie:

Starting condition: Window is at 0:63 (64 frame window, starting at 0, so ending at 63)
Add 1: Window is now at 0:63, starting at 1
Add 2: Window is now at 0:63, starting at 2
Add 3: Window is now at 0:63, starting at 3.

The reason the window pointer isn't moving along is because although you've sent the frames (or you're about to), you can't advance it until the other end has ACKed it (via a block-ack or a normal ACK.) For more information, google how 802.11n aggregation works.

The important bit here is that the window is still 0:63 and the starting point is now '3'. This continues all the way to trying to queue frame 64, where it will be outside of the current BAW and not be allowed to be transmitted. It'll sit in the software queue and wait until frame '0' has been ACKed and the BAW has been advanced to be 1:64 - at which point 63 will fall inside the window and will be transmitted.

So yes, the sender is tracking two things - the BAW and what the starting point is that they've added to the BAW.

Now, imagine instead of (1, 2, 3, 4) on the software queue, I somehow get preempted (or race between two sending threads, when using SMP) between 'allocated seqno' and 'queue to software queue'. In the existing code, a lock was held when:

Allocating a sequence number, then it was dropped; then
Adding it to the software queue.

Now because there was a period where no lock was being held, it's quite possible that what ends up on the software queue is (2, 1, 3, 4.) So:

Starting condition: Window is at 0:63
Add 2: Window is now 0:63, starting at 2.
Add 1: Window is 0:63, starting at 2; 1 is outside of the BAW (it's treated as a 'wraparound', so imagine it's 4095 seqno's away) so TX stalls.

This was the cause of the TX stalls that I was seeing originally in kern/166190. I "fixed" it by only allocating sequence numbers when the frame was about to be transmitted for the first time, and then adding it to the BAW right there. Since both sequence number allocation and adding to the BAW happened inside the same lock, everything was sweet.

Except, I totally forgot about CCMP PN. So under high enough UDP TX loads (say, > 200MBit), I'd hit the same race, but between 802.11 sequence numbers and CCMP PN sequence numbers.

CCMP PN is assigned during 802.11 encapsulation time, in the driver. In the ath(4) case, it's done during transmit and before being queued to the software queue. And it was being done outside of any locking. So it's very possible that frames would end up on the software queue with 802.11 and CCMP PN sequence numbers out of lock-step.

What would happen?

Simply - after the 802.11n reordering occured on the receive side, the CCMP PN replay detector would notice sequence numbers out of order, and start tossing said out of order frames. Lots of packet loss ensued.

So, I sat down and started trying to address it. The simplest thing - wrap the whole encapsulation path between ieee80211_crypto_encap(), 802.11 sequence number assignment and software/hardware queueing behind the TID (well, hardware TXQ) lock. It took some time; I had to revert two earlier commits which introduced the delayed sequence number allocation.

This didn't fix things. So I was back to square one.

I started looking at all the places where the frames were being queued to the software queue and .. well, let's just say I spent Sunday swearing _at myself_ for all the weird and wonderful stupid mistakes I had made when writing/porting this code over.

The short version follows (the long version is "read the sys/dev/ath/ commit logs and the PR history"):

When I was queueing frames to the software queue, I'd check how deep the hardware queue was. If the hardware queue was shallow/empty, I'd direct dispatch up to two frames to the hardware to get things 'busy'. That will (hopefully) let further frames come along in the meantime and be aggregated. However, I was queueing the new frame to the hardware rather than queueing the new frame to the tail of the queue, and queueing the head frame of the queue to the hardware. That led to some out of order behaviour.
ath_tx_xmit_aggr() would check if the sequence number was within the block-ack window and if it wasn't, it'd queue the frame to the tail of the queue. This meant that any new frames that came along would be queued to the end of the queue, even if they had been dequeued from the head of the queue. This lead to frames on the software queue being out of order.

Frames on the software queue don't have to be in-sequence (as retries are prepended to the beginning of the list, and new frames are appended to the end) however they have to be in-order. If they end up being out of order, the BAW logic fails.

So, now that I allocate sequence numbers at packet queue time, I have to be triply sure that what ends up on the software queue is correctly in order, or the BAW logic will cause traffic stalls and potentially duplicate sequence number issues. Yes, this means that the old behaviour, whilst it now works right with all the right locking, requires me to correctly handle putting frames on the software queue. (Or, as I like to say, "keeping the bastard (me) honest.")

TL;DR - 802.11n aggregation works again. Now, to fix those pesky "queue full and I want to send a BAR frame so I can unblock the full queue and transmit" problems. At least that one is more tractable and easier to solve. Or is it.

Wednesday, June 6, 2012

FreeBSD, Netflix, CDN

The big news this week is the Netflix Openconnect platform, which was just announced. It uses off the shelf hardware - and FreeBSD + nginx.

The question is how you could spin it.

You could say "Netflix chose FreeBSD because they can keep their changes proprietary." Sure, they could. But they're not making appliances that they're selling - they're owning the infrastructure and servers. It's unclear whether they'd have to contribute back any Linux changes if they ran Linux on their open connect platform. They're making a conscious, public decision to distribute their changes back to FreeBSD - even though they don't have to.

You could say "Netflix chose FreeBSD because the people inside the company knew FreeBSD." Sure, they may have. The same thing could be said about why start-ups and tech companies choose Linux. A lot of the time its because they're chasing enterprise support from Redhat. But technology startups using Ubuntu or Debian tend not to be paying support fees - they hire smart people who know the technology. So, yes - "using what they know."

According to the Netflix Openconnect website:

"This was selected for its balance of stability and features, a strong development community and staff expertise. We will contribute changes we make as part of our project to the community through the FreeBSD committers on our team."

Let's pull this apart a little.

"Balance of stability and features." FreeBSD has long been derided for how slowly it moves in some areas. The FreeBSD developers tend to be a conservative bunch, trying to find the balance between new feature development and maintaining both stability and backwards compatibility.
"Strong community." FreeBSD has a strong technical development community and Netflix finds this very important. They're also willing to join and participate in the community like many other companies do.
"Staff expertise." So yes, their staff are familiar with FreeBSD. They're also familiar with Linux. They chose a platform which they have the expertise to develop, use and improve. They didn't just choose an unfamiliar platform because of marketing brochures or sales promises. I don't see any negatives here. I'm sure that Google engineers chose Linux to begin with because they were familiar with Linux.
"Contribute changes we make as part of our project to the community." Netflix has committed to push improvements and fixes back to the upstream project They contributed some bug fixes in the 10GE Intel driver and IPv6 stack this week. This is collaborative open source working the way it should.

Why would Netflix push back changes and improvements into a project when they're not required to? That's something you should likely ask them. But the same good practice arguments hold for both Linux AND BSD projects:

The project is a constantly moving target. If you don't push your changes back upstream, you risk carrying around increasingly larger changes as your project and your BSD upstream project diverge. This will just make things more difficult in the long run.
By pushing your changes upstream, you make it easier to move with the project - including adopting improvements and new features. If you keep large changes to yourself, you will likely find it increasingly difficult to update your software to the newer upstream versions. And that upstream project is likely adding bug fixes, improvements and new features - which at some point you may wish to leverage. By pushing your changes upstream, you make it a lot easier to move to future versions of the upstream project, allowing you to leverage all those fixes and improvements without too much engineering time.
By participating, you encourage others to adopt your technology. By pushing your changes and improvements upstream, you decrease the amount of software you have to maintain yourself (and keep patching as the upstream project moves along.) But you also start to foster technology adoption. The FreeBSD jail project started out of the desire by a hosting company to support virtualisation. Since then, the Jail infrastructure has been adopted by many other companies and individuals.
When others use your technology, they also find and fix bugs in your technology; they may even improve it. The FreeBSD jail support has been extended to include IPv6 support, shared memory support and integrates into the VIMAGE (virtualised networking) stack (which, by the way, came from Ironport/Cisco.) As a company, you may find that the community will do quite a lot of the work that you would normally have to hire engineers to do yourself. This saves time and saves money.
When companies contribute upstream, it encourages other companies to also contribute upstream. A common issue is "reinventing the wheel", where companies end up having to reinvent the same technology privately because no-one has contributed it upstream. They solve the same problems, they implement the same new features .. and they all spend engineering time and resources to do so.
And when companies contribute upstream, it encourages (private) developers to contribute. Open source developers love to see their code out there in the wild, in places they never quite thought of. It's encouraging to see companies build products with their code and contribute back bug fixes and improvements. It fosters a sense of community and participation, of "give and take", rather than just "take". This is exactly the kind of thing that keeps developers coming back to contribute more - and it attracts new developers. Honestly, who wouldn't want to say that some popular device is running code that they wrote in their spare time?

So, you could rant and rave about the conspiracy side of Linux versus FreeBSD. You could rant on about GPL versus BSD. Or, you could see the more useful side of things. You could see a large company who didn't have to participate at all, agreeing to contribute back their improvements to an open source operating system. You could see that by doing so, the entire open source ecosystem benefits - not just FreeBSD. There's nothing stopping Linux or other BSD projects from keeping an eye on the improvements made by Netflix and incorporating those improvements into their own project. And it's another case of a company participating and engaging the open source community - and having that community engage them right back.

Good show, Netflix. Good show.