Sunday, April 8, 2012

And the winner of the most committing committer to src/sys over the last 12 months is ..

.. well, it was me. Up until last week. marius@ snuck in to take my place. I guess all of those commits I did about this time last year to start bringing FreeBSD's ath(4) support up to scratch are finally expiring out of the 12 month window.

(Source: http://people.freebsd.org/~peter/commits.html)

.. but I wouldn't call myself the most important committer. Or the most active. What I'd call myself is the "most active fixing a sorely needed corner of the codebase."

What I _could_ have done is simply do all my work in a branch and then merge it back into -HEAD when I was done. And, for about 6 months, this is what I did. The "if_ath_tx" branch is where I did most of the initial TX aggregation work.

But as time goes on, your branch diverges more and more from the master branch (-HEAD in FreeBSD) and you are faced with some uncomfortable decisions.

If you stay on the same branch point and never merge in anything from your master branch, you _may_ have a stable snapshot of code, but who knows how stable (or relevant) your work will be when you merge it back into master.

You have no idea if your work will break anything in master and you have no idea if changes in master have broken your work.

As time goes on, the delta between your branch point and the master branch increases, making it even more difficult to do that final merge back. It also has the side effect of making it increasingly likely that problems will occur with the merge (your code breaking master, master breaking your code, etc.)

So as uncomfortable as it was - and as much as I wanted things to stay stable - I did press through with relatively frequent merging. This means:
  • I would pick specific development targets to work towards, at which point I'd stop developing and go into a code review/tidyup/testing phase;
  • I'd do frequent merges from master back into my branch during active development - I wouldn't leave this until I was ready to merge my work back into master;
  • Once I reached my development target and had done sufficient testing - including integrating changes from master back into my branch - I'd kick off a semi-formal review (read: email freebsd-wireless@) and call for testers/review;
  • Only _then_ would I merge what was suitable back into master.

I wouldn't merge everything from my branch into master. In my instance, there were some debugging extensions that were easy to maintain (read: lots of device_printf() calls) but weren't suitable for FreeBSD-HEAD. But I merged the majority of my work each time.

But that doesn't always work. I managed to merge a bunch of ath(4), ath_hal(4) and net80211 fixes back into -HEAD as appropriate. But the TX aggregation code was .. well, rather large. So I attempted to break up my commit into as many small, self-contained functional changes as possible. Yes, there was a big "here's software TX queue and aggregation" as a big commit at the end but I managed to peel off more than 30% of that in the lead-up commits.

Why bother doing that?

Two words - version bisection. Once I started having users report issues, they would report something like "FreeBSD-HEAD revision X worked, revision Y didn't." (If I were lucky, of course.) Or, they'd note that a certain snapshot from a certain day worked, but the next day had a regression. If I had committed everything as one enormous commit after having spent 6 + months on the branch, I'd be in for a whole lot of annoying line-by-line debugging of issues. Instead, I was able to narrow down most of the regressions by trying all the different commits.

Now that 802.11n ath(4) TX aggregation and general 802.11n support is in the tree, I only use branches for larger scale changes that take a couple of weeks. For example, when fixing up the reset path to not drop any TX/RX frames. I do most of the bugfixing in FreeBSD-HEAD. I could do it in a branch and then do monthly merges, but I then have the same problems I've listed above.

In summary: don't underestimate how helpful it is to break down your commits into little, piecemeal, self-contained functional changes. It has the side effect of making you look really good in the committer statistics.

Thursday, April 5, 2012

The initial introduction into "it's the NIC, stupid!"

I've finally hit that dreaded condition which hardware guys hate - where a NIC is behaving badly.

In my case, an IBM/Lenovo Thinkpad T60 has been modified (not by me) to take an Atheros AR9280 NIC. Unfortunately, the NIC was proving to be very unstable when doing 802.11n throughput. The investigations did show I was doing something slightly incorrect with TX descriptors (and I've since fixed that) but the stability issues remained.

The Atheros NICs can expose some host interface error conditions via the AR_INTR_SYNC_CAUSE register. These include PCI(e) transaction timeouts, illegal chip access (eg whilst the MAC is asleep), parity errors, and other rather nice things. FreeBSD's HAL and Linux ath9k does have the register definition for what the bits do - but unfortunately we don't keep statistics.

In my particular case:
  • I'd see AR_INTR_SYNC_LOCAL_TIMEOUT occur. This is because a PCI(e) transaction didn't complete in time. I can tune these timeouts via a local register but that's not the point - I was seeing these errors when receiving only beacons from the access point. That's a bit silly.
  • I'd also see AR_INTR_SYNC_RADM_CPL_DLLP_ABORT, which is an indication that the PCIe layer isn't behaving well.

I swapped it out with another AR9280 based NIC and suddenly all the instabilities have gone away. No TX hangs, no missed TX interrupts. Everything looks great.

So as an open source developer, I want to try and put some tools into the hands of the community to be able to debug what's going on - or, if that's not possible, at least get some indication that things are going wrong. Right now the only thing people see is "I see TX timeouts, it must be the driver/chip fault." There's too much going on to be able to conclude that.

My game plan is this:

  • Implement statistics keeping for each of the SYNC interrupts and expose those via a diagnostic interface. Ben Grear has done something similar for Linux ath9k after a private email discussion. He's also seeing MAC sleep accesses, so it's quite likely we'll start finding/squishing these.
  • Take the offending laptop/NIC to the office and attach it to a very expensive and fancy looking PCIe analyser. I'm hoping we'll find something really silly occuring - like lots of sleep state transitions, or a high number of parity errors.
  • Try documenting this a lot better so users are able to understand what's going on when their NIC is misbehaving.