Saturday, December 14, 2013

Experimenting with zero-copy network IO in FreeBSD-HEAD

Back when I started all of this networking hacking, the "big thing" was the overhead of doing poll() and select(). Various operating systems came up with ways of eliminating these - FreeBSD grew the kqueue infrastructure; linux received epoll, Solaris received an epoll-like device and then ended up with some form of kqueue-like event mechanism. Windows has completion ports/overlapped IO which combined the event mechanism with a zero-copy way of doing network IO.

So the Free/Open operating systems have scalable event notification mechanisms for handling large numbers of concurrent sockets but they don't all have some nice, efficient way of doing zero-copy network IO.

Linux has splice()/tee()/vmsplice(). So yes, it effectively does have a way of doing zero-copy socket reading and writing.

OpenBSD does have a splice style syscall to copy data from a source to a destination TCP socket.

FreeBSD, however, has mostly focused on the "disk to network" path for content serving and thus has a lot of time invested in their sendfile() implementation. This is great if you're doing a lot of file to network sending (which Netflix does), but it has some serious shortcomings. The main one I'll address here is the lack of being able to do general zero-copy socket writes from userland. So it can only send data from disk files to the network. You can't implement a zero-copy intermediary proxy server, nor a memory cache that keeps things in pre-allocated memory regions. You have to use disk files (whether that be a real filesystem on disks, or a memory filesystem) and leverage VM hints to control caching.

Recently there was some new sendfile() work to allow sending from POSIX shared memory segments. This intrigued me - it's not the most effective way of doing zero-copy network IO from userland but it's a start. So I set off to write an updated version of my network library from yesteryear to implement some massively parallel network applications with.

The idea is simple - you allocate a POSIX shared memory segment. You then mmap() that region into memory and treat it as a place to allocate write-side network buffers from. Then you use the shared memory filedescriptor and offset to schedule a sendfile() from the shared memory segment to the destination network socket. It's not as elegant as having a write path that wires the memory down and just populates mbufs from that, but that'll come later.

Here's what I found.

Firstly, there's no asynchronous "I'm done!" notification for the sendfile path. So you have no explicit notification that the underlying memory has been freed so you can reuse it. sendfile() has the SF_SYNC flag which causes it to sleep until the transaction is done - primarily so users can be sure they can change the underlying file contents after the syscall completes. This is used by caches such as Varnish that leverage on-disk files as their cache filesystem space.

So I've been adding that. I have a working prototype that is scaling quite well under load and I'll look to commit it to FreeBSD-HEAD soon. It posts a knote to a kqueue filedescrpitor once a transaction has completed.

Once that was done, I started benchmarking the performance of this setup.

The first real roadblock I hit was massive VM contention on the shared memory segment. It turns out that a single POSIX shared memory segment is represented as a single vm_object and this is protected by a single lock. So when 8 threads are actively doing IO from the same shared memory segment it hits massive lock contention. I fixed this in my test suite by allocating one shared memory segment per thread. It's not elegant but it works well enough for benchmarking.

I next hit issues with contention on the VM page lists. Besides the per-object list, there's also a global per-type list (active, inactive, etc.) There's one lock protecting each of these lists. What I found was the VM was shuffling pages between active/inactive and at the traffic rates I was doing (20+gbit/sec) it was a few hundred thousand pages a second being shuffled around. The solution? mlock() the whole region into memory. This prevented the VM from having the pages change state so often and eliminated that overhead.

The code for doing this sendfile() work with posix shared memory is in my libiapp code - . It's terrible and hacky - I'm just experimenting with things for now. But with some tuning, I can get a good 35Gbit/sec out of 70,000 active TCP sockets. There's still a long way to go - I shouldn't be saturating an 8-core CPU with this traffic level when I'm doing no socket data copies. I'll write another update or two about that soon.

Now, what would I like to see? I did some experiments with physical disk IO using the FreeBSD AIO paths doing the same kinds of IO patterns as I am doing with network socket IO (4KiB to 64KiB random disk reads.) It turns out if you do everything correctly, the FreeBSD AIO code will turn physical disk IO into asynchronous disk buffer transactions by wiring the userland buffer into memory and then using that as the backing buffer memory. The overhead of doing the pmap work for this was not too high. So, I wonder if it's worth writing a new transmit path that uses the pmap code (and not the VM!) to wire in a region of memory and then use that for transmit buffers. Combined with an iovec style array of buffers and the above kqueue notification of the network IO completion, I think we can end up with a much more flexible method of doing network IO from userland without the shortcomings by using POSIX shared memory with sendfile().

Sunday, November 3, 2013

Doing arduino development on FreeBSD-HEAD

I'm a sucker for punishment.

Or, I noticed that FreeBSD's pkgng binary package repository ships with a port of the Arduino development environment. It's this java thing that wraps around avr-gcc and avrdude. It's very popular, it's open source, and I figured what the hell.

I plugged in my Arduino Leonardo and .. it was detected as a umodem device. Excellent!

.. and then it wasn't. It went away very quickly and came back as a single interface (OK) with three child interfaces (Hm, okay), but only one uhid (human interface) interface active (Not Ok.) The modem port used to program and talk to the thing wasn't there.

I then went on a bit of a journey. I found that quite some work had already been done to correct issues in the FreeBSD USB stack - however, it still wasn't working. It showed up fine - it identified itself as a generic USB serial port device, and yet umodem didn't bind to it.

Next - the umodem source code. It yes, claimed anything identifying as a USB serial class device - but it only claimed devices that ALSO identified as an AT-class modem. Yes, a serial modem that you speak AT commands to. The Leonardo identifies itself as a USB serial class device but with NO command encoding. umodem didn't like that.

So, to the USB 1.1 standards documention! After reading the relevant bits, I discovered that the rest of the device handling is the same! Ie, it doesn't matter whether the device says "I speak AT commands" or "I speak no commands", it's still serial. This identifier is just for the upper layer application to decide whether to send AT commands or not.

Thus the fix was simple - also claim devices that say "no commands" as well as "AT commands." That fix is in -HEAD and I hope to try and sneak it into 10.0.

And with that - FreeBSD-HEAD is now a viable development environment for the Arduino Leonardo.

Sunday, October 13, 2013

So, FreeBSD on the AR9344? What happened?

I committed a bunch of code a while ago to FreeBSD-HEAD to at least start booting on the AR934x SoCs. The AR934x SoC is a MIPS74k core - a dual-issue superscalar 11-stage pipeline MIPS32r2 CPU. It's slightly different to the existing MIPS24k stuff (which is a single 8-stage pipeline.)

So - first step - it booted up a little, then hit a machine check. At that point the FreeBSD MIPS peeps believed there was hilarity in the TLB exception handling code, so we put it to sleep for a while and I went back to real work.

Then a few weeks ago I decided to finish it off. I brought my developer board to Eurobsdcon in Malta and sat down with Warner Losh, who also has said developer board. We spent a bunch of time going over the TLB code and realised that FreeBSD's instruction/execution hazards are all.. just wrong. Then, on a whim, I read up some more about MIPS32r2 and superscalar stuff and discovered that the correct hazard instruction isn't NOPs or SSNOPs - it's EHB (execution hazard barrier.) It's 'SLL $0, $0, 3' in MIPS parlance which on older CPUs is just a NOP (since register 0 is always 'zero'.) So, this fixed the TLB management and the boot proceeded quite a bit further.

Next - bringing up ethernet and the switch PHY. I was seeing totally crappy and invalid register values when reading/writing the attached switch chips. Even probing didn't work reliably - in fact, I got to the point where I was reading the value I'd expect from the previous register read. So, I wondered if this was another out-of-order behaviour from the MIPS74k superscalar architecture.

After digging into the MIPS bus space code, I found two things:

  1. The MIPS driver(s) don't call bus barrier functions at all - so there's no driver enforced access ordering. It was all assuming that the CPU doesn't re-order things; and
  2. The bus barrier code for MIPS was a no-op. It just plainly wasn't defined.
So, I added read/write memory barriers to the MIPS bus barrier routines and I modified the ethernet driver to use barriers. For good measure, I also added barriers to the SPI driver code as that also has a bunch of register accesses that require ordering.

And with that, the switch PHY probe/attached fine, the SPI driver worked fine and the device started booting userland off of SPI connected NOR flash.

Then, it hung. I dug into that a bit and wondered what the hell was going on. Then after a day of poking, I discovered that the interrupt acknowledgement was not working. It's a quirky thing that I should really fix in the atheros platform support - the AR71xx chips don't require the CPU peripheral interrupts to be ack'ed (eg the uart) but later chips do. I added the AR934x to the list of SoCs that need interrupts to be ack'ed and the system kept booting, all the way to userland.

Next - I haven't yet written the AR8327 support but I started fleshing out the AR934x on-board switch support. I got it probing, attaching.. but not passing any traffic. After more digging, I realised my mistake - I was writing some registers incorrectly. I would mask out the right bits to set, but then I'd always set bit 0. Sigh. So, that came up and things worked.

Then I decided to do the wifi part. This was pretty damned simple. The HAL from Qualcomm Atheros already has support for the AR934x in it and I had already modified it to work for the AR933x SoC (which just required me to 'teach' it the FreeBSD way of exposing the calibration/configuration data from on-board flash.) So, all I had to do was this:

  1. Add the device to the kernel configuration;
  2. Add a hint pointing out where the device is mapped in IO space;
  3. Add a hint pointing out where the calibration data is in the NOR flash;
  4. Reboot.
That's it. No weeks of merging code in from Linux or the internal Qualcomm Atheros driver into the FreeBSD driver. No real debugging required. Just enable it, point it at the right place in memory/flash and .. boot it. I think this again vindicates my efforts to open source the Qualcomm Atheros HAL - I just inherit this working code for free. I don't have to try and merge it into anything.

So, I have a port that's dirty and working. There's a lot of infrastructure changes I need to commit before I can commit this port - lots of new clocking options (there's now variations on the clock rate that the MDIO bus (the MII bus connecting the ethernet port(s) to a PHY or switch), there's lots of new configuration options for how the on-chip ethernet port(s) map to external ports and a bunch of other ancillary stuff that's not really worth mentioning. But it's going to show up in FreeBSD-HEAD soon.

Monday, September 16, 2013

The gymnastics required to just do a "HALT" for MIPS..

So, it turns out there's no nice, guaranteed way to implement a HALT style setup for MIPS in the idle thread in the kernel. I'll braindump what happened about two years ago to try and address this.

(And I do hope that the implementation actually works!)

When the kernel isn't running any work, it will schedule the "idle" thread to run. This has the simple(!) task of entering a sleep state, only to wake up when another interrupt fires. This saves power consumption and generates less heat.

However, it's a little trickier than just "enter idle state."

The only way to exit it is to receive an interrupt. Now, this can be device interrupts; this can be timer interrupts (and yes; the timer is a kind of device, tsk..) but without it, the CPU will stay halted. This isn't a big deal - the UNIX kernel tends to be one big event processing "thing" and these include both software and hardware events. If it's a software event that needs to happen "now", the kernel won't go into the idle loop - it'll just run the needed event. If it has to happen in the future, it'll schedule it to occur after some time has elapsed - and this will be driven by a timer interrupt. An interrupt would wake up the system from the HALT state. If the interrupt occured just before the HALT instruction was executed, there'd be a very small window where the interrupt would occur - the interrupt routine would be called, complete, and then the HALT instruction would next execute.

Now, imagine you're an ethernet driver on FreeBSD-10 where the interrupt handler just scheduled some ithread to run in the future. Here's one example of what may happen:

  • The idle thread is executed;
  • An interrupt occurs, say to signal completion of some ethernet frame transmission;
  • The CPU kicks off the interrupt code;
  • The interrupt code doesn't find a fast interrupt handler but finds an ithread; so it schedules the ithread to run;
  • The scheduler goes to choose another thread to run and finds an ithread scheduled at a higher priority, so it schedules that;
  • The ethernet driver code in the ithread runs;
  • The scheduler then exits and re-enters the idle loop.
Everything is fine, right?

But, with the event timer changes that came in during the 9.x time frame, the halt code is now in a critical section.

So now, the following happens:

  • The idle thread is executed;
  • critical_enter() is called;
  • An interrupt occurs, which calls the fast handler, which schedules the ithread to run;
  • .. but critical_enter() has set a flag that says "no preemption", so the ithread doesn't get to run;
  • Control is returned to the idle function;
  • The idle thread gets to the point where it executes "HALT";
  • The idle thread continues to run, executing HALT.
In this instance, the ithread is scheduled but can't run before HALT runs. The ithread only runs after the next interrupt occurs.

I noticed this was happening when doing traffic tests on FreeBSD-HEAD between 802.11 and ethernet interfaces. The atheros 802.11 hardware implemented interrupt moderation so there wouldn't be tens of thousands of interrupts a second being generated. But what I saw was occasionally the receive queue being filled by packets and not drained fast enough. When digging into it, I found that due to interrupt moderation, if the interrupt came in just before WAIT was executed, the 802.11 receive function was taking a long time (sometimes up to milliseconds) to run after the interrupt came in and the ithread was actually scheduled. If I had an interrupt for each received packet, the amount of time between interrupts would have been very small (20,000 packets a second, so around 1/20000 sec per interrupt) and this problem would've been masked. But with moderated interrupts, it would be 750 microseconds or so before the next receive interrupt was generated.

Now, this is messy. There's some hacks in the idle loop code to try and skip the halt bit if the scheduler detects there's something to run. But there's still a small race window there which needs to be closed.

How can this be solved?

Apparently - the two instructions STI;HLT on x86 are atomic. Ie, there's no race window between them. If an interrupt comes through, HLT doesn't run. This doesn't happen for MWAIT or the ACPI sleep states and I am concerned we're still possibly hitting this race window from time to time. The specific behaviour is that the STI causes interrupts to be enabled following the next instruction. So yes, there's no window.

All that's left now is to make sure that interrupts are disabled before you do the scheduler check so no new interrupt processing on that thread can be scheduled. Ie, when entering cpu_idle():

  • Call critical_enter();
  • Disable interrupts;
  • See if the scheduler has anything to do - if so, enable interrupts and skip calling the idle loop
  • If there's nothing to do - and since interrupts are disabled, nothing new will have happened (like an interrupt scheduling an ithread) then just call the idle function
    • Which may or may not enable interrupts before entering the idle loop (you enter ACPI with interrupts disabled, but you enter HLT with interrupts enabled)
  • .. and then call critical_exit(), which will let the kernel continue preempting.
For MIPS however, there's a clever hack. (No, not from me.)

Here's how it works:
  • In the idle loop, it calls mips_wait()
  • mips_wait() is a bit of assembly code that will:
    • disable interrupts
    • see if the scheduler sees anything running
    • .. and if so, it doesn't bother running the WAIT instruction! Just enable interrupts and jump over the WAIT;
  • .. but the bit of code that re-enables interrupts and calls WAIT is aligned to a 16 byte boundary and the address is a symbol (MipsWaitStart).
  • Then in the exception handling code (MipsKernIntr), it sees if the instruction pointer where the exception occured is in the 16 bytes (4 instructions) at MipsWaitStart.
  • .. if it is, it adjusts the return address from the interrupt to be after the WAIT instruction.
It's totally dirty and to be quite honest, I haven't at all tested it. Yes, it should be tested.

Tuesday, September 3, 2013

Finding low hanging fruit with PMC, or "O(wtf)" ?

I've lately been focusing on performance counter stuff on Sandy Bridge (Xeon and non-Xeon.) Part of this has been fixing some of the counters that were wrong. Part has been digesting the Intel tuning guides and the Intel micro-architecture for Sandy Bridge. It's a little different to the older school pipeline driven architecture that rules the MIPS world.

So, I fired up some of my scripts (at on a live cache pushing a whole lot of live video netflix traffic. The scripts use the PMC framework in global counter mode rather than sampling mode, so it's cheap to do and doesn't affect performance.

What I found:

  1. The pipeline slots per cycle metric is around 16% - so there's a lot of stalling going on.
  2. There's a lot of memory traffic going on - around 50% of clock cycles are spent in LLC_MISS - ie, it wasn't in L1, L2 or L3/LLC (last-level cache) and thus has to be fetched from memory.
So, I started digging into why there were so many memory accesses. It turns out the biggest abuser was the cross-CPU IPI involved in synchronising page mapping tables - there are a few places calling pmap_invalidate_range() as part of sendfile() buffer completion and this was causing issues. I pointed this out, someone else has addressed it internally. (Ideally if the IO path uses unmapped buffers on amd64, there shouldn't be any need to map them in and out of KVA.) I think that saved about 4% of total clock cycles spent being stalled.

Then I found a lot of stalling going on in the mwait and ACPI sleep path. It turns out that these paths seem to involve doing ISA space IO port accesses. These are .. very slow. I've just flipped my testing over to use no mwait and use HLT.

Next - flowtable had been turned on during experimentation and I had noticed that the flowtable expire/flush code would periodically spike up. It spiked up more when more clients and more TCP flows were connected. It showed up in both memory accesses and clock cycles busy PMCs - and the reason made me laugh out loud.

The flowtable uses a bitstring_t - effectively an array of bytes treated as a bitmap, like select() FD_SET's - and would walk this to look for flows to expire.

The expiry code would walk the list looking for flows to expire - it would loop over the entire set, calling ffs() over the whole set to look for the next new flow to check.

.. so looping over looping over the whole set. O(n^2). Right there, in the flow cleaning path. Doing byte offset fetches, rather than 32-bit fetches. Everything about it was ridiculous. As we scaled up to serve more flows the flowcleaner CPU cycle count was spiking really, really hard.

I pointed this out in an email to my coworkers and fell asleep. It was fixed when I awoke - a co-worker fixed it to be correctly O(n) whilst I was sleeping. It's now totally disappeared from the CPU cycle and stall analysis.

So, I've just been chipping away at things here and there. There are some larger scale issues that I really want to address but I'd like to make sure all the odd, silly and remaining low hanging fruit are addressed. Then comes the fun stuff.

Monday, August 19, 2013

This blog post is mostly so I don't forget this kind of stuff. mentions "% execution stalled". This is the core i7 document rather than the Sandy Bridge document, but bear with me.

The formula is:


However, there's no UOPS_EXECUTED.CORE_STALL_CYCLES in the PMC documentation, nor is it in the Intel SDM chapter on performance counters.

But wait! It kind of is there. There /is/ UOPS_EXECUTED.THREAD, which is "Counts the total number of uops to be executed per thread each cycle." In the same block, it says that to count stall cycles, set CMASK=1, INV=1. Ok, so how does one do that with PMC?

# pmcstat -S UOPS_EXECUTED.THREAD,inv,cmask=1 -T -w 5

Now, it seems to be showing me the ACPI wait and MWAIT functions as high sample events - which is odd, as I didn't think this particular PMC measured C1 and MWAIT states. I'll chase this up.

For Sandy Bridge it's UOPS_DISPATCHED.THREAD - this counts dispatched micro-operatons per-thread each cycle. CMASK=1,INV=1 counts the number of stall cycles.

Tuesday, August 13, 2013

Profiling on superscalar architectures, or "no, instruction counts don't necessarily matter nowdays.."

I could spend a lot of time brain dumping various things to do with profiling on anything more recent than Pentium 1 (or, well, anything remotely to do with MIPS of that era, but I digress.) In any case, there's plenty of useful writings about it so I'm not going to - just jump over to .

However, I thought I'd provide a basic example of where "instructions" doesn't actually matter, as well as a shortcoming of the current FreeBSD tools.

My network testing stack does a whole lot of read() and write() syscalls to achieve its goal. For those who know what's going on, I hope you know where I'm taking this. Anyway..

Firstly, the standard. "pmcstat -S instructions -T". This prints a "top" like output counting instructions retired.
Figure 1. # pmcstat -S instructions -T -w 5

This looks like the contention is in the mutexes protecting socket receive and the TCP output path. Sure, but why is it contending?

The problem with doing it based on instructions retired is that it hides any issues to do with stalls. There's a bunch of sources of potential stalls - memory reads, memory writes, stuff not being in cache but being needed for instructions that are running. They're generally either side-effects of operations not being able to complete in time (eg if you have a whole lot of completed operations that need to push stuff out to memory to continue, but there's no free bandwidth to queue memory writes) but sometimes it's just from straight bulk memory copies.

If you're interested about the Intel microarchitecture and how all of these pieces fit together to process an instruction stream in parallel, with all of the memory controller paths coming in and out, have a read of this: .

Ok, so let's look at general stalls. There's a bunch of L1, L2, LLC (last level cache, think "L3" here) operations that can be looked at, as well as stuff that FreeBSD's PMC framework doesn't support  - notably some of the stuff on the "uncore" - the shared cache and pipelines between all cores on a socket. It supports the events implemented using MSRs, but not events implemented using the PCIe configuration space.

So, without further ado:

Figure 2. # pmcstat -S RESOURCE_STALLS.ANY -T -w 5
Yup. This looks much more like what I'd expect. The CPU is stalled doing copyout(). This is a mostly-read() workload, so that's what I'd expect. mb_free_ext() is interesting; I'll go look at that.

Now, copyout() is doing a bulk copy. So, yes - I'd expect that to be hurting. mb_free_ext() shouldn't be doing very much work though - I'll do some digging to see what's going on there.

The final output is from the Intel performance tuning overview tools. You can find them here - . There's a nice overview tool (pcm.x) which will output the basic system overview. I like this; it gives a very simple overview of how efficient things are.
Figure 3. "pmc.x 1" running on FreeBSD-10.

Now, this isn't a stock version of pcm.x - I've hacked it up to look slightly saner when doing live reporting - but it still provides exactly the same output in that format. Note the instructions per CPU cycle and the amount of cache misses. It's .. very inefficient. Tsk.

So in summary - don't just do instruction count based profiling. You want to establish whether there are any memory and cache bottlenecks. If you're doing HPC, you want to also check to see if you're hitting SSE, FPU, divider unit and other kinds of math processing stalls.

Now - what would I like to see in FreeBSD?

  • The hwpmc framework needs to grow support for more of the socket and system events - right now it's very "core" focused.
  • Some better top-level tools to provide a system summary like Intel's pcm.x tool would be nice.
  • Some better documentation (read: better than just this wiki page!) looking at how to actually profile and understand system behaviour would be desirable.
  • .. and tying this into dtrace would be great too.
  • Once we get some (more) NUMA awareness, it would be great to have the uncore stuff reporting on things like QPI traffic, cache line and memory accesses from remote sockets/cores, and other stuff to do with NUMA allocation and awareness.
Mostly, however, I'd like to get this stuff into the hands of developers sooner rather than later so they can start running this as a normal day to day thing.

Wednesday, August 7, 2013

Why, oh why am I seeing RST frames from FreeBSD when I have a high connection rate?

I started seeing something odd in my testing. I was only getting around 120-odd new connections a second being accepted by the test server. I know FreeBSD needs some tuning to make it perform at high request rates, but .. hell. The odd thing? The other requests were sometimes getting RST frames (and the client would error out with a "connection reset by peer"; sometimes not.)

After doing some digging, I discovered that.. I was doing something a little odd in my testing framework and it (surprise!) elicited some very negative behaviour from FreeBSD. Said behaviour is actually valid - it's to avoid denial of service attacks. But it's worth talking about.

My test client was bursting 'n' connections per thread each second. So, I would do a test of say, 128 new connections back to back, each second, from each thread. This is definitely odd (but easy to implement!)

Here's what the server was doing.

Firstly - there's a "syncache". The syncache handles incoming embryonic requests (ie, the SYN from a remote peer.) It's separate from the rest of the TCP stack so a large flood of new connections (valid or otherwise) doesn't need to grab TCP stack locks in order to process these frames, or waste RAM with PCB (protocol control block) entries for these embryonic requests. It also makes it easier to time out half-completed requests - the PCB will only have completed or closing connections.

If the handshake succeeds but there's a failure in allocating a new PCB or socket for the connection, the TCP stack can return an RST to the peer.

If the syncache fills up, it should be sending syncookies. (google "SYN cookies" for more information.) The point of using SYN cookies is that it doesn't fill the syncache up with embryonic connections - there's a cookie that the client will reflect back to the server that validates the connection.

If the syncookie exchange suceeds but the application can't create new sockets fast enough (ie, servicing the accept() socket queue quickly enough), the TCP stack will throw an RST back at the client.

Now, for the fun bits.

  • The RST responses back to the server are rate limited - via net.inet.icmp.icmplim. Yes, it's not just for rate limiting ICMP responses.
  • So the client would see some connections hit an RST and fail immediately; others just wouldn't get the ACK and would try again, so..
  • .. over time, there'd be a burst of new connections every second from the client (causing the issue) as well as the connection retransmits for embryonic-but-not-yet-finished connections
When I staggered the new connections over smaller, quicker bursts (so instead of 128 connections a second per thread,  I'd do 12 connections every 100mS) then the problem went away. This is better behaviour (I can connect thousands of new connections a second here!) but I still expect to see this problem in the real world. As I approach my intended TCP connection rate (100,000 connections a second - which isn't specifically a Netflix requirement, but an "Adrian proxy load" requirement! - I'm going to start seeing microbursts of new connections that will temporarily look like back-to-back new connections, thus triggering this bug.

So, to work around this for now, one just has to bump up the accept queue depth (sysctl kern.ip.somaxconn) to something much higher than the default of 128.

Now - why is this happening? My theory is this:
  • We're getting this burst of frames coming in the NIC;
  • The syncache / cookie code is being run in the NIC RX path;
  • The new connection path gets run and quickly overflows the syncache and new connection queue handling in the TCP stack, as the userland code doesn't get a notification in time
  • .. so the accept queue overflows before userland gets a chance to run, and we start sending rate limited RSTs.

Tuesday, August 6, 2013

Hacking on the Intel 10GE driver (ixgbe) for fun and .. not-crashing

My job at Netflix can be summed up as "make the Open Connect Platform better." This involves, among other things, improving FreeBSD in ways that make it more useful for content delivery.

I interpret that as including "find ways to break things."

So, I've written a crappy little multi-threaded network library ( which is absolutely, positively crappy and FreeBSD specific. Right now all it does is TCP and UDP network smashing using read() / write() for TCP, and recvfrom() / sendto() for UDP.

The aim with this is to stress test things and find where they break. So, the first thing I've written is a very simple TCP client/server - the client connects to the server and just write()s a lot of data.

.. except, that the clients are light-weight, in C, and multi-threaded.

So, I end up 'n' threads, with 'm' TCP sockets each, all doing write(). Right now I'm watching 4 threads with 12,288 sockets each sending data.

The test hardware is a pair of 1ru supermicro boxes with Intel E3-1260L CPUs in them, 32GB of RAM and dual-port Intel 82599EB 10GE NICs. The NICs are channel-bonded (using LACP) through a Cisco ASR9k switch.

I initially tested this on FreeBSD-9. I was rudely reminded of how utterly crappy the default mbuf sizing is. I constantly ran out of mbufs. Then, since FreeBSD-10 is on the cards, I just updated everything to the latest development branch and ran with it.

The result? The test ran for about 90 seconds before things got plainly pissed. The client (sender) would immediately hang. I'd get short packet errors, the LACP session would get unstable... everything was just plain screwed. The server (receiver) never saw any issues. I also saw lots of RX stalls, where one ring would seemingly fill up - and the whole RX path just ground to a halt. In addition, I'd also see a whole lot of out of order TCP segments on the server (receiver) side. Grr.

So, cue some driver hacking to see what was going on, reading the Intel 82599EB datasheet (that's freely available, by the way!) as well as discussions with Intel, Verisign and a few other companies that are using Intel 10GE hardware quite heavily, and here's what was discovered.

There's a feature called "RX_COPY" where small packets that are received are copied into a small, new mbuf - and the existing receive buffer is left in the RX ring. This improves performance - there's less churn of the mbuf allocator for those larger buffers. However. there was some dangling pointers around the management of that, leading so some stuff being DMAed where it shouldn't .. which, since ACKs and LACP frames are "small", would be triggered by this. Since the sender (client) is sending lots of segments, it's going to be receiving a lot of ACKs and this explains why the receiver (server) didn't hit this bug.

Next, the RX stalls. By default, if one of the RX rings fills up, the whole RX engine stalls. This is apparently configurable (read the data sheet!) but it's not on by default in FreeBSD/Linux. One of the verisign guys found the problem - in the general MSIX interrupt handler path, it was acknowledging all of the interrupts that were currently pending, rather than only the ones that were activated. The TX/RX interrupts are routed to other MSIX messages and thus should be handled by those interrupt threads. So, under sufficient load - and if you had any link status flaps - you may hit a situation where the non-packet MSIX interrupt thread runs, ACKs all the interrupts, and you immediately end up filling up the RX ring. You won't generate a subsequent interrupt as you've already hit the limit and the hardware won't generate anything further.. so you're stuck. That's been fixed. The annoying bit? It was fixed in the Linux driver but not the FreeBSD driver. Growl.

So, now the driver behaves much, much better. I can smash it with 20 gigabits a second of TCP traffic across 50,000 odd sockets and nary a crash/hang. But what bugs me is the out-of-order TCP packets on the receiver side of things.

The reason - it's highly likely due to the driver architecture. The driver will schedule deferred packet processing using the taskqueue if the interrupt handler ends up with too many packets to work with. Now, this taskqueue is totally separate to the interrupt thread - which means, you can have both of them running at the same time, and on separate CPUs.

So I'm going to hack the driver up to not schedule the taskqueue and instead just poke the hardware to post another interrupt to do further processing. I hope this will resolve the out of order TCP frames being received.

Saturday, June 29, 2013

Doing traffic with the Carambola 2..

Now that the port is working, I've started doing some traffic with the carambola 2 board on FreeBSD.

So far, so good:

# athstats
546236       data frames received
509242       data frames transmit
155          tx frames with an alternate rate
14818        short on-chip tx retries
13617        long on-chip tx retries
645          tx failed 'cuz too many retries
MCS7         current transmit rate
2            recv eol interrupts
9            tx frames with no ack marked
506786       tx frames with short preamble
1414         rx failed 'cuz of bad CRC
1543         rx failed 'cuz of PHY err
    12           OFDM restart
    1531         CCK restart
20610        beacons transmitted
71           periodic calibrations
-0/+0        TDMA slot adjust (usecs, smoothed)
24           rssi of last ack
25           avg recv rssi
-96          rx noise floor
2447         tx frames through raw api
39730        A-MPDU sub-frames received
494045       Half-GI frames received
5967         40MHz frames received
8037         CRC errors for non-last A-MPDU subframes
2            CRC errors for last subframe in an A-MPDU
498972       Frames transmitted with HT Protection
3            TX Timeout
177          Number of frames retransmitted in software
15717        A-MPDU sub-frame TX attempt success
177          A-MPDU sub-frame TX attempt failures
1            spur immunity level
4            first step level
128          OFDM weak signal detect
9            CCK weak signal threshold
108          ANI increased spur immunity
105          ANI decrease spur immunity
108          ANI increased first step level
105          ANI decreased first step level
943666       cumulative OFDM phy error count
108574       cumulative CCK phy error count
2            ANI parameters zero'd for non-STA operation
44           ANI forced listen time to zero
44           ANI calculated listen time < 0
13603        missing ACK's
14996        RTS without CTS
504970       successful RTS
34928        bad FCS
Antenna profile:
[0] tx   496835 rx        0
[2] tx        0 rx   546236

Wednesday, June 26, 2013

Making the AR9330 SoC wifi, or "how it feels doing things right.."

Well, "doing it right" is subjective. Sure. I'll grant you that.

I brought up the AR9330/AR9331 SoC support a couple of months ago. Unfortunately the Atheros reference board (AP121) comes with 16MB of RAM and 4MB of flash - which is just painful to do FreeBSD-HEAD development in.

Yes, I know. 16MB of RAM is tons of space... for FreeBSD-4. Anyway. That is a rant for another day.

So I managed to bring up the basic SoC support (which took longer than I thought - I had to learn how to write a FreeBSD uart driver!) but I decided to put wifi on hold until I found a board with more RAM and flash.

Along comes the Carabola 2 from ( . It's an AR9330, but with 64MB RAM, 16MB flash and a full-featured uboot. This is perfect for .. well, anything. And it's 30 Euros in quantities of one. Wait, it's cheap, it's fully-featured and it's available online? No way. What's the catch?

The catch - it wasn't running FreeBSD.

So I finally decided to bring up wifi support on FreeBSD.

The AR9300 HAL from Qualcomm Atheros includes the AR9330/AR9331 SoC wifi support. So I had to make it compile and make it work. How hard could it be?

Firstly - I wasn't compiling it in by default as it's only really useful for the SoC and not for normal PCIe NIC support. So, I needed to add that in. Luckily, I had to set AH_SUPPORT_HORNET into the source. Cool.

Next - the bus glue. The SoC internal bus isn't PCIe, it's what they call AHB, or "Atheros Host Bus." It's a derivative of a standard on-chip peripheral interconnect bus. The FreeBSD ath_ahb driver only supported AR9130, so I had to extend it to support non-AR9130 devices. That got it probing and attaching, but it wasn't finding the calibration / configuration space.

Next - gluing in the calibration data. It's on-board in the system flash, rather than on-chip (OTP) or an external EEPROM. The EEPROM space is 16KiB in size, rather than the 4KiB space used by the AR9xxx series SoCs. Also, the AR9300 HAL already seeks into the EEPROM space to grab the data at offset 0x1000, so I don't have to do that like I do with the AR9130 and related chips.

Finally - I had to teach ar9300_attach() that it needed to copy the EEPROM data I was giving it from ath_ahb into the copy it uses when setting things up.

And... that was it. After that, it booted and came up correctly. I was shocked.

You can find the boot log and dmesg at .

I haven't yet tested 802.11s (mesh) on this stuff, nor have I made TDMA work with this series of chips. But it's my eventual goal to make this board one of the "gold standard" boards for people wishing to enable their projects with wifi mesh. I bet it'll work out of the box as it stands, so if you're up for a bit of tinkering, buy a handful and set it up!

Enjoy! It's the best 30 euro you'll spend!

Friday, June 14, 2013

Working on Bluetooth Coexistence

I decided to bite the bullet and start hacking on bluetooth coexistence on these Atheros NICs. It's a bit of a rabbit hole.

I'll write up a bit more documentation on this when I'm not overly tired, but the general overview is pretty simple: "It's all done in software."

The bluetooth and wifi stacks need to speak to each other to know when is an appropriate time to prefer wifi traffic or bluetooth traffic. When pairing, bluetooth should be preferred. When scanning, associating, authenticating and rekeying, wifi should be preferred. When different profiles are active (eg A2DP audio), the bluetooth traffic should be periodically given preference so the A2DP frames can go out reliably. This has to be controlled in software.

So to make this work well on FreeBSD, I'll have to teach the wifi and bluetooth stacks to interface with each other somehow so this can be synchronised.

I have basic (static) coexistence working with the AR9285+AR3011 combo NIC. That's now in -HEAD.

I'm working on basic (static) coexistence on the AR9485+AR3012 combo NIC, however my NIC has an older BT part which requires quite a bit of dancing to make work. I'll have to teach ath3kfw how to load the config and firmware image for the required NIC. It's going to take some time but it'll be worth it.

I was hoping that FreeBSD would have basic A2DP support but it currently doesn't. I'd love to see that happen as it'd simplify a lot of my development/testing - as I can then do audio stream testing both playing and recording audio, then stream that over wifi.

Oh well. Another day of hacking!

Monday, June 10, 2013

So long, and thanks for all the fish!

After 18 months at Qualcomm Atheros, I decided I needed a bit of a change.

This is what I sent out to the open source community:

Hi all,

This Friday will be my last day at Qualcomm Atheros. I've enjoyed working with the extremely bright and driven engineers and designers that make the wireless chips and SoCs that people everywhere take for granted. I've achieved a bunch of goals both with their internal product development and open source. But now it's time to move onto different things.

I'd especially like to thank Luis Rodriguez for introducing me to the QCA folk and helping me get access to the Atheros open source project, as well as the follow-up discussions that led to me being hired. The open source wireless community has been driving innovation in a lot of areas for a number of years. I'd like to hope that I've had a small, positive effect on that. I wish you all the best of luck in pushing forward and continuing to innovate.

Now, I'm still NDA-enabled and I quite like hacking on this wireless stuff so I won't be quitting hacking on things. I will just have other things on my mind.

Good luck to you all!

Now, this generated a flurry of private emails asking me what happened and where I'm going to.

So, the summary - I accepted a job at Netflix, as part of their OpenConnect CDN team.

They've built a world-wide CDN using FreeBSD and they're looking to continue growing and improving it. They've committed to improving FreeBSD's network, storage and VM layer to facilitate moving tens of gigabits of Netflix video traffic per server. And, they're going to open source the bulk of it. They realise that the best benefit from open source comes from working with open source - and that's exactly what they've done. They've contributed back their improvements and fixes.

I've enjoyed my time at Qualcomm Atheros. The people are brilliant, the hardware is excellent and it was a great learning experience. I got to experience what it was like working at a silicon company during chip design, validation and bring-up - both the good and the bad bits. But when it came down to it, I couldn't contribute to and improve the process in any meaningful way. I was one engineer in a very large, diverse organisation - and like large organisations, things move slowly.

So, I hope to continue to maintain close ties with the hardware and software people at Qualcomm Atheros. I hope to continue hacking on the FreeBSD wireless stack in my spare time, as I have been to date. I wish I could've contributed more positively to their evolving hardware and software strategy. But there's only so much an engineer in an established company can do, and that engineer wasn't going to be me.

Sunday, June 2, 2013

The fitbit, or "making me aware of all the exercise I'm not doing"

A friend of mine (hi Sabrina!) uses a Fitbit to track her daily activities. It's a little device that tracks your movement and gives you a simple overview of how active you are (or aren't.)

Now, I don't really believe that its calorie counting, stair counting and step counting is entirely accurate. It's just doing it based on an accelerometer and I've seen it occasionally double count walks. That's fine.

But what it does do is pretty nifty: it's reminding me of exactly how freaking inactive I am being a salaried computer programmer. I'm not spending an hour or two a day walking. I'm not really doing any kind of strenuous activity outside of occasionally going to the gym.

This thing reminds me with one simple number (or flower, if you like that kind of thing) exactly how inactive you are. And that to me is worth more than millions of lines of cute looking websites to track your daily progress.

So, now I have no excuse.

Friday, May 31, 2013

Trying to implement A-MSDU support, or "what happens when things aren't done in the right order."

I've been looking at implementing A-MSDU support in net80211. This has crept up for a few reasons:

  • It means you can do basic MSDU aggregation without needing the full block-ack window mechanics;
  • It allows you to do aggregation of very small frames into one larger MPDU that you can then stuff into an A-MPDU;
  • I want to leverage it for TDMA.
So, the background.

A-MSDU is where you take a bunch of MSDUs and shovel them into a single, larger MPDU. They're all destined to the same end-node and they all are transmitted/retransmitted together. There's a single sequence number assigned to the A-MSDU, so the hardware will have to retransmit it as a single unit.

Which is great for things like TCP ACKs, which are tiny and waste a lot of airtime. If you can shovel a bunch of them into a larger A-MSDU and then transmit that inside an A-MPDU, you get a double bonus - your A-MPDU sub-frame sizes are large, so you won't hit any "minimum frame size" limits that various chips may have.

And for TDMA it's an instant win - I can just use A-MSDU with no ACK for now and achieve 11n throughput with minimal effort. All I have to do is write the A-MSDU support and I'll get 11n throughput for free with TDMA.

So all I have to do is write A-MSDU support, right? ... right?

It turns out that from an architectural standpoint, it's a pain in the ass to write.

For A-MPDU it's easy. You can just do it in the driver. It looks like a series of individual MPDUs that are already 802.11 encapsulated. So you can just buffer those in the driver and transmit/retransmit them. For net80211 (FreeBSD) and mac80211 (Linux) this is great - both stacks pass an already-encapsulated 802.11 MPDU to the driver.

But for A-MSDU it's a bit hairier. We're not aggregating already encapsulated 802.11 frames - we're encapsulating multiple 802.3 frames together. Which means the stack itself has to glue together a bunch of 802.3 frames into an A-MSDU, then pass that as an MPDU to the driver.

That bit isn't hard.

The first bit - figuring out what the maximum A-MSDU size is. Now, the naive solution is just to aggregate as much as you can to the 7935 byte maximum A-MSDU size boundary, then transmit that. Great - except that there are regulatory limits and QoS limits on how long an individual frame can take to transmit. So when you create an A-MSDU, you actually want to limit the size.

Now, what do you limit it to? It has to be based on how long it'll take to transmit, so here's the tricky bit - you already need to have made your transmit rate decision first. If you haven't made that decision, you can't calculate how long it'll take to transmit the frame.

For the short term I'm going to just ignore this and write the A-MSDU support for FreeBSD in net80211 and aggregate up to the maximum limit. It's good enough to do basic testing of the feature itself. But I do need to add that maximum frame limit for another reason: QoS.

With QoS, I have a specific slot time transmit opportunity limit. I have to do two things:
  • Not exceed the slot time entirely by scheduling a frame so large it will actually exceed the transmit opportunity limit, and
  • Schedule any other subsequent frames in order to try and "fill out" the rest of the transmit opportunity window.
This is really important for TDMA. Say I have an 8ms slot window but can only transmit 4ms at a time due to regulatory concerns. If I can schedule a 4ms long A-MSDU then great, that's what I'll do. But say this happens:
  • I have a 8ms long window;
  • My first frame is a long one, and it takes 4ms;
  • I then have five 0.5ms long frames afterwards.
What I don't want to do is create an A-MSDU with all of those frames. I want to create a 4ms frame by adding in two 0.5ms frames to that 3ms frame, or transmit the 3ms frame followed by four 0.5ms long frames. I don't want to aggregate the second set of smaller frames into a 4ms long frame and have it be "too long" to fit into the rest of the transmit opportunity.

So, things get hairy.

But wait, there's more.

What about handling frame re-transmission? If you use the Atheros (or others, for that matter) hardware frame re-transmission, you can have the hardware re-transmit the frame for you at various rates, starting with the highest one and then trying slower ones. You now have a similar issue - what if the frame is within your transmit opportunity at the fastest rate, but not at the smallest rate? What the FreeBSD and Linux atheros driver does for A-MPDU is pretty stupid - it uses hardware re-transmission at slower rates, but limits the A-MPDU size to not exceed the maximum transmit length (4ms) at the slowest rate.

I'd rather have it never use hardware multi-rate retransmission and just step down to a slower rate. It'll re-calculate the maximum length and re-aggregate frames. It's fine, it'll be slightly less efficient but it'll work.

But for A-MSDU, it is done in the stack rather than the driver. So imagine this:
  • You've buffered a bunch of 802.3 frames into a staging area, to put into an A-MSDU;
  • You make a transmit rate selection and that limits how big your A-MSDU is;
  • You assemble the A-MSDU and pass that MPDU down to the driver;
  • The driver tries transmitting it and fails, so you should retransmit it as a lower rate;
  • .. except now you really don't know if it didn't make it to the remote end or not. Did you fail to hear the ACK, or?
Now comes the tricky bit. All you know at this point is that it didn't ACK. You don't know whether it didn't transmit or whether you didn't hear the ACK (and the receiver did actually receive it, ACK it and push it back up to the network layer.)

If you retransmit the MSDU at a lower rate, the receiver can eliminate a duplicate received frame by just looking at the sequence number and seeing it already has seen it, eliminate it.

But if you retransmit it at a lower rate that exceeds the TXOP window size, you will be breaking QoS requirements. Your hardware (eg Atheros with the right bit enabled!) may even just flat out refuse to transmit the frame, returning it as failed because it automatically failed to fit into the transmit opportunity window. So, what do you do?

If you retransmit it at a lower rate, it's going to automatically fail. What about pulling apart the A-MSDU into two sets of MSDUs, then treating it as two A-MSDUs to transmit? That way both will fit into the maximum transmit duration at the given transmit rate.

The problem here is you don't know that the receiver did actually not hear the A-MSDU. All you do know is you didn't hear the ACK. So if you do this, and assign new sequence numbers to the two new A-MSDUs, it's quite possible that the receiver will hear the old A-MSDU and the two new A-MSDUs, and pass duplicated frames up to the network layer.

So, the TL;DR version here - we either form A-MSDUs for software retransmit that can be retransmitted by the lower rates (biting some inefficiency but allowing for retransmission), or you just absolutely do fail the transmit and not retry.

So, it's complicated. Complicated and annoyingly messy.

Tuesday, April 30, 2013

A FAQ about today's FSF release

I've had a few people ask me some questions. There's also been a few questions on slashdot. I'll update this article as more questions come in.

Was it all me?

No, I didn't do the bulk of the work. Luis did the bulk of the legal hoop-jumping and review process at work. I grabbed it near the end of this process (so he could move onto other things) and shepherded the process of getting things ready for open sourcing.

I encouraged some external developers from the community to come on board and help in the initial effort to get it to compile and work correctly under the open source Tensilica toolchain rather than the internal toolchain.

I've fixed a few bugs here and there - eg the RX path TSF bugs that stopped the NICs from working in Mesh mode, along with some other fallout issues from the toolchain migration.

I wanted the bulk of the work to come from the community rather than me. I don't want to be the only person working on this. Thankfully I'm not! There's an active community now!

I'll likely do a bunch more development in the firmware code once I get it working on FreeBSD!

Why is it only one device? Why is it so expensive? Why that device?

You'll have to ask the FSF that.

How different is this to the non-USB stuff?

Like a lot of manufacturers, Atheros reuses its CPU and Wifi cores everywhere they can.

The AR7010 designs have an external AR9280 or AR9287 NIC. This is exactly the same as a mini-PCIe design - the same chip, speaking PCIe, etc.

The AR9271 design is a single chip solution (see below) with an AR9285 NIC internally. I don't know whether internally it speaks PCIe or whether they just glued the NIC onto the AHB like they do for other integrated CPU+SoC designs (eg the AR913x, AR933x, AR934x, etc.)

But once you get past the USB and CPU parts, it looks exactly the same as the PCIe devices Atheros driver developers know and love.

Just keep in mind the main difference - the wifi part doesn't DMA directly to/from your computer memory. It has to go via buffer RAM on the AR7010 core in order to then send or receive it via USB endpoints.

What about the other NICs? The AR7010 based ones?

The AR7010 based ones are precursors to the single-chip solution that the FSF is selling a NIC for (the AR9271.) The AR7010 has USB on one side and PCIe on the other. It runs effectively the same firmware as the AR9271 NIC, save for some different ROM addresses, memory map and some other little differences.

The AR7010 based devices are thus "just as free" as the AR9271 NIC the FSF is selling.

I'm not sure if the FSF is going to certify an AR7010 design. I hope they can find a dual-band AR7010+AR9280 ath9k_htc NIC and sell that as part of their open hardware programme.

What is this AR9271 anyway? Why is it only 1x1 and 2GHz only?

The AR9271 is a single chip solution containing:
  • An AR7010 style core, with minor differences
  • Some RAM and ROM (but less RAM than the AR7010; no I don't know why.)
  • An AR9285 derivative (which is the 2GHz, 1x1 chip.)
Like a lot of things that manufacturers do, it's a "cost savings" design for a specific market. Even now, laptop and tablet manufacturers want to skimp on 5GHz NIC designs in order to save some cash. No, I don't know why. No, I can't quote costs.

How can I help?

Download the firmware, download a linux-next or compat-wireless tarball - or, run OpenBSD + athn for now - compile stuff up and hack away.


Hi to those from!

All I'd like to say is:

"Patches gratefully accepted."

Thursday, April 25, 2013

Today's Journey: Making AP mode power-save work better

I've been working on improving the net80211 and ath driver support for AP mode power save.

There's a few parts to it:

  • A station can tell an access point it's going to sleep by setting the power mgmt bit to 1 in a TXed frame;
  • The AP will then update the TIM entry in the beacon frames it sends out to reflect whether that station has any traffic queued;
  • A station can signal an AP that it's awake by sending a data frame with the power mgmt bit set to 0;
  • .. or it can request a frame at a time by using PS-POLL;
  • There's also the uAPSD stuff which I haven't yet implemented and won't likely do so for a while.
Now, it shouldn't be that difficult. Except, that it is.

If an AP has a bunch of frames queued to a station that has gone to sleep, it will keep trying to transmit those frames. That wastes air-time and results in annoying levels of packet loss.

When you're doing 802.11n, there's a whole lot more traffic going on and a lot more room to cause massive traffic issues if you drop frames. But you don't want to keep failing to transmit those frames or you'll end up spending a lot of time transmitting BAR frames to the station.

If the driver maintains a queue of frames (for say, software retransmit) then it also needs to ensure that the TIM bit is set correctly. Otherwise the AP may set the TIM bit to 0 because the net80211 stack has no queued frames to that node; but the driver itself has some frames. Thus, the station won't wake up and you'll see increased packet latency.

When PS-POLL is received, frames need to first be leaked from the driver queue BEFORE it starts leaking frames from the net80211 power save queue. The last thing you want is the wrong set of frames to go out.

So, I've spent the last few months extending the driver and network stack to make this feasible. There's new net80211 driver methods for tying into the TIM update process, the node power save status and the PS-POLL handling. The filtered frames handling in the ath driver is another precursor to this - it means that frames can be failed out very quickly and retried when appropriate.

(No, I'm not implementing software retransmit for non-11n traffic just yet. I will eventually. Just not yet.)

The final bits that I've been working on have been tricky.

When a node goes to sleep, you want to pause the driver transmission to the node - otherwise it will keep trying to transmit whatever is in the driver queue. For 11n this is terrible; it means that frames will keep failing to be transmitted and with enough failures, the traffic will stop whilst a BAR frame is sent. Grr.

Next was figuring out how to send frames whilst the node is "paused". I introduced a per-node "leak" counter which tells the driver transmit path that even though the node is asleep, a single frame should be scheduled. If one isn't available, the next frame sent will be scheduled. This handles the PS-POLL "null" response - ie, if there's nothing in the queue, the net80211 stack will queue a null data response with the MORE bit clear. That way the station will know there's currently nothing to receive.

But then, something odd started happening. Devices would disassociate and re-associate, but they'd still be marked as "asleep". So no traffic would occur. After digging into it a bit, I discovered that the only time a station transitions back to awake is when it receives a DATA frame with the power mgmt bit set to 0. Seeing management/control traffic from the station isn't enough. So for now, I just always transmit management/control frames regardless if the station is asleep or awake - except BAR frames. Those get software queued if the node is asleep. Now that management/control frames are transmitted directly, a station can re-associate and be marked as 'awake.'

Then I found that once a station re-associates, it should have all of its current association state reset. It may have had a bunch of aggregate frames queued to the hardware and those need to finish transmitting before we can start transmitting new data to the re-associated station. It may even have been in the middle of receiving a BAR frame! So, I have to gently (well, "gently") reset the association state to allow for currently queued frames to be cleaned up, but reset things like filtered frame state and BAR TX. Ew, but it needs to be done.

Also, if there's data queued to an asleep station and a BAR frame needs to go out, the BAR frame needs to go into the head of the software queue, not the tail. Otherwise it will have to wait for the queue to be transmitted - which, if there's a gap in the transmit block-ack window (hence needing the BAR), no further transmission will occur. Oops!

I then found that a sufficiently chatty node could end up filling the software queue full of buffers destined to it. This is a general problem in the ath driver which I'll eventually fix, but it became a huge problem with power save enabled. So, I've introduced a per-node maximum queue depth when it's asleep. That should limit the amount of pain that a single sleeping node can cause. I'll eventually introduce a limit for how many buffers an individual node can consume whether it's awake or asleep but that's for another day.

There's likely lots more corner cases that need to be addressed before I can merge this into -HEAD. I'm still seeing my macbook pro occasionally disassociate and not automatically re-associate and I'm not sure why. But things are behaving much, much better with sleeping devices.

Tuesday, March 26, 2013

Hey, look, it's lots of atheros NICs in one laptop

So after many months of evenings and a whole lot of work internally to get the AR9380 HAL release vetted by legal, I bring you: a single, unified ath(4) and ath_hal(4) driver which works on all chipsets.

Now, the only chipsets I can fit _in_ this laptop:

[100309] ath0: mem 0xebf00000-0xebf0ffff irq 17 at device 0.0 on pci3
ath0: AR9280 mac 128.2 RF5133 phy 13.0
[100309] ath1: mem 0xedf00000-0xedf1ffff irq 18 at device 0.0 on pci4
ath1: AR9380 mac 448.3 RF5110 phy 0.0
[100309] ath2: mem 0xe4310000-0xe431ffff irq 16 at device 0.0 on cardbus0
ath2: AR5212 mac 5.9 RF5112 phy 4.3

.. that's an AR9280, AR5212 and AR9380 in the same laptop.

And, that's a 3x3 AR9380:

static_rix (-1) ratemask 0xffffffff
[ 250] cur rate 20 MCS since switch: packets 1 ticks 2647581
[ 250] last sample (6  Mb) cur sample (0 ) packets sent 9
[ 250] packets since sample 9 sample tt 0
[1600] cur rate 22 MCS since switch: packets 15 ticks 2647530

[1600] last sample (21 MCS) cur sample (0 ) packets sent 6049
[1600] packets since sample 0 sample tt 532
   TX Rate     TXTOTAL:TXOK       EWMA          T/   F     avg last xmit

[ 6  Mb: 250]        4:4        (100.0%)        4/   0   760uS 2640242
[20 MCS: 250]        9:9        (100.0%)        9/   0   440uS 2647581
[20 MCS:1600]      969:969      (100.0%)       57/   0   572uS 2647445
[21 MCS:1600]     1517:1517     (100.0%)       74/   0   613uS 2647557
[22 MCS:1600]     1990:1990     (100.0%)       92/   0   529uS 2647557
[23 MCS:1600]    73986:73462    ( 99.5%)     5661/   0   755uS 2647538

Now, I'm sure the AR5210 will work with an AR9280 and an AR9380 in the same laptop - it's just that the hardware form factor won't let me fit them all at the same time.

Tuesday, March 19, 2013

AR9380 support on FreeBSD; why it's taken so long..

There's now public, open source support for the AR9380 and later chips for FreeBSD.

It's not yet in the -HEAD tree - I'll get to that.

Let me take you on a bit of a journey.

I started a little side project late last year - I wanted to see if I could make the AR9380 HAL from the Qualcomm Atheros mainline driver (10.x branch) work on FreeBSD. I was hoping that the HAL API hadn't drifted all that much over the years.

Why do this? Two reasons:
  • I wanted to see if I could open source the HAL and have it work with FreeBSD; and
  • I didn't want to take on a similar project to what ath9k had to do - which is to take the existing HAL, convert it into something Linux-upstream-compatible, then push THAT into open source.
There's only one of me, and I don't want to spend all of my evenings trying to figure out which changes to the internal driver HAL need merging into "my" version of the HAL. I want to leverage all of the development and debugging that we do internally for the HAL. The ath9k team (both public and internally) need to do a lot of manual inspection and coding in order to pick up fixes and features from the internal driver. Since there's one of me, I'd rather optimise my time (read: get some sleep at some point.)

Then there's the third point that I didn't mention above:
  • I want to see how feasible it is to do snapshots from our internal codebase and push those out, rather than having to maintain a separate driver tree (sometimes based on the internal driver tree, sometimes re-implemented) and all the associated complication there.
This bit is pretty important. There's plenty of code I didn't want to open up. The bulk of the AR9300 HAL is already open sourced via the ath9k driver in Linux. So for the most part I'm open sourcing what we already have open sourced. However, I want to try and streamline the process for taking internally developed code and push it open.

This involves a few things.

Firstly, how much of the internal driver code is written with the idea that it's going to appear in the public eye? It depends what you think of as public - are your company developers "public" ? Are your customers with source code "public" ? It may not necessarily be "the general community." When you're writing code that's eventually going to be open sourced, you may need to make some decisions about how you structure your code.

For me, it was (mostly) easy. A very large amount of the "stuff that shouldn't be released" was already wrapped up in #ifdef's - stuff like emulation code, for example. So the public HAL snapshot is actually missing a lot of code that our internal version has. All I did (heh!) was pass it through 'unifdef'.

Next is whether the code is nice to look at. Is it formatted well? Is it well designed? Does it compile without warnings? Even on clang? These should be thought about whether or not your target audience is public or not. It's just good design. Companies may be worried about exposing the code, as if it will show badly on them. Well, yes, you should. But hey - we the open community would rather you release the code and take constructive criticism instead of keeping it closed. Who knows, it may actually help you!

The Linux upstream push is actually good here - the Linux system maintainers don't take "bad code". They hold the developers to a higher standard and this is forcing companies to think a bit more about how they develop things. Now, whether companies view this as a cost-centre or a benefit is not something I wish to discuss here. The point is that by working in the Linux upstream community, companies are being forced to tidy up their game a little.

Ok, enough of the back-story. How'd it actually all happen?
The short version - there was API drift, yes. There was a bunch of driver layer stuff that needed to happen. But it wasn't terribly painful. It required me to clean up the driver a bit and implement some nicer tools.

The long version:
  • There was an internal attempt to partly convert the HAL code internally over to a format that is Linux-upstream compatible. This involved a variety of formatting changes - function names and indentation changed. It also involved a variety of variable / method changes - eg halMciSupport became hal_mci_support. The boolean type changed - HAL_BOOL and AH_TRUE/AH_FALSE became bool, true & false. These needed to be renamed back to the HAL style before I could make it compile.
  • FreeBSD stripped out the HAL_CHANNEL stuff from its HAL, replacing it with a direct reference to the net80211 type (struct ieee80211_channel.) This made things slightly tidier but it did put an external dependency on the HAL. I may end up going through the FreeBSD HAL and undoing this at some point; but it's a big job.
  • A variety of APIs changed over time. Although the bulk of the APIs stayed the same, they grew parameters (eg 11n TX and RX antenna and chain configuration); the TX descriptor APIs now take a list of TX buffers rather than a single TX buffer, and other random other things.
So, what was I going to do?

My first cut was to just take a snapshot of the HAL and rename / shuffle things around enough to make it compile.

The first thing I did was to create a set of HAL stub functions. All the stub functions did was print out their method name and return. This way I wasn't surprised by a NULL pointer dereference when the HAL or driver called an unimplemented method - I'd get told which method was being called.

I started with the bare minimum code needed to support probe and attach - which required a surprising amount of code to be converted over. But it was mostly mechanical work. And it worked - enough to get things probing and attaching. I didn't bother with frame transmission and reception just yet - getting probe/attach was enough.

Then I realised that I wanted to this in a git branch, so I could import future versions of the HAL into master and then merge it into my branch. That's what I did. The HAL from 10.x was in master, and my FreeBSD port lived in 'local/freebsd'.

Next was figuring out whether to rename/fix API functions, or to use glue functions in order to deal with API differences. I've fixed some API differences (eg the reset path), but I ended up using a lot of wrapper functions to get the APIs to line up.

The important bits to bring up (in rough order) in order to see whether things are working:
  • Probe/attach/detach;
  • The reset path;
  • The initial calibration path (ADC calibration, IQ calibration, NF/AGC calibration);
  • The radio configuration path (ie, programming the analog section with the right frequencies, channel width, filter setup and such);
  • Interrupt handling;
  • ANI support;
  • RX path.
The RX path was the important bit. Once frame RX was working, I could do things like run the NIC in monitor mode and verify that HT20 and HT40 were working. And yes, that's pretty much what I did.

But at this point, the RX path exposed the first major API change - the whole FIFO setup that the AR9380 and later required. They don't support the list-based TX and RX that previous NICs supported. (Well, they _kind_ of do on the TX side, see below.)

The major change here required in the driver is that the RX descriptor is actually in the same memory area as the RX buffer. Ie, the first 'x' bytes of the passed in buffer is where the NIC DMAs the RX completion information to. Previous NICs have two areas for each RX frame - a RX descriptor area and an RX buffer area. Descriptors are in non-cachable memory, so I had to teach the FIFO RX path to support descriptors in cachable memory. I also had to teach the RX path to "skip" the 'x' bytes in order to hand the start of the data payload up to the net80211 stack. Finally, there's two RX FIFOs - one for high priority frames (beacons, uAPSD frames, PS-POLL frames, etc) and low-priority frames (everything else.) I had to teach the stack about this.

So, you can see the changes to the RX code - there's now a set of methods that implement RX - stop, start, flush, descriptor processing. The legacy routines stayed where they were. The new routines just overrode those methods.

And with that, RX came to life.

Next was TX. TX is a bit more special. There's only 8 TX FIFO entries per hardware queue (QCU 0..9); so I can't just push all the frames I want into the list. I also have four TX data buffer pointers per descriptor, rather than one per descriptor in the past. Finally, the TX status FIFO is completely separate from the TX FIFO itself - legacy chips would put the TX status at the end of the final descriptor in a frame.

This required some pretty significant refactoring of the TX path in order to expose the correct hooks to do this all properly. I won't go into the details here - suffice to say that I'm still working on it.

The next problem with TX was figuring out exactly what TX descriptor flags I was setting incorrectly. I eventually gave in and wrote some ALQ based logging which dumps the TX and RX descriptors into an ALQ log which I can then read from userland. This made it very, very easy to inspect what was going on - I was even finding bugs with the earlier chipset code!

Initially I used this to discover I wasn't correctly filling out all four buffer pointers in each TX descriptor. I can't leave any NULL if there's more descriptors for a given frame.

Then I used it to discover whether I was setting up the general flags right - TX chainmask, TX rate, duration, etc. I (re) discovered a hardware limitation with the AR9380 - I need to pad aggregate frames that use RTS with a little more pad delimiters or the transmission underruns. I was able to take these text dumps and give them to the Qualcomm Atheros MAC/PHY team for assistance and they were very impressed by the sophistication of my debugging tools.

Now I have the TX and RX side working. I pushed all of the driver side code into the public FreeBSD repository. I promised people that I would eventually open up the HAL side of things, but I figured that keeping the driver side of this closed was just plain silly. It also meant that if I did stop working on things (for whatever reason), the driver side was done - all that would need porting was the HAL.

Then I began the internal process to get the HAL opened up. I won't go into this in too much detail - suffice to say it took some engineering and legal review to get approval for this. The approval came in about two weeks ago and I pushed the repository into github shortly afterward.

Shortly after that, people started testing it and filing bugs. This part made me happy - there's a few small bugs that are actually in our 10.x mainline tree. I'll be pushing fixes back into the internal driver tree soon.

So, what's next?

I need to push the repository into a vendor branch in FreeBSD, then merge it into the kernel tree so it can be compiled by default.

I then need to get an updated version of the HAL approved by legal/engineering and push that update into the public git repository. Once that's done, I'll do a git merge into my branch and fix up whatever merge issues there are. This updated HAL includes some fixes for TX power and the AR95xx embedded SoC that we've just released. I hope to try and do a HAL update every month or two based on what bugs and features are introduced into the internal mainline driver.

I still have a bunch of driver work to finish up - notably I need to finish optimising the TX FIFO path in the driver and I need to implement MCI support. But the driver is now usable for me at least and I hope it'll become increasingly usable by others.

This has been a long and interesting trip.

Sunday, March 3, 2013

Why PCI latency timers matter..

My latest "are you serious?" moment recently was trying to figure out the root cause of this performance issue with the AR5416 cardbus NIC on some of my test laptops.

Now, the AR5416 is Atheros' first 802.11n NIC, so it has some rough edges. But I was seeing some ridiculously bad transmission failures and I couldn't pinpoint them.

Not only that, I was seeing great performance (~ 130mbit TCP) on a specific laptop (Lenovo T41p) but the Lenovo T60 and T400 both performed extremely poorly.

To make matters weirder - the NIC performed great when speaking to another NIC in the same laptop. Just not to another physically separate device.

So, after much digging, here's what I discovered.

Firstly - I used my athalq packet descriptor logging and inspection tool (that's in FreeBSD-HEAD - no custom closed source code here!) to investigate the TX frames being sent to the hardware. What I found was troubling - large numbers of frames had TX data and TX delimiter underruns.

I then discovered that my code for counting TX data / delimiter underruns was totally incorrect - it's possible to see both a data/delimiter underrun error _with_ a valid transmitting frame. What was going on was cute - the hardware would start transmitting an aggregate frame but the DMA wouldn't keep up during said transmission and half way through the frame it would underrun. This only happened at higher MCS rates.

So making shorter aggregate frames fixed it, as well as increasing the delimiter count between frames. Both had the effect of reducing the likelihood of the NIC failing to transmit a longer aggregate. But they weren't solutions.

So I went digging. What I found was pretty simple in theory: the PCI latency timer on the NIC was being set to something appropriate (0xa8) but the PCI latency timer on the cardbus PCI bridge itself was not (0x20.) So any other bus activity would cause the NIC to not get the bus and it'd miss its DMA window.

Once I manually fixed the PCI bridge latency timer to be 0xa8, everything returned to normal.

However - there's only one thing on this PCI bridge - the cardbus interface itself. That's why it's so kooky. I would've thought that I'd have to up the value on the rest of the PCI bridges up to the root complex. There's no latency timer for PCIe, so it's not a problem there. So there's likely some very subtle timing involved that's just plain broken by default on how the BIOS initialises this cardbus slot and FreeBSD is not overriding it.

Now, if you see crappy performance on the PCI/cardbus 802.11n NICs in FreeBSD, you can check the output of 'athstats' to see if you do see TX underruns of any sort. If you are, the hardware isn't meeting the DMA deadlines it needs to DMA out frames and you need to do some further digging into your system to see why.