Monday, July 21, 2014

Application awareness of receive side scaling (RSS) on FreeBSD

Part of testing this receive side scaling work is designing a set of APIs that allow for some kind of affinity awareness. It's not easy - the general case is difficult and highly varying. But something has to be tested! So, where should it begin?

The main tricky part of this is the difference between incoming, outgoing and listening sockets.

For incoming traffic, the NIC has already calculated the RSS hash value and there's already a map between RSS hash and destination CPU. Well, destination queue to be much more precise; then there's a CPU for that queue.

For outgoing traffic, the thread(s) in question can be scheduled on any CPU core and as you have more cores, it's increasingly unlikely to be the right one. In FreeBSD, the default is to direct dispatch transmit related socket and protocol work in the thread that started it, save a handful of places like TCP timers. Once the driver if_transmit() method is called to transmit a frame it can check the mbuf to see what the flowid is and map that to a destination transmit queue. Before RSS, that's typically done to keep packets vaguely in some semblance of in-order behaviour - ie, for a given traffic flow between two endpoints (say, IP, or TCP, or UDP) the packets should be transmitted in-order. It wasn't really done for CPU affinity reasons.

Before RSS, there was no real consistency with how drivers hashed traffic upon receive, nor any rules on how it should select an outbound transmit queue for a given buffer. Most multi-queue drivers got it "mostly right". They definitely didn't try to make any CPU affinity choices - it was all done to preserve the in-order behaviour of traffic flows.

For an incoming socket, all the information about the destination CPU can be calculated from the RSS hash provided during frame reception. So, for TCP, the RSS hash for the received ACK during the three way handshake goes into the inpcb entry. For UDP it's not so simple (and the inpcb doesn't get a hash entry for UDP - I'll explain why below.)

For an outgoing socket, all the information about the eventual destination CPU isn't necessarily available. If the application knows the source/destination IP and source/destination port then it (or the kernel) can calculate the RSS hash that the hardware would calculate upon frame reception and use that to populate the inpcb. However this isn't typically known - frequently the source IP and port won't be explicitly defined and it'll be up to the kernel to choose them for the application. So, during socket creation, the destination CPU can't be known.

So to make it simple (and to make it simple for me to ensure the driver and protocol stack parts are working right) my focus has been on incoming sockets and incoming packets, rather than trying to handle outgoing sockets. I can handle outbound sockets easily enough - I just need to do a software hash calculation once all of the required information is available (ie, the source IP and port is selected) and populate the inpcb with that particular value. But I decided to not have to try and debug that at the same time as I debugged the driver side and the protocol stack side, so it's a "later" task.

For TCP, traffic for a given connection will use the same source/destination IP and source/destination port values. So for a given socket, it'll always hash to the same value. However, for UDP, it's quite possible to get UDP traffic from a variety of different source IP/ports and respond from a variety of different source/IP ports. This means that the RSS hash value that we can store in the inpcb isn't at all guaranteed to be the same for all subsequent socket writes.

Ok, so given all of that above information, how exactly is this supposed to work?

Well, the slightly more interesting and pressing problem is how to break out incoming requests/packets to multiple receive threads. In traditional UNIX socket setups, there are a couple of common design patterns for farming off incoming requests to multiple worker threads:

  • There's one thread that just does accept() (for TCP) or recv() (for UDP) and it then farms off new connections to userland worker threads; or
  • There are multiple userland worker threads which all wait on a single socket for accept() or recv() - and hope that the OS will only wake up one thread to hand work to.
It turns out that the OS may wake up one thread at a time for accept() or recv() but then userland threads will sit in a loop trying to accept connections / packets - and then you tend to find they get called a lot only to find another worker thread that was running stole the workload. Oops.

I decided this wasn't really acceptable for the RSS work. I needed a way to redirect traffic to a thread that's also pinned to the same CPU as the receive RSS bucket. I decided the cheapest way would be to allow multiple PCB entries for the same socket details (eg, multiple TCP sockets listening on *:80). Since the PCBGROUPS code in this instance has one PCB hash per RSS bucket, all I had to do was to teach the stack that wildcard listen PCB entries (eg, *:80) could also exist in each PCB hash bucket and to use those in preference to the global PCB hash.

The idea behind this decision is pretty simple - Robert Watson already did all this great work in setting up and debugging PCBGROUPS and then made the RSS work leverage that. All I'd have to do is to have one userland thread in each RSS bucket and have the listen socket for that thread be in the RSS bucket. Then any incoming packet would first check the PCBGROUP that matched the RSS bucket indicated by the RSS hash from the hardware - and it'd find the "right" PCB entry in the "right" PCBGROUP PCB has table for the "right" RSS bucket.

That's what I did for both TCP and UDP.

So the programming model is thus:

  • First, query the RSS sysctl (net.inet.rss) for the RSS configuration - this gives the number of RSS buckets and the RSS bucket -> CPU mapping.
  • Then create one worker thread per RSS bucket..
  • .. and pin each thread to the indicated CPU.
  • Next, each worker thread creates one listen socket..
  • .. sets the IP_BINDANY or IP6_BINDANY option to indicate that there'll be multiple RSS entries bound to the given listen details (eg, binding to *:80);
  • .. then IP_RSS_LISTEN_BUCKET to set which RSS bucket the incoming socket should live in;
  • Then for UDP - call bind()
  • Or for TCP - call bind(), then call listen()
Each worker thread will then receive TCP connections / UDP frames that are local to that CPU. Writing data out the TCP socket will also stay local to that CPU. Writing UDP frames out doesn't - and I'm about to cover that.

Yes, it's annoying because now you're not just able to choose an IO model that's convenient for your application / coding style. Oops.

Ok, so what's up with UDP?

The problem with UDP is that outbound responses may be to an arbitrary destination setup and thus may actually be considered "local" to another CPU. Most common services don't do this - they'll send the UDP response to the same remote IP and port that it was sent from.

My plan for UDP (and TCP in some instances, see below!) is four-fold:

  • When receiving UDP frames, optionally mark them with RSS hash and flowid information.
  • When transmitting UDP frames, allow userspace to inform the kernel about a pre-calculated RSS hash / flow information.
  • For the fully-connected setup (ie, where a single socket is connect() ed to a given UDP remote IP:port and frame exchange only occurs between the fixed IP and port details) - cache the RSS flow information in the inpcb;
  • .. and for all other situations (if it's not connected, if there's no hint from userland, if it's going to a destination that isn't in the inpcb) - just do a software hash calculation on the outgoing details.
I mostly have the the first two UDP options implemented (ie, where userland caches the information to re-use when transmitting the response) and I'll commit them to FreeBSD soon. The other two options are the "correct" way to do the default methods but it'll take some time to get right.

Ok, so does it work?

I don't have graphs. Mostly because I'm slack. I'll do up some before I present this - likely at BSDCan 2015.

My testing has been done with Intel 1G and 10G NICs on desktop Ivy Bridge 4-core hardware. So yes, server class hardware will behave better.

For incoming TCP workloads (eg a webserver) then yes, there's no lock contention between CPUs in the NIC driver or network stack any longer. The main lock contention between CPUs is the VM and allocator paths. If you're doing disk IO then that'll also show up.

For incoming UDP workloads, I've seen it scale linearly on 10G NICs (ixgbe(4)) from one to four cores. This is with no-defragmentation, 510 byte sized datagrams.

Ie, 1 core reception (ie, all flows to one core) was ~ 250,000 pps into userland with just straight UDP reception and no flow/hash information via recvmsg(); 135,000 pps into userland with UDP reception and flow/hash information via recvmsg().

4 core reception was ~ 1.1 million pps into userland, roughly ~ 255,000 pps per core. There's no contention between CPU cores at all.

Unfortunately what I was sending was markedly different. The driver quite happily received 1.1 million frames on one queue and up to 2.1 million when all four queues were busy. So there's definitely room for improvement.

Now, there is lock contention - it's just not between CPU cores. Now that I'm getting past the between-core contention, we see the within-core contention.

For TCP HTTP request reception and bulk response transmission, most of the contention I'm currently seeing is between the driver transmit paths. So, the following occurs:

  • TCP stack writes some data out;
  • NIC if_transmit() method is called;
  • It tries to grab the queue lock and succeeds;
It then appends the frame to the buf_ring and schedules a transmit out the NIC. This bit is fine.

But then whilst the transmit lock is held, because the driver is taking frames from the buf_ring to push into the NIC TX DMA queue
  • The NIC queue interrupt fires, scheduling the software interrupt thread;
  • This pre-empts the existing running transmit thread;
  • The NIC code tries to grab the transmit lock to handle completed transmissions;
  • .. and it fails, because the code it preempted holds the transmit lock already.
So there's some context switching and thrashing going on there which needs to be addressed.

Ok, what about UDP? It turns out there's some lock contention with the socket receive buffer.

The soreceive_dgram() routine grabs the socket receive buffer (SOCKBUF_LOCK()) to see if there's anything to return. If not, and if it can sleep, it'll call sbwait() that will release the lock and msleep() waiting for the protocol stack to indicate that something has been received. However, since we're receiving packets at such a very high rate, it seems that the receive protocol path contends with the socket buffer lock that is held by the userland code trying to receive a datagram. It pre-empts the user thread, tries to grab the lock and fails - and then goes to sleep until the userland code finishes with the lock. soreceive_dgram() doesn't hold the lock for very long - but I do see upwards of a million context switches a second.

To wrap up - I'm pleased with how things are going. I've found and fixed some issues with the igb(4) and ixgbe(4) drivers that were partly my fault and the traffic is now quite happily and correctly being processed in parallel. There are issues with scaling within a core that are now being exposed and I'm glad to say I'm going to ignore them for now and focus on wrapping up what I've started.

There's a bunch more to talk about and I'm going to do it in follow-up posts.
  • what I'm going to do about UDP transmit in more detail;
  • what about creating outbound connections and how applications can be structured to handle this;
  • handling IP fragments and rehashing packets to be mostly in-order - and what happens when we can't guarantee ordering with the hardware hashing UDP frames to a 4-tuple;
  • CPU hash rebalancing - what if a specific bucket gets too much CPU load for some reason;
  • randomly creating a toeplitz RSS hash key at bootup and how that should be verified;
  • multi-socket CPU and IO domain awareness;
  • .. and whatever else I'm going to stumble across whilst I'm slowly fleshing this stuff out.
I hope to get the UDP transmit side of things completed in the next couple of weeks so I can teach memcached about TCP and UDP RSS. After that, who knows!

Saturday, June 7, 2014

Hacking on Receive Side Scaling (RSS) on FreeBSD

RSS is a Microsoft invention that tries to keep a given TCP or UDP flow (and I think IP, but I haven't yet tried that) on a given CPU core. The idea is to try and keep both flow-local data and flow-local locking on a single CPU core, increasing the chances that data is hot in the CPU core cache and reducing the chance of lock overhead.

You can find the RSS overview and programming details here:

http://msdn.microsoft.com/en-us/library/windows/hardware/ff567236(v=vs.85).aspx

RSS and supporting technology has been making its way into FreeBSD for quite some time but it's not in any real shape that application developers can take advantage of.

Firstly, there's "PCBGROUPS", which looks to group PCB (protocol control block) data for a connection local to a CPU. Instead of there being one global PCB table for the system (well, VIMAGE for FreeBSD - each virtual image instance has its own PCB table) with one lock protecting it, there's now multiple PCB tables, one per "thing". Here, the thing is whatever the kernel developer thinks is worth grouping them by.

http://www.ece.rice.edu/~willmann/pubs/paranet_usenix.pdf

Now, until the RSS work went in, this code was in FreeBSD but sat unused. A kernel developer could provide the hooks needed to map TCP (and maybe UDP later) flows to a "thing" and have that map to a PCB group table - but it required some glue to stamp incoming connections and outgoing packets with some identifier (which we call a "flowid" in FreeBSD) with something that can map to said "thing". Then whenever a PCB lookup was needed, it would first try the lookup in the table mapped to by the mapping between the "flowid" and "thing" - if it was successful, it wouldn't have to use the global PCB table to do the lookup.

This is only good for established connections - creating and destroying a connection still requires manipulating that global PCB table and the single PCB table lock. I'm going to ignore fixing that for now, as that is a bigger issue.

Then Robert Watson added the RSS work done under contract to Juniper Networks, Inc. RSS provides one kind of mapping between the flowid from the NIC and which CPU to run work on. So that part worked great - but there wasn't any way for the application user to take advantage of it. Additionally, there's no driver awareness of it yet - I'll discuss this shortly.

So I grabbed a bunch of this work whilst at Netflix and tried to make sense of it. It turns out that if you can keep the work local to a CPU, a lot of the lock contention in the networking stack melts away. Here's what's going on:

  • The receive thread(s) in the NIC driver processing packets are typically doing direct dispatch to the network stack - so they're running the receive side of the TCP stack;
  • .. and the receive side of the network stack includes ACKs, which triggers the transmit side of the network stack;
  • There's typically some deferred thread(s) in the NIC driver transmitting packets to each NIC queue;
  • There's also application threads trying to queue data to the TCP socket, which also can dig into the socket and TCP stack state, which involves grabbing locks;
  • And there's also timers firing to update state, and doing this involves grabbing locks.
Without RSS and without lining everything up on CPU cores, all the above can run on different cores. Whenever any of them try running at the same time, lock contention can occur and that particular task can stop. If the lock contention blocks the transmit or receive NIC threads, then not only is that connection affected - the whole NIC processing is affected.

There's still lock contention in the network stack - especially if you're doing a lot of new, short connections. The good folk at Verisign are working on that particular corner of the problem so I'm happy to defer to them.

So, I ended up doing a bunch of little pieces to get this lined up right:
  • The per-CPU timer callwheels can now be optionally pinned to their CPU cores, so timer events running on CPU X actually do run on CPU X (yes, that was amusing to find..);
  • There's support in the TCP stack for per-CPU timers, but it's not enabled by default;
  • ... and it also didn't query RSS, netisr or anything to figure out how to map a flowid to a given CPU to run a timer on;
  • Then to make matters worse, incoming TCP sessions didn't have a flowid assigned to the PCB until after the first data packet was read - which meant that the initial timer work would all assume CPU 0 and any queries on that particular PCB would return flowid=0 - so it would not find it in the right PCBGROUP.
So those are fixed in FreeBSD-HEAD. The per-CPU TCP timer and pinned-CPU timers aren't enabled by default - I'll only flip that on when I'm confident that the RSS stuff is working.

So that lets all the RSS stuff correctly work. But there wasn't a nice way to query the per-connection flowid or RSS information. So I then extended netstat to have 'R' as a flag - it returns the flowid and the flowid type. I'll add RSS information once I have a nice way to extract it out in bulk. It's still a good diagnostic tool to ensure that the IPv4/IPv6 hashing is working correctly.

Then I had to teach a driver about RSS so I could actually test it all out. I have some igb(4) hardware at home, so I did the minimal work required to teach it about the RSS key and assigning things to the correct CPUs. It's still incomplete but it's good enough to get off the ground. I'll go into more details about the driver requirements in a follow-up blogpost.

Finally, how are application developers supposed to use it? I'll cover that particular bit in another follow-up blog post as there's quite a lot to cover there.

Sunday, April 27, 2014

Why oh why must people think backdoors == NSA? Silly people.

I know that there's some collective paranoia here in the United States and in the security community in general about government and corporation spying leading to backdoors in equipment. And yes, it's likely very true in a lot of situations. The last two years of public information leaks testify to said collective paranoia.

There's been a few writeups lately about backdoors in wireless equipment. Here's something from the hacker news - http://thehackernews.com/2014/04/router-manufacturers-secretly-added-tcp.html . A helping hand? To the NSA? Perhaps. Does the NSA know about this stuff? Of course. I'd also not be surprised if they were actively using it.

But there aren't any other, less paranoid reasons out there in the articles. So let me put one out there, based on an 18 month stint at a hardware company that makes wireless chips. These manufacturers have a whole bunch of development testing, regression testing, certification testing and factory testing that goes on. Instead of building separate images to test versus ship (which may actually be against the regulatory certification rules!) they instead just leave a bunch of these remote execution backdoors in their product.

I think it's highly likely it's just very sloppy security, sloppy code design, sloppy quality control and sloppy development.

Just to be clear - the Atheros default software (as far as I'm aware) didn't have these hooks in them. All the AP firmare interfaces I played with at Atheros required authentication or manually starting things before you could use it. Noone these days ships the default Atheros development firmware on their product. This looks like all extra code that vendors have layered on top of things.

These companies should do a better job at their product development. But given the cutthroat pricing, cheap development and ridiculous product lifecycles, are you really surprised that the result has corners missed?

(That's why I run FreeBSD-HEAD on my kit here at home.)

Monday, March 31, 2014

Meraki Sparky boards, and constant resetting

There's a Mesh internet project at Sudo Room and they've been doing some great work getting a platform up and running. However, like a lot of volunteer projects, they're working with whatever time and equipment they've been donated.

A few months ago they were donated a few hundred Meraki Sparky boards. They're an Atheros AR2317 SoC based device with an integrated 2GHz 802.11bg radio, 10/100 ethernet and.. well, a hardware watchdog that resets the board after five minutes.

Now, annoyingly, this reset occurs inside of Redboot too - which precludes them from being (fully) flashed before the unit reboots. Once the unit was flashed with OpenWRT, the unit still reboots every five minutes.

So, I started down the path of trying to debug this.

What did I know?

Firstly, the AR2317 watchdog doesn't have a way of resetting things itself - instead, all it can do is post an interrupt. The AR7161 and later SoCs do indeed have a way to do a full hardware reset if the watchdog is tickled.

Secondly, redboot has a few tricksy ways to manipulate the hardware:

  • 'x' can examine registers. Since we need them in KSEG1 (unmapped, uncached) then the reset registers (0x11000xxx becomes 0xb1000xxx.) Since its hardware access, we should do them as DWORDS and not bytes.
  • 'mfill' can be used to write to registers.
Thirdly, there's an Atheros specific command - bdshow - which is surprisingly informative:

RedBoot> bdshow
name:     Meraki Outdoor 1.0
magic:    35333131
cksum:    2a1b
rev:      10
major:    1
minor:    0
pciid:    0013
wlan0:    yes 00:18:0a:50:7b:ae
wlan1:    no  00:00:00:00:00:00
enet0:    yes 00:18:0a:50:7b:ae
enet1:    no  00:00:00:00:00:00
uart0:    yes
sysled:   no, gpio 0
factory:  no, gpio 0
serclk:   internal
cpufreq:  calculated 184000000 Hz
sysfreq:  calculated 92000000 Hz
memcap:   disabled
watchdg:  disabled (WARNING: for debugging only!)

serialNo: Q2AJYS5XMYZ8
Watchdog Gpio pin: 6
secret number: e2f019a200ee517e30ded15cdbd27ba72f9e30c8


.. hm. Watchdog GPIO pin 6? What's that?

Next, I tried manually manipulating the watchdog registers but nothing actually happened.

Then I wondered - what about manipulating the GPIO registers? Maybe there's a hardware reset circuit hooked up to GPIO 6 that needs to be toggled to keep the board from resetting.

Board: ap61
RAM: 0x80000000-0x82000000, [0x8003ddd0-0x80fe1000] available
FLASH: 0xa8000000 - 0xa87e0000, 128 blocks of 0x00010000 bytes each.
== Executing boot script in 2.000 seconds - enter ^C to abort
^C
RedBoot> # set direction of gpio6 to out
RedBoot> mfill -b 0xb1000098 -l 4 -p 0x00000043
RedBoot> x -b 0xb1000098
B1000098: 00 00 00 43 00 00 00 00  00 00 00 00 00 00 00 03  |...C............|
B10000A8: FF EF F7 B9 7D DF 5F FF  00 00 00 00 00 00 00 00  |....}._.........|

RedBoot> # pat gpio6 - set it high, then low.
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000042
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000002

.. then I manually did this every minute or so.

RedBoot>
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000042
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000002
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000042
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000002

.. so, the solution here seems to be to "set gpio6 to be output", then "pat it every 60 seconds."

I hope this helps people bring OpenWRT up on this board finally. There seems to be a few of them out there!

Thursday, March 20, 2014

Adding chipset powersave support to FreeBSD's Atheros driver

I've started adding some basic powersave support to the FreeBSD Atheros ath(4) driver. The NICs support putting parts of the device to sleep to conserve power but.. well, it's tricky.

In order to make things consistent, I either need to not do things when the NIC is asleep (for example, doing calibration when the NIC isn't running), but I also need to ensure that I force the NIC awake when the NIC may be asleep. During normal running, the NIC may have put itself into temporary sleep whilst waiting for some packets from the AP to signal that it needs to wake up. So I will also need to force the NIC awake before programming it.

So, before I start down the path of handling the whole dynamic power management stuff, I figured I'd tackle the initial bits - handling powering on the NIC at startup and powering it off when it's not in use. This includes powering it down during device detach and suspend, as well as when all of the VAPs are down.

This is turning out to be slightly more complicated than I'd like it to be.

The first really stupid thing I found was that during the interface down process, the VAP state change from RUN -> INIT would reset the BSS, which included re-programming the slot time. So, I have to wake up the hardware when programming that. It can then go back to sleep when I'm done with it.

Now there's some issues in the suspend path with the NIC being marked as asleep when it is being reset, which is confusing - the NIC should be woken up when ath_reset() is called. So, I'll have to debug these.

The really annoying bit is that if I read a register whilst the silicon is asleep, the reads return 0xDEADBEEF. So if I am storing the register contents anywhere, I'll end up storing and programming a potentially totally invalid value.

There's also some real problems with race conditions. I can put the power state changes behind a lock, but imagine something like this:

* ATH_LOCK; force awake; do something; ATH_UNLOCK .. ATH LOCK; do some more; put back to sleep; ATH_UNLOCK

Now, if a second thread puts the NIC back to sleep in between those two lock sections, the second "do some more" work may occur once the NIC was put to sleep by said second thread. So I have to correctly track if the NIC is being forced awake by refcounting how many times its being forced awake, then when the refcount hits zero and we can put it to sleep, put it back to sleep.

Once this is all done, I can start down the path of supporting proper network sleep - where the NIC stays asleep and wakes up to listen for beacons and received frames from the AP. I then choose to force the NIC awake and do more work. I have to make absolute sure that I don't queue things like transmitted frames or add more frames to the receive queue if it may fall asleep. There's also some mechanisms to have a transmit frame put the NIC to sleep - there's a bit that says "when this frame is transmitted, transition the NIC back to sleep." I have to go and figure out how that works and implement that.

But for now, let's keep it simple and debug just putting the NIC to sleep when it's not in use.

Monday, March 10, 2014

Porting over the AR8327 support

It's been a while since I posted. I'll post about why that is at some point but for now I figure it's time I wrote up the latest little side project - the Atheros AR8327 switch support.

The AR8327 switch is like the previous generation Atheros switches except for a couple of very specific and annoying differences - the register layouts and locations have changed. So it's not just a case of pretending it's an AR8316 except for the hardware setup - there's some significant surgery to do. And no, I did try just ignoring all of that - the switch doesn't come up and pass packets.

So, the first thing was to survey the damage.

The Linux driver (ar8216.c) has a bunch of abstractions that the FreeBSD driver doesn't have, so that's a good starting point. The VLAN operations and VLAN port configuration stuff is all methods in the Linux driver, so that was a good starting point. I stubbed most of the VLAN stuff out (because I really didn't want it to get in the way) - this turned out to be more annoying than I wanted.

Next was the hardware setup path. There's more configurable stuff with the AR8327 - there's two physical ports that I can configure the PHY/MAC parameters on for either external or internal connectivity. I just took the code from Linux (which yes, I have permission to relicence under BSD, thanks to the driver authors!) and I made it use the defaults from OpenWRT for the DB120. The ports didn't properly come up.

I then realised that I was reading total garbage from the PHY register space, so I went looking at the datasheet and ar8216 driver for some inspiration. Sure enough, the AR8327 has the PHY MDIO bus registers in different locations. So after patching the arswitch PHY routines with this knowledge, the PHYs were probed and attached fine. Great. But it still didn't detect port status changes.

So, back to the ar8216 driver. It turns out that there were a few things that weren't methodized - and these were the bits that read the PHY status from the switch. Both drivers didn't just poll the PHYs directly - they read the switch registers which had a summary of the port status. So, I taught the driver about this and voila! Port status changes worked.

But, no traffic.

Well, there's a few reasons for this. It's a switch, so I don't have to setup anything terribly difficult. The trick here is to enable port learning and make sure they're all in the same VLAN group. Now, here's where I screwed up and I found a bug that needed working around.

The port setup code did enable learning and put things into a vlan group.

Firstly, I found this odd behaviour that I got traffic only when I switched the ethernet cable to another port. Then learning worked fine. I then found that the ar8216 driver actually triggers a forwarding table flush upon port status change, so I added that. This fixed that behaviour.

But then it was flooding traffic to all ports. This is kinda stupid. What did I screw up? I put each port in a separate vlangroup, rather than put them in the same vlangroup. Then, I programmed the "which ports can you see?" to include all the other ports. What this meant was:
  • The forwarding table (ie, what addresses were learnt) were linked to the vlangroup the port is in;
  • .. and when the switch did a lookup for a given MAC on another port, it wouldn't find it, as the address in the forwarding table showed it was for another vlangroup;
  • .. so it would do what switches do when faced with not knowing about the MAC (well, and how I had configured it) - it flooded traffic.
The solution was thankfully easy - I just had to change the vlangroup (well, "port vlan" here) to be '1', instead of the port id. Once this was done, all the ports came up perfectly and things worked great.

So, this now works great on the Atheros DB120 reference board. It's not working on other boards - there's likely some timing issues that need to be resolved. But we're making progress!

Finally, I spent a bunch of time porting over the port configuration and LED configuration stuff from OpenWRT so I didn't have the driver just hard-coded to the DB120 board. I'll update the configuration and code when I get my hands on other boards that use the AR8327 but for now this is all I have.

Enjoy!

Saturday, January 11, 2014

Hacking on Mindwave for fun and .. fun

Allison (and others, like a game developer named Lat) showed interest in these Neurosky Mindwave headsets. They're little wireless (bluetooth, almost!) headsets that ship with a cheap USB dongle and expose their data via a binary protocol.

The protocol is not consistently and well documented. It's out there, if you can craft the right search queries. For the USB widget, you need to implement the basic handshake commands to attempt to connect to a given (or any) headset. Then you also need to implement the data decoding for the raw and processed data.

Now, I don't want to go into the details - you can read the documentation and my very bad, hacked up code.

The USB dongle didn't work with FreeBSD-9.x. It's a cheap chipset (CH341) and it just wouldn't transmit. It works fine on FreeBSD-HEAD though.

So, to explore it, I wrote a simple, hackish library to encapsulate pairing, parsing, data gathering. It needs a lot of improvement but it's there. Then, I (re-)learnt enough SDL and OpenGL to plot some data points. Finally, I grabbed a FFT library to poke at the returned data to see if it makes sense.

A few points thus far.

I still haven't found any correlation with the attention / meditation parameters the firmware returns. For the most part, you just have to stop any kind of muscular movements.

The raw values clip very easily with any kind of muscular movement. I can see how to decode say, "blink" as a muscular action though.

I've only started looking at the raw FFT results. Hopefully with a bit of filtering I'll see things that actually look like basic EEG results, or I'll concede these things are expensive muscular reaction devices.

The code:

http://github.com/erikarn/mindwave

And the obligatory screenshot: