I know that there's some collective paranoia here in the United States and in the security community in general about government and corporation spying leading to backdoors in equipment. And yes, it's likely very true in a lot of situations. The last two years of public information leaks testify to said collective paranoia.
There's been a few writeups lately about backdoors in wireless equipment. Here's something from the hacker news - http://thehackernews.com/2014/04/router-manufacturers-secretly-added-tcp.html . A helping hand? To the NSA? Perhaps. Does the NSA know about this stuff? Of course. I'd also not be surprised if they were actively using it.
But there aren't any other, less paranoid reasons out there in the articles. So let me put one out there, based on an 18 month stint at a hardware company that makes wireless chips. These manufacturers have a whole bunch of development testing, regression testing, certification testing and factory testing that goes on. Instead of building separate images to test versus ship (which may actually be against the regulatory certification rules!) they instead just leave a bunch of these remote execution backdoors in their product.
I think it's highly likely it's just very sloppy security, sloppy code design, sloppy quality control and sloppy development.
Just to be clear - the Atheros default software (as far as I'm aware) didn't have these hooks in them. All the AP firmare interfaces I played with at Atheros required authentication or manually starting things before you could use it. Noone these days ships the default Atheros development firmware on their product. This looks like all extra code that vendors have layered on top of things.
These companies should do a better job at their product development. But given the cutthroat pricing, cheap development and ridiculous product lifecycles, are you really surprised that the result has corners missed?
(That's why I run FreeBSD-HEAD on my kit here at home.)
Sunday, April 27, 2014
Monday, March 31, 2014
Meraki Sparky boards, and constant resetting
There's a Mesh internet project at Sudo Room and they've been doing some great work getting a platform up and running. However, like a lot of volunteer projects, they're working with whatever time and equipment they've been donated.
A few months ago they were donated a few hundred Meraki Sparky boards. They're an Atheros AR2317 SoC based device with an integrated 2GHz 802.11bg radio, 10/100 ethernet and.. well, a hardware watchdog that resets the board after five minutes.
Now, annoyingly, this reset occurs inside of Redboot too - which precludes them from being (fully) flashed before the unit reboots. Once the unit was flashed with OpenWRT, the unit still reboots every five minutes.
So, I started down the path of trying to debug this.
What did I know?
Firstly, the AR2317 watchdog doesn't have a way of resetting things itself - instead, all it can do is post an interrupt. The AR7161 and later SoCs do indeed have a way to do a full hardware reset if the watchdog is tickled.
Secondly, redboot has a few tricksy ways to manipulate the hardware:
A few months ago they were donated a few hundred Meraki Sparky boards. They're an Atheros AR2317 SoC based device with an integrated 2GHz 802.11bg radio, 10/100 ethernet and.. well, a hardware watchdog that resets the board after five minutes.
Now, annoyingly, this reset occurs inside of Redboot too - which precludes them from being (fully) flashed before the unit reboots. Once the unit was flashed with OpenWRT, the unit still reboots every five minutes.
So, I started down the path of trying to debug this.
What did I know?
Firstly, the AR2317 watchdog doesn't have a way of resetting things itself - instead, all it can do is post an interrupt. The AR7161 and later SoCs do indeed have a way to do a full hardware reset if the watchdog is tickled.
Secondly, redboot has a few tricksy ways to manipulate the hardware:
- 'x' can examine registers. Since we need them in KSEG1 (unmapped, uncached) then the reset registers (0x11000xxx becomes 0xb1000xxx.) Since its hardware access, we should do them as DWORDS and not bytes.
- 'mfill' can be used to write to registers.
Thirdly, there's an Atheros specific command - bdshow - which is surprisingly informative:
RedBoot> bdshow
name: Meraki Outdoor 1.0
magic: 35333131
cksum: 2a1b
rev: 10
major: 1
minor: 0
pciid: 0013
wlan0: yes 00:18:0a:50:7b:ae
wlan1: no 00:00:00:00:00:00
enet0: yes 00:18:0a:50:7b:ae
enet1: no 00:00:00:00:00:00
uart0: yes
sysled: no, gpio 0
factory: no, gpio 0
serclk: internal
cpufreq: calculated 184000000 Hz
sysfreq: calculated 92000000 Hz
memcap: disabled
watchdg: disabled (WARNING: for debugging only!)
serialNo: Q2AJYS5XMYZ8
Watchdog Gpio pin: 6
secret number: e2f019a200ee517e30ded15cdbd27b a72f9e30c8
.. hm. Watchdog GPIO pin 6? What's that?
name: Meraki Outdoor 1.0
magic: 35333131
cksum: 2a1b
rev: 10
major: 1
minor: 0
pciid: 0013
wlan0: yes 00:18:0a:50:7b:ae
wlan1: no 00:00:00:00:00:00
enet0: yes 00:18:0a:50:7b:ae
enet1: no 00:00:00:00:00:00
uart0: yes
sysled: no, gpio 0
factory: no, gpio 0
serclk: internal
cpufreq: calculated 184000000 Hz
sysfreq: calculated 92000000 Hz
memcap: disabled
watchdg: disabled (WARNING: for debugging only!)
serialNo: Q2AJYS5XMYZ8
Watchdog Gpio pin: 6
secret number: e2f019a200ee517e30ded15cdbd27b
.. hm. Watchdog GPIO pin 6? What's that?
Next, I tried manually manipulating the watchdog registers but nothing actually happened.
Then I wondered - what about manipulating the GPIO registers? Maybe there's a hardware reset circuit hooked up to GPIO 6 that needs to be toggled to keep the board from resetting.
Board: ap61
RAM: 0x80000000-0x82000000, [0x8003ddd0-0x80fe1000] available
FLASH: 0xa8000000 - 0xa87e0000, 128 blocks of 0x00010000 bytes each.
== Executing boot script in 2.000 seconds - enter ^C to abort
^C
RedBoot> # set direction of gpio6 to out
RedBoot> mfill -b 0xb1000098 -l 4 -p 0x00000043
RedBoot> x -b 0xb1000098
B1000098: 00 00 00 43 00 00 00 00 00 00 00 00 00 00 00 03 |...C............|
B10000A8: FF EF F7 B9 7D DF 5F FF 00 00 00 00 00 00 00 00 |....}._.........|
RedBoot> # pat gpio6 - set it high, then low.
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000042
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000002
.. then I manually did this every minute or so.
RedBoot>
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000042
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000002
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000042
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000002
.. so, the solution here seems to be to "set gpio6 to be output", then "pat it every 60 seconds."
RAM: 0x80000000-0x82000000, [0x8003ddd0-0x80fe1000] available
FLASH: 0xa8000000 - 0xa87e0000, 128 blocks of 0x00010000 bytes each.
== Executing boot script in 2.000 seconds - enter ^C to abort
^C
RedBoot> # set direction of gpio6 to out
RedBoot> mfill -b 0xb1000098 -l 4 -p 0x00000043
RedBoot> x -b 0xb1000098
B1000098: 00 00 00 43 00 00 00 00 00 00 00 00 00 00 00 03 |...C............|
B10000A8: FF EF F7 B9 7D DF 5F FF 00 00 00 00 00 00 00 00 |....}._.........|
RedBoot> # pat gpio6 - set it high, then low.
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000042
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000002
.. then I manually did this every minute or so.
RedBoot>
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000042
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000002
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000042
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000002
.. so, the solution here seems to be to "set gpio6 to be output", then "pat it every 60 seconds."
I hope this helps people bring OpenWRT up on this board finally. There seems to be a few of them out there!
Thursday, March 20, 2014
Adding chipset powersave support to FreeBSD's Atheros driver
I've started adding some basic powersave support to the FreeBSD Atheros ath(4) driver. The NICs support putting parts of the device to sleep to conserve power but.. well, it's tricky.
In order to make things consistent, I either need to not do things when the NIC is asleep (for example, doing calibration when the NIC isn't running), but I also need to ensure that I force the NIC awake when the NIC may be asleep. During normal running, the NIC may have put itself into temporary sleep whilst waiting for some packets from the AP to signal that it needs to wake up. So I will also need to force the NIC awake before programming it.
So, before I start down the path of handling the whole dynamic power management stuff, I figured I'd tackle the initial bits - handling powering on the NIC at startup and powering it off when it's not in use. This includes powering it down during device detach and suspend, as well as when all of the VAPs are down.
This is turning out to be slightly more complicated than I'd like it to be.
The first really stupid thing I found was that during the interface down process, the VAP state change from RUN -> INIT would reset the BSS, which included re-programming the slot time. So, I have to wake up the hardware when programming that. It can then go back to sleep when I'm done with it.
Now there's some issues in the suspend path with the NIC being marked as asleep when it is being reset, which is confusing - the NIC should be woken up when ath_reset() is called. So, I'll have to debug these.
The really annoying bit is that if I read a register whilst the silicon is asleep, the reads return 0xDEADBEEF. So if I am storing the register contents anywhere, I'll end up storing and programming a potentially totally invalid value.
There's also some real problems with race conditions. I can put the power state changes behind a lock, but imagine something like this:
* ATH_LOCK; force awake; do something; ATH_UNLOCK .. ATH LOCK; do some more; put back to sleep; ATH_UNLOCK
Now, if a second thread puts the NIC back to sleep in between those two lock sections, the second "do some more" work may occur once the NIC was put to sleep by said second thread. So I have to correctly track if the NIC is being forced awake by refcounting how many times its being forced awake, then when the refcount hits zero and we can put it to sleep, put it back to sleep.
Once this is all done, I can start down the path of supporting proper network sleep - where the NIC stays asleep and wakes up to listen for beacons and received frames from the AP. I then choose to force the NIC awake and do more work. I have to make absolute sure that I don't queue things like transmitted frames or add more frames to the receive queue if it may fall asleep. There's also some mechanisms to have a transmit frame put the NIC to sleep - there's a bit that says "when this frame is transmitted, transition the NIC back to sleep." I have to go and figure out how that works and implement that.
But for now, let's keep it simple and debug just putting the NIC to sleep when it's not in use.
In order to make things consistent, I either need to not do things when the NIC is asleep (for example, doing calibration when the NIC isn't running), but I also need to ensure that I force the NIC awake when the NIC may be asleep. During normal running, the NIC may have put itself into temporary sleep whilst waiting for some packets from the AP to signal that it needs to wake up. So I will also need to force the NIC awake before programming it.
So, before I start down the path of handling the whole dynamic power management stuff, I figured I'd tackle the initial bits - handling powering on the NIC at startup and powering it off when it's not in use. This includes powering it down during device detach and suspend, as well as when all of the VAPs are down.
This is turning out to be slightly more complicated than I'd like it to be.
The first really stupid thing I found was that during the interface down process, the VAP state change from RUN -> INIT would reset the BSS, which included re-programming the slot time. So, I have to wake up the hardware when programming that. It can then go back to sleep when I'm done with it.
Now there's some issues in the suspend path with the NIC being marked as asleep when it is being reset, which is confusing - the NIC should be woken up when ath_reset() is called. So, I'll have to debug these.
The really annoying bit is that if I read a register whilst the silicon is asleep, the reads return 0xDEADBEEF. So if I am storing the register contents anywhere, I'll end up storing and programming a potentially totally invalid value.
There's also some real problems with race conditions. I can put the power state changes behind a lock, but imagine something like this:
* ATH_LOCK; force awake; do something; ATH_UNLOCK .. ATH LOCK; do some more; put back to sleep; ATH_UNLOCK
Now, if a second thread puts the NIC back to sleep in between those two lock sections, the second "do some more" work may occur once the NIC was put to sleep by said second thread. So I have to correctly track if the NIC is being forced awake by refcounting how many times its being forced awake, then when the refcount hits zero and we can put it to sleep, put it back to sleep.
Once this is all done, I can start down the path of supporting proper network sleep - where the NIC stays asleep and wakes up to listen for beacons and received frames from the AP. I then choose to force the NIC awake and do more work. I have to make absolute sure that I don't queue things like transmitted frames or add more frames to the receive queue if it may fall asleep. There's also some mechanisms to have a transmit frame put the NIC to sleep - there's a bit that says "when this frame is transmitted, transition the NIC back to sleep." I have to go and figure out how that works and implement that.
But for now, let's keep it simple and debug just putting the NIC to sleep when it's not in use.
Monday, March 10, 2014
Porting over the AR8327 support
It's been a while since I posted. I'll post about why that is at some point but for now I figure it's time I wrote up the latest little side project - the Atheros AR8327 switch support.
The AR8327 switch is like the previous generation Atheros switches except for a couple of very specific and annoying differences - the register layouts and locations have changed. So it's not just a case of pretending it's an AR8316 except for the hardware setup - there's some significant surgery to do. And no, I did try just ignoring all of that - the switch doesn't come up and pass packets.
So, the first thing was to survey the damage.
The Linux driver (ar8216.c) has a bunch of abstractions that the FreeBSD driver doesn't have, so that's a good starting point. The VLAN operations and VLAN port configuration stuff is all methods in the Linux driver, so that was a good starting point. I stubbed most of the VLAN stuff out (because I really didn't want it to get in the way) - this turned out to be more annoying than I wanted.
Next was the hardware setup path. There's more configurable stuff with the AR8327 - there's two physical ports that I can configure the PHY/MAC parameters on for either external or internal connectivity. I just took the code from Linux (which yes, I have permission to relicence under BSD, thanks to the driver authors!) and I made it use the defaults from OpenWRT for the DB120. The ports didn't properly come up.
I then realised that I was reading total garbage from the PHY register space, so I went looking at the datasheet and ar8216 driver for some inspiration. Sure enough, the AR8327 has the PHY MDIO bus registers in different locations. So after patching the arswitch PHY routines with this knowledge, the PHYs were probed and attached fine. Great. But it still didn't detect port status changes.
So, back to the ar8216 driver. It turns out that there were a few things that weren't methodized - and these were the bits that read the PHY status from the switch. Both drivers didn't just poll the PHYs directly - they read the switch registers which had a summary of the port status. So, I taught the driver about this and voila! Port status changes worked.
But, no traffic.
Well, there's a few reasons for this. It's a switch, so I don't have to setup anything terribly difficult. The trick here is to enable port learning and make sure they're all in the same VLAN group. Now, here's where I screwed up and I found a bug that needed working around.
The port setup code did enable learning and put things into a vlan group.
Firstly, I found this odd behaviour that I got traffic only when I switched the ethernet cable to another port. Then learning worked fine. I then found that the ar8216 driver actually triggers a forwarding table flush upon port status change, so I added that. This fixed that behaviour.
But then it was flooding traffic to all ports. This is kinda stupid. What did I screw up? I put each port in a separate vlangroup, rather than put them in the same vlangroup. Then, I programmed the "which ports can you see?" to include all the other ports. What this meant was:
So, this now works great on the Atheros DB120 reference board. It's not working on other boards - there's likely some timing issues that need to be resolved. But we're making progress!
Finally, I spent a bunch of time porting over the port configuration and LED configuration stuff from OpenWRT so I didn't have the driver just hard-coded to the DB120 board. I'll update the configuration and code when I get my hands on other boards that use the AR8327 but for now this is all I have.
Enjoy!
The AR8327 switch is like the previous generation Atheros switches except for a couple of very specific and annoying differences - the register layouts and locations have changed. So it's not just a case of pretending it's an AR8316 except for the hardware setup - there's some significant surgery to do. And no, I did try just ignoring all of that - the switch doesn't come up and pass packets.
So, the first thing was to survey the damage.
The Linux driver (ar8216.c) has a bunch of abstractions that the FreeBSD driver doesn't have, so that's a good starting point. The VLAN operations and VLAN port configuration stuff is all methods in the Linux driver, so that was a good starting point. I stubbed most of the VLAN stuff out (because I really didn't want it to get in the way) - this turned out to be more annoying than I wanted.
Next was the hardware setup path. There's more configurable stuff with the AR8327 - there's two physical ports that I can configure the PHY/MAC parameters on for either external or internal connectivity. I just took the code from Linux (which yes, I have permission to relicence under BSD, thanks to the driver authors!) and I made it use the defaults from OpenWRT for the DB120. The ports didn't properly come up.
I then realised that I was reading total garbage from the PHY register space, so I went looking at the datasheet and ar8216 driver for some inspiration. Sure enough, the AR8327 has the PHY MDIO bus registers in different locations. So after patching the arswitch PHY routines with this knowledge, the PHYs were probed and attached fine. Great. But it still didn't detect port status changes.
So, back to the ar8216 driver. It turns out that there were a few things that weren't methodized - and these were the bits that read the PHY status from the switch. Both drivers didn't just poll the PHYs directly - they read the switch registers which had a summary of the port status. So, I taught the driver about this and voila! Port status changes worked.
But, no traffic.
Well, there's a few reasons for this. It's a switch, so I don't have to setup anything terribly difficult. The trick here is to enable port learning and make sure they're all in the same VLAN group. Now, here's where I screwed up and I found a bug that needed working around.
The port setup code did enable learning and put things into a vlan group.
Firstly, I found this odd behaviour that I got traffic only when I switched the ethernet cable to another port. Then learning worked fine. I then found that the ar8216 driver actually triggers a forwarding table flush upon port status change, so I added that. This fixed that behaviour.
But then it was flooding traffic to all ports. This is kinda stupid. What did I screw up? I put each port in a separate vlangroup, rather than put them in the same vlangroup. Then, I programmed the "which ports can you see?" to include all the other ports. What this meant was:
- The forwarding table (ie, what addresses were learnt) were linked to the vlangroup the port is in;
- .. and when the switch did a lookup for a given MAC on another port, it wouldn't find it, as the address in the forwarding table showed it was for another vlangroup;
- .. so it would do what switches do when faced with not knowing about the MAC (well, and how I had configured it) - it flooded traffic.
So, this now works great on the Atheros DB120 reference board. It's not working on other boards - there's likely some timing issues that need to be resolved. But we're making progress!
Finally, I spent a bunch of time porting over the port configuration and LED configuration stuff from OpenWRT so I didn't have the driver just hard-coded to the DB120 board. I'll update the configuration and code when I get my hands on other boards that use the AR8327 but for now this is all I have.
Enjoy!
Saturday, January 11, 2014
Hacking on Mindwave for fun and .. fun
Allison (and others, like a game developer named Lat) showed interest in these Neurosky Mindwave headsets. They're little wireless (bluetooth, almost!) headsets that ship with a cheap USB dongle and expose their data via a binary protocol.
The protocol is not consistently and well documented. It's out there, if you can craft the right search queries. For the USB widget, you need to implement the basic handshake commands to attempt to connect to a given (or any) headset. Then you also need to implement the data decoding for the raw and processed data.
Now, I don't want to go into the details - you can read the documentation and my very bad, hacked up code.
The USB dongle didn't work with FreeBSD-9.x. It's a cheap chipset (CH341) and it just wouldn't transmit. It works fine on FreeBSD-HEAD though.
So, to explore it, I wrote a simple, hackish library to encapsulate pairing, parsing, data gathering. It needs a lot of improvement but it's there. Then, I (re-)learnt enough SDL and OpenGL to plot some data points. Finally, I grabbed a FFT library to poke at the returned data to see if it makes sense.
A few points thus far.
I still haven't found any correlation with the attention / meditation parameters the firmware returns. For the most part, you just have to stop any kind of muscular movements.
The raw values clip very easily with any kind of muscular movement. I can see how to decode say, "blink" as a muscular action though.
I've only started looking at the raw FFT results. Hopefully with a bit of filtering I'll see things that actually look like basic EEG results, or I'll concede these things are expensive muscular reaction devices.
The code:
http://github.com/erikarn/mindwave
And the obligatory screenshot:
The protocol is not consistently and well documented. It's out there, if you can craft the right search queries. For the USB widget, you need to implement the basic handshake commands to attempt to connect to a given (or any) headset. Then you also need to implement the data decoding for the raw and processed data.
Now, I don't want to go into the details - you can read the documentation and my very bad, hacked up code.
The USB dongle didn't work with FreeBSD-9.x. It's a cheap chipset (CH341) and it just wouldn't transmit. It works fine on FreeBSD-HEAD though.
So, to explore it, I wrote a simple, hackish library to encapsulate pairing, parsing, data gathering. It needs a lot of improvement but it's there. Then, I (re-)learnt enough SDL and OpenGL to plot some data points. Finally, I grabbed a FFT library to poke at the returned data to see if it makes sense.
A few points thus far.
I still haven't found any correlation with the attention / meditation parameters the firmware returns. For the most part, you just have to stop any kind of muscular movements.
The raw values clip very easily with any kind of muscular movement. I can see how to decode say, "blink" as a muscular action though.
I've only started looking at the raw FFT results. Hopefully with a bit of filtering I'll see things that actually look like basic EEG results, or I'll concede these things are expensive muscular reaction devices.
The code:
http://github.com/erikarn/mindwave
And the obligatory screenshot:
Saturday, December 14, 2013
Experimenting with zero-copy network IO in FreeBSD-HEAD
Back when I started all of this networking hacking, the "big thing" was the overhead of doing poll() and select(). Various operating systems came up with ways of eliminating these - FreeBSD grew the kqueue infrastructure; linux received epoll, Solaris received an epoll-like device and then ended up with some form of kqueue-like event mechanism. Windows has completion ports/overlapped IO which combined the event mechanism with a zero-copy way of doing network IO.
So the Free/Open operating systems have scalable event notification mechanisms for handling large numbers of concurrent sockets but they don't all have some nice, efficient way of doing zero-copy network IO.
Linux has splice()/tee()/vmsplice(). So yes, it effectively does have a way of doing zero-copy socket reading and writing.
OpenBSD does have a splice style syscall to copy data from a source to a destination TCP socket.
FreeBSD, however, has mostly focused on the "disk to network" path for content serving and thus has a lot of time invested in their sendfile() implementation. This is great if you're doing a lot of file to network sending (which Netflix does), but it has some serious shortcomings. The main one I'll address here is the lack of being able to do general zero-copy socket writes from userland. So it can only send data from disk files to the network. You can't implement a zero-copy intermediary proxy server, nor a memory cache that keeps things in pre-allocated memory regions. You have to use disk files (whether that be a real filesystem on disks, or a memory filesystem) and leverage VM hints to control caching.
Recently there was some new sendfile() work to allow sending from POSIX shared memory segments. This intrigued me - it's not the most effective way of doing zero-copy network IO from userland but it's a start. So I set off to write an updated version of my network library from yesteryear to implement some massively parallel network applications with.
The idea is simple - you allocate a POSIX shared memory segment. You then mmap() that region into memory and treat it as a place to allocate write-side network buffers from. Then you use the shared memory filedescriptor and offset to schedule a sendfile() from the shared memory segment to the destination network socket. It's not as elegant as having a write path that wires the memory down and just populates mbufs from that, but that'll come later.
Here's what I found.
Firstly, there's no asynchronous "I'm done!" notification for the sendfile path. So you have no explicit notification that the underlying memory has been freed so you can reuse it. sendfile() has the SF_SYNC flag which causes it to sleep until the transaction is done - primarily so users can be sure they can change the underlying file contents after the syscall completes. This is used by caches such as Varnish that leverage on-disk files as their cache filesystem space.
So I've been adding that. I have a working prototype that is scaling quite well under load and I'll look to commit it to FreeBSD-HEAD soon. It posts a knote to a kqueue filedescrpitor once a transaction has completed.
Once that was done, I started benchmarking the performance of this setup.
The first real roadblock I hit was massive VM contention on the shared memory segment. It turns out that a single POSIX shared memory segment is represented as a single vm_object and this is protected by a single lock. So when 8 threads are actively doing IO from the same shared memory segment it hits massive lock contention. I fixed this in my test suite by allocating one shared memory segment per thread. It's not elegant but it works well enough for benchmarking.
I next hit issues with contention on the VM page lists. Besides the per-object list, there's also a global per-type list (active, inactive, etc.) There's one lock protecting each of these lists. What I found was the VM was shuffling pages between active/inactive and at the traffic rates I was doing (20+gbit/sec) it was a few hundred thousand pages a second being shuffled around. The solution? mlock() the whole region into memory. This prevented the VM from having the pages change state so often and eliminated that overhead.
The code for doing this sendfile() work with posix shared memory is in my libiapp code - http://github.com/erikarn/libiapp . It's terrible and hacky - I'm just experimenting with things for now. But with some tuning, I can get a good 35Gbit/sec out of 70,000 active TCP sockets. There's still a long way to go - I shouldn't be saturating an 8-core CPU with this traffic level when I'm doing no socket data copies. I'll write another update or two about that soon.
Now, what would I like to see? I did some experiments with physical disk IO using the FreeBSD AIO paths doing the same kinds of IO patterns as I am doing with network socket IO (4KiB to 64KiB random disk reads.) It turns out if you do everything correctly, the FreeBSD AIO code will turn physical disk IO into asynchronous disk buffer transactions by wiring the userland buffer into memory and then using that as the backing buffer memory. The overhead of doing the pmap work for this was not too high. So, I wonder if it's worth writing a new transmit path that uses the pmap code (and not the VM!) to wire in a region of memory and then use that for transmit buffers. Combined with an iovec style array of buffers and the above kqueue notification of the network IO completion, I think we can end up with a much more flexible method of doing network IO from userland without the shortcomings by using POSIX shared memory with sendfile().
So the Free/Open operating systems have scalable event notification mechanisms for handling large numbers of concurrent sockets but they don't all have some nice, efficient way of doing zero-copy network IO.
Linux has splice()/tee()/vmsplice(). So yes, it effectively does have a way of doing zero-copy socket reading and writing.
OpenBSD does have a splice style syscall to copy data from a source to a destination TCP socket.
FreeBSD, however, has mostly focused on the "disk to network" path for content serving and thus has a lot of time invested in their sendfile() implementation. This is great if you're doing a lot of file to network sending (which Netflix does), but it has some serious shortcomings. The main one I'll address here is the lack of being able to do general zero-copy socket writes from userland. So it can only send data from disk files to the network. You can't implement a zero-copy intermediary proxy server, nor a memory cache that keeps things in pre-allocated memory regions. You have to use disk files (whether that be a real filesystem on disks, or a memory filesystem) and leverage VM hints to control caching.
Recently there was some new sendfile() work to allow sending from POSIX shared memory segments. This intrigued me - it's not the most effective way of doing zero-copy network IO from userland but it's a start. So I set off to write an updated version of my network library from yesteryear to implement some massively parallel network applications with.
The idea is simple - you allocate a POSIX shared memory segment. You then mmap() that region into memory and treat it as a place to allocate write-side network buffers from. Then you use the shared memory filedescriptor and offset to schedule a sendfile() from the shared memory segment to the destination network socket. It's not as elegant as having a write path that wires the memory down and just populates mbufs from that, but that'll come later.
Here's what I found.
Firstly, there's no asynchronous "I'm done!" notification for the sendfile path. So you have no explicit notification that the underlying memory has been freed so you can reuse it. sendfile() has the SF_SYNC flag which causes it to sleep until the transaction is done - primarily so users can be sure they can change the underlying file contents after the syscall completes. This is used by caches such as Varnish that leverage on-disk files as their cache filesystem space.
So I've been adding that. I have a working prototype that is scaling quite well under load and I'll look to commit it to FreeBSD-HEAD soon. It posts a knote to a kqueue filedescrpitor once a transaction has completed.
Once that was done, I started benchmarking the performance of this setup.
The first real roadblock I hit was massive VM contention on the shared memory segment. It turns out that a single POSIX shared memory segment is represented as a single vm_object and this is protected by a single lock. So when 8 threads are actively doing IO from the same shared memory segment it hits massive lock contention. I fixed this in my test suite by allocating one shared memory segment per thread. It's not elegant but it works well enough for benchmarking.
I next hit issues with contention on the VM page lists. Besides the per-object list, there's also a global per-type list (active, inactive, etc.) There's one lock protecting each of these lists. What I found was the VM was shuffling pages between active/inactive and at the traffic rates I was doing (20+gbit/sec) it was a few hundred thousand pages a second being shuffled around. The solution? mlock() the whole region into memory. This prevented the VM from having the pages change state so often and eliminated that overhead.
The code for doing this sendfile() work with posix shared memory is in my libiapp code - http://github.com/erikarn/libiapp . It's terrible and hacky - I'm just experimenting with things for now. But with some tuning, I can get a good 35Gbit/sec out of 70,000 active TCP sockets. There's still a long way to go - I shouldn't be saturating an 8-core CPU with this traffic level when I'm doing no socket data copies. I'll write another update or two about that soon.
Now, what would I like to see? I did some experiments with physical disk IO using the FreeBSD AIO paths doing the same kinds of IO patterns as I am doing with network socket IO (4KiB to 64KiB random disk reads.) It turns out if you do everything correctly, the FreeBSD AIO code will turn physical disk IO into asynchronous disk buffer transactions by wiring the userland buffer into memory and then using that as the backing buffer memory. The overhead of doing the pmap work for this was not too high. So, I wonder if it's worth writing a new transmit path that uses the pmap code (and not the VM!) to wire in a region of memory and then use that for transmit buffers. Combined with an iovec style array of buffers and the above kqueue notification of the network IO completion, I think we can end up with a much more flexible method of doing network IO from userland without the shortcomings by using POSIX shared memory with sendfile().
Sunday, November 3, 2013
Doing arduino development on FreeBSD-HEAD
I'm a sucker for punishment.
Or, I noticed that FreeBSD's pkgng binary package repository ships with a port of the Arduino development environment. It's this java thing that wraps around avr-gcc and avrdude. It's very popular, it's open source, and I figured what the hell.
I plugged in my Arduino Leonardo and .. it was detected as a umodem device. Excellent!
.. and then it wasn't. It went away very quickly and came back as a single interface (OK) with three child interfaces (Hm, okay), but only one uhid (human interface) interface active (Not Ok.) The modem port used to program and talk to the thing wasn't there.
I then went on a bit of a journey. I found that quite some work had already been done to correct issues in the FreeBSD USB stack - however, it still wasn't working. It showed up fine - it identified itself as a generic USB serial port device, and yet umodem didn't bind to it.
Next - the umodem source code. It yes, claimed anything identifying as a USB serial class device - but it only claimed devices that ALSO identified as an AT-class modem. Yes, a serial modem that you speak AT commands to. The Leonardo identifies itself as a USB serial class device but with NO command encoding. umodem didn't like that.
So, to the USB 1.1 standards documention! After reading the relevant bits, I discovered that the rest of the device handling is the same! Ie, it doesn't matter whether the device says "I speak AT commands" or "I speak no commands", it's still serial. This identifier is just for the upper layer application to decide whether to send AT commands or not.
Thus the fix was simple - also claim devices that say "no commands" as well as "AT commands." That fix is in -HEAD and I hope to try and sneak it into 10.0.
And with that - FreeBSD-HEAD is now a viable development environment for the Arduino Leonardo.
Or, I noticed that FreeBSD's pkgng binary package repository ships with a port of the Arduino development environment. It's this java thing that wraps around avr-gcc and avrdude. It's very popular, it's open source, and I figured what the hell.
I plugged in my Arduino Leonardo and .. it was detected as a umodem device. Excellent!
.. and then it wasn't. It went away very quickly and came back as a single interface (OK) with three child interfaces (Hm, okay), but only one uhid (human interface) interface active (Not Ok.) The modem port used to program and talk to the thing wasn't there.
I then went on a bit of a journey. I found that quite some work had already been done to correct issues in the FreeBSD USB stack - however, it still wasn't working. It showed up fine - it identified itself as a generic USB serial port device, and yet umodem didn't bind to it.
Next - the umodem source code. It yes, claimed anything identifying as a USB serial class device - but it only claimed devices that ALSO identified as an AT-class modem. Yes, a serial modem that you speak AT commands to. The Leonardo identifies itself as a USB serial class device but with NO command encoding. umodem didn't like that.
So, to the USB 1.1 standards documention! After reading the relevant bits, I discovered that the rest of the device handling is the same! Ie, it doesn't matter whether the device says "I speak AT commands" or "I speak no commands", it's still serial. This identifier is just for the upper layer application to decide whether to send AT commands or not.
Thus the fix was simple - also claim devices that say "no commands" as well as "AT commands." That fix is in -HEAD and I hope to try and sneak it into 10.0.
And with that - FreeBSD-HEAD is now a viable development environment for the Arduino Leonardo.
Subscribe to:
Comments (Atom)
