Sunday, June 28, 2015

RTL-SDR on FreeBSD, or "hey, cool, I live near an airport, I wonder if ADSB works.."

I bought one of those cheap RTL-SDR units a few months ago. There's no real kernel code required for it - all of the rtl-sdr code just uses the generic USB userland API which is shared between many operating systems.

So, getting it going was pretty easy:

# pkg install rtl-sdr

Then, using it to test ADSB is pretty easy:

# rtl_adsb -V -S 

.. this is verbose and listens to short packets.

Where I live (near San Jose Airport!) I receive a lot of ADSB transmissions. It's quite interesting.

Ok, so next - what about something more GUI like? Someone's already done it - https://github.com/antirez/dump1090 . There's already a package for it:

# pkg install dump1090
# dump1090 --net --aggressive

Then, point a webserver at http://localhost:8080/ and watch!

Sunday, May 17, 2015

freebsd-wifi-build, or "wait, you can run freebsd on atheros MIPS access points? where do I get that?"

I've been running FreeBSD at home as my primary internet/wifi access for a few years now. It's cheap, it's easy to do, and I've tried very hard to wrap up the whole process into a mostly-simple build system that spits out a useful image to use.

It's pretty simple in concept - I take FreeBSD-HEAD, build it with some cut-down options, create a custom filesystem image with some custom boot scripts and a custom configuration file, and provide an image that you can TFTP (using a serial console and ethernet cable) or upload directly to the AP if it supports it.

The supported hardware list is here:

https://github.com/freebsd/freebsd-wifi-build/wiki/Supported-Boards

Now, it's not a huge list like OpenWRT, but that's mostly because I don't have an infinite supply of Atheros MIPS based routers. I think I'll get some of the TP-Link Archer series stuff next.

Building it is pretty simple:

https://github.com/freebsd/freebsd-wifi-build/wiki

You checkout the build repo, check out FreeBSD-HEAD, install a couple of packages, and run the build for your board. Once it's done, the images for your board appear in ../tftpboot/. There's a wiki page for each of the supported boards with a walkthrough with how to get FreeBSD going on it.

It comes up on 192.168.1.20/24 with 'user' and 'root' users, with no password. So, the first thing you should do after installation is telnet in, configure /etc/cfg/rc.conf with your actual LAN IPs, set the user/root passwords, and then 'cfg_save' to save things. Then, reboot and voila!

The configuration file format looks like FreeBSD but it isn't. I'm keeping it somewhat hierarchical-looking in naming but flat in implementation so I can migrate it to something like a sqlite or luci backend in the future.

https://github.com/freebsd/freebsd-wifi-build/wiki/Config-Overview

It's good enough for me to be able to set up an AP to be a bridge with a management IP address and configure the ethernet switch. Others have added ipfw support to do NAT and firewalling - I'm going to add configuration rules for NAT, IPFW and routing soon so it's all integrated.

It's FreeBSD, all the way through:

$ uname -a
FreeBSD tl-wdr3600 11.0-CURRENT FreeBSD 11.0-CURRENT #0 r282406M: Wed May  6 22:27:16 PDT 2015     adrian@lucy-11i386:/usr/home/adrian/work/freebsd/head-embedded/obj/mips/mips.mips/usr/home/adrian/work/freebsd/head-embedded/src/sys/TL-WDR4300  mips
$ ifconfig wlan0 list sta
ADDR               AID CHAN RATE RSSI IDLE  TXSEQ  RXSEQ CAPS FLAG   
18:ee:69:15:f4:12    2    1  26M 37.0   45   2703  51888 EPS  AQEHTRM RSN HTCAP WME
04:e5:36:0d:1b:0d    1    1  19M 23.0   15   1524  47072 EPS  AQEPHTR RSN HTCAP WME
cc:3a:61:0e:33:a0    3    1  19M 32.0   30   2585  43072 EPS  AQEPHTR RSN HTCAP WME
40:0e:85:1a:f1:69    4    1  19M 25.0   30   1138  54800 EPS  AQEPHTR RSN HTCAP WME
00:0f:13:97:14:54    5    1  54M 30.0   45   1808  57424 EPS  AE      RSN
00:22:fa:c2:d1:20    6    1  26M 24.5    0    574  57776 EPS  AQEHTRS RSN HTCAP WME

So if you'd like a FreeBSD based device to act as your home gateway, this is where you can start. It's not pfsense, but it's designed to run on things much smaller than pfsense supports and it's a good introduction into the world of FreeBSD embedded.

Friday, April 10, 2015

Intel DDIO, LLC cache, buffer alignment, prefetching, shared locks and packet rates.

I've been digging into the low level behaviour of high throughput packet classification and pushing for my job. The initial suggestions from everyone was "use netmap!" Which was cool, but it only seems to to fast packet work if you're only ever really flipping packets between receive and transmit rings. Once you start actually looking into the payload, you start having to take memory misses and things can slow down quite a bit. An L3 miss (ie, RAM access) on Sandybridge is ~50ns. (There's also costs involved in walking the TLB, but I won't cover that here.)

For background: http://7-cpu.com/cpu/SandyBridge.html .

But! Intel has this magical thing called DDIO. In theory (and there's a lot of theory here), DMA is done via a small (~10%) fraction of LLC (L3) cache, which is shared between all cores. If the data is already in cache when the CPU accesses it, it will be quick. Also, if you then wish to DMA out data from something in cache, it doesn't have to get flushed to memory first - it's just DMAed straight out of cache.

However! When I was doing packet bridge testing (using netmap + bridge, 64 byte payloads), I noticed that I was doing a significant amount of memory bandwidth. It wasn't quite at the rate of 10G worth of bridged data, but DDIO should be doing almost all of that work for me at 64 byte payloads.

So, to reproduce: run netmap bridge (eg 'bridge -i netmap:ix0 -i netmap:ix1') and run pkt-gen between two nodes.

This is the output of 'pcm-memory.x 1' from the intel-pcm toolkit (which is available as a binary package on FreeBSD.)

---------------------------------------||---------------------------------------
--                   System Read Throughput(MB/s):    300.68                  --
--                  System Write Throughput(MB/s):    970.81                  --
--                 System Memory Throughput(MB/s):   1271.48                  --
---------------------------------------||---------------------------------------

The first theory - the bridging isn't occuring fast enough to service what's in LLC before it gets flushed out by other packets. So, assume:

  1. It's 1/10th of the LLC - which is 1/10th of an 8 core * 2.5MB per core setup, is ~ 2MB.
  2. 64 byte payloads are being cached.
  3. Perfect (!) LLC use.
That's 32,768 packets at a time. Now, netmap is doing ~ 1000 packets a batch and it's keeping up line rate bridging on one core (~14 million packets per second), so it's not likely that.

Ok, so what if it's not perfect LLC usage?

Then I thought back to cache line aliasing and other issues that I've previously written about. What if the buffers are perfectly aligned (say, 2048 byte aligned) - the cache line aliasing effects should also manifest themselves as low LLC utilisation.

Luckily netmap has a twiddle - 'dev.netmap.buf_size' / 'dev.netmap.priv_buf_size'. They're both .. 2048. So yes, the default buffer sizes are aligned, and there's likely some very poor LLC utilisation going on.

So, I tried 1920 - that's 2048 - (2 * 64) - ie, two cache lines less than 2048.


---------------------------------------||---------------------------------------
--                   System Read Throughput(MB/s):    104.92                  --
--                  System Write Throughput(MB/s):    382.32                  --
--                 System Memory Throughput(MB/s):    487.24                  --
---------------------------------------||---------------------------------------

It's now using significantly less memory bandwidth to do the same thing. I'm guessing this is because I'm now using the LLC much more efficiently.

Ok, so that's nice - but what about when it comes time to actually look at the packet contents to make decisions?

I've modified a copy of bridge to do a few things, mostly inspired by netmap-ipfw:
  • It does batch receive from netmap;
  • but it then looks at the ethernet header do decap that;
  • then it gets the IPv4 src/dst addresses;
  • .. and looks them up in a (very large) traditional hash table.
I also have a modified copy of pkt-gen that will use completely random source and destination IPv4 addresses and ports, so as to elicit some very terrible behaviour.

With an empty hash set, but still dereferencing the ethernet header and IPv4 source/destination, handling a packet at a time, no batching, no prefetching and only using one core/thread to run:

buf_size=2048:
  • Bridges about 6.5 million pps;
  • .. maxes out the CPU core;
  • Memory access: 1000MB/sec read; 423MB/sec write (~1400MB/sec in total).
buf_size=1920:
  • Bridges around 10 million pps;
  • 98% of a CPU core;
  • Memory access: 125MB/sec read, 32MB/sec write, ~ 153MB/sec in total.
So, it's a significant drop in memory throughput and a massive increase in pps for a single core.

Ok, so most of the CPU time is now spent looking at the ethernet header in the demux routine and in the hash table lookup. It's a blank hash table, so it's just the memory access needed to see if the bucket has anything in it. I'm guessing it's because the CPU is loading in the ethernet and IP header into a cache line, so it's not already there from DDIO.

I next added in prefetching the ethernet header. I don't have the code to do that, so I can't report numbers at the moment. But what I did there was I looped over everything in the netmap RX ring, dereferenced the ethernet header, and then did per-packet processing. This was interesting, but I wanted to try batching out next. So, after some significant refactoring, I arranged the code to look like this:
  1. Pull in up to 1024 entries from the netmap receive ring;
  2. Loop through, up to 16 at a time, and place them in a batch
  3. For each packet in a batch do:
    1. For each packet in the batch: optional prefetch on the ethernet header
    2. For each packet in the batch: decapsulate ethernet/IP header;
    3. For each packet in the batch: optional prefetch on the hash table bucket head;
    4. For each packet in the batch: do hash table lookup, decide whether to forward/block
    5. For each packet in the batch: forward (ie, ignore the forward/block for now.)
I had things be optional so I could turn on/off prefetching and control the batch size.

So, with an empty hash table, no prefetching and only changing the batch size, at buf_size=1920:
  • Batch size of 1: 10 million pps;
  • Batch size of 2: 11.1 million pps;
  • Batch size of 4: 11.7 million pps.
Hm, that's cute. What about with prefetching of ethernet header? At buf_size=1920:
  • Batch size of 1: 10 million pps;
  • Batch size of 2: 10.8 million pps;
  • Batch size of 4: 11.5 million pps.
Ok, so that's not that useful. Prefetching on the bucket header here isn't worthwhile, because the buckets are all empty (and thus NULL pointers.)

But, I want to also be doing hash table lookups. I loaded in a reasonably large hash table set (~ 6 million entries), and I absolutely accept that a traditional hash table is not exactly memory or cache footprint happy. I was specifically after what the performance was like for a traditional hash table. Said hash table has 524,288 buckets, and each points to an array of IPv4 addresses to search. So yes, not very optimal by any measure, but it's the kind of thing you'd expect to find in an existing project.

With no prefetching, and a 6 million entry hash table:

At 2048 byte buffers:
  • Batch size of 1: 3.7 million pps;
  • Batch size of 2: 4.5 million pps;
  • Batch size of 4: 4.8 million pps.
At 1920 byte buffers:
  • Batch size of 1: 5 million pps;
  • Batch size of 2: 5.6 million pps;
  • Batch size of 4: 5.6 million pps.
That's a very inefficient hash table - each bucket is going to have around 11 IPv4 entries in it, and that's checking almost a cache line worth of IPv4 addresses in it. Not very nice. But, it's within a cache line worth of data, so in theory it's not too terrible.

What about with prefetching? All at 1920 byte buffers:
  • Batch size of 4, ethernet prefetching: 5.5 million pps
  • Batch size of 4, hash bucket prefetching: 7.7 million pps
  • Batch size of 4, ethernet + hash bucket prefetching: 7.5 million pps
So in this instance, there's no real benefit from doing prefetching on both.

For one last test, let's bump the bucket count from 524,288 to 2,097,152. These again are all at buf_size=1920:
  • Batch size of 1, no prefetching: 6.1 million pps;
  • Batch size of 2, no prefetching: 7.1 million pps;
  • Batch size of 4, no prefetching: 7.1 million pps;
  • Batch size of 4, hash bucket prefetching: 8.9 million pps.
Now, I didn't quite predict this. I figured that since I was reading in the full cache line anyway, having up to 11 entries in it to linearly check would be cheap. It turns out that no, that's not exactly true.

The difference between the naive way (no prefetching, no batching) to 4-packet batching, hash bucket prefetching is not trivial - it's ~ 50% faster. Going all the way to a larger hash bucket was ~75% faster. Now, this hash implementation is not exactly cache footprint friendly - it's bigger than the LLC, so with random flows and thus no real useful cache behaviour it's going to degrade to quite a few memory accesses.

This has been quite a fun trip down the optimisation peephole. I'm going to spend a bunch of time writing down the hardware performance counters involved in analysing this stuff and I'll look to write a follow-up post with details about that.

One final things: threads and locking. I wanted to clearly demonstrate the cost of shared read locks on a setup like this. There's been lots of discussions about the right kind of locking and concurrency strategies, so I figured I'd just do a simple test in this setup and explain how terrible it can get.

So, no read-locks between threads on the hash table, batch size of 4, hash bucket prefetching, buf_size=1920:
  • 1 thread: 8.9 million pps;
  • 4 threads: 12 million pps.
But with a read lock on the hash table lookups:
  • 1 thread: 7 million pps;
  • 4 threads: 4.7 million pps.
I'm guessing that as I add more threads, the performance will drop.

Even taking a rwlock as a reader lock in pthreads is expensive - it's purely just an atomic increment/decrement in FreeBSD, but it's still not free. I'm getting the lock once for two hash table lookups - ie, the source and destination IP hash table lookups are done under one lock. I'm sure if I took the lock for the whole batch hash table lookup it'd work out a little better on a small number of CPU cores, but I think this demonstrates my point - read locks aren't going to cut it when you have a frequently accessed thing to protect.

The best bit about this post? The prefetching, terrible (large) hash table performance and general cache abuse is not new. Doing batching on superscalar Intel CPUs is not new. Documenting DDIO effectiveness using non-power-of-two-aligned buffer sizes is new, but it's just a rehash of the existing cache aliasing effect. But, I now have a little test bed to experiment with these things without having to try and involve the rest of a kernel.

Yes, I'll publish code soon.

Saturday, March 28, 2015

Using the arswitch ethernet switch on FreeBSD

I sat down a few weeks ago to make the AR8327 ethernet switch work and in doing so I wanted to add per-port and 802.1q VLAN support. It turned out that I .. didn't know as much I thought I did about the etherswitch support. So, after a whole bunch of trial-and-error, I wrapped my head around things. This post is mostly a braindump so if I do forget I have something written down about it - at least until I turn it into a FreeBSD manpage.

There's three modes:
  • default - all ports are in the same VLAN;
  • per-port - each port can be in a VLAN 'group';
  • dot1q - each port can be in multiple VLAN groups, with 802.1q tagging going on.
The per-port VLAN group is for switches that don't have an arbitrary VLAN table - you just assign each port an ID from some low set of values (say, 16), and then the VLAN tag can either be added or not added. I think the RTL8366 switch is like this, but I'd have to check.

The dot1q VLAN is for switches that support multiple VLANs, each can have an arbitrary VLAN ID (0..4095) with optional other VLAN options (like tag-in-tag support.)

The etherswitch configuration side has a few options and they're supported by different hardware:
  • Each port has a port VLAN ID - this is the "native port" for dot1q support. I don't think it has any particular meaning in the per-port VLAN code in arswitch but I could be terribly wrong. I thought it did when I initially did the port, but the documentation is .. lacking.
  • Then there's a set of per-port flags - eg q-in-q, 802.1q tagging, etc.
  • Then there's the vlangroup - each vlangroup has a vlan ID, and then a set of port members. Each port member can be tagged or untagged.
This is where things get odd.

Firstly - the AR934x SoC switch support doesn't include VLANs. I need to add that. I'm not sure which side of the wall this falls.

The switches previous to the AR8327 support per-port and VLAN configuration, but they don't support per-port-per-VLAN tagging. Ie, you can configure 802.1q VLANs, and you can enable tagging on the port - but it tags all packets that aren't the port 'VLAN ID'.

The per-port VLAN ID seems ignored by the arswitch code - it's only used by the dot1q support.

So I think (and it hasn't yet been tested) that on the earlier switches, I can use per-port VLANs with tagging by:
  • Configuring per port vlans - "etherswitch config vlan_mode port"
  • Adding vlangroups as appropriate with membership - tag/untag doesn't matter
  • Set the CPU port up to have tagging - "etherswitch port0 addtag"
When configuring dot1q VLANs, the mode is "config vlan_mode dot1q" and the 802.1q VLAN IDs are used, but the above still holds - the port is tagged or untagged.

But on the AR8327, the VLAN map hardware actually supports enabling/disabling tagging on a per-port-per-VLAN basis. Ie, when the VLAN table is programmed with the port membership, it takes a list of both the ports and whether the ports are tagged/untagged/open/filtered. So, I don't think per-port VLAN tagging works - only dot1q tagging. Maybe I can make it work, but I haven't really sat down for long enough with the documentation to see what combinations are required.
  • Configure the hardware - "etherswitch config vlan_mode dot1q"
  • Add vlangroups as appropriate, set pvid as appropriate
  • For each vlangroup membership, the port can be tagged or untagged - eg to tag the cpu port 0, you'd use '0t' as the port member. That says "port0 is a member, and it's tagged."
I still have a whole lot more to add - the ingress/egress filters aren't configurable, the per-port vlan stuff needs to be made much more sensible and consistent - and the AR934x SoC switch needs to support VLANs. Oh, and much more documentation. But, hey, I can get the thing spitting out VLAN tags, so when it's time to setup my home network with some VLANs, i'll be sure to document what I did and share it with everyone.

Thursday, March 19, 2015

Cache Line Aliasing #2, or "What happens when you page align everything"

After a little more digging into the Intel performance side of things, I discovered one of the big reasons for the performance drop on this particular workload: how Intel CPUs do memory reordering.

The TL;DR is this - there's some hardware inside the Intel CPUs that tracks memory ordering and cache contents - but they don't use all the address bits.

The relevant chapter in the intel optimisation guide is 3.6.8 - Capacity Limits and Aliasing in Caches. The specific thing I was hitting was in 3.6.8.2 - Store Forwarding Aliasing.

Assembly/Compiler Coding Rule 56. (H impact, M generality) Avoid having a store followed by a non-dependent load with addresses that differ by a multiple of 4 KBytes. Also, lay out data or order computation to avoid having cache lines that have linear addresses that are a multiple of 64 KBytes apart in the same working set. Avoid having more than 4 cache lines that are some multiple of 2 KBytes apart in the same first-level cache working set, and avoid having more than 8 cache lines that are some multiple of 4 KBytes apart in the same first-level cache working set.

So, given this, what can be done? In this workload, a bunch of large matrices were allocated via jemalloc, which page aligns large allocations. In the default invocation of the benchmark (where the allocation padding size is 0), the memory access patterns showed a very large number of counter events on "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS" - which is the number of 64k address aliases on the Sandy Bridge Xeon processors I've been testing on. (The same occurs on Westmere, Ivy Bridge and Haswell.) As I vary the padding size, the address aliasing value drops, the memory access counters increase, and the general performance increases.

On the test boxes I have (running pmcstat -w 120 -C -p LD_BLOCKS_PARTIAL.ADDRESS_ALIAS ./himenobmtxpa M )

0 217799413 830.995025
64 18138386 1624.296713
96 8876469 1662.486298
128 19281984 1645.370750
192 18247069 1643.119908
256 18511952 1661.426341
320 19636951 1674.154119
352 19716236 1686.694053
384 19684863 1681.110499
448 18189029 1683.163673
512 19380987 1691.937818

So there's still plenty of aliasing going on at different padding offsets, however it's a very marked drop between 0 and, well, anything.

It turns out that someone's gone and done a bunch more digging into the effects of various CPU magic under the hood. The last paper in the list (Analysing Contextual Bias..) looks at Aliasing and Cache Effects and the effect of memory layout. There's some cute (and sobering!) analysis of the performance changes due to something as simple as the length of your login name in the UNIX environment. It's worth reading.

The summary? Maybe page alignment of all of your memory accesses isn't the way to go.

For further reading:

Friday, March 6, 2015

cache line aliasing effects, or "why is freebsd slower than linux?"

There was some threads on FreeBSD/DragonflyBSD mailing lists a few years ago (2012?) which talked about some math benchmarks being much slower on FreeBSD/DragonflyBSD versus Linux.

When the same benchmark is run on FreeBSD/DragonflyBSD using the Linux layer (ie, a linux binary compiled for linux, but run on BSD) it gives the same or better behaviour.

Some digging was done, and it turned out it was due to memory allocation patterns and memory layout. The jemalloc library allocates large chunks at page aligned boundaries, whereas the allocator in glibc under Linux does not.

I've put the code online in the hope that others can test and verify this:

https://github.com/erikarn/himenobmtxpa

The branch 'local/freebsd' has my local change to allow the allocator offset to be specified. The offset compounds on each allocation - so with an 'n' byte offset, the first allocation is 0 bytes offset from the page boundary, the next is 'n' bytes offset from the page boundary, the next is '2n' bytes offset, etc.

You can experiment with different values and get completely different behavioural results. It's non-trivial: there's a 100% speedup by using a 127 byte offset for each allocation, versus a 0 byte offset.

I'd like to investigate cache line aliasing effects further. There was work done a few years ago to offset mbuf headers in the FreeBSD kernel so they weren't all page-aligned or 256/512/1024 byte aligned - and apparently this gave a significant performance improvement. But it wasn't folded into FreeBSD. What I'd like to do is come up with some better strategies / profiling guides for identifying when this is actually happening so the underlying objects being accessed can be adjusted.

So - if anyone out there has any tips, hints or suggestions on how to do this, please let me know. I'd like to document and automate this testing.

Sunday, February 22, 2015

FreeBSD on the POWER8: it's alive!

A post to freebsd-ppc from a couple of months ago asked if we had support for POWER8 and offered to provide remote access to anyone interested in working on it. I was sufficiently intrigued that I approached the FreeBSD powerpc hackers to ask about it, and was informed that it'd be nice, but we didn't have hardware.

After a bit of wrangling of hardware logistics and with the FreeBSD Foundation purchasing a box, a Tyan POWER8 evaluation server appeared. Nathan Whitehorn started poking at it and managed to get a basic "hello world" going, but stalled on issues with the Linux KVM virtualisation environment.

Fast forward a few weeks - he's figured out the KVM issues, their lack of support for some mandated hypervisor APIs and other bugs - FreeBSD now boots inside of the hypervisor environment and seems stable enough to do development on.

He then found the existing powerpc pmap (physical memory management) code wasn't very SMP friendly - it works fine on one and two CPU powerpc machines, but this POWER8 evaluation board is a 4-core, 32-thread CPU. So a few days of development went by and he rewrote most of the pmap code to be much more fine grained locked and scale much, much better than the existing code. (He also found the PS3 hypervisor layer isn't thread-safe.)

What's been done thus far?

  • FreeBSD boots inside the hypervisor environment;
  • Virtualised console, networking and storage all work;
  • (in progress) new, scalable pmap implementation;
  • Initial support for the Vector-Scalar Extension (VSX) that's found on POWER7 and POWER8.
So, I'm impressed. Nathan's done a fantastic job bringing the whole thing up. There's some further work on the new powerpc technology that needs doing (things like the new vector processing units, performance counter support and such) and I'm sure Justin and Nathan will poke powerpc dtrace support into further good shape. I'm going to see if we can fix a chelsio 40G NIC into one of these and work with their developers to fix any endian/busdma issues that creep up, and then do some network stack scaling testing with it. There's also the missing hardware/hypervisor support to run FreeBSD on the bare metal, which would be a fantastic achievement.

Now I kind of want some larger POWER8 hardware.