Adrian Chadd's Ramblings: March 2009

Tuesday, March 31, 2009

Lusca snapshot released

I've just put up a snapshot of the version of lusca-head which is running on the cacheboy cdn. Head to http://code.google.com/p/lusca-cache/downloads/list .

I've just put the latest Lusca-HEAD release up for download on the downloads page. This is the version which is currently running on the busiest Cacheboy CDN nodes (> 200mbit each) with plenty of resources to spare.

The major changes from Lusca-1.0 (and Squid-2 / Squid-3, before that):

The memory pools code has been gutted so it now acts as a statistics-keeping wrapper around malloc() rather than trying to cache memory allocations; this is in preparation for finding and fixing the worst memory users in the codebase!
The addition of reference counted buffers and some support framework has appeared!
The server-side code has been reorganised somewhat in preparation for copy-free data flow from the server to the store (src/http.c)
The asynchronous disk IO code has been extracted out from the AUFS codebase and turned into its own (mostly - one external variable left..) standalone library - it should be reusable by other parts of Lusca now
Some more performance work across the board
Code reorganisation and tidying up in preparation for further IPv6 integration (which was mostly completed in another branch, but I decided it moved along too quickly and caused some stability issues I wasn't willing to keep in Lusca for now..)
More code has been shuffled into separate libraries (especially libhttp/ - the HTTP code library) in preparation for some widescale performance changes.
Plenty more headerdoc-based code documentation!
Support for FreeBSD-current full transparent interception and Linux TPROXY-4 based full transparent interception

The next few weeks should be interesting. I'll post a TODO list once I'm back in Australia.

Documentation pass - disk IO

I've begun documenting the disk IO routines in Lusca-HEAD. I've started with the legacy disk routines in libiapp/disk.c - these are used by some of the early code (errorpages, icon loading, etc) and the UFS based swap log (ufs, aufs, diskd).

This code is relatively straight forward but I'm also documenting the API shortcomings so others (including myself!) will realise why it isn't quite good enough in its current form for processing fully asynchronous disk IO.

I'll make a pass through the async IO code on the flight back to Australia (so libasyncio/aiops.c and libasyncio/async_io.c) to document how the existing async IO code works and its shortcomings.

Monday, March 30, 2009

Mirroring a new project - Cyberduck!

I've just started providing mirror download services for Cyberduck - a file manager for a wide variety of platforms including the traditional (SFTP, FTP) and the new (Amazon/S3, WebDAV.) Cacheboy is listed as the primary download site on the main page.

Woo!

Mozilla 3.0.8 release!

The CDN handled the load with oodles to spare. The aggregate client traffic peak was about 650mbit across 5 major boxes. The boxes themselves peaked at about 160mbit each, depending upon the time of day (ie, whether Europe or the US was active.) None of the nodes were anywhere near maximum CPU utilisation.

About 2 and a half TB of mozilla updates a day are being shuffled out.

I'd like to try pushing a couple of the nodes up to 600mbit -each- but I don't have enough CDN nodes to guarantee the bits will keep flowing if said node fails. I'll just have to be patient and wait for a few more sponsors to step up and provide some hardware and bandwidth to the project.

So far so good - the bits are flowing, I'm able to use this to benchmark Lusca development and fix performance bottlenecks before they become serious (in this environment, at least) and things are growing at about the right rate for me to not need to panic. :)

My next major goal will be to finish off the BGP library and lookup daemon; flesh out some BGP related redirection map logic; and start investigating reporting for "services" on the box. Hm, I may have to write some nagios plugins after all..

Sunday, March 29, 2009

Async IO changes to Lusca

I've just cut out all of the mempool'ed buffers from Lusca and converted said buffers to just use xmalloc()/xfree(). Since memory pools in Lusca now don't "cache" memory - ie, they're just for statistics keeping - the only extra thing said memory pooled buffers were doing was providing NUL'ed memory for disk buffers.

So now, for the "high hit rate, large object" workload which the mirror nodes are currently doing, the top CPU user is memcpy() - via aioCheckCallbacks(). At least it wasn't -also- memset() as well.

That memcpy() is taking ~ 17% of the total userland CPU used by Lusca in this particular workload.

I have this nagging feeling that said memcpy() is the one done in storeAufsReadDone(), where the AUFS code copies the result from the async read into the supplied buffer. It does this because its entirely possible the caller has disappeared between the time the storage manager read was scheduled and the time the filesystem read() was scheduled.

Because the Squid codebase doesn't explicitly cancel or wait for completion of async events - and instead relies on this "locking" and "invalidation" semantics provided by the callback data scheme - trying to pass buffers (and structs in general) into threads is pretty much plainly impossible to do correctly.

In any case, the performance should now be noticably better.

(obnote: I tried explaining this to the rest of the core Squid developers last year and somehow I don't quite think I convinced them that the current approach, with or without the AsyncCallback scheme in Squid-3, is going to work without significant re-engineering of the source tree. Alas..)

Saturday, March 28, 2009

shortcomings in the async io code

Profiling the busy(ish) Lusca nodes during the Mozilla 3.0.8 release cycle has shown significant CPU wastage in memset() (ie, 0'ing memory) - via the aioRead and aioCheckCallbacks code paths.

The problem stems from the disk IO interface inherited from Squid. With Squid, there's no explicit cancel-and-wait-for-cancel to occur with both the network and disk IO code, so the async disk IO read code would actually allocate its own read buffer, read into that, and then provide said read buffer to the completion callback to copy said read data out of. If the request is cancelled but the worker thread is currently read()'ing data, it'll read into its own buffer and not a potentially free()'d buffer from the owner. Its a bit inefficient but in the grand scheme of Squid CPU use, its not that big a waste on modern hardware.

In the short term, I'm going to re-jig the async IO code to not zero buffers that are involved in the aioRead() path. In the longer term, I'm not sure. I prefer cancels which may fail - ie, if an operation is in progress, let it complete, if not then return immediately. I'd like this for the network code too, so I can use async network IO threads for less copy network IO (eg FreeBSD and aio_read() / aio_write()); but there's significant amounts of existing code which assumes things can be cancelled immediately and assumes temporary copies of data are made everywhere. Sigh.

Anyway - grr'ing aside, fixing the pointless zero'ing of buffers should drop the CPU use for large file operations reasonably noticably - by at least 10% to 15%. I'm sure that'll be a benefit to someone.

New Lusca-head features to date

I've been adding in a few new features to Lusca.

Firstly is the config option "n_aiops_threads" which allows configuration / runtime tuning of the number of IO threads. I got fed up recompiling Lusca every time I wanted to fiddle with the number of threads, so I made it configurable.

Next is a "client_socksize" - which overrides the compiled and system default TCP socket buffer sizes for client-side sockets. This allows the admin to run Lusca with restricted client side socket buffers whilst leaving the server side socket buffers (and the default system buffer sizes) large. I'm using this on my Cacheboy CDN nodes to help scale load by having large socket buffers to grab files from the upstream servers, but smaller buffers to not waste memory on a few fast clients.

The async IO code is now in a separate library rather than in the AUFS disk storage module. This change is part of a general strategy to overhaul the disk handling code and introduce performance improvements to storage and logfile writing. I also hope to include asynchronous logfile rotation. The change breaks the "auto tuning" done via various hacks in the AUFS and COSS storage modules. Just set "n_aiops_threads" to a sensible amount (say, 8 * the number of storage directories you have, up to about 128 or so threads in total) and rely on that instead of the auto-tuning. I found the auto-tuning didn't quite work as intended anyway..

Finally, I've started exporting some of the statistics from the "info" cachemgr page in a more easier computer-parsable format. The cachemgr page "mgr:curcounters" page includes current client/server counts, current hit rates and disk/memory storage size. I'll be using these counters as part of my Cacheboy CDN statistics code.

Googletalk: "Getting C++ threads to work right"

I've been watching a few Google dev talks on Youtube. I thought I'd write up a summary of this one:

http://www.youtube.com/watch?v=mrvAqvtWYb4

In summary:

Writing "correct" thread code using the pthreads and CPU instructions (fencing, for example) requires the code to know whats going on under the hood;
Gluing concurrency to the "side" of a language which was specified without concurrency has shown to be a bit of a problem - eg, concurrent access to different variables in a structure and how various compilers have implemented this (eg, changing a byte in a struct becoming a 32 bit load, 8 bit modify, 32 bit store);
Most programmers should really use higher level constructs, like what C++0x and what the Java specification groups have been doing.

If you write threaded code or you're curious about it, you should watch this talk. It provides a very good overview of the problems and should open your mind up a little to what may go wrong..

Friday, March 27, 2009

Another open cdn project - mirrorbrain

I've been made aware of Mirrorbrain (http://mirrorbrain.org), another project working towards an open CDN framework. Mirrorbrain uses Apache as the web server and some apache module smarts to redirect users between mirrors.

I like it - I'm going to read through their released source and papers to see what clue can be crimed from them - but they still base the CDN on an untrusted, third-party mirror network out of their control. I still think the path forward to an "open CDN" involves complete control right out to the mirror nodes and, in some places, the network which the mirror nodes live on.

There's a couple of shortcomings - most notably, their ASN implementation currently uses snapshots of the BGP network topology table rather than a live BGP feed distributed out to each mirror and DNS node. They also store central indexes of files and attempt to maintain maps of which mirror nodes have which updated versions of files, rather than building on top of perfectly good HTTP/1.1 caching semantics. I wonder why..

Monday, March 23, 2009

Example CDN stats!

Here's a snapshot of the global aggregate traffic level:

.. and top 10 AS stats from last Sunday (UTC) :

Sunday, March 22, 2009

wiki.cacheboy.net

I've setup http://wiki.cacheboy.net/, a simple mediawiki install which will serve as a place for me to braindump stuff into.

Thursday, March 19, 2009

More "Content Delivery" done open

Another network-savvy guy in Europe is doing something content-delivery related: http://www.as250.net/ .

AS250 is building a BGP anycast based platform for various 'open' content delivery and other applications. I plan on doing something similar (or maybe just partner with him, I'm not sure!) but anycast is only part of my over-all solution space.

He's put up some slides from a presentation he did earlier in the year:

http://www.trex.fi/2009/as250-anycast-bgp.pdf

Filesystem Specifications, or EXT4 "Losing Data"

This is a bit off-topic for this blog, but the particular issue at hand bugs the heck out of me.

EXT4 "meets" the POSIX specifications for filesystems. The specification does not make any requirements for data to be written out in any order - and for very good reason. If the application developer -requires- data to be written out in order, they should serialise their operations through use of fsync(). If they do -not- require it, then the operating system should be free to optimise away the physical IO operations.

As a clueful(!) application developer, -I- appreciate being given the opportunity to provide this kind of feedback to the operating system. I don't want one or the other. I'd like to be able to use both where and when I choose.

Application developers - stop being stupid. Fix your applications. Read and understand the specification and what it provides -everyone- rather than just you.

Monday, March 16, 2009

Breaking 200mbit..

The CDN broke 200mbit at peak today - roughly half mozilla and half videolan.

200mbit is still tiny in the grand scheme of things, but it proves that things are working fine.

The next goal is to handle 500mbit average traffic during the the day, and keep a very close eye on the overheads in doing so (specifically - making sure that things don't blow up when the number of concurrent clients grows.)

GeoIP backend, or "reinventing the wheel"

The first incantation of the Cacheboy CDN uses 100% GeoIP to redirect users. This is roughly how it goes:

Take a GeoIP map to break up IPs into "country" regions (thanks nerd.dk!) ;
Take the list of "up" CDN nodes;
For each country in my redirection table, find the CDN node that is up with the highest weight;
Generate a "geo-map" file consisting of the highest-weight "up" CDN node for each country in "3";
Feed that to the PowerDNS geoip module (thanks Mark @ Wikipedia!)

This really is a good place to start - its simple, its tested and it provides me with some basic abilities for distributing traffic across multiple sites to both speed up transfer times to end-users and better use the bandwidth available. The trouble is that it knows very little about the current state of the "internet" at any point in time. But, as I said, as a first (coarse!) step to get the CDN delivering bits, it worked out.

My next step is to build a much easier "hackable" backend which I can start adding functionality to. I've reimplemented the geoip backend in Perl and glued it to the "pipe-backend" module in PowerDNS. This simply passes DNS requests to an external process which spits back DNS replies. The trouble is that multiple backend processes will be invoked regardless of whether you want to or not. This means that I can't simply load in large databases into the backend process as it'll take time to load, waste RAM, and generally make things scale (less) well.

So I broke out the first memory hungry bit - the "geoip" lookup - and stuffed it into a small C daemon. All the daemon does is take a client IP and answer the geoip information for that IP. It will periodically check and reload the GeoIP database file in the background if its changed - maintaining whatever request rate I'm throwing at it rather than pausing for a few seconds whilst things are loaded in.

I can then use the "geoip daemon" (lets call it "geoipd") by the PowerDNS pipe-backend process I'm writing. All this process has to do at the moment is load in the geo maps (which are small) and reload them as required. It sends all geoip requests to the geoipd and uses the reply. If there is a problem talking to the geoipd, the backend process will simply use a weighted round robin of well-connected servers as a last resort.

The aim is to build a flexible backend framework for processing redirection requests which can be used by a variety of applications. For example, when its time for the CDN proxy nodes to also do 302 redirections to "closer" nodes, I can simply reuse a large part of the modular libraries written. When I integrate BGP information into the DNS infrastructure, I can reuse all of those libraries in the CDN proxy redirection logic, or the webserver URL rewriting logic, or anywhere else where its needed.

The next step? Figuring out how to load balance traffic destined to the same AS / GeoIP region across multiple CDN end nodes. This should let me scale the CDN up to a gigabit of aggregate traffic given the kind of sponsored boxes I'm currently receiving. More to come..

Friday, March 13, 2009

Downtime!

The CDN had a bit of downtime tonight. It went like this:

The first mirror threw a disk;
For some reason, gmirror became unhappy, rather than running on the second mirror (I'm guessing the controller went unhappy; there wasn't anything logged to indicate the other disk in the mirror set was failing);
The second mirror started taking load;
For some weird reason, the second mirror hung hard without any logging to explain why.

I've replaced the disks in mirror1 and its slowly rebuilding the content. It probably won't be finished resync'ing the (new) mirror set until tomorrow. Hopefully mirror-2 will stay just as stable as it currently is.

The CDN ended up still serving content whilst the masters were down - they just couldn't download uncached content. So it wasn't a -total- loss.

This just highlights that I really do require another mirror master or two located elsewhere. :)

Monday, March 9, 2009

minimising traffic to the backends..

Squid/Cacheboy/Lusca has a nifty feature where it'll "piggyback" a client connection on an existing backend connection if the backend response is cachable AND said response is valid for the client (ie, its the right variant, doesn't require revalidation at that point, etc.)

I've been using this for the videolan and mozilla downloads. Basically, one client will suck down the whole object, and any other clients which want the same object (say, the 3.0.7 US english win32 update!) will share the same connection.

There's a few problems which have crept up.

Firstly - the "collapsed forwarding" support is not working in this instance. I think the logic is broken with large objects (it was only written for small objects, delaying forwarding the request until the forwarded response was known cachable) where it denies cachability of the response (well, it forces it to be RELEASEd after it finishes transferring) because of all of the concurrent range requests going on.

Secondly - Squid/Cacheboy/Lusca doesn't handle range request caching. It'll -serve- range responses for objects it has the data for, but it won't cache partial responses nor will it reassemble them into one chunk. I've been thinking about how to possibly fix that, but for now I'm hacking around the problems with some scripts.

Finally - the forwarding logic uses the speed of the -slowest- client to determine how quickly to download the file. This needs to be changed to use the speed of the -fastest- client to determine how quickly to download said file.

I need to get these fixed before the next mozilla release cycle if I'm to have a chance of increasing the traffic levels to a gigabit and beyond.

More to come..

Cacheboy Outage

There was a brief outage earlier tonight due to some troubles with the transit provider of one of my sponsors. They're sponsoring the (only) pair of mirror master servers at the moment.

This will be (somewhat) mitigated when I bring up another set of mirror master servers elsewhere.

Sunday, March 8, 2009

Cacheboy is pushing bits..

The Cacheboy CDN is currently pushing about 1.2 TB a day (~ 100mbit average) of mozilla and videolan downloads out of 5 seperate CDN locations. A couple of the servers are running LUSCA_HEAD and they seem to handle the traffic just fine.

The server assignment is currently being done through GeoIP mapping via DNS. I've brought up BGP sessions to each of the sites to eventually use in the request forwarding process.

All in all, things are going reasonably successfully so far. There's been a few hiccups which I'll blog about over the next few days but the bits are flowing, and noone is complaining. :)