Tuesday, April 28, 2009

More brain-damage code..

I queued up another test run of LUSCA_HEAD on the 600mhz celeron box and discovered that performance had dropped slightly.

It turns out that there's an increase in malloc/free calls which I traced back to the ACL code and further back to the HTTP logging code.

The HTTP request logging code is using ACLs to determine what to log. I have the access log disabled AND I'm not even -using- the log_access ACL directive to restrict what is being logged but it still is taking a trip through the ACL checklist code to create an empty check list, check it, and return NULL. This is aside from all the rest of the logging code which will create all the duplicate strings for the logging layer, then not log anything, then free them.

Sigh. The more I look at the code the more I want to stab out my own eyes. Oh well, I'll add "fix the damned logging code" to my TODO list for later.

Wednesday, April 22, 2009

More traffic!

970mbit and rising...!

mirror1.jp.cacheboy.net - mozilla!

The nice folk at mozilla.org have provided me with a .jp CDN node. I'm now serving a good stack of bits from it into Australia, Malaysia, India, Japan, China, Korea and the Phillipines.

Thanks guys!

Mozilla 3.0.9 release..

The mozilla release (3.0.9) is currently going on. The traffic levels are ramping up now to the release peak.

880mbit/sec and counting..

Monday, April 13, 2009

Changing tack - more modularisation?

I've decided that tackling the storage manager codebase in its current shape is a bit too much right now. I may end up going down the path of using refcounted buffers as part of the data store to temporarily work around the shortcomings in the code. I don't like the idea of doing that long-term because, to be quite honest, that area of the codebase needs a whole lot of reorganisation and sanity to make it consistent and understandable. I think it should be done before any larger-scale changes are done and this includes the obvious performance boosts available by avoiding the copying.

In any case, I'm going to move onto my next short-term goal - a very, very basic module framework. I'd like to at least shoehorn in some very simple dynamic module loading and unloading into the core, not unlike what TMF did for Squid-3 as part of their initial eCAP work. My initial plans are to do the bare minimum necessary to start breaking out a very small chunk of code into modules - namely the request rewriters (url, storeurl and location) so they don't have to be compiled in. It will also force a little bit of tidying up around the HTTP and client-side code.

The initial aim is purely code reorganisation. Instead of having a nested list of callbacks forming the request processing chain, I plan on having a simple iterative process finite state machine which will route the request through different modules as required before passing it along to the rest of the client-side code. I hope that I can (slowly!) unwind a large part of the request path hairiness and enumerate it with said state engine.

In any case, I won't be going anywhere near as far with this as I'd like to in the first pass. There are plenty of problems with this (the biggest being parsing compound configuration types like ACLs - for example, if I wanted to modularise the "asn" ACL type, the module will need to be loaded far before the rest of the configuration file is parsed; then it needs to hook itself into the ACL code and register (a la what happened with Squid-3) itself in there; then subsequent ACL line parsing needs to direct things at the ACL module; then the ACL module needs to make sure its not unloaded until everything referencing it is gone!) but I'm going to pleasantly ignore them all the first time around.

By keeping the scope low and the changes minimal, I hope that the amount of re-recoding needed later on down the track (once I've established exactly what is needed for all of this) should be limited.

Oh, and as an aside but related project, I'm slowly fixing the SNMP core startup code to -not- use 15-20 nested deep function calls as part of its MIB tree construction. It is a cute functional programming type construct but it is absolutely horrible to try and add something. The related bit is allowing for SNMP mib leaves to be added -and- removed at runtime - so modules can register themselves with the cachemgr and SNMP core to provide their stats. 

Thursday, April 9, 2009

Fixing the disk code, part 4

My changes are now occasionally randomly crashing in malloc() - which means something is modifying memory it shouldn't and confusing things. It's a shame that I can't run this under valgrind - the required traffic load to generate this issue at the moment makes valgrind impossible to use.

My guess is that something is ripping the read buffer out from underneath the pread() in the worker thread. Figuring out exactly where this is happening is going to be tricky.

The prime candidate at the moment is where the read is cancelled indirectly via a call to aioClose(). aioClose() calls aioCancel() which attempts to cancel all of the pending events but I'm absolutely sure that it is possible for a read event to be in progress. The fact this is occurring was hidden by the read() buffer being a temporary local buffer - the read() would complete (error or not) and then the result would be tossed away.

The solution? I'm not sure. I have a feeling I'm going to be introducing something like AsyncCalls in Squid-3 to delay the IO callbacks from occuring - but then, for those keeping score, the AsyncCalls stuff in squid-3 (in its current inclusion) has made some code unstable and other bits of code difficult to debug. I wonder what repercussions there would be to splitting out the generation of IO callbacks and their execution; it would certainly help with diskd..

Wednesday, April 8, 2009

Fixing the disk code, part 3

The current store client code experiment involves this:

  • store clients are created and destroyed as per normal;
  • creation sets "flag.active", destruction clears "flag.active";
  • destruction doesn't deallocate the store client - it removes it from various lists and only -attempts- to free it via a new function, storeClientComplete()
  • storeClientComplete() (badly named) will only free the store client if it is marked as inactive and the callbacks it is a part of have returned.
  • In the callback functions, storeClientComplete() is called to see whether the store client should be freed, and if so, they terminate the callback
  • In the callback functions, if "flags.active" is clear, then the callback is terminated

Since cbdataFree() is not called until the flags are clear (active, disk_io_pending, event_pending) then it is possible that some callbacks will be called when they shouldn't be. In order to preserve the existing behaviour I need to further audit the store client code, find out which functions are passed in as callbacks, and make sure those functions correctly check the active state of the store client and also try calling storeClientComplete(). The problem here is that what controls the callback being called is the validity of the callback -data- - so I am hoping that there are no non-store client callbacks being scheduled with a store client instance as the callback data. Gah.

The other issue at the moment is that there is currently no guarantee (read: I haven't ensured it happens!) from the store API that the FileClosed() and FileNotify() routines get called before the store client is deregistered. I therefore can't delay free'ing the store client until those are also called. It should not be a big deal because the main motivation to delaying the store client -free- is to allow pending async operations using other data (in this case, read buffers) to complete before free'ing the data. In the two above instances the behaviour should be the same as before - the store client pointer will simply go invalid and the callbacks won't be made.

Monday, April 6, 2009

Lusca and Cacheboy improvements in the pipeline..

After profiling Lusca-HEAD rather extensively on the CDN nodes, I've discovered that the largest CPU "use" on the core 2 duo class boxes is memcpy(). On the ia64-2 node memcpy() shows up much lower down in the list. I'm sure this has to do with the differing FSB and general memory bus bandwidth available on the two architectures.

I'm planning out the changes to the store client needed to support fully copy-free async read and write. This should reduce the CPU overhead on core 2 duo class machines to the point where Lusca should break GigE throughput on this workload without too much CPU use. (I'm sure it could break GigE throughput right now on this workload though.)

I'll code this all up during the week and build a simulated testing rig at home "pretending" to be a whole lot of clients downloading partial bits of mozilla/firefox updates, complete with a random packetloss, latency and abort probability.

I also plan on finally releasing the bulk of the Cacheboy CDN software (hackish as it is!) during the week, right after I finally remove the last few bits of hard-coded configuration locations. :) I still haven't finished merging in the bits of code which do the health check, calculate the current probabilities to assign each host and then write out the geoip map files. I'll try to sort that out over the next few days and get a public subversion repository with the software online.

By the way, I plan on releasing the Cacheboy CDN software under the Affero GPL (AGPL) licence.

Sunday, April 5, 2009

Fixing the disk code, part 2

I've been reviewing and documenting the store client code to get a feel for how the code works and is used. Sure enough, it relies heavily on the callback data invalidation magic to avoid having to wait around until the store disk file operations complete (or an event scheduled via eventAdd() to complete at a later date.) I added a little bit of logging to print out a warning if a pending disk operation was there and sure enough, it happens often enough in production to probably cause all those random crashes that I remember in Squid-1.2.

Anyway. I have an idea for eliminating the read copy in the async IO code. Something I've done in other C programs is this:
  • Track the pending events which the store client code schedules callbacks into itself (ie, the eventAdd() for later processing, and pending disk IO)
  • Split storeClientUnregister() into two parts - one which sets a "shutting down" flag, and another which does the hard work
  • assert() that API methods aren't called (ie, storeClientRef(), etc) if the shutting down flag is set - ie, they shouldn't be at the present time, so any instance of that happening means that some callback is occuring where it shouldn't be and getting further than it should be!
  • On event/disk callback completion, clear the relevant flags and see if the store client is completely ready to be destroyed. Only then destroy it.
This -should- preserve the current behaviour (callbacks will return immediately instead of not being called, so they effectively behave the same) but it won't prevent non-callback code from running if the callback data pointer is "valid". (Which, hopefully isn't actually happening, but god knows in this codebase..) It means that the callback data "is valid" check in the store client code (and hopefully, the disk IO code) really becomes a debugging check rather than being used to control whether the callback is made or not.

I'm just worried that there's going to be weirdness in how the store client code is being called. Grr.

Anyway. If this is successful and proves itself in production, I'll then change the disk APIs to also enforce that the callback data remains valid for the lifetime of the scheduled IO. Which IMHO it should be. As I said earlier, using tricks like reference counted buffers shouldn't be a crutch for a bad API design..

Saturday, April 4, 2009

Fixing the disk APIs (or, reading Squid-1.2 code)

One of the performance limiting things I've found with the mozilla workload is the memcpy() being done from the temporary read buffer into the reader buffer (ie, what was supplied to storeClientRead*()) and, curious to know the history of this, I decided to go digging into the earlier revisions of all of this code. I went back to Squid-1.2.beta25.

I found that file_read() and file_write() (ie, the current "sync only" disk routines) had the hooks into calling the async IO functions (ie, aioRead() and aioWrite()) if they were compiled in. Like now, aio_read() makes a temporary buffer to read into and then copies it to the original request buffer if the request wasn't cancelled between submission and completion time.

In the 1.2 code, the callback involved copying the data into the supplied buffer and calls the callback. This didn't check to make sure the callback data is still valid. Tsk. In more recent code, the AIO read callback actually supplies the buffer to the completion callback (which is only called if valid!) and its up to the callee to do the copy. tsk.

The problem is neither "API" forced the caller to ensure the buffer stays valid for the length of the call. I'm going to investigate doing this, but I'm not sure whether this is easier or more difficult than shoehorning in reference counted buffers at this point. I want to use refcounted buffers throughout the code but not as a crutch for a badly designed API.