Adrian Chadd's Ramblings: 2008

Sunday, December 28, 2008

Cacheboy-HEAD updates

I've finished cleaning up the bits of the IPv6 work from CACHEBOY_HEAD - it should be just a slightly better structured Squid-2.HEAD / Cacheboy-1.5.

Right now I'm pulling out as much of the HTTP related code from src/ into libhttp/ before the 1.6 release. I'm hoping to glue together bits and pieces of the HTTP code into a very lightweight (for Squid) HTTP server implementation which can be used to test out various things like thread safe-ness. Of course, properly testing thread-safeness in production relies on a lot of the other code being thread-safe, like the comm code, the event registration code, the memory allocation code, the debugging and logging code ... aiee, etc. Oh well, I said I wanted to..

I'm also going through and adding some headerDoc comments to various library files. headerDoc (from apple) is actually rather nice. It lacks -one- function - the ability to merge multiple files together (say, libsqinet/sqinet.[ch]) into one "module" for documentation. I may look at doing that in some of my spare time.

Saturday, December 27, 2008

Reverting IPv6 for now; moving forward with structural changes

I've been working on the IPv6 support in Cacheboy for a couple months now and I've come to the conclusion that I'm not getting anywhere near as far along the development path as I wanted to be.

So I've taken a rather drastic step - I've branched off CACHEBOY_HEAD from the last point along the main codebase where the non-intrusive IPv6 changes had occured and I'm going to pursue Cacheboy-1.6 development from that.

The primary short-term goal with Cacheboy was to restructure the codebase in such a way as to make further development much, much simpler. I sort of lost track with the IPv6 development stuff and I rushed it in when the codebase obviously wasn't ready.

So, the IPv6 changes will stay in the CACHEBOY_PRE branch for now; development will continue in CACHEBOY_HEAD. I'll continue the restructuring work and stability work towards a Cacheboy-1.6 release come January 1. I'll then look at merging over the IPv6 infrastructure work into CACHEBOY_HEAD far before I merge in the client and server related code - specifically, completing the DNS updates, the ipcache/fqdncache updates, port over the IPv6 SNMP changes from Squid-3, and look at modularising the ACL code in preparation for IPv6'ifying that. The goal is less to IPv6-ify Cacheboy; its more to tidy up the code to the point where IPv6 becomes trivial.

Saturday, November 29, 2008

Working on NNRP proxies, or writing threaded C code..

So it turns out that I'm working on some closed-source NNRP proxy code. It sits between clients / readers and backend spool servers and directs/balances requests to the correct backend servers as required.

News is a kind of interesting setup. There are servers with just lots of articles, indexed via message ID (or a hash thereof.) There are servers with the overview databases, which keep track of article ids, message ids, group names, and all that junk. The client reader interface has a series of commands which may invoke a combination of access to both the overview databases and the article spools.

I'm working on a bit of code which began life as a reader -> spool load balancer; I'm turning it into a general reader and client facing bit of software which speaks enough NNRP to route connections to the relevant backend servers. The architecture is pretty simplistic - one thread per backend connection, one thread per client connection, "message queues" sit between all of these and use pthread primitives to serialise access. For the number of requests and concurrency, it scales quite well. It won't scale to 100,000 connections by any means but considering the article sizes (megabytes at a time) a 10GE pipe will be filled far, far before that sort of connection and request rate limit is reached.

So, today's post is going to cover a few things I've learnt whilst writing this bit of code. Its in C, so by its very nature its going to be horrible. The question is whether I can make it less horrible to work with.

Each client thread sits in a loop reading requests, parsing them, figuring out what needs to happen, then queuing messages to the relevant spool or reader server queue to be handled by one of the connection threads. Its relatively straightforward. The trick is to figure out how to keep connections around long enough so the thing you've sent the request to is still there when you reply.

There's a couple of options which are used in the codebase.

The first is what the previous authors did - they would parse an article request (ARTICLE, HEAD, BODY), create a request, push it onto the queue, and wait 1 second for a reply. If the reply didn't occur in that time they would push another request to another server. The idea is to minimise latency on the article fetches - instead of waiting around for a potentially overloaded server, they just queue requests to the other servers which may have the article and then stop queuing requests when one issues a reply. The rest of the replies then have to be dequeued and tossed away.

The second is what I did for the reader/overview side - I would parse a client request, (GROUP, STAT, XOVER, etc), create a request to the backend, push it onto the queue, and wait for the reply. The backend code took care of trying the set of commands required to handle that client request (eg a STAT would require a GROUP, then a STAT; but a STAT would only require a STAT on the backend), with explicit timeouts. If the request didn't happen by then, the backend reader thread would send a "timeout" reply to the client thread, and then attempt to complete the transaction before dequeuing the next.

There are some implications from the above!

The first method is easier to code and easier to understand conceptually - the client handles timeouts and throws away unwanted responses. The backend server code is easy - dequeue, attempt the request until completion or error, return. The problem is that there is no guaranteed time in which the client will be notified of the completion of the request.

The second method is trickier. The backend thread handles timeouts and sends them to the client thread. The backend then needs to track the NNTP transaction state so it can resume it and run the request to completion, tossing away whatever data was being returned. The benefit is that the client -will- get a message from the backend in the specified time period.

These approaches aren't mutually exclusive either. The first works better for article fetches where the isn't any code to try and monitor server performance and issue requests to servers that are responding quickly. I'm going to add that code in soon anyway. The second approach works great for the reader commands because they're either nice and quick, or they're extremely long-lived. Article replies are generally maxing out at a few megabytes. Overview commands can server back tens or hundreds of megabytes of database information and this can take time.

One of the important implications is when the client thread can be freed. In the first method, the client thread MUST stay around until all the pending article requests have been replied to in some fashion. In the second method, the client thread waits for a response to its message immediately after queuing it, so it doesn't have to reap queue events on connection shutdown.

The current crash bugs I've seen seem to be related to message queuing. I'm seeing both junk being dequeued from the client reader queue (when there should be NO messages pending in that queue once a command has been fully processed!) and I'm seeing article responses being sent to queues for clients which have been destroyed for one reason or another. I'm going to spend some time over the next few hours putting in assert()ions to track these conditions down and naff them on the head before the stack gets scrambled and I end up with a 4 gigabyte core which gives me absolutely no useful traceback. :P

Oh look, the application cored again, this time in the GNU malloc code! Time to figure out what is going on again..

Saturday, November 22, 2008

Updates!

A few updates!

I've fixed a few bugs in CACHEBOY_PRE which will be back-ported to CACHEBOY_1.5. This is in line with my current goal of stability before features. CACHEBOY_PRE and CACHEBOY_1.5 have passed all the polygraph runs I've been throwing at them and there aren't any outstanding stability issues in the Issue tracker.

I'll roll CACHEBOY_1.6.PRE3 and CACHEBOY_1.5.2 releases in the next day or two and get those out there.

Thursday, October 16, 2008

Serving IPv6 from Cacheboy-1.6.PRE2

I've done the very minimum amount of work required to get Cacheboy-1.6.PRE2 to the point where it'll handle IPv6 client requests. I've put it in front of http://www.cacheboy.net/ which now has v4 and v6 records.

There's still plenty of work to do to bring it up to par with the Squid-3 IPv6 support but that will have to wait a while. Specifically, (if anyone feels up to handling it), the dns, ipcache and fqdncache code all needs to be massaged to support IPv4 and IPv6 handling. It shouldn't be that much work.

Cacheboy-1.6 is definitely now in the "freeze and fix bugs as they creep up" stage. I'll continue the memory allocator and HTTP parser code reimplementation in their respective branches and get them ready for merge once I'm happy 1.6 is stable. The rest of the IPv6 support will also have to wait.

Friday, October 3, 2008

Cacheboy IPv6 update

I've made some progress in the IPv6 reorganisation in cacheboy. I've converted the ACL, authentication and ident code over to support v4/v6. I'm now going to convert over the client_db, request_t structure and then the related stuff like logging, x-forwarded-for, etc. I'll then revisit what else is required before I enable v6 sockets on the http client-side. It -should- be pretty minimal - persistent connections/connection pinning (for just assembling the hash key) and some SNMP code to just gloss over IPv6 connections for the time being.

Hm, I was hoping to have this all done by the end of September but I've been a bit busy with paid work. I'll hopefully have this done just after NYCBSDCON. I hope. :)

Sunday, September 21, 2008

IPv6 ACL code, sort of!

I'm just doing a spot of testing with my new IPv6 ACL code.

Take a look at this:


(adrian) agnus:~/work/cacheboy/playpen/ipv6_acl/tools% ./squidclient mgr:config@PASSWORD | grep acl
acl all src 0.0.0.0/0.0.0.0
acl all6 src6 ::/::
acl lclnet6 src6 fe80::/fff0::
acl test1 src6 2a01:348:147:5::/ffff:ffff:ffff:ffff::
acl test1 src6 fe80::/fff0::

That there is an IPv6 "src6" ACL (well, three) with somewhat unfriendly netmask display code. I'll tidy that up later. Importantly, the IPv6 code seems to be coming along fine. I'm going to generate up some large random IPv4 and IPv6 ACLs tomorrow to make sure they load in and display out from the splay tree fine, then I'll look at writing some test cases for all of this.

The last bit of code that needs converting before -very basic- client-side IPv6 support can be enabled is to convvert the ACL checklist struct "src_addr" and "my_addr" over to sqaddr_t IPv6 types. This will probably require a whole lot of horrible code changes but luckily I can convert most of them to just be "assign that an IPv4 address thx" and everything should just work as before. Although I need to remind myself to make sure aclMatchIp() checks the _type_ of the ACL its looking up against - doing an IPv4 lookup against an IPv6 splay tree won't really work out.

(Amos / Squid-3 have a single IPv6 "type" for this, and the IPv4 addresses are merged into the IPv6 address space. The ACL types for IP src/dst/myip is then -always- an IPv6 type lookup. I decided to keep seperate IPv4/IPv6 ACL types for now to make testing and development easier. It will double up on the ACL sizes a little - holy crap, I'm doing something less efficient then Squid-3?!? - but thats a small price to pay at the moment for an easier to migrate codebase. Basically, if you compile this up and listen on an IPv6 address, but don't configure an IPv6 ACL, you won't get surprised when IPv6 requests are let through when they shouldn't..)

Friday, September 5, 2008

Cacheboy-1.5: IPv6 DNS servers

I'm just debugging the last couple of issues with the IPv6-aware UDP/TCP DNS code. The Internal DNS resolver still only understands IPv4 code (and, more importantly, the ipcache/fqdncache layer too!) but the code itself will communicate with IPv4/IPv6 DNS servers.

I think I'll stop the development here and concentrate on getting the Cacheboy-1.5 release out the door. I'll then work on IPv6 record resolution in a seperate branch in preparation for Cacheboy-1.6. I may even break out the ipcache/fqdncache code into external libraries so I can reuse/debug/test that code during development.

Tuesday, September 2, 2008

Upcoming Cacheboy-1.5.PRE3 development release

(Yes, I've been slack in posting about this stuff.)

I'm just about to roll the next Cacheboy-1.5 development pre-release. Cacheboy-1.5 is probably the last "almost but not quite squid-2.HEAD" release. Besides the IPv6 core, Cacheboy-1.5 resembles the Squid code but with a more sensible layout of modules and libraries.

Its main difference is the inclusion of core comm layer changes to support IPv6 in preparation of IPv6 client and server support. This particular pre-release includes some changes to the internal DNS code to decouple it from a few routines in src/ relating to TCP socket connection. Its possible I've busted stuff - just run cacheboy with "debug_options ALL,1 78,2" for a while to see if you're falling back to TCP DNS properly.

I'm about to put Cacheboy-1.5.PRE3 in production for a couple of clients to get some real world feedback.

Sunday, August 24, 2008

Standalone HTTP header parser!

I've finally broken out enough of the HTTP header parsing code from src/ into libhttp/ to run the http header parser standalone.

This allows me to write some test cases to make sure I don't break things whilst changing how the HTTP header parser and HTTP header entry code uses (ie, abuses!) the memory allocator. It's also one step closer to being able to reuse bits of the Squid internals in a "simpler" HTTP proxy core.

I'll commit this code reorganisation to Cacheboy trunk after I've released and tested a few developer previews.

So, without further delay:


test1b: test parsing sample headers
| init-ing hdr: 0x7fffffffe6f0 owner: 2
| parsing hdr: (0x7fffffffe6f0)
Host: www.creative.net.au
Content-type: text/html
Foo: bar


| creating entry 0x60ed40: near 'Host: www.creative.net.au'
| created entry 0x60ed40: 'Host: www.creative.net.au'
| 0x7fffffffe6f0 adding entry: 27 at 0
| creating entry 0x60eda0: near 'Content-type: text/html'
| created entry 0x60eda0: 'Content-Type: text/html'
| 0x7fffffffe6f0 adding entry: 18 at 1
| creating entry 0x60ee00: near 'Foo: bar'
| created entry 0x60ee00: 'Foo: bar'
| 0x7fffffffe6f0 adding entry: 68 at 2
  retval from parse: 1
  Parsed Header: Host: www.creative.net.au
  Parsed Header: Content-Type: text/html
  Parsed Header: Foo: bar
| cleaning hdr: 0x7fffffffe6f0 owner: 2
| destroying entry 0x60ed40: 'Host: www.creative.net.au'
| destroying entry 0x60eda0: 'Content-Type: text/html'
| destroying entry 0x60ee00: 'Foo: bar'

Thursday, August 21, 2008

IPv6 core merged into cacheboy trunk

I've just completed merging the IPv6 core into the cacheboy trunk. This doesn't mean it handles IPv6 client/server requests yet - there's a lot more to do before that can happen!

I'll next merge in the IPv6 DNS changes from husni's Squid-2.6 IPv6 patch and do up a basic test suite for all of that. Once done, I'll roll the first Cacheboy-1.5 pre-release.

Wednesday, August 20, 2008

Merging sockaddr_rework into Cacheboy trunk

I'm slowly cherrypicking bits and pieces of the Cacheboy sockaddr_rework into trunk. I've merged in the no_addr/any_addr tidyup which makes those comparisons and sets much clearer. I'll next bring over the sqinet_ routines as just files, ignoring the change history. I'll then bring over the changesets implementing the sqinet_ changes to the comm code and main codebase, retaining the basic change history.

I now need some live testing under decent amounts of real traffic so I can make sure I haven't missed some silly corner condition in the base comm code.

All of this work exposed some of the ugliness that happens in the IPC code with filedescriptor creation that bypasses the comm layer. Basically, the IPC helper code creates file descriptors itself and uses fd_open() to tell Squid about them but then unconditionally uses comm_close() to close them. This is .. stupid.

I may drop in some debugging code to ensure that only sockets created by the comm layer are closed by comm_close(). I wonder how many bad uses of file descriptors will be caught out by that..

Tuesday, August 12, 2008

Cacheboy IPv6 (phase 1): More Updates!

The IPv6 code hackery is going along well. I'm just sorting out making a few loose ends .. well, slightly less loose.

I'll run polygraph PolyMix-4 over this codebase in the next few days to make sure I haven't busted anything and then I'll start preparing to merge it back into CACHEBOY_PRE.

I'm not quite sure how to conditional-compile IPv6; I'm not bothering to do it at the moment (ie, its always included.) Thats a later problem.

The IPv6 TCP proxy is still happily chugging along. FreeBSD's IPv6 stack seems still partially Giant locked but I'm still pushing ~ 350mbit through this Core 2 Duo test server.

This has been too easy. What the hell have I missed!??

Sunday, August 10, 2008

IPv6 tcp proxy success

I feel like an undergraduate computer science student after all of this. I've managed to coax the cacheboy core to support v4/v6 and am using it in the tcpproxy test application.

I've got a modified apachebench speaking IPv6 to tcpproxy, listening on a :8080 IPv6 socket. It then forwards all requests to a thttpd instance running on IPv4.

Tomorrow's job - making sure the squid proxy codebase is still happy with these latest changes, and then preparing for some further testing and the implementation of some unit tests for the comm and inet libraries. Then it back to commercial projects for a few weeks.


Server Software:        thttpd/2.25b                                       
Server Hostname:        [2a01:348:XXX:3207]
Server Port:            8080

Document Path:          /test8k
Document Length:        8192 bytes

Concurrency Level:      1000
Time taken for tests:   21.690 seconds
Complete requests:      100000
Failed requests:        0
Broken pipe errors:     0
Total transferred:      844171764 bytes
HTML transferred:       819841632 bytes
Requests per second:    4610.42 [#/sec] (mean)
Time per request:       216.90 [ms] (mean)
Time per request:       0.22 [ms] (mean, across all concurrent requests)
Transfer rate:          38919.86 [Kbytes/sec] received

Saturday, August 9, 2008

Cacheboy IPv6 (phase 1): Updates!

I've been working on the IPv6 core support in a cacheboy branch (Changes: http://code.google.com/p/cacheboy/source/list?path=playpen/sockaddr_change) and it seems to be coming along swimmingly.

The current goal is just to get basic IPv6 support into the base libraries and keep the rest of the codebase IPv4-only.

I've converted commBind(), comm_connect_addr() and comm_accept() to my new IPv4/IPv6 address type and nothing seems amiss at the present time. comm_open() and comm_openex() will take a little more time as there are plenty of places which create a new outgoing socket.

My next move is to modify my tcp proxy to listen on both IPv4 and IPv6 incoming ports and proxy to an IPv4 destination. I can then fire off some HTTP clients at it and see what happens.

(I may have to modify apachebench-adrian to support IPv6 though; I'm not sure what other stupidly-high-traffic open source http benchmarking clients exist at the present time.)

I hope to get all of this sorted out in the next week or so and head over to the Sydney Squid developers meet with my "alternate" IPv6 core for Squid-2 and better understand the IPv4/IPv6 requirements before discussing them with Amos.

Tuesday, July 29, 2008

Benchmarking is available!

I've begun benchmarking Cacheboy-1.4. The details are available at http://www.cacheboy.net/benchmarks.html. They aren't spectacular - I'm mainly doing them to keep track on development and make sure I'm not introducing regressions anywhere.

I'm not all that happy with 50% CPU (on one CPU too!) at 500 req/sec. Alas, thats what I have to work with - I can't push these disks nor the polygraph hosts any harder at the present time. Maybe if I spent two weeks fixing polygraph so it used kqueue() instead of poll() ..

Saturday, July 26, 2008

Commercial work updates!

I've just completed the development and local testing of the client-side delay pools. That'll go into Squid-2.HEAD in the next few days. I'll try untangling the client-side delay pools from the class 5 delay pool work (which shouldn't be -that- difficult, just slightly tedious) and commit them as two seperate chunks.

I'll post more details on my company blog - http://xenionhosting.blogspot.com/ - as I think the details of my current and future commercial Squid stuff should be detailed over there.

Wednesday, July 23, 2008

Surviving polymix-4..

I'm putting Cacheboy-1.4 through a basic polymix-4 polygraph workload. So far so good - its just unfortunate that polygraph still uses poll() / select(). Most of the process CPU time is spent in those two system calls and not doing any useful work.

So far, so good at ~ 500 req/sec (with <10% CPU usage..) I'm going to resolve a few strange issues I'm seeing and then begin publishing some actual performance numbers over the next few weeks. I'll also start publishing some microbench numbers comparing Squid-2.6, Squid-2.7, Squid-3.0, Squid-3.1 and Cacheboy. Cacheboy will come out on top, of that I'm quite sure. :)

Tuesday, July 22, 2008

Threading Squid - initial observations

My next task after some IPv6 related reshuffling is to bring in the bare essentials needed to make Squid^WCacheboy SMP-happy.

There are a few potential ideas:

Leave Squid single-threaded. Stop it from doing its own disk/memory caching; push that out to a shared external process and abuse sysvshm IPC/anonymous mmap/etc to share large amounts of data efficiently;

Thread Squid entirely. Allow multiple concurrent copies of squid running in threads - whichever "model" of thread helpers you choose - and parallelise everything;

Provide basic thread services but leave Squid monolithic - push certain things into threads for now, figure out what benefits from being run in parallel;

A mix of all of the above.

Some of the problems that are faced!

cbdata

The cbdata type makes it a pain in the ass. Specifically, anything which wants to be shared between threads needs to be able to be 'locked' into memory until the thread hands it back either completed, or cancelled.

cbdata doesn't give you any guarantees that the pointer is pointing to something even remotely valid - even if you cbdataLock()'ed the item, the owner (or not! Thats how horrible the code can get) can cbdataFree() the underlying pointer and suddenly you're pointing at gunk. It might smell mostly right, it might even have somewhat valid data, but its still freed gunk, and thats not good enough.

Shared Statistics

Squid keeps a lot of statistics and histograms. Something needs to be done to allow these to be kept in multiple threads without lots of fine-grain locks and/or stalling.

I may just get rid of a lot of the complicated statistics and require them to be post-process derived externally.

Memory Pools

The memory pools framework will be a nightmare to thread efficiently. Well, memory allocators in general are. I -could- just fine-grain lock it, but it gets a -lot- of requests and so I'd have to first fix the pool abusers before I consider this. (I'm going to do it anyway, but not so I can then fine-grain thread mempools.) I could figure out the best way to thread it - or run multiple pools per pool, one per thread - but damnit, this is 2008, there are better malloc implementations out there by people who understand concurrency issues better than I. Its a waste of time to try and thread it until I understand the workload and implications better.

So I'll -probably- be turfing mempools as it stands and replacing it with just enough to keep statistics before going direct to malloc(). See the statistics section above. I won't do this until I've modified the heaviest mempool abusers to -not- put such large demands on the allocator system, so it'll be a win/win situation everywhere.

more to come..

Tuesday, July 15, 2008

Commercial projects and such..

I've got a few commercial projects to finish up on Squid over the next few weeks which will be taking my time away from Cacheboy development.

Specifically:

I'm adding client-side -write- delay pools, so you can rate limit the replies sent back to clients whether they are a cache hit or miss (specifically for reverse proxies, but I'm sure forward proxies will have a use for them);

Buffering POST requests a bit before connecting to the back-end origin server, which matters when your back-end server pays a high price for holding a connection open with no data going over it;

Finally - some log reporting tools (hopefully written in Lua! :) for basic WebUI logfile reporting in a fast, sensible manner

I've got a few other possibilities which might creep up over the next couple of months but nothing yet concrete.

Client-side IPv6, HTTP/1.1 and a threaded core will have to wait until I've completed the paid work I'm afraid! OSS coders have to eat too!

Sunday, July 13, 2008

Watching things evolve..

I'm finding it interesting to watch myself "evolve" the Cacheboy roadmap over time. Take the previous two cacheboy-users posts: first I thought Cacheboy-1.4 will get the IPv6 enabled core, but after doing the latest set of changes I've decided the best thing to do is to get Cacheboy-1.4 out with the current code layout, sort out whatever bugs crept in, then build the IPv6 enabled core in Cacheboy-1.5 and IPv6 client-side support in Cacheboy-1.6.

I have a general idea where I'd like to take things and I have a specific set of goals in mind along the way, but everything is still evolving with time. Its an interesting experience - there are dozens of areas in the codebase which I'd like to spend time working on but I have to keep the medium and long-term project goals in mind.

Which isn't to say I won't get distracted from time to time and break out a test branch to play with something, like one of the branches playing around with memory allocation overheads. I just treat that, like the last 10 or so years of experimenting with the codebase, as a way to get more of an idea what work needs to be done.

Saturday, July 12, 2008

Cacheboy: shuffling around the DNS code

I'm shuffling around the DNS code in preparation for some work toward an IPv6 core. Strictly speaking, I could have just left the dns code in src/ and IPv6'ed the raw network/socket layer but I've decided "basic" functional IPv6 support will require DNS support and so be it. It'll let me write test cases to make sure that the new code handles IPv4 and IPv6 DNS "right". I still don't know what "right" entails and I'm sure that journey will be very enlightening!

Its been more tedious than complicated. There's a bunch of config file parsing which needs to stay in src/ and I've split out the "libsqdns" DNS initialisation from the "squid" DNS initialisation. It compiles and runs here, resolving DNS requests happily, so I guess I'm mostly on track. I had to shuffle around some config variables so its entirely possible I've screwed that up somewhere.

This highlights the requirement for a much more sensible configuration management framework. It doesn't even have to be that complicated - just not the "one great big Config struct" that Squid currently has. I've got some plans in the back of my head to generic-ify that much later on down the track but it'll have to wait a while. It'll probably come in when the ACL code is split out into squid-specific and generic ACL types. (A lot of the ACL types aren't really specific to HTTP and in reality can be reused in a variety of network applications.)

So tomorrow I'll find some time to get the external DNS code working again which I hope will be slightly easier than the internal DNS code. Then I can let this codebase simmer for a bit, push Cacheboy-1.4 out the door and wait for it to stabilise before my next round of changes towards IPv6.

Saturday, July 5, 2008

The Squid Callback Data Type

One of the issues programmers frequently face is knowing whether some piece of data you have is actually valid. Modern languages provide a variety of methods for creating an "invariant condition" about the validity of your data - reference counting, for example, allows you to ensure that data is not free'd before all references to it have been removed. This invariant is not always what you first think. The invariant condition for reference counting, for example, is that the data is either referenced by something or by nothing at all. Generally the programmer will treat the transition from "referenced by something" to "referenced by nothing" as the important transition and do something like removing the object from whatever lists its on, notifying other objects that its going away, cleaning up allocated memory, etc.

Take traditional "callback" type programming. The programmer decides that some function is to be called after an event has occured (for example, "the ACL lookup has completed", or "the network write has completed") and this function needs some sort of "state" to know what its operating on. You could view this state as a sort of object. The trouble in C is that the language itself doesn't give you any tools to know that the supplied pointer is valid or not. Now, think about this - firstly, whats "valid" mean. The pointer is pointing to some region of memory that hasn't been freed? What about the state of the object? What if the object state changed between the callback being scheduled and the callback being executed? Is this "valid"?

Squid implements "callback data". Initially, this "callback data" (called cbdata in the code) was a registry for callback data pointers. Pointers were reference counted when passed in as part of a scheduled callback; they would be decremented before the callback was about to run, and the callback would only be executed if the callback pointer was "valid". The "owner" (for whatever meanings of "own" you'd like to try and define) could "free" the data pointer - in which case the callback data registry would mark that pointer as invalid; subsequent checks for the validity of said pointer would return invalid, and any callbacks that were going to occur could be ignored. Eventually, the reference count would hit 0 and at 0 the memory at the pointer would be freed.

Expressed as code:

ptr = cbdataAlloc(type);
...
doSomething(someFunc, ptr)

which would:
state->cb = someFunc;
state->cbdata = ptr;
cbdataLock(ptr);

.. then, when the chain of events which doSomething() started would finish, this would occur:

if (state->cb && cbdataValid(ptr)) {
state->cb(state->cbdata);
}
cbdataUnlock(ptr);

This way, the callback would only occur IFF there was a callback and the callback data was still valid.

cbdataAlloc() returns a pointer with refcount = 0 and valid = true ; cbdataLock() incremented refcount; cbdataUnlock() decremented refcount and would free the pointer if (valid == false && refcount == 0); cbdataValid() returned (valid == true); cbdataFree() would set valid = false and free the object if (valid == false && refcount == 0)

This worked out to be quite helpful in preventing callbacks from being run if the data was freed. It however introduces a few assumptions which make certain things difficult to debug and implement.

Firstly, you don't have any guarantee that the callback will be called when you schedule for the call. So in the above code, if something calls cbdataFree(ptr) between the callback registration and the completion of the action initiated by doSomething(), the action will complete but the callback won't be made. The programmer needs to make sure that the code can handle not having the callback ever be made. Traditionally, you would instead either cancel the operation explicitly instead of letting it continue to completion and handle the situation where it couldn't be cancelled, or let the operation complete before transitioning to some "dying" state.

Secondly, generally the "object destructor" here is called not by the cbdata reference count hitting 0, but by some explicit destruction call elsewhere in the code. For example, you would have this in the code:

fooComplete(foo *ptr)
{
free(ptr->data);
cbdataFree(ptr);
}

There still may be references to the callback data but no callbacks will occur on it because cbdataFree() marks that ptr as invalid. So cbdata isn't quite behaving traditional reference counted "types" behave.

Here's where this gets ugly: but it can - you can register a function to be called just before the ptr is finally freed. _SOME_ areas of code do this. _SOME_ areas of code do not. You can't assume that the behaviour for a given cbdata pointer type will be one or the other.

Thirdly, if the action initiated by doSomething() requires some part of ptr to be valid then it will need to wrap every access to the data inside ptr with a if (cbdataValid(ptr)) check. This doesn't always happen :) and has been the cause of all sorts of silly bugs because although the memory pointed to by ptr is still valid, the object may have gone through its "destruction" phase and whats left in memory (which again, hasn't been freed) is actually the last traces of object state. This may be valid, this may be invalid. Who knows. I can't guarantee that accesses to cbdata pointer dereferences are always done conditional to said pointer being valid. That would be a fun thing to hack in as a valgrind module!

This all started rearing its ugly head in Squid-3 as a few things were converted from cbdata type pointers to more traditional reference counted types. The programmers assumed the behaviour was equivalent when it wasn't and all kinds of strange bugs arose some of which took over 12 months to find and fix.

What would I like to see? Thats a good question and will probably form the basis of further improvements in Cacheboy..

Wednesday, July 2, 2008

libevent httperf!

I decided to poke httperf a little as a testing suite - and it uses select()! What the hell?

Four hours later, I think I have a libevent enabled httperf.

http://code.google.com/p/httperf-adrian/

Monday, June 30, 2008

Initial Profiling!

Here's the output trace: I'm running it on the Sun X2100 running a flavour of ubuntu; this is doing ~ 300mbit FDX at about 9000 req/sec (tiny transactions!) w/ 1000 concurrent connections; I'm specifically trying to trace the management overhead versus the data copying overhead. This has maxed out both thttpd on the server-side and the tcp proxy itself.

Gah, look at all of those mallocs and stdio calls doing "stuff"..


root@rachelle:/home/adrian/work/cacheboy/branches/CACHEBOY_PRE/app/tcptest# opreport -l ./tcptest | less
CPU: AMD64 processors, speed 2613.43 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        image name               symbol name
96851    11.3738  libc-2.6.1.so            vfprintf
62317     7.3182  libc-2.6.1.so            _int_malloc
37556     4.4104  tcptest                  comm_select
35405     4.1578  tcptest                  commSetEvents
32901     3.8638  libc-2.6.1.so            _int_free
30245     3.5518  tcptest                  commSetSelect
29890     3.5102  tcptest                  commUpdateEvents
28812     3.3836  libc-2.6.1.so            _IO_default_xsputn
20360     2.3910  tcptest                  sslSetSelect
17279     2.0292  libc-2.6.1.so            malloc_consolidate
16610     1.9506  libc-2.6.1.so            epoll_ctl
16307     1.9150  tcptest                  sslReadServer
16154     1.8971  libc-2.6.1.so            fcntl
14601     1.7147  tcptest                  xstrncpy
12003     1.4096  libc-2.6.1.so            memset
11617     1.3643  tcptest                  memPoolAlloc
10931     1.2837  libc-2.6.1.so            calloc

First milestone - code reuse!

I've spent the last couple of evenings committing code to break out the last bits of the core event loop. I then added in a chopped up copy of src/ssl.c (the SSL CONNECT tunneling stuff) and voila! I now have a TCP proxy.

A (comparitively) slow TCP proxy (3000 small obj/sec instead of where it should be: ~10,000 small obj/sec). A slow, single-threaded TCP proxy, but a TCP proxy nonetheless.

I can now instrument just the core libraries to find out where they perform and scale poorly, seperate from the rest of the Squid codebase. I count this as a pretty big milestone.

Thursday, June 26, 2008

Why memory allocation is a pain in the ass..

The memory allocator routines in particular are annoying - a lot of work has gone into malloc implementations over the last few years to make them perform -very- well in threaded applications as long as you know what you are doing. This means doing things like allocating/freeing memory in the same thread and limiting memory exchange between threads (mostly a deal with very small allocations).

Unfortunately, the mempools implementation saves a noticable amount of CPU because it hides all of the repetitive small memory allocations which Squid does for a variety of things. Its hard to profile too - I see that the CPU spends a lot of time in the allocator, but figuring out which functions are causing the CPU usage is difficult. Sure, I can find out the biggest malloc users by call - but they're not the biggest CPU users according to the oprofile callgraphs. I think I'll end up having to spend a month or so rewriting a few areas of code that account for the bulk of the malloc'ing to see what affect it has on CPU before I decide what to do here.

I just don't see the point in trying to thread the mempools codebase for anything other than per-pool statistics when others have been doing a much better job of understanding memory allocation contention on massively parallel machines.

Cacheboy-1.3 (.1) released; short-term future

I've just merged the latest Squid-2.HEAD changes into Cacheboy and released 1.3.1.

1.3 and 1.3.1 fix the Vary related issues which affects hit rates.

1.3.1 fixes the SNMP counter bugs.

This ends the first set of mostly non-intrusive changes which have been made to the codebase. The next area of work will be pulling out the rest of the event/communications/signal code from src/ and into libiapp/ so I can begin treating "Squid" as a client of "libiapp" - ie, the libiapp code handles event, fd, communication and event scheduling (disk stuff is still in src/ for now) making callbacks into the Squid application. I can then begin writing a few test applications to give the core and support libraries a good thrashing.

I'll start planning out threading and ipv6 support in the libraries themselves with the minimum amount of Squid changes required to continue functioning (but still staying in IPv4/non-threaded land.) The plan is to take something like a minimalistic TCP proxy thats been fully debugged and use it as the basis for testing out potential IPv6 and threading related changes, seperate from the rest of the application.

My tentative aim is to run the current "Squid" application in just one thread but have the support libraries support threading (either by explicitly supporting concurrency or being labelled as "not locking" and thus callers must guarantee nothing quirky will happen.) The three areas that strike me as being problematic right now are the shared fd/comm state (fd_table[]), the statistics being kept all over the place and the memory allocator routines. (I'll write up the malloc stuff in a different post.)

Tuesday, June 24, 2008

Current CPU usage

Where's the CPU going?

Here's an oprofile output from a very naive custom polygraph workload, ~ 1000 requests a second, ~14kbyte objects. MemPools are disabled; Zero buffers are off so the majority of the allocations aren't zero'ed.

Note that somehow, memPoolAlloc takes 4% of CPU even with memory pools switched off. The allocations still go via the pool code but deallocations aren't "cached". What the hell is taking the 4% of CPU time?


CPU: AMD64 processors, speed 2613.43 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        image name               symbol name
176014    6.7039  libc-2.6.1.so            _int_malloc
160779    6.1236  libc-2.6.1.so            memcpy
128371    4.8893  libc-2.6.1.so            malloc_consolidate
123734    4.7127  squid                    memPoolAlloc
101514    3.8664  libc-2.6.1.so            free
76772     2.9240  libc-2.6.1.so            _int_free
55696     2.1213  libc-2.6.1.so            malloc
55681     2.1207  libc-2.6.1.so            vfprintf
50245     1.9137  libc-2.6.1.so            calloc
48095     1.8318  squid                    httpHeaderIdByName
41172     1.5681  libm-2.6.1.so            floor
37573     1.4310  libc-2.6.1.so            re_search_internal
37434     1.4258  libc-2.6.1.so            memchr
36536     1.3916  squid                    xfree
30646     1.1672  libc-2.6.1.so            memset
30576     1.1646  squid                    memPoolFree
30108     1.1467  squid                    headersEnd
28626     1.0903  squid                    httpHeaderGetEntry
26668     1.0157  squid                    storeKeyHashCmp
...

Thursday, June 19, 2008

Updates - comm code, etc

I've finally managed to divorce the comm code from the base system. Its proving to be a pain in the butt for a few reasons:

The DNS code is involved in the socket connection path - most users just pass a hostname in to the comm connect call and it gets diverted via the ipcache/dns code. Tsk!
There's quite a bit of statistics gathering which goes on - the code is very monolithic and the statistics code keeps 5/60 minute histograms as well as raw counters
The event loop needs to be sorted out quite a bit better - right now the event loop is still stuck in src/main.c and this needs to change

The statistics gathering and reporting for the network/disk syscalls and events will have to change - I don't feel like trying to make the histogram code more generic and modular. I don't think that Squid should be maintaining the histograms - thats the job for a reporting suite. Squid should just export raw counters for a reporting suite to record and present as appropriate. I'll add in a new cachemgr option to report the "application core" statistics in a machine-parsable manner and leave it at that for now. (As a side-note, I also don't think that Squid should have SNMP code integrated. It should have an easier, cleaner way of grabbing statistics and modifying the configuration and an external SNMP daemon to do SNMP stuff.)

I then need to extract out the main event loop somewhat from src/main.c and turn it into something that can be reused. The main loop handles the following:

comm events
store dir events
timed/immediate registered events
signals - which basically just set global variables!
checking signal global variables - for rotate, shutdown, etc

I think I'll implement a libevent setup of sorts - I'll implement some methods in libiapp to register callbacks to occur when certain signals are set (sort of like libevent) but the storedir and signal global variable handler will just be functions called in the src/main.c loop. I'd like to implement a Squid-3 like method of registering event dispatchers but I will leave all of that alone until this is all stable and I've done planning into concurrency and SMP.

Its also possible that the reasons for registering dispatchers will go away with a slightly more sensible event abstraction (eg, if I convert the signal handlers to proper events (exactly like libevent!) which get pushed into the head of the event queue and called at the beginning of the next loop iteration - this however assumes the global variables that are set in the current signal handlers are only checked in the main loop and not elsewhere..!)

Wednesday, June 11, 2008

Async IO related hackery

I've been staring at the Async IO code in preparation to migrate stuff out of the aufs directory and into a seperate library.

It requires the fd tracking code (understandable) and the comm code to monitor a notification pipe. This monitor pipe was used by the worker threads to wake up the main process if its waiting inside a select()/poll()/etc call, so it can immediately work on some disk IO.

Squid checks the aio completion queues each time through the comm loop. For aio, there isn't a per-storedir queue, there's just a global queue for all storedirs and other users, so aioCheckCallbacks() is called for each storedir.

There are two problems - firstly, select()/poll() take a while to run on a busy cache, so aioCheckCallbacks() isn't called that often. But the event notification based mechanisms end up running very often, returning a handful of filedescriptors each pass through the comm loop - and so the storedir checks are called. Secondly, its called once per storedir, so if you have 10 storedirs (like I have for testing!) aioCheckCallbacks() is called 10 times per IO loop.

This is a bit silly!

Instead, I've modified the async IO code to only call aioCheckCallbacks() when that pipe is written to. This ends up being the "normal" hack that UNIX thread programmers do to wake up a thread stuck waiting for both network and thread events. This cuts back substantially on the number of aioCheckCallbacks() calls without impacting performance (as far as I can see.)

Next! By default, the aufs store code only does async IO for open() and read() - write() and close() doesn't run asynchronously. Apparently this is due to testing under Linux - unless you're stressing the buffer cache too hard, write() to a disk FD didn't block, so there wasn't a reason to run write() and close() async. Apparently Solaris close() will block as metadata writes are done synchronously, and its possible FreeBSD + softupdates may do something similar. This is all "apparently", I haven't sat down and instrumented any of this!

FreeBSD and Solaris users have reported that diskd performs better than aufs - something I don't understand, as diskd only handles one outstanding disk IO at a time with similar issues with write() and close() to aufs (namely, if the calls block, the whole diskd process stops handling disk IO) but the difference here is the main process won't hang whilst these syscalls complete. Perhaps this is a reason for this behaviour. Its difficult for me to test; aufs has always performed fantastically for me.

There's so much to tidy up and reorganise, I still can't sit down and begin implementing any of the new features I want to!

Thursday, June 5, 2008

More reorganisation..

I've moved cbdata, mempools, fd and legacy disk (file_*) routines out of src/. I also shuffled the comm related -definitions- out but not the code. I hit a big of a snag - the comm code path used for connecting a socket to a remote site actually uses the DNS code. Fixing this will involve divorcing the DNS lookup stuff so sockets can be connected to a remote IP directly - and the DNS lookup path will just be in another module.

This however is more intrusive than "code reorganisation" so its going to have to wait a while. Unfortunately, this means that my grand plans for 1.1 will have to be put on hold a little until I've thought this out a little more and implemented it in a seperate branch.

Thus, things will change a little. 1.1 will be released shortly, with the current set of changes included. I'll then concentrate on planning out the next set of changes required to properly divorce the core event/disk code from src/.

Why do this? Well, the biggest reason is to be able to build "other" bits of code which reuse the Squid core. I can write unit tests for a lot of stuff, sure, but it also means I can write simple network and disk applications which reuse the Squid core and find exactly how hard I can push them. I can also break off a branch and hack up the code to see what impact changes make without worrying that said changes expose strange stuff in the rest of the Squid codebase.

The four main things that I'd like to finally sort out are:

IPv6 socket support - support v4/v6 in the base core, and make sure that it works properly
Sort out the messy disk related code and reintegrate async IO as a top-level disk producer (like it was in Squid-2.2 and it almost is in Squid-3) so it can be again used for things like logfile writing!
Begin looking at scatter/gather disk and network IO - gather disk IO should work out great for writing logfile buffers and objects to disk, for example
Design a parallelism model which allows multiple threads to cooperate on tasks - worker threads implementing callback type stuff for some work; entire seperate network event threads (look at memcached as an example.) "Squid" as it stands will simply run as one thread, but some CPU intensive stuff can be pushed into worker threads for some cheap parallelism gains (as on the roadmap, ACLs and rewriting/content manipulation are two easy targets.)

So there's a lot of work to do, and not a lot of time to do it in.

Saturday, May 24, 2008

cacheboy 1.0 released

Cacheboy 1.0 has been tagged, tarballed and port'ed. Its been in production at my beta testers site for a week or so now and hasn't missed a beat.

There's a lot more work to do to Cacheboy to shape it up like I believe Squid should've been; a stable 1.0 release is the first step along this path.

I'll let this settle for a couple weeks (ie, Adrian needs to sit his mid-year exams in two weeks!) before I begin some more larger-scale code refactoring and shuffling around.

Cacheboy-1.1 changes will include MemBuf, cbdata, most of the http request/reply/header manipulation code and potentially a little of the filedescriptor, disk, event and network communication code. This stuff forms the "core" of Squid/Cacheboy. I'll then look at some basic infrastructure changes to support IPv6 clients.

Like the Cacheboy-1.0 changes, these will not be terribly difficult or intrusive (the IPv6 client-only changes will be the most intrusive by far!) but a lot of refactoring, rewriting and shuffling about of the core needs to take place before I can begin work on the necessary stuff - HTTP/1.1, SMP, modularity, performance.

Saturday, May 17, 2008

FreeBSD port update

I've updated the port to CACHEBOY_0.PRE6; it also now defaults to the replacement shiny english errors (NewEnglish) rather than the default ones (English).

Friday, May 16, 2008

Revalidating objects in Polygraph

I have a locally hacked up polygraph config based on datacomm-1. Datacomm-1 is a very simple workload which doesn't pretend to be the real world at all; it thus makes it really easy for me to implement custom bits of polygraph to test specific things.

One thing I needed to test was object revalidation. I needed objects to be revalidated in a relatively short period of time so I could trigger a storage revalidation bug in Squid-2.HEAD.

Here's the changes.

include/content.pg; added:

ObjLifeCycle olcRevalid = {

length = const(2min);

variance = 50%;

with_lmt = 100%;

expires = [

lmt + const(2min) : 5%,

now + const(5min) : 15%

];

};

content cntRevalid = {

kind = "revalid";

obj_life_cycle = olcRevalid;

size = logn(32KB, 32KB);

cachable = 80%;

checksum = 1%;

};

I then edited my locally modified datacomm-1.pg to set the contents to cntRevalid.

Now, I get stale objects popping up during the test - and I need to figure out why -that- is happening - but note that my life cycles are very quick (couple minutes). squid _should_ be good down to object lifetime of 1 second so I'm a bit surprised.

In any case, it tripped the bug, which is all that matters..

Cacheboy PRE6 is out

I've just rolled PRE6. This includes the Squid-2.HEAD fix for the signed vs unsigned comparison bug I introduced earlier (which lead to a crash.)

This code -should- be stable enough for public consumption.

Wednesday, May 14, 2008

Error page update, phase 2

I've gone and modified all the English error pages in my little playpen project.

Here's an example of a live DNS failure. Same (confusing) text with Squid; slightly nicer layout.

Thursday, May 1, 2008

Errors shouldn't be ugly

Part of my "things i hate about Squid" list includes the god awful error pages which haven't really changed since .. well, since I got involved with the project in 2000.

Here's my take on the "simple" error page. The text is exactly the same as the old error page (note that I haven't included the "Generated by.." footer text here, as thats included by Squid/Cacheboy) but I've reformatted the error page to use CSS for layout and then crafted a very simple example CSS.

Sunday, April 27, 2008

Where is my CPU time going? (or how to divine useful information from oprofile)

OProfile is cool - it lets you dig into where your CPU is being spent. But aggregating statistics can be aggrevating. (Yes yes, it was bad, I know..)

Take this example from cacheboy:

CPU: Core 2, speed 2194.48 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples % image name symbol name
216049 6.5469 libc-2.7.so memcpy
115581 3.5024 libc-2.7.so _int_malloc
103345 3.1316 libc-2.7.so vfprintf
85197 2.5817 squid memPoolAlloc
64652 1.9591 libc-2.7.so memchr
60720 1.8400 libc-2.7.so strlen
...

Now, these tell you that CPU is being spent in the function (which is great) but its not the entire picture. The trouble is this: there's 527 functions in the top-level list, and 25 of them account for 1 or more percent of total runtime. Those top 25 account for ~ 45% of the total CPU time - so another 55% is being spent in the 501 functions remaining.

You may now ask yourself what the problem with that is - just optimise those top 25 functions and you'll be fine. Unfortunately, those top 25 functions aren't being called in one place - they're being called all over the shop.

Here's a example. Notice the strlen time:

496 13.7816 squid httpRequestFree
773 21.4782 squid httpHeaderPutStrf
9518 0.3432 libc-2.7.so vsnprintf
85433 55.6846 libc-2.7.so vfprintf
18212 11.8704 libc-2.7.so strchrnul
16037 10.4528 libc-2.7.so _IO_default_xsputn
13351 8.7021 libc-2.7.so _itoa_word
10872 7.0863 libc-2.7.so strlen
9518 6.2038 libc-2.7.so vsnprintf [self]

...

Note that the CPU times above "vsnprintf" are from the functions which call it, and CPU times below "vsnprintf" are the calls which it makes. Its not immediately obvious that I have to optimise "vsnprintf" calls from the top-level trace, as most of the *printf() calls end up being to "vsnprintf" (which shows up at 0.3% of CPU time) rather than "vfprintf" and friends.

Its obvious here that finding those places which call the *printf() functions in performance critical code - and then exorcising them - will probably help quite a bit.

What about the rest of the 500 odd functions? What I'd like to do is build aggregates of CPU time spent in different functions, including their called functions, and figure out which execution stacks are chewing the most CPU. Thats something to do after Cacheboy-1 is stable, and then only after my June exams.

The important thing here is that I have the data to figure out where Squid does things poorly and given enough time, I'm going to start fixing them in the next Cacheboy release.

Wednesday, April 23, 2008

Solaris Event Ports for Network IO

What do I do at midnight to try and relax?

Figure out how to make Solaris Event Ports work for Network IO.

It took me a while to realise that "num" needs to be initialised with the minimum number of events you'd like to wait for before port_getn() returns. I haven't any idea whether this will restrict the returned event count to 1 or whether it will grow to MAX - this will need further testing. It is enough to handle single requests though, so its a start!

Sunday, April 20, 2008

Knowing what your allocator is doing..

I committed a change a few years ago which collapsed the mem_node struct + buffer into one structure. This relieved quite a high volume of allocator requests, but it made the structure slightly larger than 4k.

Modern malloc implementations (and its possible earlier ones circa 2001 did too; remember I was only 21 then!) have a separation between "small" and "large" (and "huge"!) objects. Small objects (say, under a page size) will generally go in a pool of just those object sizes. Large objects (from say page size to something larger, like a megabyte) will be allocated a multiple of pages.

This unfortunately means that my 4096 + 12 byte structure may suddenly take 8192 bytes of RAM! Oops.

I decided to test this out. This is what happens when you do that with FreeBSD's allocator. Henrik has tried this under GNUMalloc and has found that the 4108 byte allocation doesn't take two pages.

[adrian@sarah ~]$ ./test1 test1 131072
allocating 12, then 4096 byte structures 131072 times..
RSS: 537708
[adrian@sarah ~]$ ./test1 test2 131072
allocating 4108 byte structure 131072 times..
RSS: 1063840

Saturday, April 19, 2008

"Dial before you Dig"

This is the sort of stuff lurking in the Squid codebase which really needs to be cleaned out.

The short:

http://code.google.com/p/cacheboy/source/detail?r=12592

The long:

I'm going through the legacy memory allocator uses and pushing the allocator initialisation out to the modules themselves rather than having some of them globally initialised. This will let me push the buffer allocator code (ie, the "rest" of the current legacy memory allocator uses) outside of the Squid/cacheboy src/ directory and allows them to be reused by other modules. I can then begin pushing some more code out of the src/ directory and into libraries to make dependencies saner, code reuse easier and unit testing much easier.

One of these types is MEM_LINK_LIST.

A FIFO type queue implementation was implemented using an single linked list. An SLIST has an O(1) dequeue behaviour but an O(n) queue behaviour - the whole list has to be traversed to find the end before it can append to the end. This requires touching potentially dirty pages which may also stall the bus a little. (I haven't measured that in my testing btw; my benchmarking focused on the memory-hit/miss pathway and left ACLs/disk access out - thus thats all currently conjecture!)

The FIFO implementation allocated a temporary list object (MEM_LINK_LIST) to hold the next and data pointers. This was mempooled and thus "cached", rather than hitting the system malloc each time.

The only user is the threaded aufs storage code - to store the pending disk read and write operations for a given open storage file.

Now, the "n" in O(n) shouldn't be that great as not very many operations are queued on an open file - generally, there's one read pending on a store file and potentially many writes pending on a store file (if the object is large and coming in faster than 4kbytes every few milliseconds.) In any case, I dislike unbounded cases like this, so I created a new function in the double-linked list type which pops the head item off the dlink list and returns it (and returns NULL if the list is empty) and re-worked the aufs code to use it. The above link is the second half of the work - read the previous few commits for the dlink changes.

Note: I really want to move all of this code over to the BSD queue/list types. ARGH! But I digress.

Initial testing shows that I haven't screwed anything up too badly. (~400 req/sec to a pair of 10,000 RPM 18gig SCSI disks, ~50mbit client traffic, 80% idle CPU.)

Thursday, April 17, 2008

Development thus far

The initial release has been done. The cacheboy-0pre1 release is just a vanilla Squid-2.HEAD tree with the first part of the code reorganisation included - they should be 1:1 bug compliant.

There's a developer who has found a bug in Squid-2.HEAD relating to larger-than-requested data replies in the data pipeline. That shows up during Vary processing. It shouldn't show up in Squid-2.HEAD / Cacheboy as I committed a workaround.

The Squid-2.HEAD / Cacheboy stuff should give a ~5% CPU reduction over Squid-2.7 (dataflow changes), and a ~10% CPU reduction over Squid-2.6 (HTTP parsing changes).

Next: sorting out the rest of the code shuffling - the generic parts of the mem and cbdata routines, and then a look at the comm and disk code.

Sunday, April 13, 2008

Experiences with Google Code

I've imported a Squid-2.HEAD CVS repository (with complete history!) into Google Code. This -mostly- worked, although!

There were some Subversion gatewaying issues inside Google somewhere which made SVN transactions occasionally fail - they've rolled back these changes and things work again!
I'm not getting any commit messages for some reason!
SVNSYNC takes -far too long- : building the SVN repo from CVS took about 5 minutes. Syncing my local SVN repo to the Google Code repo? 2 days.
The size of my repository hangs browsers that try to run the Google Code "source browse" feature. Heh!
$Id$ tag version numbers have been munged into Subversion revision numbers. Argh! I wish it were obvious this was going to happen! (And because of the above times, I really can't be bothered re-building the repository just yet.)

All in all though I've been happy with the service - Google employees pipe up on the code hosting group and are generally helpful. The "wiki contents in SVN" trick is cute. The source browser was nice when it worked! And I like the simple UI for things like revision browsing.