Monday, February 23, 2009

Lusca and BGP, take 2.

I've ironed out the crash kinks (the rest of the "kinks" are in the BGP FSM implementation); thus I'm left with:

1235459412.856 17063 118.92.109.x TCP_REFRESH_HIT/206 33405 GET http://videolan.cdn.cacheboy.net/vlc/0.9.8a/win32/vlc-0.9.8a-win32.exe - NONE/- application/x-msdownload AS7657
1235459417.194 1113 202.150.98.x TCP_HIT/200 45637 GET http://videolan.cdn.cacheboy.net/vlc/0.9.8a/win32/vlc-0.9.8a-win32.exe - NONE/- application/x-msdownload AS17746

Notice how the Squid logs have AS numbers in them? :)

Lusca and BGP

I've been fleshing out some very, very basic BGP support in a lusca-head branch. I'm only using the BGP information right now for logging but I'll eventually use it as part of the request and reply processing.

It *cough* mostly works. I need to figure out why there's occasional radix tree corruption (which probably means running it under valgrind to find when the radix code goes off the map..) and un-dirty some of the BGP code (ie, implement a real FSM; proper separation of the protocol handling, FSM, network and RIB code) and add in the AS path/community/attribute stuff before I commit it to LUSCA_HEAD.

It is kind of cool though having a live BGP feed in your application. :) All 280,000 odd routes of it. :)

Sunday, February 1, 2009

Lusca development, and changes to string handling

I've just renamed Cacheboy to "Lusca". I've had a few potential users comment that "Cacheboy" isn't uhm, "management compatible", so the project has been renamed to try and bring some of these users on board. I'm also hoping to make Lusca less Adrian-focused and involve more of the community. We'll see how that goes.

In terms of development, I've shifted the code to http://code.google.com/p/lusca-cache/ and I'm continuing my work in /branches/LUSCA_HEAD.

I've been working on src/http.c (the server-side HTTP code) in preparation for introducing reference counted buffer/string handling. I removed one copy (of the socket read buffer into another memory buffer, to assemble a buffer containing the HTTP reply, in preparation for parsing) and have just migrated that bit of the codebase over to use my reference counted buffer (buf_t; found in libmem/buf.[ch].) It's entirely possible that I've horribly broken the server-side code so I'm reluctant to do much else until I've finished restructuring and testing the server-side HTTP code.

I've also been tidying up a few more places where the current String API is used "incorrectly", at least incorrectly for reference counted strings/buffers. I have ~ 61 code chunks to rewrite, mostly in the logging code. I've done it twice already in other branches, so this won't be terribly difficult. Its just boring. :)

Oh, and I've also just removed the "caching" bits of the MemPools code. MemPools in LUSCA_HEAD is now just a small wrapper around malloc/calloc/free, mainly to preserve the "block allocator" style API and keep some statistics. At the end of the day, Squid uses memory very very poorly and the caching code in MemPools is purely to avoid said poor memory use. I'm going to just fix the memory use (mostly revolving around String buffers, HTTP headers and the TLV code, amazing that!) so the number of calls through the allocator is much, much reduced. I'm guessing once I've finished, the number of calls through the system allocator will be about 2 or 3% of what they are now. That should drop the CPU use quite a bit.

Ah, now to find testers..

Tuesday, January 20, 2009

Where the CPU is going

Oprofile is fun.

So, lets find out all of the time spent in cacheboy-head, per-symbol, with accumulative time, but only showing symbols taking 1% or more of CPU:


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -la -t 1 ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples cum. samples % cum. % image name symbol name
2100394 2100394 6.9315 6.9315 libc-2.3.6.so memcpy
674036 2774430 2.2244 9.1558 libc-2.3.6.so vfprintf
657729 3432159 2.1706 11.3264 squid memPoolAlloc
463901 3896060 1.5309 12.8573 libc-2.3.6.so _int_malloc
453978 4350038 1.4982 14.3555 libc-2.3.6.so strncasecmp
442439 4792477 1.4601 15.8156 libc-2.3.6.so re_search_internal
438752 5231229 1.4479 17.2635 squid comm_select
423196 5654425 1.3966 18.6601 squid memPoolFree
418949 6073374 1.3826 20.0426 squid stackPop
412394 6485768 1.3609 21.4036 squid httpHeaderIdByName
402709 6888477 1.3290 22.7325 libc-2.3.6.so strtok
364201 7252678 1.2019 23.9344 squid httpHeaderClean
359257 7611935 1.1856 25.1200 squid statHistBin
343628 7955563 1.1340 26.2540 squid SQUID_MD5Transform
330128 8285691 1.0894 27.3434 libc-2.3.6.so memset
323962 8609653 1.0691 28.4125 libc-2.3.6.so memchr

Ok, thats sort of useful. Whats unfortunate is that there's uhm, a lot more symbols than that:


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -la ./squid | wc -l
595

Ok, so thats a bit annoying. 16 symbols take ~ 28% of the CPU time, but the other 569 odd take the ~ 72% remaining CPU. This sort of makes traditional optimisation techniques a bit pointless now. I've optimised almost all of the "stupid" bits - double/triple copying of data, over-allocating and freeing pointlessly, multiple parsing attempts, etc.

How many samples in total?


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -l ./squid | cut -f1 -d' ' | awk '{ s+= $1; } END { print s }'
30302294

Lets look now at what memcpy() is doing, just to get an idea of what needs to be changed


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i memcpy ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples % image name symbol name
-------------------------------------------------------------------------------
28133 1.3394 squid storeSwapOut
31515 1.5004 squid stringInit
32619 1.5530 squid httpBuildRequestPrefix
54237 2.5822 squid strListAddStr
54322 2.5863 squid storeSwapMetaBuild
80047 3.8110 squid clientKeepaliveNextRequest
171738 8.1765 squid httpHeaderEntryParseCreate
211091 10.0501 squid httpHeaderEntryPackInto
318793 15.1778 squid stringDup
1022812 48.6962 squid storeAppend
2100394 100.000 libc-2.3.6.so memcpy
2100394 100.000 libc-2.3.6.so memcpy [self]
------------------------------------------------------------------------------

So hm, half the memcpy() CPU time is spent in storeAppend, followed by storeDup, and httpHeaderEntryPackInto. Ok, those are what I'm going to be working on eliminating next anyway, so its not a big deal. This means I'll eliminate ~ 73% of the memcpy() CPU time, which is 73% of 7%, so around 5% of CPU time. Not too shabby. There'll be some overheads introduced by how its done (referenced buffer management) but one of the side-effects of that should be a drop in the number of calls to the memory allocator functions, so they should drop off a bit.

But this stuff is still just micro-optimisation. What I need is an idea of what code -paths- are taking up precious CPU time and thus what I should consider first to reimplement. Lets use the "-t" on non-top-level symbols. To start with, lets look at the two top-level "read" functions, which generally lead to some kind of other processing.

root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i clientReadRequest ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples % symbol name
-------------------------------------------------------------------------------
87536 4.7189 clientKeepaliveNextRequest
1758418 94.7925 comm_select
88441 100.000 clientReadRequest
2121926 86.3731 clientTryParseRequest
88441 3.6000 clientReadRequest [self]
52951 2.1554 commSetSelect
-------------------------------------------------------------------------------


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i httpReadReply ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples % symbol name
-------------------------------------------------------------------------------
3962448 99.7463 comm_select
163081 100.000 httpReadReply
2781096 53.2193 httpAppendBody
1857597 35.5471 httpProcessReplyHeader
163081 3.1207 httpReadReply [self]
57084 1.0924 memBufGrow
------------------------------------------------------------------------------

Here we're not interested in who is -calling- these functions (since its just the comm routine :) but which functions this routine is calling. The next trick, of course, is to try and figure out which of these paths are taking a noticable amount of CPU time. Obviously httpAppendBody() and httpProcessReplyHeader() are; they're doing both a lot of copying and a lot of parsing.

I'll look into things a little more in-depth in a few days; I need to get back to paid work. :)

Monday, January 19, 2009

Eliminating copies, or "god this code is horrible"

Eliminating copies, or "god this code is horrible"

I've been (slowlyish!) unwinding some of the evil horridness that exists in the src/http.c code which handles reading data from upstream servers/caches, parsing it, and throwing it into the store.

There's two annoying memory copies as I've said before - one was a copy of the incoming data into a MemBuf, used -just- to assemble the full response headers for parsing, and the other (well, other two) are for appending the data coming in from the network into the memory store, on its way to the client-side code to be sent back to the client.

Now, as I've said before, the src/http.c code isn't all that long and complicated (by far most of the logic actually happens in the forward and client-side routines; the http.c routines do very little besides pump data back into the memory store) but unfortunately enough various layers of logic are mashed together to make things uhm, "very difficult" to work on separately.

Anyway, back on track. I've mostly pulled apart the code which handles reading the reply and parsing the response headers, and I've eliminated the first copy. The data is now read directly into a MemBuf, which serves as both the incoming buffer (which gets appended to) for the reply status line + headers, _AND_ the incoming buffer for HTTP body data (which never gets appended to - it is written out to the memory store and then reset back to empty.)

So the good news now is the number one place for L2 loads, L2 stores and CPU cycles spent unhalted (as measured on my P3 667mhz celeron test box, nice and slow, to expose all those stupid inefficiencies modern CPUs try to cover up :) comes from the memcpy() from src/http.c -> { header parsing (12%), http body appending (84%) } -> storeAppend().

This means one main thing - if I can eliminate the copying from into the store, and instead read directly into variable-sized pages (which is unfortunately the bloody tricky part), which are then handed to their entirety to the memory store, that last memcpy() will be eliminated, along with hopefully a good 10 + % of CPU time on this P3.

After that, its fixing the various uses of *printf() functions in the critical path, which absolutely should be avoided. I've got some basic patches to begin replacing some of the really STUPID uses of those. I'll begin committing the really obviously easy ones to Cacheboy HEAD once I've verified they don't break anything (in particular, SNMP indexes of all things..)

Once the two above are done, which accounts for a good 15 - 20% of the current CPU use in Cacheboy (at least in my small objects, memory-cache-only test load on the above hardware), I'll absolutely stop adding any and all new changes, features, optimisations, etc, and go -straight- to "make everything stable" mode again.

There's still so much that needs doing (proper refcounted buffers and strings, comm library functions which properly implement readv() and writev() so I can do things like write out the entire request/reply using vector operations and avoid the other bits of copying which go on, lessening the load on the memory allocator by actually efficiently packing structures, rewriting the http request/reply handling in preparation for replacement HTTP client/server modules, oh and IPv6/threading!) but that will come later.

Sunday, January 18, 2009

Tidying up the http reply handling code..

One of the unfortunate parts of the Squid codebase is that the HTTP request and reply handling code is messed up with the client and server code, and contains both stuff specific to a Cache (eg, looking for headers to control cache behaviour) as well as connection stuff (eg Transfer Encoding stuff, Keepalive, etc.)

My long-term goal is to finally separate all of this mess out so there's "generic" routines to be a HTTP client and server, create requests/replies and parse responses. But for now, tidying up some of the messy code to improve performance (and thus give people motivation to migrate their busy sites to Cacheboy) is on my short-term TODO list.

I spent some time ~ 18 months ago tidying up all of the client-side code so the request line and request header parsing didn't require half a dozen copies of various things just to complete. That was quite successful. The code structure is still horrible, but it works, and that for now is absolutely the most important part.

Now I'm doing something similar to the server-side code. The HTTP server code (src/http.c) combines both reply buffer appending, parsing, 100-continue response handling (well, "handling") and the various header checks for caching and connection in one enormous puddle of code. I'm trying to tease these apart so each part is done separately and the reply data isn't double-copied - once into the reply buffer, then once via storeAppend() into the memory store.

The CPU time spent doing this copying isn't all that high on current systems but it is definitely noticable (~30% of all CPU time spent in memcpy()) for slower systems talking to LAN-connected servers. So I'm going to do it - primarily to fix performance on slower hardware, but it also forces me to tidy up the existing code somewhat.

The next step is avoiding the copy into the memory store entirely, removing another 65% or so of memcpy() CPU time.