Adrian Chadd's Ramblings

Monday, February 23, 2009

Lusca and BGP, take 2.

I've ironed out the crash kinks (the rest of the "kinks" are in the BGP FSM implementation); thus I'm left with:

1235459412.856 17063 118.92.109.x TCP_REFRESH_HIT/206 33405 GET http://videolan.cdn.cacheboy.net/vlc/0.9.8a/win32/vlc-0.9.8a-win32.exe - NONE/- application/x-msdownload AS7657
1235459417.194 1113 202.150.98.x TCP_HIT/200 45637 GET http://videolan.cdn.cacheboy.net/vlc/0.9.8a/win32/vlc-0.9.8a-win32.exe - NONE/- application/x-msdownload AS17746

Notice how the Squid logs have AS numbers in them? :)

Lusca and BGP

I've been fleshing out some very, very basic BGP support in a lusca-head branch. I'm only using the BGP information right now for logging but I'll eventually use it as part of the request and reply processing.

It *cough* mostly works. I need to figure out why there's occasional radix tree corruption (which probably means running it under valgrind to find when the radix code goes off the map..) and un-dirty some of the BGP code (ie, implement a real FSM; proper separation of the protocol handling, FSM, network and RIB code) and add in the AS path/community/attribute stuff before I commit it to LUSCA_HEAD.

It is kind of cool though having a live BGP feed in your application. :) All 280,000 odd routes of it. :)

Sunday, February 1, 2009

Lusca development, and changes to string handling

I've just renamed Cacheboy to "Lusca". I've had a few potential users comment that "Cacheboy" isn't uhm, "management compatible", so the project has been renamed to try and bring some of these users on board. I'm also hoping to make Lusca less Adrian-focused and involve more of the community. We'll see how that goes.

In terms of development, I've shifted the code to http://code.google.com/p/lusca-cache/ and I'm continuing my work in /branches/LUSCA_HEAD.

I've been working on src/http.c (the server-side HTTP code) in preparation for introducing reference counted buffer/string handling. I removed one copy (of the socket read buffer into another memory buffer, to assemble a buffer containing the HTTP reply, in preparation for parsing) and have just migrated that bit of the codebase over to use my reference counted buffer (buf_t; found in libmem/buf.[ch].) It's entirely possible that I've horribly broken the server-side code so I'm reluctant to do much else until I've finished restructuring and testing the server-side HTTP code.

I've also been tidying up a few more places where the current String API is used "incorrectly", at least incorrectly for reference counted strings/buffers. I have ~ 61 code chunks to rewrite, mostly in the logging code. I've done it twice already in other branches, so this won't be terribly difficult. Its just boring. :)

Oh, and I've also just removed the "caching" bits of the MemPools code. MemPools in LUSCA_HEAD is now just a small wrapper around malloc/calloc/free, mainly to preserve the "block allocator" style API and keep some statistics. At the end of the day, Squid uses memory very very poorly and the caching code in MemPools is purely to avoid said poor memory use. I'm going to just fix the memory use (mostly revolving around String buffers, HTTP headers and the TLV code, amazing that!) so the number of calls through the allocator is much, much reduced. I'm guessing once I've finished, the number of calls through the system allocator will be about 2 or 3% of what they are now. That should drop the CPU use quite a bit.

Ah, now to find testers..

Tuesday, January 20, 2009

Where the CPU is going

Oprofile is fun.

So, lets find out all of the time spent in cacheboy-head, per-symbol, with accumulative time, but only showing symbols taking 1% or more of CPU:


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -la -t 1 ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  cum. samples  %        cum. %     image name               symbol name
2100394  2100394        6.9315   6.9315    libc-2.3.6.so            memcpy
674036   2774430        2.2244   9.1558    libc-2.3.6.so            vfprintf
657729   3432159        2.1706  11.3264    squid                    memPoolAlloc
463901   3896060        1.5309  12.8573    libc-2.3.6.so            _int_malloc
453978   4350038        1.4982  14.3555    libc-2.3.6.so            strncasecmp
442439   4792477        1.4601  15.8156    libc-2.3.6.so            re_search_internal
438752   5231229        1.4479  17.2635    squid                    comm_select
423196   5654425        1.3966  18.6601    squid                    memPoolFree
418949   6073374        1.3826  20.0426    squid                    stackPop
412394   6485768        1.3609  21.4036    squid                    httpHeaderIdByName
402709   6888477        1.3290  22.7325    libc-2.3.6.so            strtok
364201   7252678        1.2019  23.9344    squid                    httpHeaderClean
359257   7611935        1.1856  25.1200    squid                    statHistBin
343628   7955563        1.1340  26.2540    squid                    SQUID_MD5Transform
330128   8285691        1.0894  27.3434    libc-2.3.6.so            memset
323962   8609653        1.0691  28.4125    libc-2.3.6.so            memchr

Ok, thats sort of useful. Whats unfortunate is that there's uhm, a lot more symbols than that:


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -la ./squid | wc -l
595

Ok, so thats a bit annoying. 16 symbols take ~ 28% of the CPU time, but the other 569 odd take the ~ 72% remaining CPU. This sort of makes traditional optimisation techniques a bit pointless now. I've optimised almost all of the "stupid" bits - double/triple copying of data, over-allocating and freeing pointlessly, multiple parsing attempts, etc.

How many samples in total?


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -l ./squid | cut -f1 -d' ' | awk '{ s+= $1; } END { print s }'
30302294

Lets look now at what memcpy() is doing, just to get an idea of what needs to be changed


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i memcpy ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  %        image name               symbol name
-------------------------------------------------------------------------------
 28133     1.3394  squid                    storeSwapOut
 31515     1.5004  squid                    stringInit
 32619     1.5530  squid                    httpBuildRequestPrefix
 54237     2.5822  squid                    strListAddStr
 54322     2.5863  squid                    storeSwapMetaBuild
 80047     3.8110  squid                    clientKeepaliveNextRequest
 171738    8.1765  squid                    httpHeaderEntryParseCreate
 211091   10.0501  squid                    httpHeaderEntryPackInto
 318793   15.1778  squid                    stringDup
 1022812  48.6962  squid                    storeAppend
2100394  100.000  libc-2.3.6.so            memcpy
 2100394  100.000  libc-2.3.6.so            memcpy [self]
------------------------------------------------------------------------------

So hm, half the memcpy() CPU time is spent in storeAppend, followed by storeDup, and httpHeaderEntryPackInto. Ok, those are what I'm going to be working on eliminating next anyway, so its not a big deal. This means I'll eliminate ~ 73% of the memcpy() CPU time, which is 73% of 7%, so around 5% of CPU time. Not too shabby. There'll be some overheads introduced by how its done (referenced buffer management) but one of the side-effects of that should be a drop in the number of calls to the memory allocator functions, so they should drop off a bit.

But this stuff is still just micro-optimisation. What I need is an idea of what code -paths- are taking up precious CPU time and thus what I should consider first to reimplement. Lets use the "-t" on non-top-level symbols. To start with, lets look at the two top-level "read" functions, which generally lead to some kind of other processing.

root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i clientReadRequest ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  %        symbol name
-------------------------------------------------------------------------------
  87536     4.7189  clientKeepaliveNextRequest
  1758418  94.7925  comm_select
88441    100.000  clientReadRequest
  2121926  86.3731  clientTryParseRequest
  88441     3.6000  clientReadRequest [self]
  52951     2.1554  commSetSelect
-------------------------------------------------------------------------------


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i httpReadReply ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  %        symbol name
-------------------------------------------------------------------------------
  3962448  99.7463  comm_select
163081   100.000  httpReadReply
  2781096  53.2193  httpAppendBody
  1857597  35.5471  httpProcessReplyHeader
  163081    3.1207  httpReadReply [self]
  57084     1.0924  memBufGrow
------------------------------------------------------------------------------

Here we're not interested in who is -calling- these functions (since its just the comm routine :) but which functions this routine is calling. The next trick, of course, is to try and figure out which of these paths are taking a noticable amount of CPU time. Obviously httpAppendBody() and httpProcessReplyHeader() are; they're doing both a lot of copying and a lot of parsing.

I'll look into things a little more in-depth in a few days; I need to get back to paid work. :)

Monday, January 19, 2009

Eliminating copies, or "god this code is horrible"

I've been (slowlyish!) unwinding some of the evil horridness that exists in the src/http.c code which handles reading data from upstream servers/caches, parsing it, and throwing it into the store.

There's two annoying memory copies as I've said before - one was a copy of the incoming data into a MemBuf, used -just- to assemble the full response headers for parsing, and the other (well, other two) are for appending the data coming in from the network into the memory store, on its way to the client-side code to be sent back to the client.

Now, as I've said before, the src/http.c code isn't all that long and complicated (by far most of the logic actually happens in the forward and client-side routines; the http.c routines do very little besides pump data back into the memory store) but unfortunately enough various layers of logic are mashed together to make things uhm, "very difficult" to work on separately.

Anyway, back on track. I've mostly pulled apart the code which handles reading the reply and parsing the response headers, and I've eliminated the first copy. The data is now read directly into a MemBuf, which serves as both the incoming buffer (which gets appended to) for the reply status line + headers, _AND_ the incoming buffer for HTTP body data (which never gets appended to - it is written out to the memory store and then reset back to empty.)

So the good news now is the number one place for L2 loads, L2 stores and CPU cycles spent unhalted (as measured on my P3 667mhz celeron test box, nice and slow, to expose all those stupid inefficiencies modern CPUs try to cover up :) comes from the memcpy() from src/http.c -> { header parsing (12%), http body appending (84%) } -> storeAppend().

This means one main thing - if I can eliminate the copying from into the store, and instead read directly into variable-sized pages (which is unfortunately the bloody tricky part), which are then handed to their entirety to the memory store, that last memcpy() will be eliminated, along with hopefully a good 10 + % of CPU time on this P3.

After that, its fixing the various uses of *printf() functions in the critical path, which absolutely should be avoided. I've got some basic patches to begin replacing some of the really STUPID uses of those. I'll begin committing the really obviously easy ones to Cacheboy HEAD once I've verified they don't break anything (in particular, SNMP indexes of all things..)

Once the two above are done, which accounts for a good 15 - 20% of the current CPU use in Cacheboy (at least in my small objects, memory-cache-only test load on the above hardware), I'll absolutely stop adding any and all new changes, features, optimisations, etc, and go -straight- to "make everything stable" mode again.

There's still so much that needs doing (proper refcounted buffers and strings, comm library functions which properly implement readv() and writev() so I can do things like write out the entire request/reply using vector operations and avoid the other bits of copying which go on, lessening the load on the memory allocator by actually efficiently packing structures, rewriting the http request/reply handling in preparation for replacement HTTP client/server modules, oh and IPv6/threading!) but that will come later.

Sunday, January 18, 2009

Tidying up the http reply handling code..

One of the unfortunate parts of the Squid codebase is that the HTTP request and reply handling code is messed up with the client and server code, and contains both stuff specific to a Cache (eg, looking for headers to control cache behaviour) as well as connection stuff (eg Transfer Encoding stuff, Keepalive, etc.)

My long-term goal is to finally separate all of this mess out so there's "generic" routines to be a HTTP client and server, create requests/replies and parse responses. But for now, tidying up some of the messy code to improve performance (and thus give people motivation to migrate their busy sites to Cacheboy) is on my short-term TODO list.

I spent some time ~ 18 months ago tidying up all of the client-side code so the request line and request header parsing didn't require half a dozen copies of various things just to complete. That was quite successful. The code structure is still horrible, but it works, and that for now is absolutely the most important part.

Now I'm doing something similar to the server-side code. The HTTP server code (src/http.c) combines both reply buffer appending, parsing, 100-continue response handling (well, "handling") and the various header checks for caching and connection in one enormous puddle of code. I'm trying to tease these apart so each part is done separately and the reply data isn't double-copied - once into the reply buffer, then once via storeAppend() into the memory store.

The CPU time spent doing this copying isn't all that high on current systems but it is definitely noticable (~30% of all CPU time spent in memcpy()) for slower systems talking to LAN-connected servers. So I'm going to do it - primarily to fix performance on slower hardware, but it also forces me to tidy up the existing code somewhat.

The next step is avoiding the copy into the memory store entirely, removing another 65% or so of memcpy() CPU time.