Adrian Chadd's Ramblings: squid

Showing posts with label squid. Show all posts

Monday, March 12, 2018

Not merging stuff from FreeBSD-HEAD into production branches, or "hey FreeBSD-HEAD should just be production"

I get asked all the time why I don't backport my patches into stable FreeBSD release branches. It's a good question, so let me explain it here.

I don't get paid to do it.

Ok, so now you ask "but wait, surely the users matter?" Yes, of course they do! But, I also have other things going on in my life, and the stuff I do for fun is .. well, it's the stuff I do for fun. I'm not paid to do FreeBSD work, let alone open source wireless stuff in general.

So then I see posts like this:

https://www.anserinae.net/adventures-in-wifi-freebsd-edition.html

I understand his point of view, I really do. I'm also that user when it comes to a variety of other open source software and I ask why features aren't implemented that seem easy, or why they're not in a stable release. But then I remember that I'm also doing this for fun and it's totally up to me to spend my time however I want.

Now, why am I like this?

Well, the short-hand version is - I used to bend over backwards to try and get stuff in to stable releases of the open source software I once worked on. And that was taken advantage of by a lot of people and companies who turned around to incorporate that work into successful commercial software releases without any useful financial contribution to either myself or the project as a whole. After enough time of this, you realise that hey, maybe my spare time should just be my spare time.

My hope is that if people wish to backport my FreeBSD work to a stable release then they'll either pay me to do it, pay someone else to do it, or see if a company will sponsor that work for their own benefit. I don't want to get into the game of trying to backport things to one and potentially two stable releases and deal with all the ABI changes and support fallout that happens when you are porting things into a mostly ABI stable release. And yes, my spare time is my own.

Thursday, July 7, 2011

Lusca development update - IPv6 is almost working

Now that I've (hopefully!) completely finished with university, I can get back into using my hard-earned money from Xenion to get more Lusca development done. (And yes, I'll also be doing wireless development too, fear not.)

The IPv6 branch is a bit messy at the moment, but it's almost able to handle IPv6 server requests.

The problem? The existing code which handles connecting to remote hosts (ie, src/comm.c) doesn't "know" about IPv6. It assumes all sockets are IPv4 and that all hostnames which are returned are also IPv4.

There's unfortunately a lot of dirty code in there - commReuseFD() is the main culprit and a good example of this. The 30 second version - since the commConnectStart() API assumes the socket is already created before the connect() occurs, any "connect retry" (for multiple hostnames and multiple attempts at the same end-host) requires the socket to be closed and recreated. But the FD has to stay the same. So commReuseFD() manually creates a new FD, makes it look like the old FD, then calls dup2() to get it into the same FD as the old one.

The "Correct" Fix is to modify the API to not take an FD, but to return an FD on successful connect() to the remote destination. There's some problems with this though, most notably in the request forwarding layer where the FD is created and comm close handlers are assigned before the connection is attempted. I need to make sure that there's no code which calls comm_close() on the active connection whilst connect() is going on - as said code expects the FD to be valid and assigned by this point.

The "Dirty" fix is to modify commConnectStart() and commReuseFD() to check the FD address family and destroy/create a "new" socket with the correct address family.

The "Problem" is that the code allows the outgoing address to be specified, both for transparent interception (source address spoofing) and to be set via an ACL match. Since IPv4 and IPv6 addresses are now possible, the API will have to be modified to handle this case.

What I'm likely going to do is something inspired by Squid-3. I'll teach the forwarding layer about "try v4 destinations" and "try v6 destinations". The administrator can then configure whether to try v4 or v6 destinations first. Only one outgoing address has to be provided - either "v4" or "v6"; and commConnectStart() will only try connecting to IP addresses that match the family of the outgoing address. That way if a host resolves to a mix of v4 and v6 addresses, they'll be tried in a "v4" group, then a "v6" group (or vice versa.) It's a bit dirty, but it's likely doable in the short-term.

In the long term, I'd like to fix the API up to be less messy and return an FD, rather than take an existing FD and abuse that. But that can come later.

Monday, March 8, 2010

Why are some Squid/Lusca ACL types slower than others? And which ones?

This post should likely be part of the documentation!

One thing which hasn't really been documented is the relative speed of each of the Squid/Lusca ACL types. This is important to know if you're administering a large Squid/Lusca install - it's entirely possible that the performance of your site will be massively impacted with the wrong ACL setup.

Firstly - the types themselves:

Splay trees are likely the fastest - src, dst, myip, dstdomain, srcdomain
The wordlist checks are linear but place hits back on the top of the wordlist to try and speed up the most looked up items - portname, method, snmp community, urlgroup, hiercode,
The regular expression checks are also linear and also reshuffle the list based on the most popular items - url regex, path regex, source/destination domain regex, request/reply mime type

Now the exceptions! Some require DNS lookups to match on the IP of the hostname being connected to - eg "dst", "srcdom_regex", "dstdom_regex".

A lot of places will simply use URL regular expression ACLs ("url_regex") to filter/forward requests. Unfortunately these scale poorly under high load and are almost always the reason a busy proxy server is pegging at full CPU.

I'll write up an article explaining how to work around these behaviours if enough people ask me nicely. :)

Wednesday, December 16, 2009

Why would more than 10,000 URLs be a problem?

I'm going to preface this (and all other censorship/filtering related posts) with a disclaimer:

I believe that mandatory censorship and filtering is wrong, inappropriate and risky.

That said, I'd like others to better understand the various technical issues behind implementing a filter. My hope is that people begin to understand the proper technical issues rather than simply re-stating others' potentially misguided opinions.

The "10,000 URL" limit is an interesting one. Since the report doesn't mention the specifics behind this view, and I can't find anything about it in my simple web searching, I'm going to make a stab in the dark.

Many people who implement filters using open source methods such as Squid will typically implement them as a check against a list of URLs. This searching can be implemented via two main methods:

Building a list of matches (regular expressions, exact-match strings, etc) which is compared against; and
Building a tree/hash/etc to match against in one pass.

Squid implements the former for regular expression matching and the latter for dstdomain/IP address matching.

What this unfortunately means is that full URL matching with regular expressions depends not only on the complexity of the regular expression, but the number of entries. It checks each entry in the list in turn.

So when Squid (and similar) software is used to filter a large set of URLs, and regular expressions are used to match against, it is quite possible that there will be a limitation on how many URLs can be included before performance degrades.

So, how would one work around it?

It is possible to combine regular expression matches into one larger rule, versus checking against many smaller ones. Technical details - instead of /a/, /b/, /c/; one may use /(a|b|c)/. But unfortunately not all regular expression libraries handle very long regular expressions so for portability reasons this is not always done.

Squid at least doesn't make it easy to match on the full URL without using regular expressions. Exact-match and glob-style match (eg, http://foo.com/path/to/file/*) will work very nicely. (I also should write that for Squid/Lusca at some point.)

A google "SafeSearch" type methodology may be used to avoid the use of regular expressions. This normalises the URL, breaks it up into parts, creates MD5 hashes for each part and compares them in turn to a large database of MD5 hashes. This provides a method of distributing the filtering list without specifically providing the clear-text list of URLs and it turns all of the lookups into simple MD5 comparisons. The downside is the filtering is a lot less powerful than regular expressions.

To wrap up, I'm specifically not discussing the effectiveness of URL matching and these kinds of rules in building filters. That is a completely different subject - one which will typically end with "it's an arms race; we'll never really win it." The point is that it is possible to filter requests against a list of URLs and regular expressions much, much greater than a low arbitrary limit.

Wednesday, September 30, 2009

Just a few Lusca related updates!

All of the Cacheboy CDN nodes are running Lusca-HEAD now and are nice and stable.
I've deployed Lusca at a few customer sites and again, it is nice and stable.
The rebuild logic changes are, for the most part, nice and stable. There seems to be some weirdness with 32 vs 64 bit compilation options which I need to suss out but everything "just works" if you compile Lusca with large file/large cache file support regardless of the platform you're using. I may make that the default option.
I've got a couple of small coding projects to introduce a couple of small new features to Lusca - more on those when they're done!
Finally, I'm going to be migrating some more of the internal code over to use the sqinet_t type in preparation for IPv4/IPv6 agnostic support.

Stay Tuned!

Lusca updates - September 2009

Just a few Lusca related updates!

All of the Cacheboy CDN nodes are running Lusca-HEAD now and are nice and stable.
I've deployed Lusca at a few customer sites and again, it is nice and stable.
The rebuild logic changes are, for the most part, nice and stable. There seems to be some weirdness with 32 vs 64 bit compilation options which I need to suss out but everything "just works" if you compile Lusca with large file/large cache file support regardless of the platform you're using. I may make that the default option.
I've got a couple of small coding projects to introduce a couple of small new features to Lusca - more on those when they're done!
Finally, I'm going to be migrating some more of the internal code over to use the sqinet_t type in preparation for IPv4/IPv6 agnostic support.

Stay Tuned!

Sunday, August 16, 2009

Squid-3 isn't a rewrite!

G'day,

There seems to be this strange misconception that Squid-3 is a "rewrite" of Squid in C++. I am not sure where this particular little tidbit gets copy/pasted from but just for the record:

Squid-3 is the continuation of Squid-2.5, made to compile using the GNU C++ compiler. It is not a rewrite.

If Squid-3 -were- a rewrite, and the resultant code -was- as much of a crappy-performing, bastardised C/C++ hybrid, then I'd have suggested the C++ coders in question need to relearn C++. Luckily for them, the codebase is a hybrid of C and C++ because it did just start as a C codebase with bits and pieces part-migrated to C++.

Tuesday, July 28, 2009

Updates - rebuild logic, peering and COSS work

I've committed the initial modifications to the storage rebuilding code. The changes mostly live in the AUFS and COSS code - the rest of Lusca isn't affected.

The change pushes the rebuild logic itself into external helpers which simply stream swaplog entries to the main process. Lusca doesn't care how the swaplog entries are generated.

The external helper method is big boost for AUFS. Each storedir creates a single rebuild helper process which can block on disk IO without blocking anything else. The original code in Squid will do a little disk IO work at a time - which almost always involved blocking the process until said disk IO completed.

The main motivation of this work was the removal of a lot of really horrible, twisty code and further modularisation of the codebase. The speedups to the rebuild process are a nice side-effect. The next big improvement will be sorting out how the swap logs are written. Fixing that will be key to allowing enormous caches to properly function without log rotation potentially destroying the proxy service.

Tuesday, March 31, 2009

lusca release - rev 13894

I've just put the latest Lusca-HEAD release up for download on the downloads page. This is the version which is currently running on the busiest Cacheboy CDN nodes (> 200mbit each) with plenty of resources to spare.

The major changes from Lusca-1.0 (and Squid-2 / Squid-3, before that):

The memory pools code has been gutted so it now acts as a statistics-keeping wrapper around malloc() rather than trying to cache memory allocations; this is in preparation for finding and fixing the worst memory users in the codebase!
The addition of reference counted buffers and some support framework has appeared!
The server-side code has been reorganised somewhat in preparation for copy-free data flow from the server to the store (src/http.c)
The asynchronous disk IO code has been extracted out from the AUFS codebase and turned into its own (mostly - one external variable left..) standalone library - it should be reusable by other parts of Lusca now
Some more performance work across the board
Code reorganisation and tidying up in preparation for further IPv6 integration (which was mostly completed in another branch, but I decided it moved along too quickly and caused some stability issues I wasn't willing to keep in Lusca for now..)
More code has been shuffled into separate libraries (especially libhttp/ - the HTTP code library) in preparation for some widescale performance changes.
Plenty more headerdoc-based code documentation!
Support for FreeBSD-current full transparent interception and Linux TPROXY-4 based full transparent interception

The next few weeks should be interesting. I'll post a TODO list once I'm back in Australia.

Monday, February 23, 2009

Lusca and BGP, take 2.

I've ironed out the crash kinks (the rest of the "kinks" are in the BGP FSM implementation); thus I'm left with:

1235459412.856 17063 118.92.109.x TCP_REFRESH_HIT/206 33405 GET http://videolan.cdn.cacheboy.net/vlc/0.9.8a/win32/vlc-0.9.8a-win32.exe - NONE/- application/x-msdownload AS7657
1235459417.194 1113 202.150.98.x TCP_HIT/200 45637 GET http://videolan.cdn.cacheboy.net/vlc/0.9.8a/win32/vlc-0.9.8a-win32.exe - NONE/- application/x-msdownload AS17746

Notice how the Squid logs have AS numbers in them? :)

Tuesday, January 20, 2009

Where the CPU is going

Oprofile is fun.

So, lets find out all of the time spent in cacheboy-head, per-symbol, with accumulative time, but only showing symbols taking 1% or more of CPU:


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -la -t 1 ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  cum. samples  %        cum. %     image name               symbol name
2100394  2100394        6.9315   6.9315    libc-2.3.6.so            memcpy
674036   2774430        2.2244   9.1558    libc-2.3.6.so            vfprintf
657729   3432159        2.1706  11.3264    squid                    memPoolAlloc
463901   3896060        1.5309  12.8573    libc-2.3.6.so            _int_malloc
453978   4350038        1.4982  14.3555    libc-2.3.6.so            strncasecmp
442439   4792477        1.4601  15.8156    libc-2.3.6.so            re_search_internal
438752   5231229        1.4479  17.2635    squid                    comm_select
423196   5654425        1.3966  18.6601    squid                    memPoolFree
418949   6073374        1.3826  20.0426    squid                    stackPop
412394   6485768        1.3609  21.4036    squid                    httpHeaderIdByName
402709   6888477        1.3290  22.7325    libc-2.3.6.so            strtok
364201   7252678        1.2019  23.9344    squid                    httpHeaderClean
359257   7611935        1.1856  25.1200    squid                    statHistBin
343628   7955563        1.1340  26.2540    squid                    SQUID_MD5Transform
330128   8285691        1.0894  27.3434    libc-2.3.6.so            memset
323962   8609653        1.0691  28.4125    libc-2.3.6.so            memchr

Ok, thats sort of useful. Whats unfortunate is that there's uhm, a lot more symbols than that:


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -la ./squid | wc -l
595

Ok, so thats a bit annoying. 16 symbols take ~ 28% of the CPU time, but the other 569 odd take the ~ 72% remaining CPU. This sort of makes traditional optimisation techniques a bit pointless now. I've optimised almost all of the "stupid" bits - double/triple copying of data, over-allocating and freeing pointlessly, multiple parsing attempts, etc.

How many samples in total?


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -l ./squid | cut -f1 -d' ' | awk '{ s+= $1; } END { print s }'
30302294

Lets look now at what memcpy() is doing, just to get an idea of what needs to be changed


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i memcpy ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  %        image name               symbol name
-------------------------------------------------------------------------------
 28133     1.3394  squid                    storeSwapOut
 31515     1.5004  squid                    stringInit
 32619     1.5530  squid                    httpBuildRequestPrefix
 54237     2.5822  squid                    strListAddStr
 54322     2.5863  squid                    storeSwapMetaBuild
 80047     3.8110  squid                    clientKeepaliveNextRequest
 171738    8.1765  squid                    httpHeaderEntryParseCreate
 211091   10.0501  squid                    httpHeaderEntryPackInto
 318793   15.1778  squid                    stringDup
 1022812  48.6962  squid                    storeAppend
2100394  100.000  libc-2.3.6.so            memcpy
 2100394  100.000  libc-2.3.6.so            memcpy [self]
------------------------------------------------------------------------------

So hm, half the memcpy() CPU time is spent in storeAppend, followed by storeDup, and httpHeaderEntryPackInto. Ok, those are what I'm going to be working on eliminating next anyway, so its not a big deal. This means I'll eliminate ~ 73% of the memcpy() CPU time, which is 73% of 7%, so around 5% of CPU time. Not too shabby. There'll be some overheads introduced by how its done (referenced buffer management) but one of the side-effects of that should be a drop in the number of calls to the memory allocator functions, so they should drop off a bit.

But this stuff is still just micro-optimisation. What I need is an idea of what code -paths- are taking up precious CPU time and thus what I should consider first to reimplement. Lets use the "-t" on non-top-level symbols. To start with, lets look at the two top-level "read" functions, which generally lead to some kind of other processing.

root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i clientReadRequest ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  %        symbol name
-------------------------------------------------------------------------------
  87536     4.7189  clientKeepaliveNextRequest
  1758418  94.7925  comm_select
88441    100.000  clientReadRequest
  2121926  86.3731  clientTryParseRequest
  88441     3.6000  clientReadRequest [self]
  52951     2.1554  commSetSelect
-------------------------------------------------------------------------------


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i httpReadReply ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  %        symbol name
-------------------------------------------------------------------------------
  3962448  99.7463  comm_select
163081   100.000  httpReadReply
  2781096  53.2193  httpAppendBody
  1857597  35.5471  httpProcessReplyHeader
  163081    3.1207  httpReadReply [self]
  57084     1.0924  memBufGrow
------------------------------------------------------------------------------

Here we're not interested in who is -calling- these functions (since its just the comm routine :) but which functions this routine is calling. The next trick, of course, is to try and figure out which of these paths are taking a noticable amount of CPU time. Obviously httpAppendBody() and httpProcessReplyHeader() are; they're doing both a lot of copying and a lot of parsing.

I'll look into things a little more in-depth in a few days; I need to get back to paid work. :)

Monday, January 19, 2009

Eliminating copies, or "god this code is horrible"

I've been (slowlyish!) unwinding some of the evil horridness that exists in the src/http.c code which handles reading data from upstream servers/caches, parsing it, and throwing it into the store.

There's two annoying memory copies as I've said before - one was a copy of the incoming data into a MemBuf, used -just- to assemble the full response headers for parsing, and the other (well, other two) are for appending the data coming in from the network into the memory store, on its way to the client-side code to be sent back to the client.

Now, as I've said before, the src/http.c code isn't all that long and complicated (by far most of the logic actually happens in the forward and client-side routines; the http.c routines do very little besides pump data back into the memory store) but unfortunately enough various layers of logic are mashed together to make things uhm, "very difficult" to work on separately.

Anyway, back on track. I've mostly pulled apart the code which handles reading the reply and parsing the response headers, and I've eliminated the first copy. The data is now read directly into a MemBuf, which serves as both the incoming buffer (which gets appended to) for the reply status line + headers, _AND_ the incoming buffer for HTTP body data (which never gets appended to - it is written out to the memory store and then reset back to empty.)

So the good news now is the number one place for L2 loads, L2 stores and CPU cycles spent unhalted (as measured on my P3 667mhz celeron test box, nice and slow, to expose all those stupid inefficiencies modern CPUs try to cover up :) comes from the memcpy() from src/http.c -> { header parsing (12%), http body appending (84%) } -> storeAppend().

This means one main thing - if I can eliminate the copying from into the store, and instead read directly into variable-sized pages (which is unfortunately the bloody tricky part), which are then handed to their entirety to the memory store, that last memcpy() will be eliminated, along with hopefully a good 10 + % of CPU time on this P3.

After that, its fixing the various uses of *printf() functions in the critical path, which absolutely should be avoided. I've got some basic patches to begin replacing some of the really STUPID uses of those. I'll begin committing the really obviously easy ones to Cacheboy HEAD once I've verified they don't break anything (in particular, SNMP indexes of all things..)

Once the two above are done, which accounts for a good 15 - 20% of the current CPU use in Cacheboy (at least in my small objects, memory-cache-only test load on the above hardware), I'll absolutely stop adding any and all new changes, features, optimisations, etc, and go -straight- to "make everything stable" mode again.

There's still so much that needs doing (proper refcounted buffers and strings, comm library functions which properly implement readv() and writev() so I can do things like write out the entire request/reply using vector operations and avoid the other bits of copying which go on, lessening the load on the memory allocator by actually efficiently packing structures, rewriting the http request/reply handling in preparation for replacement HTTP client/server modules, oh and IPv6/threading!) but that will come later.

Sunday, January 18, 2009

Tidying up the http reply handling code..

One of the unfortunate parts of the Squid codebase is that the HTTP request and reply handling code is messed up with the client and server code, and contains both stuff specific to a Cache (eg, looking for headers to control cache behaviour) as well as connection stuff (eg Transfer Encoding stuff, Keepalive, etc.)

My long-term goal is to finally separate all of this mess out so there's "generic" routines to be a HTTP client and server, create requests/replies and parse responses. But for now, tidying up some of the messy code to improve performance (and thus give people motivation to migrate their busy sites to Cacheboy) is on my short-term TODO list.

I spent some time ~ 18 months ago tidying up all of the client-side code so the request line and request header parsing didn't require half a dozen copies of various things just to complete. That was quite successful. The code structure is still horrible, but it works, and that for now is absolutely the most important part.

Now I'm doing something similar to the server-side code. The HTTP server code (src/http.c) combines both reply buffer appending, parsing, 100-continue response handling (well, "handling") and the various header checks for caching and connection in one enormous puddle of code. I'm trying to tease these apart so each part is done separately and the reply data isn't double-copied - once into the reply buffer, then once via storeAppend() into the memory store.

The CPU time spent doing this copying isn't all that high on current systems but it is definitely noticable (~30% of all CPU time spent in memcpy()) for slower systems talking to LAN-connected servers. So I'm going to do it - primarily to fix performance on slower hardware, but it also forces me to tidy up the existing code somewhat.

The next step is avoiding the copy into the memory store entirely, removing another 65% or so of memcpy() CPU time.

Friday, January 16, 2009

Refcounted string buffers!

Those of you who have been watching may have noticed a few String tidyups going into CACHEBOY_HEAD recently (one of which caused a bug in the first cacheboy-1.6 stable release that made it very non-stable!)

This is all in preparation for more sensible string and buffer handling. Unfortunately the Cacheboy codebase inherited a lot of dirty string handling and it needed some house cleaning before I could look towards the future.

Well, the future is here now (well, in /svn/branches/CACHEBOY_HEAD_strref ...) - I brought in my refcounted buffer routines from my previous attempts at all of this and converted String.[ch] over to use it.

For now, the refcounted string implementation doubles the malloc overhead for new strings (since it has to create a small buf_t and a string buffer) but stringDup() becomes essentially free. Since in a lot of cases, the stringDup() occurs when copying string headers and basically leaving them alone, this saves on a bunch of memory copying.

Decent performance benefits will only come with a whole lot of work:

Remove all of the current assumptions in code which uses String that the actual backing buffer (accessible via strBuf()) is NUL-terminated;
Rewrite sections of the code which go between String and C string buffers (with copying, etc) to use String where applicable. Unfortunately a whole lot of the original client_side.c code which handles parsing the request involves a fair bit of crap - so..
.. writing replacement request and reply HTTP parsers is probably the next thing to do;
Shuffling around the client-side code and the http code to use a buf_t as a incoming socket buffer, instead of how they currently do things (in an ugly way..)
Propagate down the incoming socket buffer to the request/reply parsing code, so said code can simply create references to the original socket buffer, bypassing any and all requirement for copying the request/reply data seperately.

I'm reasonably excited about the future benefits this code holds, but for now I'm going to remain reasonably conservative and leave the current String improvements where they are. I don't mind if these and the next round of changes to the MemBuf code reduce performance but improve the code; I know that the medium-term goal is going to provide some pretty decent benefits and I want to keep things stable and usable in production whilst I get there.

Next on my list though; looking at removing the places where *printf() is used in critical sections..

Friday, January 9, 2009

More profiling!

The following info is for a 10,000 concurrent connections, keep-alived, of just a fetch of an internal icon object from Squid. This is using my apachebench-adrian package which can handle such traffic loads.

The below accounts for roughly 60% of total CPU time (ie, 60% of the CPU is spent in userspace) on one core.
With oprofile, it hits around 12,300 transactions a second.

I have much, much hatred for how Squid uses *printf() everywhere. Sigh.



CPU: AMD64 processors, speed 2613.4 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples  cum. samples  %        cum. %     image name               symbol name
5383709  5383709        4.5316   4.5316    libc-2.6.1.so            vfprintf
4025991  9409700        3.3888   7.9203    libc-2.6.1.so            memcpy
3673722  13083422       3.0922  11.0126    libc-2.6.1.so            _int_malloc
3428362  16511784       2.8857  13.8983    libc-2.6.1.so            memset
3306571  19818355       2.7832  16.6815    libc-2.6.1.so            malloc_consolidate
2847887  22666242       2.3971  19.0787    squid                    memPoolFree
2634120  25300362       2.2172  21.2958    libm-2.6.1.so            floor
2609922  27910284       2.1968  23.4927    squid                    memPoolAlloc
2408836  30319120       2.0276  25.5202    libc-2.6.1.so            re_search_internal
2296612  32615732       1.9331  27.4534    libc-2.6.1.so            strlen
2265816  34881548       1.9072  29.3605    libc-2.6.1.so            _int_free
1826493  36708041       1.5374  30.8979    libc-2.6.1.so            _IO_default_xsputn
1641986  38350027       1.3821  32.2800    libc-2.6.1.so            free
1601997  39952024       1.3484  33.6285    squid                    httpHeaderGetEntry
1575919  41527943       1.3265  34.9549    libc-2.6.1.so            memchr
1466114  42994057       1.2341  36.1890    libc-2.6.1.so            re_string_reconstruct
1275377  44269434       1.0735  37.2625    squid                    clientTryParseRequest
1214714  45484148       1.0225  38.2850    squid                    httpMsgFindHeadersEnd
1185932  46670080       0.9982  39.2832    squid                    statHistBin
1170361  47840441       0.9851  40.2683    squid                    urlCanonicalClean
1169694  49010135       0.9846  41.2529    libc-2.6.1.so            strtok
1145933  50156068       0.9646  42.2174    squid                    comm_select
1128595  51284663       0.9500  43.1674    libc-2.6.1.so            __GI_____strtoll_l_internal
1116573  52401236       0.9398  44.1072    squid                    httpHeaderIdByName
956209   53357445       0.8049  44.9121    squid                    SQUID_MD5Transform
915844   54273289       0.7709  45.6830    squid                    memBufAppend
907609   55180898       0.7640  46.4469    squid                    stringLimitInit
898666   56079564       0.7564  47.2034    libc-2.6.1.so            strspn
883282   56962846       0.7435  47.9468    squid                    urlParse
852875   57815721       0.7179  48.6647    libc-2.6.1.so            calloc
819613   58635334       0.6899  49.3546    squid                    clientWriteComplete
800196   59435530       0.6735  50.0281    squid                    httpMsgParseRequestLine

Thursday, January 8, 2009

FreeBSD TPROXY works!

The FreeBSD TPROXY support (with a patched FreeBSD kernel for now) works just fine in testing.

I'm going to commit the changes to FreeBSD in the next couple of days. I'll then bring in the TPROXY4 support from Squid-3, and hopefully get functioning TPROXY2, TPROXY4 and FreeBSD TPROXY support into the upcoming Cacheboy-1.6 release.

Wednesday, January 7, 2009

TPROXY support

G'day,

I've done a bit of shuffling in the communication code to include a more modular approach to IP source address spoofing.

There's (currently) untested support for some FreeBSD source IP address spoofing that I'm bringing over courtesy of Julian Elischer; and there's a Linux TPROXY2 module.

I'll look at porting over the TPROXY4 support from Squid-3 in a few days.

I think this release is about as close to "stable" as Cacheboy-1.6 is going to get, so look forward to a "stable" release as soon as the FreeBSD port has been setup.

I already have a list of things to do for Cacheboy-1.7 which should prove to be interesting. Stay tuned..

Sunday, December 28, 2008

Cacheboy-HEAD updates

I've finished cleaning up the bits of the IPv6 work from CACHEBOY_HEAD - it should be just a slightly better structured Squid-2.HEAD / Cacheboy-1.5.

Right now I'm pulling out as much of the HTTP related code from src/ into libhttp/ before the 1.6 release. I'm hoping to glue together bits and pieces of the HTTP code into a very lightweight (for Squid) HTTP server implementation which can be used to test out various things like thread safe-ness. Of course, properly testing thread-safeness in production relies on a lot of the other code being thread-safe, like the comm code, the event registration code, the memory allocation code, the debugging and logging code ... aiee, etc. Oh well, I said I wanted to..

I'm also going through and adding some headerDoc comments to various library files. headerDoc (from apple) is actually rather nice. It lacks -one- function - the ability to merge multiple files together (say, libsqinet/sqinet.[ch]) into one "module" for documentation. I may look at doing that in some of my spare time.

Saturday, December 27, 2008

Reverting IPv6 for now; moving forward with structural changes

I've been working on the IPv6 support in Cacheboy for a couple months now and I've come to the conclusion that I'm not getting anywhere near as far along the development path as I wanted to be.

So I've taken a rather drastic step - I've branched off CACHEBOY_HEAD from the last point along the main codebase where the non-intrusive IPv6 changes had occured and I'm going to pursue Cacheboy-1.6 development from that.

The primary short-term goal with Cacheboy was to restructure the codebase in such a way as to make further development much, much simpler. I sort of lost track with the IPv6 development stuff and I rushed it in when the codebase obviously wasn't ready.

So, the IPv6 changes will stay in the CACHEBOY_PRE branch for now; development will continue in CACHEBOY_HEAD. I'll continue the restructuring work and stability work towards a Cacheboy-1.6 release come January 1. I'll then look at merging over the IPv6 infrastructure work into CACHEBOY_HEAD far before I merge in the client and server related code - specifically, completing the DNS updates, the ipcache/fqdncache updates, port over the IPv6 SNMP changes from Squid-3, and look at modularising the ACL code in preparation for IPv6'ifying that. The goal is less to IPv6-ify Cacheboy; its more to tidy up the code to the point where IPv6 becomes trivial.

Saturday, November 22, 2008

Updates!

A few updates!

I've fixed a few bugs in CACHEBOY_PRE which will be back-ported to CACHEBOY_1.5. This is in line with my current goal of stability before features. CACHEBOY_PRE and CACHEBOY_1.5 have passed all the polygraph runs I've been throwing at them and there aren't any outstanding stability issues in the Issue tracker.

I'll roll CACHEBOY_1.6.PRE3 and CACHEBOY_1.5.2 releases in the next day or two and get those out there.