Sunday, March 28, 2010

Today's fun bug: invalid swap metadata

One of the Lusca users has issues with swap file contents being invalid. It's linked to the length of the object URL - and fixing this is unfortunately more difficult than first thought.

In essence - the size of the URL, the size of the metadata and the size of the buffers being used for reading data don't match "right". The TLV encoding code doesn't put an upper limit on the size of the metadata. The TLV decoding code doesn't enforce a maximum buffer size - it tries reading the metadata until it finds the end of said metadata. All of this unfortunately results in stupid behaviour when overly long URLs are stored.

The current maximum URL is MAX_URL - 4096 bytes. Squid-3 may have redefined this to be longer. The reason I haven't done this is because the URL is included in the swap metadata - and this is only read in in SM_PAGE_SIZE chunks - again, 4096 bytes. So if the URL is say, 4000ish or so bytes long, the total length of the encoded metadata is > 4096 bytes. This is happily written out to the swapfile.

Reading the data in however is another story. An SM_PAGE_SIZE sized buffer is created and a read is issued. The incomplete metadata is read in. There's unfortunately no check to ensure the passed in buffer actually fully contains all of the metadata - so the code happily trumps in potentially uninitialised memory. The end result is at the very least an invalid object which is eventually deleted; it could be worse. I haven't yet investigated.

In any case - I'm now going to have to somehow enforce some sensible behaviour. I'd much prefer to make the code handle arbitrary (ie, configurable arbitrary) long URLs and read in the metadata as needed - but that's a bigger issue which will take some further refactoring and redesigning to solve.

This is all another example of how code works "by magic", rather than "by intent". :)

Thursday, March 25, 2010

Lusca update: important bug fixes, portability work

I've finally found and fixed two annoying bugs in Lusca.

The first is that occasionally the rebuild processed failed to properly rebuild the cache contents and the proxy would start with an empty cache. This ended up being due to the undocumented IPC code in Lusca, inherited from Squid, which would "eat" a few bytes at the beginning of the child process lifetime to establish whether the child was active and ready. It would try to read a hello string ("hi there!\n\0") but it would read 32 bytes instead of the exact string length. If the child process started sending data straight away (as my store rebuild helpers do!) the initial part of their conversation could be eaten.

The real fix is to modify the helper IPC handshake so it's two way and fixed length rather than the hacky way it's done now. But the temporary workaround seems to work fine - just read the 11 bytes for the hello string, rather than up to 32 bytes.

The second was a weird situation involving the swap.state files in the UFS store dirs growing to enormous sizes, filling the cache_dir up. This was eventually traced back to the proxy being reconfigured once a minute on some deployments (in this case - pfsense!). The problem was quite simple in the end - a reconfigure would force the swap state logs to be closed and re-opened; but this didn't know whether the swap state logs were pointing at the live ones or the rebuilding ones. In the latter, the rebuild process was reading from swap.state and appending to swap.state.new; a reconfigure would close the swap.state.new file and append to swap.state. This meant that the rebuild process was reading from swap.state and appending to swap.state.new - thus never quite finishing the rebuild process.

Those have been fixed for *NIX ports and now the rebuild processe seems to be moving forward quite swimmingly!

I've also had a user pop up recently who is submitting portability fixes for cygwin (and under Windows 7, no less!) I've been committing fixes to the way header files are included so the code mostly compiles under both *NIX and Cygwin. Admittedly most of those portability fixes were my fault - I didn't really bother making my new code "autoconf/automake safe" but, thankfully, revisiting this choice isn't difficult. Thankyou for the continuous assistance with the Cygwin changes!

Finally, there's still 46 currently active issues in the issue tracker. Some are just placeholders for me to revisit certain bits of code (eg Issue #85 - figuring out how PURGE works!) but I'm still aiming to address, repair and close the majority of those issues - and release a few more stable snapshots! - before I begin including more development stuff into the main tree.

Not that the snapshots at the moment are unstable - far from it!

Saturday, March 20, 2010

Lusca and HTTP parsing..

I've broken out most of the client-side request parsing code into a separate source file. There's now only a single entry and exit point (and one sideways jump for a comm timeout handler for now) from the client-side main code into the request handling code.

The main aim here is to eventually isolate and rework the whole process with which a HTTP request is parsed and constructed in memory. The process itself isn't all that CPU intensive these days compared to Squid-2.x and Squid-3.x but it is quite ugly. I won't go into the gory details - you can check it out for yourself if you like. Just cast your eyes over src/client_side_request_parser.c in LUSCA_HEAD.

I'm going to leave the request parsing code as it is for now. It's ugly but it works and its inefficiencies are dwarfed by the misuses of memory bandwidth/CPU cycles elsewhere.

I think I've left the codebase slightly easier to understand than before. I think I'm now at the point where I can document a large part of the client-side request and reply handling pipeline. The caching, vary and ETag processing is still very messy and too tightly integrated into the client-side code for my liking but as it also works fine for now I'll be leaving it well alone. There be dragons and all of that.

Monday, March 15, 2010

Lusca Logging, revisited

So the obvious thing to do is to not run the logging path at all, and to avoid evaluating the access control list(s) if there's no access control set defined for logging.

There are a couple of problems with this!

Firstly, the current behaviour is to not pass request/reply information through some of the statistics counters and the clientdb modules if the request doesn't match the logging ACL. If there's no logging ACL then all requests are kicked to the statistics/clientdb counters.

Secondly, there's also the ACLs used in the access_log directives which allow the administrator to filter which requests/replies are logged to which access log file(s). The current implementation will pass all requests which aren't denied by the top-level logging ACL through to each of the access_log entries, checking those for suitability.

The question is - can I mostly maintain the behaviour for the use cases. The main two are:
  1. where logging is completely disabled (via "access_log none") but maintain client counters and the clientdb;
  2. where logging is enabled to one access_log entry, maintaining client counters and the clientdb.
Hm, stuff to do, stuff to do..

Sunday, March 14, 2010

Improving the access logging code

The Squid/Lusca access logging logic is .. well, to be honest, it's quite dirty and inefficient. Yes, like most of Squid. But this is a somewhat bite-sized piece of the puzzle which I can tidy up in one corner without too many effects elsewhere.. or is it?

For high request rate, small object loads on slower FSB hardware, one of the largest CPU users is actually the client-side access log path. There's two culprits here:
  1. memcpy() when copying the "hier" struct from the request struct to the access log entry struct (client_side.c:297; revision 14457); and
  2. The ACL list setup in preparation for writing filtered access log entries to multiple files (client_side.c:314; revision 14457).
The memcpy() is of a 588 byte HierarchyLogEntry struct - it's this large because of two "SQUIDHOSTNAMELEN" (256 byte) long strings embedded in the struct itself. Annoying, but somewhat fixable with a little code refactoring and use of reference counted strings.

The ACL list setup is a bit more problematic. It sets up an ACL checklist using the http_access checklist before then checking log_access / acl_access. It then further may use this ACL checklist when evaluating various access_log lines to allow filtering certain access log entries to certain files.

Now, the latter bites because the really slow paths (setting up and destroying the ACL access stuff) is done even if the default logging configuration is used - one log file - and there's currently no easy way around that. Furthermore, if you disable logging entirely (access_log none) then the initial setup of the access log entry information is done, the ACL checklist is created, and then it's all tossed away without logging. A bit stupid, eh?

I'll cover what I'm doing in the codebase in a subsequent post (read; when I'm not running off to class.)

Saturday, March 13, 2010

Currently reading: "The Science of Programming" by David Gries

I've been acquiring older computer science textbooks at various second hand book sale type events over the last few years and occasionally I'll pick up what seems to be like a gem which shouldn't have fallen by the wayside in the name of progress.

Today's gem is "The Science of Programming" by David Gries. It's a book first published in the late 1970's which I'm guessing is one of the earlier attempts at a not-quite-academic publication trying to formalise some concepts of program design, provability and deductive reasoning. It does this for a variety of algorithms in a framework which - and this is a useful point - does not involve in any way an object oriented language, functional language, or anything which hides what's mostly going on under the hood from the programmer. Not that I think those are bad things, but I do think that being taught about how things are done closest to how things are run is a good idea. Starting at OO seems to produce programmers who, well, don't seem to have much of a clue about reality and rely on their tools a lot more than they should.

I'm currently re-reading it with a notebook at hand. Luckily for me, the first few chapters involve propositional/predicate logic and deduction stuff which overlaps nicely with my Semantics course in Linguistics. So it's "almost" related to my university degree. Sort of.

Tuesday, March 9, 2010

Some more Lusca Profiling..

More Lusca profiling of a simple test workload:

samples % image name symbol name
234266 6.7052 libc-2.3.6.so _int_malloc
138031 3.9507 libc-2.3.6.so vfprintf
111831 3.2008 libc-2.3.6.so calloc
104393 2.9879 libc-2.3.6.so malloc_consolidate
98984 2.8331 libc-2.3.6.so memcpy
91783 2.6270 libc-2.3.6.so _int_free
72578 2.0773 libc-2.3.6.so memset
69068 1.9769 libc-2.3.6.so free
50473 1.4446 squid clientTryParseRequest
50064 1.4329 squid memPoolAlloc
45211 1.2940 libc-2.3.6.so re_search_internal
40469 1.1583 squid httpRequestFree
39227 1.1228 squid statHistBin
38916 1.1139 squid comm_select
36974 1.0583 libc-2.3.6.so _IO_default_xsputn
36260 1.0378 squid memPoolFree

.. CPU is still taken up in the standard areas: memory allocation, stdio, and memcpy.

memcpy() I can't tackle right now. malloc() and friends I can tackle, but it may cause strange bugs and I'm not willing to commit anything which upsets stability too much right now. But vfprintf() is fun.

51285 22.7881 libc-2.3.6.so vsprintf
168714 74.9667 libc-2.3.6.so vsnprintf
156434 100.000 libc-2.3.6.so vfprintf

.. so most of the CPU is vsnprintf().

2549 1.0775 squid httpHeaderPutStrf
78253 33.0790 squid memBufVPrintf
153322 64.8121 libc-2.3.6.so snprintf
13601 100.000 libc-2.3.6.so vsnprintf

.. memBufVPrintf():

33231 36.8652 httpHeaderPutStrf
55577 61.6549 packerPrintf
7225 100.000 memBufVPrintf

.. snprintf():

5021 2.8271 squid clientAccessCheckDone
31897 17.9597 squid clientSendHeaders
53339 30.0327 squid xitoa
85548 48.1681 squid urlCanonical
7870 100.000 libc-2.3.6.so snprintf

.. so it's likely that eliminating the random calls through the printf() code to do "stuff" like assembling the URL and request/reply strings will shave off about 4% of this workload. But the biggest issue right now is the stupidly large amount of CPU being used in the memory allocation routines.

But the annoying one is memcpy():

10557 7.3665 squid httpAccept
20779 14.4992 squid stringDup
37010 25.8250 squid httpHeaderEntryPackInto
47882 33.4113 squid connStateFree
143177 100.000 libc-2.3.6.so memcpy

.. there's no memcpy() call in connStateFree. Which means I need to go hunting to figure out what's going on. Grr.

Monday, March 8, 2010

Why are some Squid/Lusca ACL types slower than others? And which ones?

This post should likely be part of the documentation!

One thing which hasn't really been documented is the relative speed of each of the Squid/Lusca ACL types. This is important to know if you're administering a large Squid/Lusca install - it's entirely possible that the performance of your site will be massively impacted with the wrong ACL setup.

Firstly - the types themselves:
  1. Splay trees are likely the fastest - src, dst, myip, dstdomain, srcdomain
  2. The wordlist checks are linear but place hits back on the top of the wordlist to try and speed up the most looked up items - portname, method, snmp community, urlgroup, hiercode,
  3. The regular expression checks are also linear and also reshuffle the list based on the most popular items - url regex, path regex, source/destination domain regex, request/reply mime type
Now the exceptions! Some require DNS lookups to match on the IP of the hostname being connected to - eg "dst", "srcdom_regex", "dstdom_regex".

A lot of places will simply use URL regular expression ACLs ("url_regex") to filter/forward requests. Unfortunately these scale poorly under high load and are almost always the reason a busy proxy server is pegging at full CPU.

I'll write up an article explaining how to work around these behaviours if enough people ask me nicely. :)