OProfile is cool - it lets you dig into where your CPU is being spent. But aggregating statistics can be aggrevating. (Yes yes, it was bad, I know..)
Take this example from cacheboy:
CPU: Core 2, speed 2194.48 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples % image name symbol name
216049 6.5469 libc-2.7.so memcpy
115581 3.5024 libc-2.7.so _int_malloc
103345 3.1316 libc-2.7.so vfprintf
85197 2.5817 squid memPoolAlloc
64652 1.9591 libc-2.7.so memchr
60720 1.8400 libc-2.7.so strlen
...
Now, these tell you that CPU is being spent in the function (which is great) but its not the entire picture. The trouble is this: there's 527 functions in the top-level list, and 25 of them account for 1 or more percent of total runtime. Those top 25 account for ~ 45% of the total CPU time - so another 55% is being spent in the 501 functions remaining.
You may now ask yourself what the problem with that is - just optimise those top 25 functions and you'll be fine. Unfortunately, those top 25 functions aren't being called in one place - they're being called all over the shop.
Here's a example. Notice the strlen time:
496 13.7816 squid httpRequestFree
773 21.4782 squid httpHeaderPutStrf
9518 0.3432 libc-2.7.so vsnprintf
85433 55.6846 libc-2.7.so vfprintf
18212 11.8704 libc-2.7.so strchrnul
16037 10.4528 libc-2.7.so _IO_default_xsputn
13351 8.7021 libc-2.7.so _itoa_word
10872 7.0863 libc-2.7.so strlen
9518 6.2038 libc-2.7.so vsnprintf [self]
...
Note that the CPU times above "vsnprintf" are from the functions which call it, and CPU times below "vsnprintf" are the calls which it makes. Its not immediately obvious that I have to optimise "vsnprintf" calls from the top-level trace, as most of the *printf() calls end up being to "vsnprintf" (which shows up at 0.3% of CPU time) rather than "vfprintf" and friends.
Its obvious here that finding those places which call the *printf() functions in performance critical code - and then exorcising them - will probably help quite a bit.
What about the rest of the 500 odd functions? What I'd like to do is build aggregates of CPU time spent in different functions, including their called functions, and figure out which execution stacks are chewing the most CPU. Thats something to do after Cacheboy-1 is stable, and then only after my June exams.
The important thing here is that I have the data to figure out where Squid does things poorly and given enough time, I'm going to start fixing them in the next Cacheboy release.
Sunday, April 27, 2008
Wednesday, April 23, 2008
Solaris Event Ports for Network IO
What do I do at midnight to try and relax?
Figure out how to make Solaris Event Ports work for Network IO.
It took me a while to realise that "num" needs to be initialised with the minimum number of events you'd like to wait for before port_getn() returns. I haven't any idea whether this will restrict the returned event count to 1 or whether it will grow to MAX - this will need further testing. It is enough to handle single requests though, so its a start!
Figure out how to make Solaris Event Ports work for Network IO.
It took me a while to realise that "num" needs to be initialised with the minimum number of events you'd like to wait for before port_getn() returns. I haven't any idea whether this will restrict the returned event count to 1 or whether it will grow to MAX - this will need further testing. It is enough to handle single requests though, so its a start!
Sunday, April 20, 2008
Knowing what your allocator is doing..
I committed a change a few years ago which collapsed the mem_node struct + buffer into one structure. This relieved quite a high volume of allocator requests, but it made the structure slightly larger than 4k.
Modern malloc implementations (and its possible earlier ones circa 2001 did too; remember I was only 21 then!) have a separation between "small" and "large" (and "huge"!) objects. Small objects (say, under a page size) will generally go in a pool of just those object sizes. Large objects (from say page size to something larger, like a megabyte) will be allocated a multiple of pages.
This unfortunately means that my 4096 + 12 byte structure may suddenly take 8192 bytes of RAM! Oops.
I decided to test this out. This is what happens when you do that with FreeBSD's allocator. Henrik has tried this under GNUMalloc and has found that the 4108 byte allocation doesn't take two pages.
Modern malloc implementations (and its possible earlier ones circa 2001 did too; remember I was only 21 then!) have a separation between "small" and "large" (and "huge"!) objects. Small objects (say, under a page size) will generally go in a pool of just those object sizes. Large objects (from say page size to something larger, like a megabyte) will be allocated a multiple of pages.
This unfortunately means that my 4096 + 12 byte structure may suddenly take 8192 bytes of RAM! Oops.
I decided to test this out. This is what happens when you do that with FreeBSD's allocator. Henrik has tried this under GNUMalloc and has found that the 4108 byte allocation doesn't take two pages.
[adrian@sarah ~]$ ./test1 test1 131072
allocating 12, then 4096 byte structures 131072 times..
RSS: 537708
[adrian@sarah ~]$ ./test1 test2 131072
allocating 4108 byte structure 131072 times..
RSS: 1063840
Saturday, April 19, 2008
"Dial before you Dig"
This is the sort of stuff lurking in the Squid codebase which really needs to be cleaned out.
The short:
http://code.google.com/p/cacheboy/source/detail?r=12592
The long:
I'm going through the legacy memory allocator uses and pushing the allocator initialisation out to the modules themselves rather than having some of them globally initialised. This will let me push the buffer allocator code (ie, the "rest" of the current legacy memory allocator uses) outside of the Squid/cacheboy src/ directory and allows them to be reused by other modules. I can then begin pushing some more code out of the src/ directory and into libraries to make dependencies saner, code reuse easier and unit testing much easier.
One of these types is MEM_LINK_LIST.
A FIFO type queue implementation was implemented using an single linked list. An SLIST has an O(1) dequeue behaviour but an O(n) queue behaviour - the whole list has to be traversed to find the end before it can append to the end. This requires touching potentially dirty pages which may also stall the bus a little. (I haven't measured that in my testing btw; my benchmarking focused on the memory-hit/miss pathway and left ACLs/disk access out - thus thats all currently conjecture!)
The FIFO implementation allocated a temporary list object (MEM_LINK_LIST) to hold the next and data pointers. This was mempooled and thus "cached", rather than hitting the system malloc each time.
The only user is the threaded aufs storage code - to store the pending disk read and write operations for a given open storage file.
Now, the "n" in O(n) shouldn't be that great as not very many operations are queued on an open file - generally, there's one read pending on a store file and potentially many writes pending on a store file (if the object is large and coming in faster than 4kbytes every few milliseconds.) In any case, I dislike unbounded cases like this, so I created a new function in the double-linked list type which pops the head item off the dlink list and returns it (and returns NULL if the list is empty) and re-worked the aufs code to use it. The above link is the second half of the work - read the previous few commits for the dlink changes.
Note: I really want to move all of this code over to the BSD queue/list types. ARGH! But I digress.
Initial testing shows that I haven't screwed anything up too badly. (~400 req/sec to a pair of 10,000 RPM 18gig SCSI disks, ~50mbit client traffic, 80% idle CPU.)
The short:
http://code.google.com/p/cacheboy/source/detail?r=12592
The long:
I'm going through the legacy memory allocator uses and pushing the allocator initialisation out to the modules themselves rather than having some of them globally initialised. This will let me push the buffer allocator code (ie, the "rest" of the current legacy memory allocator uses) outside of the Squid/cacheboy src/ directory and allows them to be reused by other modules. I can then begin pushing some more code out of the src/ directory and into libraries to make dependencies saner, code reuse easier and unit testing much easier.
One of these types is MEM_LINK_LIST.
A FIFO type queue implementation was implemented using an single linked list. An SLIST has an O(1) dequeue behaviour but an O(n) queue behaviour - the whole list has to be traversed to find the end before it can append to the end. This requires touching potentially dirty pages which may also stall the bus a little. (I haven't measured that in my testing btw; my benchmarking focused on the memory-hit/miss pathway and left ACLs/disk access out - thus thats all currently conjecture!)
The FIFO implementation allocated a temporary list object (MEM_LINK_LIST) to hold the next and data pointers. This was mempooled and thus "cached", rather than hitting the system malloc each time.
The only user is the threaded aufs storage code - to store the pending disk read and write operations for a given open storage file.
Now, the "n" in O(n) shouldn't be that great as not very many operations are queued on an open file - generally, there's one read pending on a store file and potentially many writes pending on a store file (if the object is large and coming in faster than 4kbytes every few milliseconds.) In any case, I dislike unbounded cases like this, so I created a new function in the double-linked list type which pops the head item off the dlink list and returns it (and returns NULL if the list is empty) and re-worked the aufs code to use it. The above link is the second half of the work - read the previous few commits for the dlink changes.
Note: I really want to move all of this code over to the BSD queue/list types. ARGH! But I digress.
Initial testing shows that I haven't screwed anything up too badly. (~400 req/sec to a pair of 10,000 RPM 18gig SCSI disks, ~50mbit client traffic, 80% idle CPU.)
Thursday, April 17, 2008
Development thus far
The initial release has been done. The cacheboy-0pre1 release is just a vanilla Squid-2.HEAD tree with the first part of the code reorganisation included - they should be 1:1 bug compliant.
There's a developer who has found a bug in Squid-2.HEAD relating to larger-than-requested data replies in the data pipeline. That shows up during Vary processing. It shouldn't show up in Squid-2.HEAD / Cacheboy as I committed a workaround.
The Squid-2.HEAD / Cacheboy stuff should give a ~5% CPU reduction over Squid-2.7 (dataflow changes), and a ~10% CPU reduction over Squid-2.6 (HTTP parsing changes).
Next: sorting out the rest of the code shuffling - the generic parts of the mem and cbdata routines, and then a look at the comm and disk code.
There's a developer who has found a bug in Squid-2.HEAD relating to larger-than-requested data replies in the data pipeline. That shows up during Vary processing. It shouldn't show up in Squid-2.HEAD / Cacheboy as I committed a workaround.
The Squid-2.HEAD / Cacheboy stuff should give a ~5% CPU reduction over Squid-2.7 (dataflow changes), and a ~10% CPU reduction over Squid-2.6 (HTTP parsing changes).
Next: sorting out the rest of the code shuffling - the generic parts of the mem and cbdata routines, and then a look at the comm and disk code.
Sunday, April 13, 2008
Experiences with Google Code
I've imported a Squid-2.HEAD CVS repository (with complete history!) into Google Code. This -mostly- worked, although!
- There were some Subversion gatewaying issues inside Google somewhere which made SVN transactions occasionally fail - they've rolled back these changes and things work again!
- I'm not getting any commit messages for some reason!
- SVNSYNC takes -far too long- : building the SVN repo from CVS took about 5 minutes. Syncing my local SVN repo to the Google Code repo? 2 days.
- The size of my repository hangs browsers that try to run the Google Code "source browse" feature. Heh!
- $Id$ tag version numbers have been munged into Subversion revision numbers. Argh! I wish it were obvious this was going to happen! (And because of the above times, I really can't be bothered re-building the repository just yet.)
Subscribe to:
Posts (Atom)