Adrian Chadd's Ramblings: performance

Showing posts with label performance. Show all posts

Monday, February 22, 2016

Why's my laptop running so hot? Or Firefox, pandora, and 1 million syscalls a second.

My FreeBSD-HEAD laptop runs very warm when running firefox, but even warmer when it's doing something simple - like say, streaming from pandora.

So, I decided to take a bit of a look.

Firstly, 'vmstat -a' - a good top level peek.

procs memory       page                    disks     faults         cpu
r b w avm   fre   flt re pi po    fr   sr ad0 cd0   in    sy    cs us sy id
3 0 0 21G 180M 30403   0   2   0 21520 976 27   0 598 462567 10061 25 31 44
1 0 0 21G 174M 10389   0   0   0 3320 980   0   0 913 1203071 11892 15 24 61
3 0 0 21G 192M 4028   0   0   0 4763 983 14   0 563 1246314 8166 15 23 62
1 0 0 21G 192M 2305   0   0   0   334 988   1   0 390 1165396 10784 18 20 62
2 0 0 21G 202M 30493   0   0   0 3154 983   4   0 340 1072100 13287 28 23 49
2 0 0 21G 202M 8440   0   0   0   646 979   1   0 391 1071166 8802 32 20 48
1 0 0 21G 204M 3608   0   0   0 1841 1954 31   0 516 1041635 11319 33 21 46
3 0 0 20G 212M 67782   0   0   0 2895 973   1   0 387 1053575 10995 28 26 46
2 0 0 21G 187M 25368   0   0   0 2483 989   7   0 475 1047031 12056 29 23 48

.. ok, a million syscalls a second. Fine.Let's ask dtrace what's going on:

root@victoria:/home/adrian # dtrace -n 'syscall:::entry { @[probefunc] = count(); }'
dtrace: description 'syscall:::entry ' matched 1082 probes
^C

gettimeofday                                                    305
lstat                                                           336
kevent                                                          598
recvfrom                                                       1018
__sysctl                                                       1384
getpid                                                         2158
sigprocmask                                                    5189
select                                                         5443
writev                                                         6215
madvise                                                        6606
setitimer                                                      6729
recvmsg                                                       17556
_umtx_op                                                      40740
ppoll                                                        853940
read                                                        1152896
poll                                                        1158669
write                                                       2159990
ioctl                                                       2170830

root@victoria:/home/adrian # dtrace -n 'syscall::read:return /execname == "firefox"/ { @["rval (bytes)"] =
quantize(arg1); }'
dtrace: description 'syscall::read:return ' matched 2 probes
^C
rval (bytes)
           value ------------- Distribution ------------- count
              -2 |                                         0
              -1 |                                         1
               0 |                                         0
               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 496294
               2 |                                         5
               4 |                                         0
               8 |                                         0
              16 |                                         0
              32 |                                         0
              64 |                                         0
             128 |                                         0
             256 |                                         0
             512 |                                         0
            1024 |                                         0
            2048 |                                         0
            4096 |                                         6
            8192 |                                         0
           16384 |                                         0
           32768 |                                         36
           65536 |                                         0

root@victoria:/home/adrian # dtrace -n 'syscall::write:return /execname == "firefox"/ { @["rval (bytes)"] = quantize(arg1); }'
dtrace: description 'syscall::write:return ' matched 2 probes
^C
rval (bytes)
           value ------------- Distribution ------------- count
              -2 |                                         0
              -1 |@@@@@@@@@@@@@@@@@@@@                     875025
               0 |                                         0
               1 |@@@@@@@@@@@@@@@@@@@@                     876075
               2 |                                         0
               4 |                                         0
               8 |                                         0
              16 |                                         14
              32 |                                         1
              64 |                                         0
             128 |                                         15
             256 |                                         8
             512 |                                         8
            1024 |                                         29
            2048 |                                         563
            4096 |                                         0
            8192 |                                         0
           16384 |                                         0
           32768 |                                         14
           65536 |                                         0

... ok, so read and write is doing single byte transactions, and write is actually failing as often as it's succeeding.

so, what's actually going on? I decided to run truss briefly, and I get a lot of this:

_umtx_op(0x82efa2e80,UMTX_OP_MUTEX_WAIT,0x0,0x0,0x0) = 0 (0x0)
ioctl(68,SNDCTL_DSP_GETOPTR,0xce0c5ac0)        = 0 (0x0)
_umtx_op(0x82b16fb70,UMTX_OP_MUTEX_WAIT,0x0,0x0,0x0) = 0 (0x0)
_umtx_op(0x8d6a57e58,UMTX_OP_NWAKE_PRIVATE,0x1,0x0,0x0) = 0 (0x0)
write(158,"x",1)                ERR#35 'Resource temporarily unavailable'
_umtx_op(0x8006bd4b8,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x7fffbde44c88) = 0 (0x0)

So I'm guessing there's a lot of inefficient single byte read/write syscalls to wake up a remote thread, along with a lot of inefficient use of the sound API.

For sound ioctls:

ioctl(68,SNDCTL_DSP_GETOSPACE,0xd7d54e10)   = 0 (0x0)
ioctl(68,SNDCTL_DSP_GETOPTR,0xd7d54e00)       = 0 (0x0)
ioctl(68,SNDCTL_DSP_GETOSPACE,0xd7d54df0)   = 0 (0x0)
ioctl(68,SNDCTL_DSP_GETOPTR,0xd7d54d50)       = 0 (0x0)

.. so i'm guessing it's doing it every thread wakeup or something stupid, even if it doesn't need to yet.

Thursday, March 19, 2015

Cache Line Aliasing #2, or "What happens when you page align everything"

After a little more digging into the Intel performance side of things, I discovered one of the big reasons for the performance drop on this particular workload: how Intel CPUs do memory reordering.

The TL;DR is this - there's some hardware inside the Intel CPUs that tracks memory ordering and cache contents - but they don't use all the address bits.

The relevant chapter in the intel optimisation guide is 3.6.8 - Capacity Limits and Aliasing in Caches. The specific thing I was hitting was in 3.6.8.2 - Store Forwarding Aliasing.

Assembly/Compiler Coding Rule 56. (H impact, M generality) Avoid having a store followed by a non-dependent load with addresses that differ by a multiple of 4 KBytes. Also, lay out data or order computation to avoid having cache lines that have linear addresses that are a multiple of 64 KBytes apart in the same working set. Avoid having more than 4 cache lines that are some multiple of 2 KBytes apart in the same first-level cache working set, and avoid having more than 8 cache lines that are some multiple of 4 KBytes apart in the same first-level cache working set.

So, given this, what can be done? In this workload, a bunch of large matrices were allocated via jemalloc, which page aligns large allocations. In the default invocation of the benchmark (where the allocation padding size is 0), the memory access patterns showed a very large number of counter events on "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS" - which is the number of 64k address aliases on the Sandy Bridge Xeon processors I've been testing on. (The same occurs on Westmere, Ivy Bridge and Haswell.) As I vary the padding size, the address aliasing value drops, the memory access counters increase, and the general performance increases.

On the test boxes I have (running pmcstat -w 120 -C -p LD_BLOCKS_PARTIAL.ADDRESS_ALIAS ./himenobmtxpa M )

0 217799413 830.995025
64 18138386 1624.296713
96 8876469 1662.486298
128 19281984 1645.370750
192 18247069 1643.119908
256 18511952 1661.426341
320 19636951 1674.154119
352 19716236 1686.694053
384 19684863 1681.110499
448 18189029 1683.163673
512 19380987 1691.937818

So there's still plenty of aliasing going on at different padding offsets, however it's a very marked drop between 0 and, well, anything.

It turns out that someone's gone and done a bunch more digging into the effects of various CPU magic under the hood. The last paper in the list (Analysing Contextual Bias..) looks at Aliasing and Cache Effects and the effect of memory layout. There's some cute (and sobering!) analysis of the performance changes due to something as simple as the length of your login name in the UNIX environment. It's worth reading.

The summary? Maybe page alignment of all of your memory accesses isn't the way to go.

For further reading:

Intel Architecture Optimisation Manual: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
Intel: Adjusting Thread Stack Address To Improve Performance on Intel Xeon Processors: https://software.intel.com/en-us/articles/adjusting-thread-stack-address-to-improve-performance-on-intel-xeonr-processors/
Analysing Contextual Bias of Program Execution on Modern CPUs: http://daim.idi.ntnu.no/masteroppgaver/009/9231/masteroppgave.pdf

Friday, March 6, 2015

cache line aliasing effects, or "why is freebsd slower than linux?"

There was some threads on FreeBSD/DragonflyBSD mailing lists a few years ago (2012?) which talked about some math benchmarks being much slower on FreeBSD/DragonflyBSD versus Linux.

When the same benchmark is run on FreeBSD/DragonflyBSD using the Linux layer (ie, a linux binary compiled for linux, but run on BSD) it gives the same or better behaviour.

Some digging was done, and it turned out it was due to memory allocation patterns and memory layout. The jemalloc library allocates large chunks at page aligned boundaries, whereas the allocator in glibc under Linux does not.

I've put the code online in the hope that others can test and verify this:

https://github.com/erikarn/himenobmtxpa

The branch 'local/freebsd' has my local change to allow the allocator offset to be specified. The offset compounds on each allocation - so with an 'n' byte offset, the first allocation is 0 bytes offset from the page boundary, the next is 'n' bytes offset from the page boundary, the next is '2n' bytes offset, etc.

You can experiment with different values and get completely different behavioural results. It's non-trivial: there's a 100% speedup by using a 127 byte offset for each allocation, versus a 0 byte offset.

I'd like to investigate cache line aliasing effects further. There was work done a few years ago to offset mbuf headers in the FreeBSD kernel so they weren't all page-aligned or 256/512/1024 byte aligned - and apparently this gave a significant performance improvement. But it wasn't folded into FreeBSD. What I'd like to do is come up with some better strategies / profiling guides for identifying when this is actually happening so the underlying objects being accessed can be adjusted.

So - if anyone out there has any tips, hints or suggestions on how to do this, please let me know. I'd like to document and automate this testing.

Tuesday, September 3, 2013

Finding low hanging fruit with PMC, or "O(wtf)" ?

I've lately been focusing on performance counter stuff on Sandy Bridge (Xeon and non-Xeon.) Part of this has been fixing some of the counters that were wrong. Part has been digesting the Intel tuning guides and the Intel micro-architecture for Sandy Bridge. It's a little different to the older school pipeline driven architecture that rules the MIPS world.

So, I fired up some of my scripts (at http://github.com/erikarn/hwpmc) on a live cache pushing a whole lot of live video netflix traffic. The scripts use the PMC framework in global counter mode rather than sampling mode, so it's cheap to do and doesn't affect performance.

What I found:

The pipeline slots per cycle metric is around 16% - so there's a lot of stalling going on.
There's a lot of memory traffic going on - around 50% of clock cycles are spent in LLC_MISS - ie, it wasn't in L1, L2 or L3/LLC (last-level cache) and thus has to be fetched from memory.

So, I started digging into why there were so many memory accesses. It turns out the biggest abuser was the cross-CPU IPI involved in synchronising page mapping tables - there are a few places calling pmap_invalidate_range() as part of sendfile() buffer completion and this was causing issues. I pointed this out, someone else has addressed it internally. (Ideally if the IO path uses unmapped buffers on amd64, there shouldn't be any need to map them in and out of KVA.) I think that saved about 4% of total clock cycles spent being stalled.

Then I found a lot of stalling going on in the mwait and ACPI sleep path. It turns out that these paths seem to involve doing ISA space IO port accesses. These are .. very slow. I've just flipped my testing over to use no mwait and use HLT.

Next - flowtable had been turned on during experimentation and I had noticed that the flowtable expire/flush code would periodically spike up. It spiked up more when more clients and more TCP flows were connected. It showed up in both memory accesses and clock cycles busy PMCs - and the reason made me laugh out loud.

The flowtable uses a bitstring_t - effectively an array of bytes treated as a bitmap, like select() FD_SET's - and would walk this to look for flows to expire.

The expiry code would walk the list looking for flows to expire - it would loop over the entire set, calling ffs() over the whole set to look for the next new flow to check.

.. so looping over looping over the whole set. O(n^2). Right there, in the flow cleaning path. Doing byte offset fetches, rather than 32-bit fetches. Everything about it was ridiculous. As we scaled up to serve more flows the flowcleaner CPU cycle count was spiking really, really hard.

I pointed this out in an email to my coworkers and fell asleep. It was fixed when I awoke - a co-worker fixed it to be correctly O(n) whilst I was sleeping. It's now totally disappeared from the CPU cycle and stall analysis.

So, I've just been chipping away at things here and there. There are some larger scale issues that I really want to address but I'd like to make sure all the odd, silly and remaining low hanging fruit are addressed. Then comes the fun stuff.

Friday, June 26, 2009

Lusca in Production, #2

Here's some 24 hour statistics from a Lusca-HEAD forward proxy install:

Request rate:

File descriptor count (so clients, servers and disk FDs) :

Byte and Request hit ratio:

Traffic to clients (blue) and from the internet (red):

CPU Utilisation (1.0 is "100% of one core):

Lusca-head pushing >100mbit in production..

Here's the Lusca HEAD install under FreeBSD-7.2 + TPROXY patch. This is a basic configuration with minimal customisation. There's ~ 600gig of data in the cache and there's around 5TB of total disk storage.

You can see when I turned on half of the users, then all of the users. I think there's now around 10,000 active users sitting behind this single Lusca server.

Tuesday, May 19, 2009

More profiling results

So whilst I wait for some of the base restructuring in LUSCA_HEAD to "bake" (ie, settle down, be stable, etc) I've been doing some more profiling.

Top CPU users on my P3 testing box:

root@jennifer:/home/adrian/work/lusca/branches/LUSCA_HEAD/src# oplist ./squid

CPU: PIII, speed 634.464 MHz (estimated)

Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000

samples % image name symbol name

1832194 8.7351 libc-2.3.6.so _int_malloc

969838 4.6238 libc-2.3.6.so memcpy

778086 3.7096 libc-2.3.6.so malloc_consolidate

647097 3.0851 libc-2.3.6.so vfprintf

479264 2.2849 libc-2.3.6.so _int_free

468865 2.2354 libc-2.3.6.so free

382189 1.8221 libc-2.3.6.so calloc

326540 1.5568 squid memPoolAlloc

277866 1.3247 libc-2.3.6.so re_search_internal

256727 1.2240 libc-2.3.6.so strncasecmp

249860 1.1912 squid httpHeaderIdByName

248918 1.1867 squid comm_select

238686 1.1380 libc-2.3.6.so strtok

215302 1.0265 squid statHistBin

Sigh. snprintf() leads to this:

root@jennifer:/home/adrian/work/lusca/branches/LUSCA_HEAD/src# opsymbol ./squid snprintf

CPU: PIII, speed 634.464 MHz (estimated)

Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000

samples % image name symbol name

-------------------------------------------------------------------------------

15840 1.7869 squid safe_inet_addr

29606 3.3398 squid pconnPush

54021 6.0940 squid httpBuildRequestHeader

64916 7.3230 libc-2.3.6.so inet_ntoa

105692 11.9229 squid clientSendHeaders

127929 14.4314 squid urlCanonicalClean

196481 22.1646 squid pconnKey

265474 29.9476 squid urlCanonical

36879 100.000 libc-2.3.6.so snprintf

813284 91.7449 libc-2.3.6.so vsnprintf

36879 4.1602 libc-2.3.6.so snprintf [self]

12516 1.4119 libc-2.3.6.so vfprintf

9209 1.0388 libc-2.3.6.so _IO_no_init

Double sigh. Hi, the 90's called, they'd like their printf()-in-performance-critical code back.

Monday, April 6, 2009

Lusca and Cacheboy improvements in the pipeline..

After profiling Lusca-HEAD rather extensively on the CDN nodes, I've discovered that the largest CPU "use" on the core 2 duo class boxes is memcpy(). On the ia64-2 node memcpy() shows up much lower down in the list. I'm sure this has to do with the differing FSB and general memory bus bandwidth available on the two architectures.

I'm planning out the changes to the store client needed to support fully copy-free async read and write. This should reduce the CPU overhead on core 2 duo class machines to the point where Lusca should break GigE throughput on this workload without too much CPU use. (I'm sure it could break GigE throughput right now on this workload though.)

I'll code this all up during the week and build a simulated testing rig at home "pretending" to be a whole lot of clients downloading partial bits of mozilla/firefox updates, complete with a random packetloss, latency and abort probability.

I also plan on finally releasing the bulk of the Cacheboy CDN software (hackish as it is!) during the week, right after I finally remove the last few bits of hard-coded configuration locations. :) I still haven't finished merging in the bits of code which do the health check, calculate the current probabilities to assign each host and then write out the geoip map files. I'll try to sort that out over the next few days and get a public subversion repository with the software online.

By the way, I plan on releasing the Cacheboy CDN software under the Affero GPL (AGPL) licence.

Tuesday, January 20, 2009

Where the CPU is going

Oprofile is fun.

So, lets find out all of the time spent in cacheboy-head, per-symbol, with accumulative time, but only showing symbols taking 1% or more of CPU:


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -la -t 1 ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  cum. samples  %        cum. %     image name               symbol name
2100394  2100394        6.9315   6.9315    libc-2.3.6.so            memcpy
674036   2774430        2.2244   9.1558    libc-2.3.6.so            vfprintf
657729   3432159        2.1706  11.3264    squid                    memPoolAlloc
463901   3896060        1.5309  12.8573    libc-2.3.6.so            _int_malloc
453978   4350038        1.4982  14.3555    libc-2.3.6.so            strncasecmp
442439   4792477        1.4601  15.8156    libc-2.3.6.so            re_search_internal
438752   5231229        1.4479  17.2635    squid                    comm_select
423196   5654425        1.3966  18.6601    squid                    memPoolFree
418949   6073374        1.3826  20.0426    squid                    stackPop
412394   6485768        1.3609  21.4036    squid                    httpHeaderIdByName
402709   6888477        1.3290  22.7325    libc-2.3.6.so            strtok
364201   7252678        1.2019  23.9344    squid                    httpHeaderClean
359257   7611935        1.1856  25.1200    squid                    statHistBin
343628   7955563        1.1340  26.2540    squid                    SQUID_MD5Transform
330128   8285691        1.0894  27.3434    libc-2.3.6.so            memset
323962   8609653        1.0691  28.4125    libc-2.3.6.so            memchr

Ok, thats sort of useful. Whats unfortunate is that there's uhm, a lot more symbols than that:


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -la ./squid | wc -l
595

Ok, so thats a bit annoying. 16 symbols take ~ 28% of the CPU time, but the other 569 odd take the ~ 72% remaining CPU. This sort of makes traditional optimisation techniques a bit pointless now. I've optimised almost all of the "stupid" bits - double/triple copying of data, over-allocating and freeing pointlessly, multiple parsing attempts, etc.

How many samples in total?


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -l ./squid | cut -f1 -d' ' | awk '{ s+= $1; } END { print s }'
30302294

Lets look now at what memcpy() is doing, just to get an idea of what needs to be changed


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i memcpy ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  %        image name               symbol name
-------------------------------------------------------------------------------
 28133     1.3394  squid                    storeSwapOut
 31515     1.5004  squid                    stringInit
 32619     1.5530  squid                    httpBuildRequestPrefix
 54237     2.5822  squid                    strListAddStr
 54322     2.5863  squid                    storeSwapMetaBuild
 80047     3.8110  squid                    clientKeepaliveNextRequest
 171738    8.1765  squid                    httpHeaderEntryParseCreate
 211091   10.0501  squid                    httpHeaderEntryPackInto
 318793   15.1778  squid                    stringDup
 1022812  48.6962  squid                    storeAppend
2100394  100.000  libc-2.3.6.so            memcpy
 2100394  100.000  libc-2.3.6.so            memcpy [self]
------------------------------------------------------------------------------

So hm, half the memcpy() CPU time is spent in storeAppend, followed by storeDup, and httpHeaderEntryPackInto. Ok, those are what I'm going to be working on eliminating next anyway, so its not a big deal. This means I'll eliminate ~ 73% of the memcpy() CPU time, which is 73% of 7%, so around 5% of CPU time. Not too shabby. There'll be some overheads introduced by how its done (referenced buffer management) but one of the side-effects of that should be a drop in the number of calls to the memory allocator functions, so they should drop off a bit.

But this stuff is still just micro-optimisation. What I need is an idea of what code -paths- are taking up precious CPU time and thus what I should consider first to reimplement. Lets use the "-t" on non-top-level symbols. To start with, lets look at the two top-level "read" functions, which generally lead to some kind of other processing.

root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i clientReadRequest ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  %        symbol name
-------------------------------------------------------------------------------
  87536     4.7189  clientKeepaliveNextRequest
  1758418  94.7925  comm_select
88441    100.000  clientReadRequest
  2121926  86.3731  clientTryParseRequest
  88441     3.6000  clientReadRequest [self]
  52951     2.1554  commSetSelect
-------------------------------------------------------------------------------


root@jennifer:/home/adrian/work/cacheboy/branches/CACHEBOY_HEAD/src# opreport -lc -t 1 -i httpReadReply ./squid
CPU: PIII, speed 634.485 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  %        symbol name
-------------------------------------------------------------------------------
  3962448  99.7463  comm_select
163081   100.000  httpReadReply
  2781096  53.2193  httpAppendBody
  1857597  35.5471  httpProcessReplyHeader
  163081    3.1207  httpReadReply [self]
  57084     1.0924  memBufGrow
------------------------------------------------------------------------------

Here we're not interested in who is -calling- these functions (since its just the comm routine :) but which functions this routine is calling. The next trick, of course, is to try and figure out which of these paths are taking a noticable amount of CPU time. Obviously httpAppendBody() and httpProcessReplyHeader() are; they're doing both a lot of copying and a lot of parsing.

I'll look into things a little more in-depth in a few days; I need to get back to paid work. :)

Monday, January 19, 2009

Eliminating copies, or "god this code is horrible"

I've been (slowlyish!) unwinding some of the evil horridness that exists in the src/http.c code which handles reading data from upstream servers/caches, parsing it, and throwing it into the store.

There's two annoying memory copies as I've said before - one was a copy of the incoming data into a MemBuf, used -just- to assemble the full response headers for parsing, and the other (well, other two) are for appending the data coming in from the network into the memory store, on its way to the client-side code to be sent back to the client.

Now, as I've said before, the src/http.c code isn't all that long and complicated (by far most of the logic actually happens in the forward and client-side routines; the http.c routines do very little besides pump data back into the memory store) but unfortunately enough various layers of logic are mashed together to make things uhm, "very difficult" to work on separately.

Anyway, back on track. I've mostly pulled apart the code which handles reading the reply and parsing the response headers, and I've eliminated the first copy. The data is now read directly into a MemBuf, which serves as both the incoming buffer (which gets appended to) for the reply status line + headers, _AND_ the incoming buffer for HTTP body data (which never gets appended to - it is written out to the memory store and then reset back to empty.)

So the good news now is the number one place for L2 loads, L2 stores and CPU cycles spent unhalted (as measured on my P3 667mhz celeron test box, nice and slow, to expose all those stupid inefficiencies modern CPUs try to cover up :) comes from the memcpy() from src/http.c -> { header parsing (12%), http body appending (84%) } -> storeAppend().

This means one main thing - if I can eliminate the copying from into the store, and instead read directly into variable-sized pages (which is unfortunately the bloody tricky part), which are then handed to their entirety to the memory store, that last memcpy() will be eliminated, along with hopefully a good 10 + % of CPU time on this P3.

After that, its fixing the various uses of *printf() functions in the critical path, which absolutely should be avoided. I've got some basic patches to begin replacing some of the really STUPID uses of those. I'll begin committing the really obviously easy ones to Cacheboy HEAD once I've verified they don't break anything (in particular, SNMP indexes of all things..)

Once the two above are done, which accounts for a good 15 - 20% of the current CPU use in Cacheboy (at least in my small objects, memory-cache-only test load on the above hardware), I'll absolutely stop adding any and all new changes, features, optimisations, etc, and go -straight- to "make everything stable" mode again.

There's still so much that needs doing (proper refcounted buffers and strings, comm library functions which properly implement readv() and writev() so I can do things like write out the entire request/reply using vector operations and avoid the other bits of copying which go on, lessening the load on the memory allocator by actually efficiently packing structures, rewriting the http request/reply handling in preparation for replacement HTTP client/server modules, oh and IPv6/threading!) but that will come later.

Sunday, January 18, 2009

Tidying up the http reply handling code..

One of the unfortunate parts of the Squid codebase is that the HTTP request and reply handling code is messed up with the client and server code, and contains both stuff specific to a Cache (eg, looking for headers to control cache behaviour) as well as connection stuff (eg Transfer Encoding stuff, Keepalive, etc.)

My long-term goal is to finally separate all of this mess out so there's "generic" routines to be a HTTP client and server, create requests/replies and parse responses. But for now, tidying up some of the messy code to improve performance (and thus give people motivation to migrate their busy sites to Cacheboy) is on my short-term TODO list.

I spent some time ~ 18 months ago tidying up all of the client-side code so the request line and request header parsing didn't require half a dozen copies of various things just to complete. That was quite successful. The code structure is still horrible, but it works, and that for now is absolutely the most important part.

Now I'm doing something similar to the server-side code. The HTTP server code (src/http.c) combines both reply buffer appending, parsing, 100-continue response handling (well, "handling") and the various header checks for caching and connection in one enormous puddle of code. I'm trying to tease these apart so each part is done separately and the reply data isn't double-copied - once into the reply buffer, then once via storeAppend() into the memory store.

The CPU time spent doing this copying isn't all that high on current systems but it is definitely noticable (~30% of all CPU time spent in memcpy()) for slower systems talking to LAN-connected servers. So I'm going to do it - primarily to fix performance on slower hardware, but it also forces me to tidy up the existing code somewhat.

The next step is avoiding the copy into the memory store entirely, removing another 65% or so of memcpy() CPU time.

Friday, January 16, 2009

Refcounted string buffers!

Those of you who have been watching may have noticed a few String tidyups going into CACHEBOY_HEAD recently (one of which caused a bug in the first cacheboy-1.6 stable release that made it very non-stable!)

This is all in preparation for more sensible string and buffer handling. Unfortunately the Cacheboy codebase inherited a lot of dirty string handling and it needed some house cleaning before I could look towards the future.

Well, the future is here now (well, in /svn/branches/CACHEBOY_HEAD_strref ...) - I brought in my refcounted buffer routines from my previous attempts at all of this and converted String.[ch] over to use it.

For now, the refcounted string implementation doubles the malloc overhead for new strings (since it has to create a small buf_t and a string buffer) but stringDup() becomes essentially free. Since in a lot of cases, the stringDup() occurs when copying string headers and basically leaving them alone, this saves on a bunch of memory copying.

Decent performance benefits will only come with a whole lot of work:

Remove all of the current assumptions in code which uses String that the actual backing buffer (accessible via strBuf()) is NUL-terminated;
Rewrite sections of the code which go between String and C string buffers (with copying, etc) to use String where applicable. Unfortunately a whole lot of the original client_side.c code which handles parsing the request involves a fair bit of crap - so..
.. writing replacement request and reply HTTP parsers is probably the next thing to do;
Shuffling around the client-side code and the http code to use a buf_t as a incoming socket buffer, instead of how they currently do things (in an ugly way..)
Propagate down the incoming socket buffer to the request/reply parsing code, so said code can simply create references to the original socket buffer, bypassing any and all requirement for copying the request/reply data seperately.

I'm reasonably excited about the future benefits this code holds, but for now I'm going to remain reasonably conservative and leave the current String improvements where they are. I don't mind if these and the next round of changes to the MemBuf code reduce performance but improve the code; I know that the medium-term goal is going to provide some pretty decent benefits and I want to keep things stable and usable in production whilst I get there.

Next on my list though; looking at removing the places where *printf() is used in critical sections..

Friday, January 9, 2009

More profiling!

The following info is for a 10,000 concurrent connections, keep-alived, of just a fetch of an internal icon object from Squid. This is using my apachebench-adrian package which can handle such traffic loads.

The below accounts for roughly 60% of total CPU time (ie, 60% of the CPU is spent in userspace) on one core.
With oprofile, it hits around 12,300 transactions a second.

I have much, much hatred for how Squid uses *printf() everywhere. Sigh.



CPU: AMD64 processors, speed 2613.4 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples  cum. samples  %        cum. %     image name               symbol name
5383709  5383709        4.5316   4.5316    libc-2.6.1.so            vfprintf
4025991  9409700        3.3888   7.9203    libc-2.6.1.so            memcpy
3673722  13083422       3.0922  11.0126    libc-2.6.1.so            _int_malloc
3428362  16511784       2.8857  13.8983    libc-2.6.1.so            memset
3306571  19818355       2.7832  16.6815    libc-2.6.1.so            malloc_consolidate
2847887  22666242       2.3971  19.0787    squid                    memPoolFree
2634120  25300362       2.2172  21.2958    libm-2.6.1.so            floor
2609922  27910284       2.1968  23.4927    squid                    memPoolAlloc
2408836  30319120       2.0276  25.5202    libc-2.6.1.so            re_search_internal
2296612  32615732       1.9331  27.4534    libc-2.6.1.so            strlen
2265816  34881548       1.9072  29.3605    libc-2.6.1.so            _int_free
1826493  36708041       1.5374  30.8979    libc-2.6.1.so            _IO_default_xsputn
1641986  38350027       1.3821  32.2800    libc-2.6.1.so            free
1601997  39952024       1.3484  33.6285    squid                    httpHeaderGetEntry
1575919  41527943       1.3265  34.9549    libc-2.6.1.so            memchr
1466114  42994057       1.2341  36.1890    libc-2.6.1.so            re_string_reconstruct
1275377  44269434       1.0735  37.2625    squid                    clientTryParseRequest
1214714  45484148       1.0225  38.2850    squid                    httpMsgFindHeadersEnd
1185932  46670080       0.9982  39.2832    squid                    statHistBin
1170361  47840441       0.9851  40.2683    squid                    urlCanonicalClean
1169694  49010135       0.9846  41.2529    libc-2.6.1.so            strtok
1145933  50156068       0.9646  42.2174    squid                    comm_select
1128595  51284663       0.9500  43.1674    libc-2.6.1.so            __GI_____strtoll_l_internal
1116573  52401236       0.9398  44.1072    squid                    httpHeaderIdByName
956209   53357445       0.8049  44.9121    squid                    SQUID_MD5Transform
915844   54273289       0.7709  45.6830    squid                    memBufAppend
907609   55180898       0.7640  46.4469    squid                    stringLimitInit
898666   56079564       0.7564  47.2034    libc-2.6.1.so            strspn
883282   56962846       0.7435  47.9468    squid                    urlParse
852875   57815721       0.7179  48.6647    libc-2.6.1.so            calloc
819613   58635334       0.6899  49.3546    squid                    clientWriteComplete
800196   59435530       0.6735  50.0281    squid                    httpMsgParseRequestLine