I migrated away from apache-1.3 to Lighttpd-1.4.19 to handle the load better. Apache-1.3 handles lots of concurrent disk IO on large files fine but it bites for lots of concurrent network connections.
In theory, once all of the caching stuff is fixed, the backends will spend most of their time revalidating objects.
But for some weird reason I'm seeing TCP_REFRESH_MISS on my Lusca edge nodes and generally poor performance during this release. I look at the logs and find this:
[Host: mozilla.cdn.cacheboy.net\r\n
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10\r\n
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n
Accept-Language: en-us,en;q=0.5\r\n
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n
If-Modified-Since: Wed, 03 Jun 2009 15:09:39 GMT\r\n
If-None-Match: "1721454571"\r\n
Cache-Control: max-stale=0\r\n
Connection: Keep-Alive\r\n
Pragma: no-cache\r\n
X-BlueCoat-Via: 24C3C50D45B23509\r\n]
[HTTP/1.0 200 OK\r\n
Content-Type: application/octet-stream\r\n
Accept-Ranges: bytes\r\n
ETag: "1687308715"\r\n
Last-Modified: Wed, 03 Jun 2009 15:09:39 GMT\r\n
Content-Length: 2178196\r\n
Date: Fri, 12 Jun 2009 04:25:40 GMT\r\n
Server: lighttpd/1.4.19\r\n
X-Cache: MISS from mirror1.jp.cacheboy.net\r\n
Via: 1.0 mirror1.jp.cacheboy.net:80 (Lusca/LUSCA_HEAD)\r\n
Connection: keep-alive\r\n\r]
Notice the different ETags? Hm! I wonder whats going on. On a hunch I checked the Etags from both backends. master1 for that object gives "1721454571"; master2 gives "1687308715". They both have the same size and same timestamp. I wonder what is different?
Time to go digging into the depths of the lighttpd code.
EDIT: the etag generation is configurable. By default it uses the mtime, inode and filesize. Disabling inode and inode/mtime didn't help. I then found that earlier lighttpd versions have different etag generation behaviour based on 32 or 64 bit platforms. I'll build a local lighttpd package and see if I can replicate the behaviour on my 32/64 bit systems. Grr.
Meanwhile, Cacheboy isn't really serving any of the mozilla updates. :(
EDIT: so it turns out the bug is in the ETag generation code. They create an unsigned 32-bit integer hash value from the etag contents, then shovel it into a signed long for the ETag header. Unfortunately for FreeBSD-i386, "long" is a signed 32 bit type, and thus things go airy from time to time. Grrrrrr.
EDIT: fixed in a newly-built local lighttpd package; both backend servers are now doing the right thing. I'm going back to serving content.