Thursday, June 11, 2009

Migrating to Lighttpd on the backend, and why aren't my files being cached..

I migrated away from apache-1.3 to Lighttpd-1.4.19 to handle the load better. Apache-1.3 handles lots of concurrent disk IO on large files fine but it bites for lots of concurrent network connections.

In theory, once all of the caching stuff is fixed, the backends will spend most of their time revalidating objects.

But for some weird reason I'm seeing TCP_REFRESH_MISS on my Lusca edge nodes and generally poor performance during this release. I look at the logs and find this:



[Host: mozilla.cdn.cacheboy.net\r\n
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10\r\n
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n
Accept-Language: en-us,en;q=0.5\r\n
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n
If-Modified-Since: Wed, 03 Jun 2009 15:09:39 GMT\r\n
If-None-Match: "1721454571"\r\n
Cache-Control: max-stale=0\r\n
Connection: Keep-Alive\r\n
Pragma: no-cache\r\n
X-BlueCoat-Via: 24C3C50D45B23509\r\n]

[HTTP/1.0 200 OK\r\n
Content-Type: application/octet-stream\r\n
Accept-Ranges: bytes\r\n
ETag: "1687308715"\r\n
Last-Modified: Wed, 03 Jun 2009 15:09:39 GMT\r\n
Content-Length: 2178196\r\n
Date: Fri, 12 Jun 2009 04:25:40 GMT\r\n
Server: lighttpd/1.4.19\r\n
X-Cache: MISS from mirror1.jp.cacheboy.net\r\n
Via: 1.0 mirror1.jp.cacheboy.net:80 (Lusca/LUSCA_HEAD)\r\n
Connection: keep-alive\r\n\r]


Notice the different ETags? Hm! I wonder whats going on. On a hunch I checked the Etags from both backends. master1 for that object gives "1721454571"; master2 gives "1687308715". They both have the same size and same timestamp. I wonder what is different?

Time to go digging into the depths of the lighttpd code.

EDIT: the etag generation is configurable. By default it uses the mtime, inode and filesize. Disabling inode and inode/mtime didn't help. I then found that earlier lighttpd versions have different etag generation behaviour based on 32 or 64 bit platforms. I'll build a local lighttpd package and see if I can replicate the behaviour on my 32/64 bit systems. Grr.

Meanwhile, Cacheboy isn't really serving any of the mozilla updates. :(

EDIT: so it turns out the bug is in the ETag generation code. They create an unsigned 32-bit integer hash value from the etag contents, then shovel it into a signed long for the ETag header. Unfortunately for FreeBSD-i386, "long" is a signed 32 bit type, and thus things go airy from time to time. Grrrrrr.

EDIT: fixed in a newly-built local lighttpd package; both backend servers are now doing the right thing. I'm going back to serving content.

5 comments:

  1. Is this also in the latest 1.4 snapshot ?

    http://blog.lighttpd.net/articles/2009/06/11/pre-release-lighttpd-1-4-23rc2-r2534-new-1-5-snapshot

    Would appreciate if you could file a bug report and upload your patch to it.

    ReplyDelete
  2. Gea-Sun Lin: nope. Not inode number. Its an unsigned -> signed overflow.

    ymg: its in 1.4-svn; I didn't check -head. I've emailed the lighttpd list to see what others think about it; I'd appreciate discusson before I code a patch. My current patch is just to turn the uint32_t to int32_t and force it to be calculated and passed in as a signed int; that will fit inside a signed long.

    ReplyDelete
  3. Apache-1.3? Did you try 2.2 with the event based model?

    ReplyDelete
  4. Apache-2.2 event may have worked just as well as the lighttpd model.

    As far as I can tell, both suck under high levels of blocking disk IO though. I need to investigate this a little further. I hear lighttpd has async disk IO in trunk and there's something similar with apache that doesn't involve dedicated worker threads per connection.

    In any case its still early days and the whole point here is to start generating some data and provide feedback to people. :)

    ReplyDelete