So now, for the "high hit rate, large object" workload which the mirror nodes are currently doing, the top CPU user is memcpy() - via aioCheckCallbacks(). At least it wasn't -also- memset() as well.
That memcpy() is taking ~ 17% of the total userland CPU used by Lusca in this particular workload.
I have this nagging feeling that said memcpy() is the one done in storeAufsReadDone(), where the AUFS code copies the result from the async read into the supplied buffer. It does this because its entirely possible the caller has disappeared between the time the storage manager read was scheduled and the time the filesystem read() was scheduled.
Because the Squid codebase doesn't explicitly cancel or wait for completion of async events - and instead relies on this "locking" and "invalidation" semantics provided by the callback data scheme - trying to pass buffers (and structs in general) into threads is pretty much plainly impossible to do correctly.
In any case, the performance should now be noticably better.
(obnote: I tried explaining this to the rest of the core Squid developers last year and somehow I don't quite think I convinced them that the current approach, with or without the AsyncCallback scheme in Squid-3, is going to work without significant re-engineering of the source tree. Alas..)