The memory allocator routines in particular are annoying - a lot of work has gone into malloc implementations over the last few years to make them perform -very- well in threaded applications as long as you know what you are doing. This means doing things like allocating/freeing memory in the same thread and limiting memory exchange between threads (mostly a deal with very small allocations).
Unfortunately, the mempools implementation saves a noticable amount of CPU because it hides all of the repetitive small memory allocations which Squid does for a variety of things. Its hard to profile too - I see that the CPU spends a lot of time in the allocator, but figuring out which functions are causing the CPU usage is difficult. Sure, I can find out the biggest malloc users by call - but they're not the biggest CPU users according to the oprofile callgraphs. I think I'll end up having to spend a month or so rewriting a few areas of code that account for the bulk of the malloc'ing to see what affect it has on CPU before I decide what to do here.
I just don't see the point in trying to thread the mempools codebase for anything other than per-pool statistics when others have been doing a much better job of understanding memory allocation contention on massively parallel machines.