Tuesday, August 13, 2013

Profiling on superscalar architectures, or "no, instruction counts don't necessarily matter nowdays.."

I could spend a lot of time brain dumping various things to do with profiling on anything more recent than Pentium 1 (or, well, anything remotely to do with MIPS of that era, but I digress.) In any case, there's plenty of useful writings about it so I'm not going to - just jump over to https://wiki.freebsd.org/AdrianChadd/ProfilingNotes .

However, I thought I'd provide a basic example of where "instructions" doesn't actually matter, as well as a shortcoming of the current FreeBSD tools.

My network testing stack does a whole lot of read() and write() syscalls to achieve its goal. For those who know what's going on, I hope you know where I'm taking this. Anyway..

Firstly, the standard. "pmcstat -S instructions -T". This prints a "top" like output counting instructions retired.
Figure 1. # pmcstat -S instructions -T -w 5

This looks like the contention is in the mutexes protecting socket receive and the TCP output path. Sure, but why is it contending?

The problem with doing it based on instructions retired is that it hides any issues to do with stalls. There's a bunch of sources of potential stalls - memory reads, memory writes, stuff not being in cache but being needed for instructions that are running. They're generally either side-effects of operations not being able to complete in time (eg if you have a whole lot of completed operations that need to push stuff out to memory to continue, but there's no free bandwidth to queue memory writes) but sometimes it's just from straight bulk memory copies.

If you're interested about the Intel microarchitecture and how all of these pieces fit together to process an instruction stream in parallel, with all of the memory controller paths coming in and out, have a read of this: http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf .

Ok, so let's look at general stalls. There's a bunch of L1, L2, LLC (last level cache, think "L3" here) operations that can be looked at, as well as stuff that FreeBSD's PMC framework doesn't support  - notably some of the stuff on the "uncore" - the shared cache and pipelines between all cores on a socket. It supports the events implemented using MSRs, but not events implemented using the PCIe configuration space.

So, without further ado:

Figure 2. # pmcstat -S RESOURCE_STALLS.ANY -T -w 5
Yup. This looks much more like what I'd expect. The CPU is stalled doing copyout(). This is a mostly-read() workload, so that's what I'd expect. mb_free_ext() is interesting; I'll go look at that.

Now, copyout() is doing a bulk copy. So, yes - I'd expect that to be hurting. mb_free_ext() shouldn't be doing very much work though - I'll do some digging to see what's going on there.

The final output is from the Intel performance tuning overview tools. You can find them here - http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization . There's a nice overview tool (pcm.x) which will output the basic system overview. I like this; it gives a very simple overview of how efficient things are.
Figure 3. "pmc.x 1" running on FreeBSD-10.

Now, this isn't a stock version of pcm.x - I've hacked it up to look slightly saner when doing live reporting - but it still provides exactly the same output in that format. Note the instructions per CPU cycle and the amount of cache misses. It's .. very inefficient. Tsk.

So in summary - don't just do instruction count based profiling. You want to establish whether there are any memory and cache bottlenecks. If you're doing HPC, you want to also check to see if you're hitting SSE, FPU, divider unit and other kinds of math processing stalls.

Now - what would I like to see in FreeBSD?

  • The hwpmc framework needs to grow support for more of the socket and system events - right now it's very "core" focused.
  • Some better top-level tools to provide a system summary like Intel's pcm.x tool would be nice.
  • Some better documentation (read: better than just this wiki page!) looking at how to actually profile and understand system behaviour would be desirable.
  • .. and tying this into dtrace would be great too.
  • Once we get some (more) NUMA awareness, it would be great to have the uncore stuff reporting on things like QPI traffic, cache line and memory accesses from remote sockets/cores, and other stuff to do with NUMA allocation and awareness.
Mostly, however, I'd like to get this stuff into the hands of developers sooner rather than later so they can start running this as a normal day to day thing.


  1. Hi Adrian,
    I think it's not correct to say the "uncore stuffs" are not supported. Take a look at sys/dev/hwpmc/hwpmc_uncore.c where the relevant bits are. Also, I'm the guy who added support for SandyBrigde (Model 0x2A) and I'm pretty sure uncore support is there (at least for that platform). So, I think it's fair to say "some microarchs lack uncore support" rather than "FreeBSD lacks uncore support".



  2. Aha! Let me amend the blog post! Thanks Davide!