Tuesday, July 14, 2015

FreeBSD now has NUMA? Why'd it take so long?

I just committed "NUMA" to FreeBSD. Well, no, I didn't. I did almost no actual NUMA-y work in FreeBSD. I just exposed the existing NUMA stuff in FreeBSD out and re-enabled it.

FreeBSD-9 introduced basic NUMA awareness in the physical allocator (sys/vm/vm_phys.c.) It implemented first-touch page allocation, and then fell back to searching through the domains, round-robin style. It wasn't perfect, for some workloads it was apparently okay. But it had some shortcomings - it wasn't configurable, UMA and other subsystems didn't know about NUMA domains, and the scheduler really didn't know about NUMA domains. So I'm sure there are plenty of workloads which it didn't work for.

That was all ripped out before FreeBSD-10. FreeBSD-10 NUMA just implements round-robin physical page allocation. It still tracks the per-domain physical memory regions, but it doesn't do any kind of NUMA aware allocation. From what I can gather, it was removed until something 'better' would land.

However, nothing (yet) has landed. So I decided I'd take a look into it. I found that for a lot of simple workloads (ie, where you're doing lots of anonymous memory allocation - eg, you're doing math crunching) the FreeBSD-9 model works fine. It's also a perfectly good starting point for experimenting.

So all my NUMA work in -HEAD does is provide an API to exactly the above. It doesn't teach the kernel APIs about domain aware allocations - there's currently no way to ask for memory from a specific domain when calling UMA, or contigmalloc, etc. The scheduler doesn't know about NUMA, so threads/processes will migrate off-socket very quickly unless you explicitly limit things. Devices don't yet do NUMA local work - the ACPI code is in there to enumerate which NUMA domain they're in, but it's not used anywhere just yet.

Then what is it good for?

If you're doing math workloads where you read in data into memory, do a bunch of work, and spit it out - it works fine. If you're running bhyve instances, you can run them using numactl and have them pinned to a local NUMA domain. Those coarse-grained things work fine. You can also change the system default back to round-robin and use first-touch or fixed-domain for specific processes. It's useful for exactly the same subset of tasks as it was in FreeBSD-9, but now it's at least configurable.

So what's next?

Well, my main aim is to get the minimum done so kernel side work is NUMA aware. This includes UMA, contigmalloc, malloc, mbuf allocation and such. It'd be nice to tag VM objects with a domain allocation policy, but that's currently out of scope. I'd also like to plumb in domain configuration into devices and allow devices to allocate memory for different driver threads with different policies.

But the first thing that showed up is that KVA allocation and superpages get in the way of malloc/contigmalloc working. Allocating memory in FreeBSD first allocates KVA space, then back-fills it with pages. As far as malloc/contigmalloc is concerned, KVA is KVA and it finds the first available space in a time-fast way. It then backfills it with physical pages. The superpage reservation bits (sys/vm/vm_reserv.[ch]) join together regions that are contiguous and in the same superpage and turn it into an allocation from the same superpage. These have no idea about NUMA domains. So, if you allocate a 4KiB page via malloc() from domain 0 and then try to allocate a 4KiB page from domain 1, it will likely mess it up:

  • First page gets allocated - first KVA, then the underlying 2mb superpage is allocated and a 4k page is returned - from physical memory domain 0;
  • Second page gets allocated - first KVA, and if it's adjacent or within the same 2mb superpage as the above allocation, it'll "fake" the page allocation via refcounting and it'll really be that same underlying superpage - but it's from physical memory domain 0.
I have to teach both vm_reserv and the KVA allocator about NUMA domains, enough so domain specific allocations don't use KVA that's adjacent. It was suggested that I create a second layer of KVA allocators that allocate KVA from the main resource allocator in superpage chunks (here it's 2mb) and then I do domain-specific allocations from them. It'll change how things get fragmented a bit, but it does mean that I won't fall afoul of things.

So, I'll do the above as an experiment and I'll push the VM policy evaluation up a little into malloc/contigmalloc. I'll see how that experiment goes and I'll post diffs for testing/evaluation.

No comments:

Post a Comment