Tuesday, July 14, 2015

FreeBSD now has NUMA? Why'd it take so long?

I just committed "NUMA" to FreeBSD. Well, no, I didn't. I did almost no actual NUMA-y work in FreeBSD. I just exposed the existing NUMA stuff in FreeBSD out and re-enabled it.

FreeBSD-9 introduced basic NUMA awareness in the physical allocator (sys/vm/vm_phys.c.) It implemented first-touch page allocation, and then fell back to searching through the domains, round-robin style. It wasn't perfect, for some workloads it was apparently okay. But it had some shortcomings - it wasn't configurable, UMA and other subsystems didn't know about NUMA domains, and the scheduler really didn't know about NUMA domains. So I'm sure there are plenty of workloads which it didn't work for.

That was all ripped out before FreeBSD-10. FreeBSD-10 NUMA just implements round-robin physical page allocation. It still tracks the per-domain physical memory regions, but it doesn't do any kind of NUMA aware allocation. From what I can gather, it was removed until something 'better' would land.

However, nothing (yet) has landed. So I decided I'd take a look into it. I found that for a lot of simple workloads (ie, where you're doing lots of anonymous memory allocation - eg, you're doing math crunching) the FreeBSD-9 model works fine. It's also a perfectly good starting point for experimenting.

So all my NUMA work in -HEAD does is provide an API to exactly the above. It doesn't teach the kernel APIs about domain aware allocations - there's currently no way to ask for memory from a specific domain when calling UMA, or contigmalloc, etc. The scheduler doesn't know about NUMA, so threads/processes will migrate off-socket very quickly unless you explicitly limit things. Devices don't yet do NUMA local work - the ACPI code is in there to enumerate which NUMA domain they're in, but it's not used anywhere just yet.

Then what is it good for?

If you're doing math workloads where you read in data into memory, do a bunch of work, and spit it out - it works fine. If you're running bhyve instances, you can run them using numactl and have them pinned to a local NUMA domain. Those coarse-grained things work fine. You can also change the system default back to round-robin and use first-touch or fixed-domain for specific processes. It's useful for exactly the same subset of tasks as it was in FreeBSD-9, but now it's at least configurable.

So what's next?

Well, my main aim is to get the minimum done so kernel side work is NUMA aware. This includes UMA, contigmalloc, malloc, mbuf allocation and such. It'd be nice to tag VM objects with a domain allocation policy, but that's currently out of scope. I'd also like to plumb in domain configuration into devices and allow devices to allocate memory for different driver threads with different policies.

But the first thing that showed up is that KVA allocation and superpages get in the way of malloc/contigmalloc working. Allocating memory in FreeBSD first allocates KVA space, then back-fills it with pages. As far as malloc/contigmalloc is concerned, KVA is KVA and it finds the first available space in a time-fast way. It then backfills it with physical pages. The superpage reservation bits (sys/vm/vm_reserv.[ch]) join together regions that are contiguous and in the same superpage and turn it into an allocation from the same superpage. These have no idea about NUMA domains. So, if you allocate a 4KiB page via malloc() from domain 0 and then try to allocate a 4KiB page from domain 1, it will likely mess it up:

  • First page gets allocated - first KVA, then the underlying 2mb superpage is allocated and a 4k page is returned - from physical memory domain 0;
  • Second page gets allocated - first KVA, and if it's adjacent or within the same 2mb superpage as the above allocation, it'll "fake" the page allocation via refcounting and it'll really be that same underlying superpage - but it's from physical memory domain 0.
I have to teach both vm_reserv and the KVA allocator about NUMA domains, enough so domain specific allocations don't use KVA that's adjacent. It was suggested that I create a second layer of KVA allocators that allocate KVA from the main resource allocator in superpage chunks (here it's 2mb) and then I do domain-specific allocations from them. It'll change how things get fragmented a bit, but it does mean that I won't fall afoul of things.

So, I'll do the above as an experiment and I'll push the VM policy evaluation up a little into malloc/contigmalloc. I'll see how that experiment goes and I'll post diffs for testing/evaluation.

Saturday, July 11, 2015

The importance of mentoring, or "how I got involved in FreeBSD"..

Here's how I was introduced into this UNIX world, or "wait, WHO was your WHAT?"

So, here's 11ish or so year old Adrian. It's the early 90s. I was hiding in my bedroom, trying to make another crystal set out of random parts and scraping away the paint at my windowsill. In walks my Aunty, who introduces her new boyfriend.

"Hi, I'm Julian." he said. That wasn't all that interesting.

"Oh, are you making a crystal set?" .. ok, so that was interesting.

And, that was that. Suddenly, someone role-model-y shows up in my life out of the blue. There I was, an 11 year old who felt very mostly alone most of the time, and someone shows up who I can look up to and think I can relate to. So, I'm a sponge for everything he shows me. Whenever he comes over, he has some new story to tell, some new thing to show me. He would show me better ways of building transistor switch circuits when I was in the "make large arcs with car alternator" phase of my early teens. And, when I saved up and bought a PC, he started to show me programming.

Now, I was already programming. My parents had saved up and bought me an Amstrad CPC464. We had a second-hand commodore 64 for a short while, but that eventually somehow stopped working and I didn't have the clue to fix it. But I was programming Locomotive BASIC and dabbling in Z80 assembly when I was 12, and had "upgraded" to Turbo Pascal 6 when I hit high school. (Yes, school taught Turbo Pascal at Grade 10 level, and I decided to learn it a bit earlier. That's .. wow, that dates me.) I hadn't yet really stumbled into C yet. I had heard about it, but I didn't have anything that could write it.


Julian explained task switching to me one day during a walk along the beach. He explained that computers can just appear to be doing multiple things at once - but the CPU only does one thing at a time, and you can just switch things really quickly to give the appearance that it's multitasking. With that bright spark planted in my head, I went home and started dreaming up ways to make my Z80 based CPC do something like this.

My mother dragged me to McDonalds to apply for a job the moment I was legally able to (14 years, 9 months) and I saw a computer at a second hand shop - it was a $500 IBM PC/AT, with EGA monitor, two floppy disks and a printer. We put down a down-payment and I paid it off myself with my minimum wage money. Once I had that home I quickly erm, "acquired" a copy of Turbo Pascal for home and was off drawing funny little fractals.

So yes - it's Julian's fault I discovered FreeBSD. Yes, this is Julian Elischer. One day he showed me his computer, running something called BSD. He was trying to explain bourne shell scripting and the installer. I nodded, very confused, and eventually went back to the VGA programming book he lent me. He also showed me fractint running in X on his monochome 486 DX2-50 laptop. I had no idea what was going on under the scenes, only that the fractals were much more interesting than the ones I was drawing. So I took the VGA book home and started learning how to use the higher resolutions available. One thing stuck in my mind: so much bit-plane work. Ugh. One other thing stuck in my mind - reading from VGA memory is one of the slowest things you can do. Don't do it. Ever. (Do you hear that console driver authors? Don't do it. It's bad.)

One day he explained pointers to me. I had erm, "acquired" a copy of Turbo C 2.0 from a friend after failing to make much traction with the less friendly versions (Tiny C, for example.) I had coded up a few things, but I didn't really "get" it. So he sat me down with a pen and paper, and drew diagrams to explain what was going on. I remember that lightbulb going off in the back of my mind, as I dimly connected the whole idea of types and sizes together - and that was it. I was off and doing bad things to C code.

I eventually saved up enough for an updated 286 motherboard, then an updated graphics card (full VGA!), then a sound blaster card, and finally a 486-DX33 motherboard. He introduced me to his friend Peter (who had, and I believe still has, a rather extensive electronics collection) and handed me a FreeBSD-1.1 CDROM. I took it home, put it in, and .. it didn't do anything. My 486 had a soundblaster pro + CD-ROM, and .. well, FreeBSD-1.1 didn't speak to that hardware. So, I eventually put Slackware Linux 3.0 on the thing, and became a Linux nerd for a bit.

I did eventually try FreeBSD-1.1 on it - after putting a lot of FreeBSD bits on a lot of floppies - but I couldn't figure out what to do when it booted. This is going to sound silly - but the lack of colorls turned me off. I know, it seems silly now, but that's honestly why I went back to Slackware.

I eventually went back to FreeBSD in the 2.x era once I had an IDE CDROM and I was working part time at an ISP after (high) school finished. Yes, I figured out how to get colorls to work, I got in trouble disagreeing with a Michael (O, not M) at iiNet about Squid on Linux versus FreeBSD, and well.. stuff. Here was this 17yo kid disagreeing with things and acting like he knew everything. I'm sure it was endearing.

Fast-forward a couple years, and I had been hacking on FreeBSD here and there. I got in a little erm, "trouble" before I finished high school, which phk reminded me of - when they granted me a commit bit. I forget when this was, but I wouldn't have been much older than 20.

So - this is why mentoring kids is important. It may seem like a waste of time; it may seem like they don't understand, but we were all there once. We wanted someone to relate to, someone to look up to, and something interesting to do. Julian was that person for me, and I owe both him and my mother (of course) pretty much everything about my existence in this silly little computer industry.

(This is also why you don't skimp on hardware support for popular, if cheaper platforms and "shiny" looking features if you want people to adopt your stuff -  but that's a different rant.)

Ok, that's done. I'm going back to hacking on VGA/VESA boot loader support for FreeBSD-HEAD. That's long overdue, and I want my pretty splash screen.