(And I do hope that the implementation actually works!)
When the kernel isn't running any work, it will schedule the "idle" thread to run. This has the simple(!) task of entering a sleep state, only to wake up when another interrupt fires. This saves power consumption and generates less heat.
However, it's a little trickier than just "enter idle state."
The only way to exit it is to receive an interrupt. Now, this can be device interrupts; this can be timer interrupts (and yes; the timer is a kind of device, tsk..) but without it, the CPU will stay halted. This isn't a big deal - the UNIX kernel tends to be one big event processing "thing" and these include both software and hardware events. If it's a software event that needs to happen "now", the kernel won't go into the idle loop - it'll just run the needed event. If it has to happen in the future, it'll schedule it to occur after some time has elapsed - and this will be driven by a timer interrupt. An interrupt would wake up the system from the HALT state. If the interrupt occured just before the HALT instruction was executed, there'd be a very small window where the interrupt would occur - the interrupt routine would be called, complete, and then the HALT instruction would next execute.
Now, imagine you're an ethernet driver on FreeBSD-10 where the interrupt handler just scheduled some ithread to run in the future. Here's one example of what may happen:
- The idle thread is executed;
- An interrupt occurs, say to signal completion of some ethernet frame transmission;
- The CPU kicks off the interrupt code;
- The interrupt code doesn't find a fast interrupt handler but finds an ithread; so it schedules the ithread to run;
- The scheduler goes to choose another thread to run and finds an ithread scheduled at a higher priority, so it schedules that;
- The ethernet driver code in the ithread runs;
- The scheduler then exits and re-enters the idle loop.
Everything is fine, right?
But, with the event timer changes that came in during the 9.x time frame, the halt code is now in a critical section.
So now, the following happens:
- The idle thread is executed;
- critical_enter() is called;
- An interrupt occurs, which calls the fast handler, which schedules the ithread to run;
- .. but critical_enter() has set a flag that says "no preemption", so the ithread doesn't get to run;
- Control is returned to the idle function;
- The idle thread gets to the point where it executes "HALT";
- The idle thread continues to run, executing HALT.
In this instance, the ithread is scheduled but can't run before HALT runs. The ithread only runs after the next interrupt occurs.
I noticed this was happening when doing traffic tests on FreeBSD-HEAD between 802.11 and ethernet interfaces. The atheros 802.11 hardware implemented interrupt moderation so there wouldn't be tens of thousands of interrupts a second being generated. But what I saw was occasionally the receive queue being filled by packets and not drained fast enough. When digging into it, I found that due to interrupt moderation, if the interrupt came in just before WAIT was executed, the 802.11 receive function was taking a long time (sometimes up to milliseconds) to run after the interrupt came in and the ithread was actually scheduled. If I had an interrupt for each received packet, the amount of time between interrupts would have been very small (20,000 packets a second, so around 1/20000 sec per interrupt) and this problem would've been masked. But with moderated interrupts, it would be 750 microseconds or so before the next receive interrupt was generated.
Now, this is messy. There's some hacks in the idle loop code to try and skip the halt bit if the scheduler detects there's something to run. But there's still a small race window there which needs to be closed.
How can this be solved?
Apparently - the two instructions STI;HLT on x86 are atomic. Ie, there's no race window between them. If an interrupt comes through, HLT doesn't run. This doesn't happen for MWAIT or the ACPI sleep states and I am concerned we're still possibly hitting this race window from time to time. The specific behaviour is that the STI causes interrupts to be enabled following the next instruction. So yes, there's no window.
All that's left now is to make sure that interrupts are disabled before you do the scheduler check so no new interrupt processing on that thread can be scheduled. Ie, when entering cpu_idle():
All that's left now is to make sure that interrupts are disabled before you do the scheduler check so no new interrupt processing on that thread can be scheduled. Ie, when entering cpu_idle():
- Call critical_enter();
- Disable interrupts;
- See if the scheduler has anything to do - if so, enable interrupts and skip calling the idle loop
- If there's nothing to do - and since interrupts are disabled, nothing new will have happened (like an interrupt scheduling an ithread) then just call the idle function
- Which may or may not enable interrupts before entering the idle loop (you enter ACPI with interrupts disabled, but you enter HLT with interrupts enabled)
- .. and then call critical_exit(), which will let the kernel continue preempting.
For MIPS however, there's a clever hack. (No, not from me.)
Here's how it works:
Here's how it works:
- In the idle loop, it calls mips_wait()
- mips_wait() is a bit of assembly code that will:
- disable interrupts
- see if the scheduler sees anything running
- .. and if so, it doesn't bother running the WAIT instruction! Just enable interrupts and jump over the WAIT;
- .. but the bit of code that re-enables interrupts and calls WAIT is aligned to a 16 byte boundary and the address is a symbol (MipsWaitStart).
- Then in the exception handling code (MipsKernIntr), it sees if the instruction pointer where the exception occured is in the 16 bytes (4 instructions) at MipsWaitStart.
- .. if it is, it adjusts the return address from the interrupt to be after the WAIT instruction.
It's totally dirty and to be quite honest, I haven't at all tested it. Yes, it should be tested.