Non Maskable Interrupt: Difference between revisions

From OSDev.wiki
Jump to navigation Jump to search
[unchecked revision][unchecked revision]
Content added Content deleted
m (format cleaning, can be improved still)
m (Bot: Replace deprecated source tag with syntaxhighlight)
 
(23 intermediate revisions by 14 users not shown)
Line 1: Line 1:
The '''Non-Maskable Interrupt''' ('''NMI''') is a hardware-driven interrupt much like the PIC interrupts, but the NMI goes either directly to the CPU, or via another controller (e.g., the ISP)---in which case it can be masked.
{{Convert}}


==About==
The NMI ("Non Maskable Interrupt") is a hardware-driven interrupt much like the PIC interrupts, but the NMI goes directly to the CPU, not via the PIC controller.
NMIs occur for RAM errors and unrecoverable hardware problems. For newer computers these things may be handled using machine check exceptions and/or SMI. For the newest chipsets (at least for Intel) there's also a pile of TCO stuff ("total cost of ownership") that is tied into it all (with a special "TCO IRQ" and connections to SMI/SMM, etc). All of the TCO stuff can be connected to an onboard ethernet controller, and (at least part of it) is intended for remote monitoring of the system. Unfortunately, the chipset documentation doesn't say how BIOSes normally configure the chipset, and the chipsets themselves support several different options in each case. For example, for a RAM error it could be handled by the chipset itself, it could generate an SMI (where the BIOS/SMM handler does "RAM scrubbing" in software), it could generate a "TCO interrupt", etc. If you add it all up it's a huge complex mess (TCO + SMI + SMBus + Northbridge + PCI bus/controller/s + PCI-to-LPC-bridge + god-knows-what) that can be completely different between motherboards (even motherboards with the same chipset).


The short version of this story is that there's only really 2 reasons for an NMI. The first reason is a hardware failure. The second reason is a "watchdog timer", which can be used to detect when the kernel itself locks up (and is sometimes also used for more accurate profiling as it allows EIP to be sampled even when IRQs are disabled).
Luckily, you CAN have control over the NMI -- otherwise you could be in deep trouble.


If a hardware failure caused an NMI then there's no way to figure out which piece of hardware caused the NMI. In this case you may want to inform the user that a hardware error has occurred, and then the kernel should shutdown/reset the machine.
The NMI is "turned on" (set high) by the memory module when a memory parity error occurs.


For the watchdog timer, it must be setup by the OS first. This can actually be done even when the chipset itself doesn't have a special watchdog timer for it (e.g. setting the PIT, RTC/CMOS IRQ or a HPET IRQ to "NMI, send to all CPUs" in the I/O APIC). In this case you want the watchdog timer to be fast (i.e. no slow hardware task switching and cache flushing) and you'd also want all CPUs to share the same timer, which means all CPUs would receive the same IRQ at the same time.
You have to be careful about disabling the NMI and the PIC for extended periods of time: your system will hang unless it has a failsafe timer! (You've always got one, as long as you don't kill the [[PIT]] timer.)


As an alternative, you could also use the local APIC's timer or the performance monitoring counter overflow for a "per CPU" watchdog timer.
/* enable the NMI */
void NMI_enable(void)
{
outb(0x70, inb(0x70)&0x7F);
}


==Usage==
/* disable the NMI */
void NMI_disable(void)
{
outb(0x70, inb(0x70)|0x80);
}


The NMI is enabled (set high) by the memory module when a memory parity error occurs.


Be careful about disabling the NMI and the PIC for extended periods of time (mind you, watchdog timers typically use NMIs).
As posted by Brendan in [this|Forum:9490] thread:


On the XT the NMI can be masked by setting bit 7 on I/O port 0xA0. On the AT the NMI can be masked by setting bit 7 on I/O port 0x70. This port is shared with the CMOS RAM index register using bits 0 through 6 of I/O port 0x70. The CMOS RTC expects a read from or write to the data port 0x71 after any write to index port 0x70 or it may go into an undefined state. There may also need to be an [[Inline_Assembly/Examples#I/O_access|I/O delay]] between accessing the index and data registers. The index port 0x70 may be a write-only port and always return 0xFF on read. Hence the bit masking below to preserve bits 0 through 6 of the CMOS index register may not work, nor may it be possible to retrieve the current state of the NMI mask from port 0x70.
The following is based on my own research into chipsets (and some unfortunate guess-work to fill in the gaps), and shouldn't be considered "100% correct"...


<syntaxhighlight lang="C">
AFAIK for older computers NMI was used for RAM errors and unrecoverable hardware problems. For newer computers these things may be handled using machine check exceptions and/or SMI. For the newest chipsets (at least for Intel) there's also a pile of TCO stuff ("total cost of ownership") that is tied into it all (with a special "TCO IRQ" and connections to SMI/SMM, etc). Somehow all of the TCO stuff is/can be connected to an onboard ethernet controller, and (at least part of it) is intended for remote monitoring of the system. Unfortunately the chipset documentation I've been reading can't tell me how BIOSs normally configure the chipset, and the chipsets themselves support several different options in each case. For example, for a RAM error it could be handled by the chipset itself, it could generate an SMI (where the BIOS/SMM handler does "RAM scrubbing" in software), it could generate a "TCO interrupt", etc. If you add it all up it's a huge complex mess (TCO + SMI + SMBus + northbridge + PCI bus/controller/s + PCI-to-LPC-bridge + god-knows-what) that can be completely different between motherboards (even motherboards with the same chipset).
void NMI_enable() {
outb(0x70, inb(0x70) & 0x7F);
inb(0x71);
}


void NMI_disable() {
The short version of this story is that there's only really 2 reasons for an NMI. The first reason is a hardware failure. The second reason is a "watchdog timer", which can be used to detect when the kernel itself locks up (and is sometimes also used for more accurate profiling as it allows EIP to be sampled even when IRQs are disabled).
outb(0x70, inb(0x70) | 0x80);

inb(0x71);
If a hardware failure caused an NMI then there's no way to figure out which piece of hardware caused the NMI. In this case I'd try to do the least possible in an attempt to tell the user that a hardware failure occured, but at the end of the day you can't expect any OS to work sanely on faulty hardware and there's nothing software can do to work around the hardware failure anyway.
}

</syntaxhighlight>
For the watchdog timer, it must be setup by the OS first. This can actually be done even when the chipset itself doesn't have a special watchdog timer for it (e.g. setting the PIT, RTC/CMOS IRQ or a HPET IRQ to "NMI, send to all CPUs" in the I/O APIC). In this case you want the watchdog timer to be fast (i.e. no slow hardware task switching and cache flushing) and you'd also want all CPUs to share the same timer, which means all CPUs would receive the same IRQ at the same time (which brings me back to the busy flag in your TSS).

As an alternative, you could also use the local APIC's timer or the performance monitoring counter overflow for a "per CPU" watchdog timer. Unfortunately these things are usually used for other purposes.



===Comments===
:Is it really wise to turn off NMI? okay, if you get an NMI while you're switching from RealMode to ProtectedMode, you could get a [http://en.wikipedia.org/wiki/Triple_fault Triple Fault], which would reset the system, but isn't a system reset wished when content from memory is unreliable by that time ? -- Pype.


When an NMI occurs you can check the system control port A and B at I/O addresses 0x92 and 0x61 respectively to get an indication of what caused the error:
:Is there much you can do if an NMI occurs? I guess if you got the error while reading from something that was copied from a disk at some point in the past and not modified since then you could read it from the disk again and continue with the (hopefully) good copy, but if it's something that has changed or been created dynamically then you don't really have an easy way to recover other than essentially starting from scratch. -- TheKemp


System Control Port A (0x92) layout:
:Of course, you're assuming that the kernel code hasn't been corrupted, if the kernel is damaged then you'll most likely triple fault anyway. The best course of action is probably to request the user perform a RAM diagnostic with ~MemTest or something (and hope you don't crash before you can get that far). Then again... Windows keeps a "Hardware Damaged" flag for every physical page of RAM, unfortunately this would mean aborting the program that was running when the NMI occured then checking each page the program was using at the time for what are basically "bad sectors" and flaging them so they aren't used again -- AR
{| {{wikitable}}
!BIT
!Description
|-
|0||Alternate hot reset
|-
|1||Alternate gate A20
|-
|2||Reserved
|-
|3||Security Lock
|-
|4*||Watchdog timer status
|-
|5||Reserved
|-
|6||HDD 2 drive activity
|-
|7||HDD 1 drive activity
|}


System Control Port B (0x61)
:So if an NMI occurs you should assume it'll continue happening at that location in future rather than it being a freak occurrence? -- TheKemp
{| {{wikitable}}
!Bit
!Description
|-
|0||Timer 2 tied to speaker
|-
|1||Speaker data enable
|-
|2||Parity check enable
|-
|3||Channel check enable
|-
|4||Refresh request
|-
|5||Timer 2 output
|-
|6*||Channel check
|-
|7*||Parity check
|}


The important bits are indicated with an '*'. The Channel Check bit indicates a failure on the bus, probably by a peripheral device such as a modem, sound card, NIC, etc, while the Parity check bit indicates a memory read or write failure.
:I'm not an engineer so I wouldn't know for certain, but I would be inclined to think that if an NMI occurs on a page when you retest the program that caused the original NMI then it would seem rather likely that the chip is faulty, I don't know if Windows persists the flags across reboots but I doubt it since AFAIK it's impossible to tell if the RAM has been replaced while the computer was off. If the NMI doesn't occur again while checking the pages then the fault isn't severe and could be written off, just if it is a persistent problem then the page should probably be disabled (in the interest of preventing random program crashes) -- AR


[[Category:Interrupts]]
:Added Brendan's information from the forums, could probably do with being organised better but at least it's here for now. -- TheKemp


[[de:Non Maskable Interrupt]]
----
Categories: CollectedKnowledge, HardWareIrq

Latest revision as of 04:39, 9 June 2024

The Non-Maskable Interrupt (NMI) is a hardware-driven interrupt much like the PIC interrupts, but the NMI goes either directly to the CPU, or via another controller (e.g., the ISP)---in which case it can be masked.

About

NMIs occur for RAM errors and unrecoverable hardware problems. For newer computers these things may be handled using machine check exceptions and/or SMI. For the newest chipsets (at least for Intel) there's also a pile of TCO stuff ("total cost of ownership") that is tied into it all (with a special "TCO IRQ" and connections to SMI/SMM, etc). All of the TCO stuff can be connected to an onboard ethernet controller, and (at least part of it) is intended for remote monitoring of the system. Unfortunately, the chipset documentation doesn't say how BIOSes normally configure the chipset, and the chipsets themselves support several different options in each case. For example, for a RAM error it could be handled by the chipset itself, it could generate an SMI (where the BIOS/SMM handler does "RAM scrubbing" in software), it could generate a "TCO interrupt", etc. If you add it all up it's a huge complex mess (TCO + SMI + SMBus + Northbridge + PCI bus/controller/s + PCI-to-LPC-bridge + god-knows-what) that can be completely different between motherboards (even motherboards with the same chipset).

The short version of this story is that there's only really 2 reasons for an NMI. The first reason is a hardware failure. The second reason is a "watchdog timer", which can be used to detect when the kernel itself locks up (and is sometimes also used for more accurate profiling as it allows EIP to be sampled even when IRQs are disabled).

If a hardware failure caused an NMI then there's no way to figure out which piece of hardware caused the NMI. In this case you may want to inform the user that a hardware error has occurred, and then the kernel should shutdown/reset the machine.

For the watchdog timer, it must be setup by the OS first. This can actually be done even when the chipset itself doesn't have a special watchdog timer for it (e.g. setting the PIT, RTC/CMOS IRQ or a HPET IRQ to "NMI, send to all CPUs" in the I/O APIC). In this case you want the watchdog timer to be fast (i.e. no slow hardware task switching and cache flushing) and you'd also want all CPUs to share the same timer, which means all CPUs would receive the same IRQ at the same time.

As an alternative, you could also use the local APIC's timer or the performance monitoring counter overflow for a "per CPU" watchdog timer.

Usage

The NMI is enabled (set high) by the memory module when a memory parity error occurs.

Be careful about disabling the NMI and the PIC for extended periods of time (mind you, watchdog timers typically use NMIs).

On the XT the NMI can be masked by setting bit 7 on I/O port 0xA0. On the AT the NMI can be masked by setting bit 7 on I/O port 0x70. This port is shared with the CMOS RAM index register using bits 0 through 6 of I/O port 0x70. The CMOS RTC expects a read from or write to the data port 0x71 after any write to index port 0x70 or it may go into an undefined state. There may also need to be an I/O delay between accessing the index and data registers. The index port 0x70 may be a write-only port and always return 0xFF on read. Hence the bit masking below to preserve bits 0 through 6 of the CMOS index register may not work, nor may it be possible to retrieve the current state of the NMI mask from port 0x70.

 void NMI_enable() {
    outb(0x70, inb(0x70) & 0x7F);
    inb(0x71);
 }

 void NMI_disable() {
    outb(0x70, inb(0x70) | 0x80);
    inb(0x71);
 }

When an NMI occurs you can check the system control port A and B at I/O addresses 0x92 and 0x61 respectively to get an indication of what caused the error:

System Control Port A (0x92) layout:

BIT Description
0 Alternate hot reset
1 Alternate gate A20
2 Reserved
3 Security Lock
4* Watchdog timer status
5 Reserved
6 HDD 2 drive activity
7 HDD 1 drive activity

System Control Port B (0x61)

Bit Description
0 Timer 2 tied to speaker
1 Speaker data enable
2 Parity check enable
3 Channel check enable
4 Refresh request
5 Timer 2 output
6* Channel check
7* Parity check

The important bits are indicated with an '*'. The Channel Check bit indicates a failure on the bus, probably by a peripheral device such as a modem, sound card, NIC, etc, while the Parity check bit indicates a memory read or write failure.