The x86 FPU was originally an optional addition to the processor that was able to perform floating point math in hardware, but has since been integrated into the CPU proper and has collected over the years the majority of math-heavy instructions. The modern FPU has become an legacy term for what is actually the vector processing units, which just happens to be include the original floating point operations.

x86 FPU Legacy

Originally, the FPU was a dedicated coprocessor chip placed on top of the actual processor. Since it was performing calculations asynchronously from the core logic, it's results would have been available after the main processor has executed several other instructions. Since errors would also become available asynchronously, the original PC had the error line of the FPU wired to the interrupt controller. When the 486 added multiprocessor support, it became impossible to detect which of the FPUs has raised an exception, after which they integrated the FPU on-die and added an option to signal a regular exception rather than an interrupt. To provide backwards compatibility, the 486 was given a pin to replace the original FPU error line, which would be routed to the PIC and then back into the CPU's IRQ line to simulate the original setup with a dedicated coprocessor. This has the unfortunate consequence that by default, floating point exceptions will not operate as recommended by the manual.

FPU configuration

Due to the many forms of FPUs and vector units, some logic is required to get them in the expected state.

Detecting an FPU

On 386s, FPUs were external and strictly optional. The 486 came in an FPU-included and an FPU-less package, with the "FPU upgrade" being just a modified 486 that disabled its lesser counterpart. From the Pentium onwards, FPUs were always integrated and present. To make things more tricky, 386s were capable of operating with both an 287 (the 286's FPU), and the 387 (the intended FPU)

There are two ways to detect an FPU:

Check the FPU bit in CPUID
Check the EM bit in CR0, if it is set then the FPU is not meant to be used.
Check the ET bit in CR0, if it is clear, then the CPU did not detect an 80387 on boot
Probe for an FPU

The correct order is a bit doubtful. The current official manuals state that attempts to use the FPU when one is not present will lock up the CPU. There are however many sources that contain probing code to various degrees of complexity, with the common consensus that fwait or actual calculations are not to be performed. Similarly, the EM and ET bits can be modified by code and might not have the right values. Different wirings on actual hardware may also cause 386s to not detect an FPU as an 80386, causing the ET bit to have the wrong value on boot.

The common way of testing the presence of an FPU is to have it write it's status somewhere and then check if it actually did.

MOV EDX, CR0                            ; Start probe, get CR0
AND EDX, (-1) - (CR0_TS + CR0_EM)       ; clear TS and EM to force fpu access
MOV CR0, EDX                            ; store control word
FNINIT                                  ; load defaults to FPU
FNSTSW [.testword]                      ; store status word
CMP word [.testword], 0                 ; compare the written status with the expected FPU state
JNE .nofpu                              ; jump if the FPU hasn't written anything (i.e. it's not there)
JMP .hasfpu

.testword: DW 0x55AA                    ; store garbage to be able to detect a change

To distinguish a 287 and a 387 FPU, you can try if it can see the difference between +infinity and -infinity.

FPU control

If an FPU is found to be present, you should set up the control registers accordingly. If an FPU is not present, you should also set up the registers accordingly.

CR0.EM (bit 2; counting starts at bit 0 making this the third bit)

If the EM bit is set, all FPU and vector operations will cause and #UD so they can be EMulated in software. Should be off to be actually able to use the FPU

CR0.ET (bit 4)

This bit is used on the 386 to tell it how to communicate with the coprocessor, which is 0 for an 287, and 1 for a 387 or later. This bit is hardwired on 486+

CR0.NE (bit 5)

When set, enables Native Exception handling which will use the FPU exceptions. When cleared, an exception is sent via the interrupt controller. Should be on for 486+, but not on 386s because they lack that bit.

CR0.TS (bit 3)

Task switched. The FPU state is designed to be lazily switched to save read and write cycles. If set, all meaningful operations will cause an #NM exception so that the OS can backup the FPU state. This bit is automatically set on a hardware task switch, and can be cleared with the CLTS opcode. Software task switching may want to manually set this bit on a reschedule if they want to lazily store FPU state.

CR0.MP (bit 1)

This does little else other than saying if an FWAIT opcode is exempted from responding to the TS bit. Since FWAIT will force serialisation of exceptions, it should normally be set to the inverse of the EM bit, so that FWAIT will actually cause a fpu state update when FPU instructions are asynchronous, and not when they are emulated.

CR4.OSFXSR (bit 9)

Enables 128-bit SSE support. When clear, most SSE instructions will cause an invalid opcode, and FXSAVE and FXRSTOR will only include the legacy FPU state. When set, SSE is allowed and the XMM and MXCSR registers are accessible, which also means that your OS should maintain those additional registers. Trying to set this bit on a CPU without SSE will cause an exception, so you should check for SSE (or long mode) support first.

CR4.OSXMMEXCPT (bit 10)

Enables the #XF exception. When clear, SSE will work until an exception would be generated, after which all SSE instructions will fail with an invalid opcode. When set, the exception handler is called instead and the problem may be diagnosed and reported. Again, you can't set this bit without ensuring SSE support is present

CR4.OSXSAVE (bit 18)

Enables the XSAVE extension, which is able to save SSE state as well as other next-generation register states. Again, check CPUID before setting: Long mode support is not sufficient in this case.

Vector unit

MMX, 3DNow and the rare EMMI reuse the old FPU registers as vector units, aliasing them into 64 bit data registers. This means that they can be used safely without modifications of the FPU handling. SSE however adds a whole new set of registers, and therefore is disabled by default. To allow SSE instructions, CR4.OSFXSR should be set. Be careful though since writing it on a processor without SSE support causes an exception. When SSE is enabled, FXSAVE and FXRSTOR should be used to store the entire FPU and vector register file. It is good practice to enable the other SSE bit (CR4.OSXMMEXCPT) as well so that SSE exceptions are routed to the #XF handler, instead of your vector unit automatically disabling itself when an exception occurs. The state of the art includes AVX, which adds

Long Mode

Long mode demands that SSE and SSE2 are available, and compilers are free to use the SSE registers instead of the old FPU registers for floating point operations. This means that your kernel will need to have SSE enabled before using any floating point operations, whereas 32-bit mode might just happen to work without touching CR0/CR4. Also, long mode doubles the registers for SSE, giving you 16 XMM registers rather than the 8 available in 32-bit mode, which implies that more data is in need of saving.

FPU state

When the FPU is configured, the only thing left to do is to initialize its registers to their proper states. FNINIT will reset the user-visible part of the FPU stack. This will set precision to 64-bit and rounding to nearest, which should be correct for most operations. It will also mask all exceptions from causing an interrupt. You can change the control by issuing an FLDCW. To diagnose broken code, you usually want to enable exceptions for invalid operands and stack overflows (bit 0). Bit 2 allows you to catch divisions by zero as well. Some examples:

; FLDCW requires a memory operand, immediates do not work
FLDCW [value_37F]   ; writes 0x37f into the control word: the value written by F(N)INIT
FLDCW [value_37E]   ; writes 0x37e, the default with invalid operand exceptions enabled
FLDCW [value_37A]   ; writes 0x37a, both division by zero and invalid operands cause exceptions.

Using the MMX aliases for the FPU registers will cause those registers to be invalidated for floating point use. The EMMS instruction will reset the registers to non-vector use. The x86 calling convention assumes that the stack is usable for either floating point or vector use, so you will need to call EMMS before calling or returning to regular compiler-generated code. Both MMX instructions and EMMS preserves the control word you set with FLDCW so you don't need to adjust it manually afterwards.

SSE operates mostly independent of the FPU registers. It has a separate MXCSR register which deals with control and exceptions, which should be written separately.

Rent-a-coder

These functions can be used with GCC to perform some FPU operations without resorting to dedicated assembly:

void fpu_load_control_word(const uint16_t control)
{
    asm volatile("fldcw %0;"::"m"(control)); 
}

References

Simply FPU, a practical guide covering the FPU basics in an userland perspective
Intel 80387 Programmer's Reference Manual, complete with example code
AMD Programmer's Manuals, has FPU instruction reference conveniently ordered by processor component.
Intel 64-bit Manuals, the Intel version of the manuals. More complete, but also more bloated.

FPU

Contents

x86 FPU Legacy

FPU configuration

Detecting an FPU

FPU control

Vector unit

Long Mode

FPU state

Rent-a-coder

References

Navigation menu

FPU

x86 FPU Legacy

FPU configuration

Detecting an FPU

FPU control

Vector unit

Long Mode

FPU state

Rent-a-coder

References

Navigation menu

Search