Jump to navigation Jump to search
Real numbers, coprocessors and vector units
X86 implementations

Streaming SIMD Extensions (SSE)

Streaming SIMD Extensions (SSE)


SSE was introduced in the Pentium III and offered an additional 70 instructions to the Intel Instruction Set. SSE instructions can help give an increase in data thouroughput due to Single Instruction, Multiple Data (SIMD) instructions. These instructions can execute a common expression on multiple data in parallel.

There are 8 (16 in 64-bit mode) XMM registers (XMM0-7(15)) that come with SSE, and they are 128-bit registers. Certain SSE instructions (movntdqa, movdqa, movdqu, etc...) can load 16 bytes from memory or store 16 bytes to memory in a single operation. Also, SSE introduces a few non-temporal hint instructions (movntdqa and movntdq) that allow one-shot memory locations to be stored in non-temporal memory so those location references to do not pollute the small on-chip caches.

Since this change added new registers, it is disabled by default as the typical operating system of that time was not yet able to save those registers on a task switch. To support SSE, you will need to implement separate code paths for saving and restoring SSE state (as those instructions will cause an exception on processors that do not support it), and handlers for the new exceptions. After that, you can tell the CPU to enable SSE use in userland tasks.

Checking for SSE

to check for SSE CPUID.01h:EDX.SSE[bit 25] needs to be set

mov eax, 0x1
test edx, 1<<25
jz .noSSE
;SSE is available

Adding support

In order to allow SSE instructions to be executed without generating a #UD, we need to alter the CR0 and CR4 registers.

clear the CR0.EM bit (bit 2) [ CR0 &= ~(1 << 2) ]
set the CR0.MP bit (bit 1) [ CR0 |= (1 << 1) ]
set the CR4.OSFXSR bit (bit 9) [ CR4 |= (1 << 9) ]
set the CR4.OSXMMEXCPT bit (bit 10) [ CR4 |= (1 << 10) ]

Here is an asm example:

;now enable SSE and the like
mov eax, cr0
and ax, 0xFFFB		;clear coprocessor emulation CR0.EM
or ax, 0x2			;set coprocessor monitoring  CR0.MP
mov cr0, eax
mov eax, cr4
or ax, 3 << 9		;set CR4.OSFXSR and CR4.OSXMMEXCPT at the same time
mov cr4, eax


FXSAVE and FXRSTOR are used to save and load the complete SSE, x87 FPU, and MMX states from memory. The host needs to allocate 512 bytes for the storage and use that memory pointer as an operand to either FXSAVE or FXRSTOR. Before using either of those instructions, make sure to check the CPUID features for the FXSR bit. Also, like most SSE instructions, the memory operand needs to be 16-byte aligned or a #GP exception will occur. Remember to execute FXSAVE *before* any MXCSR modifications happen, or else it the register will most likely get overwritten or set to 0 based on the unknown state of the MXCSR_MASK.

Example usage:

char fxsave_region[512] __attribute__((aligned(16)));
asm volatile(" fxsave %0 "::"m"(fxsave_region));

or in asm:

segment .code
fxsave [SavedFloats]
segment .data
align 16
SavedFloats: TIMES 512 db 0

Pitfalls: only one level of saving supported.

MXCSR and its helpers LDMXCSR and STMXCSR

The MXCSR register holds all of the masking and flag information for use with SSE floating-point operations. Just like the x87 FPU control word, if you would like to mask certain exceptions from occuring or would like to specify rounding types, MXCSR will need to be modified. Bits 16-31 are reserved and will cause a #GP exception if set. LDMXCSR and STMXCSR load and write the MXCSR register respectively. They both require a 32-bit memory operand. SSE support needs to already be set up before using either of these instructions (CR4.OSFXSR = 1, CR0.EM = 0, and CR0.TS = 0). If bits 7-12 are set, all SSE floating-point exceptions are masked. Bits 0-5 are exception status flags that are set if the corresponding exception has occured. Bits 13-14 are the RC (Rounding Control) bits. RC:0 = to nearest, RC:1 = down, RC:2 = up, RC:3 = truncate.

Updates to SSE

Later processors have added more instructions for different work to be performed on the vector registers. Supporting them with SSE support in place doesn't require any effort on the part of the OS (except for AVX, see below). The actual user of the instructions should however check if those instructions actually exist.

CPUID bits


Streaming SIMD Extensions 2 (SSE2)
The bit for SSE2 can be found on CPUID page 1, in EDX bit 26.


Streaming SIMD Extensions 3 (SSE3)
The bit for SSE3 can be found on CPUID page 1, in ECX bit 0.


Supplemental Streaming SIMD Extensions 3 (SSSE3)
The bit for SSSE3 can be found on CPUID page 1, in ECX bit 9.


Streaming SIMD Extensions 4 (SSE4)
The bit for SSE4.1 can be found on CPUID page 1, in ECX bit 19

The bit for SSE4.2 can be found on CPUID page 1, in ECX bit 20

The bit for SSE4A can be found on CPUID page 1, in ECX bit 6


Streaming SIMD Extensions 5 (SSE5)
SSE5 was planned as one unit, but split into several:


The bit for XOP can be found on CPUID page 1, in ECX bit 11


The bit for FMA4 can be found on CPUID page 1, in ECX bit 16


The bit for CVT16 can be found on CPUID page 1, in ECX bit 29


The bit for AVX can be found on CPUID page 1, in ECX bit 28


The bit for XSAVE (needed to manage extended processor states) can be found on CPUID page 1, in ECX bit 26


The bit for AVX2 can be found on CPUID page 7, 0, in EBX bit 5


The bits for AVX-512 are in CPUID page 0x0D, 0x0, EAX bits 5-7

AVX512 implements separate features that can also be detected in CPUID page 7, 0. Basic support is detected by checking the AVX512F Bit (AVX-512 Foundation) in CPUID page 7, 0 EBX Bit 16, you can also check various AVX512 Features through the same CPUID Function, the bits are listed here


When the X86-64 architecture was introduced, AMD demanded a minimum level of SSE support to simplify OS code. Any system capable of long mode should support at least SSE and SSE2, which means that the kernel does not need to care about the old FPU save code. X86-64 adds 8 SSE registers (xmm8 - xmm15) to the mix. However, you can only access these in 64 bit mode.

Advanced Vector Extensions is a SIMD (Single Instruction, Multiple Data) instruction set introduced by Intel in 2011.


AVX needs to be enabled by the kernel before being used. Forgetting to do this will raise an #UD on the first AVX call. Both SSE and OSXSAVE must be enabled before allowing. Failing to do so will also produce an #UD.

AVX is enabled by setting bit 2 of the XCR0 register. Bit 1 of XCR0 must also be set (indicating SSE support).

Here is an example of assembly code enabling AVX after SSE has been enabled (you should check AVX and XSAVE are supported first, see above):

    push rax
    push rcx
    push rdx

    xor rcx, rcx
    xgetbv ;Load XCR0 register
    or eax, 7 ;Set AVX, SSE, X87 bits
    xsetbv ;Save back to XCR0

    pop rdx
    pop rcx
    pop rax

To enable AVX-512, set the OPMASK (bit 5), ZMM_Hi256 (bit 6), Hi16_ZMM (bit 7) of XCR0. You must ensure that these bits are valid first (see above).

See Also