SSE

From OSDev.wiki
Revision as of 10:43, 18 February 2012 by osdev>Bellezzasolo (See also section)
Jump to navigation Jump to search

Streaming SIMD Extensions (SSE)

Streaming SIMD Extensions (SSE)

Introduction

SSE was introduced in the Pentium III and offered an additional 70 instructions to the Intel Instruction Set. SSE instructions can help give an increase in data thouroughput due to Single Instruction, Multiple Data (SIMD) instructions. These instructions can execute a common expression on multiple data in parallel.

There are 8 (16 in 64-bit mode) XMM registers (XMM0-7(15)) that come with SSE, and they are 128-bit registers. Certain SSE instructions (movntdqa, movdqa, movdqu, etc...) can load 16 bytes from memory or store 16 bytes to memory in a single operation. Also, SSE introduces a few non-temporal hint instructions (movntdqa and movntdq) that allow one-shot memory locations to be stored in non-temporal memory so those location references to do not pollute the small on-chip caches.

Adding support

In order to allow SSE instructions to be executed without generating a #UD, we need to alter the CR0 and CR4 registers.

clear the CR0.EM bit (bit 2) [ CR0 &= ~(1 << 2) ]
set the CR0.MP bit (bit 1) [ CR0 |= (1 << 1) ]
set the CR4.OSFXSR bit (bit 9) [ CR4 |= (1 << 9) ]
set the CR4.OSXMMEXCPT bit (bit 10) [ CR4 |= (1 << 10) ]

Here is an asm example:

;now enable SSE and the like
mov eax, cr0
and ax, 0xFFFB		;clear coprocessor emulation CR0.EM
or ax, 0x2			;set coprocessor monitoring  CR0.MP
mov cr0, eax
mov eax, cr4
or ax, 3 << 9		;set CR4.OSFXSR and CR4.OSXMMEXCPT at the same time
mov cr4, eax
ret

FXSAVE and FXRSTOR

FXSAVE and FXRSTOR are used to save and load the complete SSE, x87 FPU, and MMX states from memory. The host needs to allocate 512 bytes for the storage and use that memory pointer as an operand to either FXSAVE or FXRSTOR. Before using either of those instructions, make sure to check the CPUID features for the FXSR bit. Also, like most SSE instructions, the memory operand needs to be 16-byte aligned or a #GP exception will occur. Remember to execute FXSAVE *before* any MXCSR modifications happen, or else it the register will most likely get overwritten or set to 0 based on the unknown state of the MXCSR_MASK.

Example usage:

char fxsave_region[512] __attribute__((aligned(16)));
asm volatile(" fxsave; "::"m"(fxsave_region));

MXCSR and its helpers LDMXCSR and STMXCSR

The MXCSR register holds all of the masking and flag information for use with SSE floating-point operations. Just like the x87 FPU control word, if you would like to mask certain exceptions from occuring or would like to specify rounding types, MXCSR will need to be modified. Bits 16-31 are reserved and will cause a #GP exception if set. LDMXCSR and STMXCSR load and write the MXCSR register respectively. They both require a 32-bit memory operand. SSE support needs to already be set up before using either of these instructions (CR4.OSFXSR = 1, CR0.EM = 0, and CR0.TS = 0). If bits 7-12 are set, all SSE floating-point exceptions are masked. Bits 0-5 are exception status flags that are set if the corresponding exception has occured. Bits 13-14 are the RC (Rounding Control) bits. RC:0 = to nearest, RC:1 = down, RC:2 = up, RC:3 = truncate.

See Also

References

SSE Wikipedia article: [1]

A few SSE examples: User:01000101/optlib/