SSE: Difference between revisions

4,486 bytes added ,  9 days ago
m
no edit summary
[unchecked revision][unchecked revision]
(Floats)
mNo edit summary
 
(22 intermediate revisions by 8 users not shown)
Line 1:
{{Floats}}
''' Streaming SIMD Extensions (SSE) '''
 
== Streaming SIMD Extensions (SSE) ==
=== Introduction ===
{{Floats}}
SSE was introduced in the Pentium III and offered an additional 70 instructions to the Intel Instruction Set. SSE instructions can help give an increase in data thouroughput due to Single Instruction, Multiple Data (SIMD) instructions. These instructions can execute a common expression on multiple data in parallel.
 
There are 8 (16 in 64-bit mode) XMM registers (XMM0-7(15)) that come with SSE, and they are 128-bit registers. Certain SSE instructions (movntdqa, movdqa, movdqu, etc...) can load 16 bytes from memory or store 16 bytes to memory in a single operation. Also, SSE introduces a few non-temporal hint instructions (movntdqa and movntdq) that allow one-shot memory locations to be stored in non-temporal memory so those location references to do not pollute the small on-chip caches.
 
Since this change added new registers, it is disabled by default as the typical operating system of that time was not yet able to save those registers on a task switch. To support SSE, you will need to implement separate code paths for saving and restoring SSE state (as those instructions will cause an exception on processors that do not support it), and handlers for the new exceptions. After that, you can tell the CPU to enable SSE use in userland tasks.
 
=== Checking for SSE ===
to check for SSE CPUID.01h:EDX.SSE[bit 25] needs to be set
<syntaxhighlight lang="asm">
mov eax, 0x1
cpuid
test edx, 1<<25
jz .noSSE
;SSE is available
</syntaxhighlight>
 
=== Adding support ===
Line 16 ⟶ 28:
 
Here is an asm example:
<sourcesyntaxhighlight lang="asm">
;now enable SSE and the like
mov eax, cr0
Line 26 ⟶ 38:
mov cr4, eax
ret
</syntaxhighlight>
</source>
 
=== FXSAVE and FXRSTOR ===
Line 32 ⟶ 44:
 
Example usage:
<sourcesyntaxhighlight lang="c">
char fxsave_region[512] __attribute__((aligned(16)));
asm volatile(" fxsave; %0 "::"m"(fxsave_region));
</syntaxhighlight>
</source>
or in asm:
<syntaxhighlight lang="asm">
segment .code
SaveFloats:
fxsave [SavedFloats]
segment .data
align 16
SavedFloats: TIMES 512 db 0
</syntaxhighlight>
Pitfalls: only one level of saving supported.
 
=== MXCSR and its helpers LDMXCSR and STMXCSR ===
The MXCSR register holds all of the masking and flag information for use with SSE floating-point operations. Just like the x87 FPU control word, if you would like to mask certain exceptions from occuring or would like to specify rounding types, MXCSR will need to be modified. Bits 16-31 are reserved and will cause a #GP exception if set. LDMXCSR and STMXCSR load and write the MXCSR register respectively. They both require a 32-bit memory operand. SSE support needs to already be set up before using either of these instructions (CR4.OSFXSR = 1, CR0.EM = 0, and CR0.TS = 0). If bits 7-12 are set, all SSE floating-point exceptions are masked. Bits 0-5 are exception status flags that are set if the corresponding exception has occured. Bits 13-14 are the RC (Rounding Control) bits. RC:0 = to nearest, RC:1 = down, RC:2 = up, RC:3 = truncate.
== Updates to SSE ==
Later processors have added more instructions for different work to be performed on the vector registers. Supporting them with SSE support in place doesn't require any effort on the part of the OS (except for AVX, see below). The actual user of the instructions should however check if those instructions actually exist.
=== CPUID bits ===
=====SSE2=====
<small>'''Streaming SIMD Extensions 2 (SSE2)'''</small><br />
The bit for SSE2 can be found on CPUID page 1, in EDX bit 26.
=====SSE3=====
<small>'''Streaming SIMD Extensions 3 (SSE3)'''</small><br />
The bit for SSE3 can be found on CPUID page 1, in ECX bit 0.
=====SSSE3=====
<small>'''Supplemental Streaming SIMD Extensions 3 (SSSE3)'''</small><br />
The bit for SSSE3 can be found on CPUID page 1, in ECX bit 9.
 
=====SSE4=====
<small>'''Streaming SIMD Extensions 4 (SSE4)'''</small><br />
The bit for SSE4.1 can be found on CPUID page 1, in ECX bit 19
 
The bit for SSE4.2 can be found on CPUID page 1, in ECX bit 20
 
The bit for SSE4A can be found on CPUID page 1, in ECX bit 6
 
=====SSE5=====
<small>'''Streaming SIMD Extensions 5 (SSE5)'''</small><br />
SSE5 was planned as one unit, but split into several:
======XOP======
The bit for XOP can be found on CPUID page 1, in ECX bit 11
======FMA4======
The bit for FMA4 can be found on CPUID page 1, in ECX bit 16
======CVT16======
The bit for CVT16 can be found on CPUID page 1, in ECX bit 29
======AVX======
The bit for AVX can be found on CPUID page 1, in ECX bit 28
======XSAVE======
The bit for XSAVE (needed to manage extended processor states) can be found on CPUID page 1, in ECX bit 26
======AVX2======
The bit for AVX2 can be found on CPUID page 7, 0, in EBX bit 5
=====AVX-512=====
The bits for AVX-512 are in CPUID page 0x0D, 0x0, EAX bits 5-7
 
AVX512 implements separate features that can also be detected in CPUID page 7, 0. Basic support is detected by checking the AVX512F Bit (AVX-512 Foundation) in CPUID page 7, 0 EBX Bit 16, you can also check various AVX512 Features through the same CPUID Function, the bits are listed [[w:CPUID|here]]
===X86_64===
When the [[X86-64]] architecture was introduced, AMD demanded a minimum level of SSE support to simplify OS code. Any system capable of long mode should support at least SSE and SSE2, which means that the kernel does not need to care about the old FPU save code.
X86-64 adds 8 SSE registers (xmm8 - xmm15) to the mix. However, you can only access these in 64 bit mode.
 
'''Advanced Vector Extensions''' is a '''SIMD''' (Single Instruction, Multiple Data) instruction set introduced by Intel in 2011.
 
=== AVX ===
 
AVX needs to be enabled by the kernel before being used. Forgetting to do this will raise an #UD on the first AVX call.
Both SSE and OSXSAVE must be enabled before allowing. Failing to do so will also produce an #UD.
 
AVX is enabled by setting bit 2 of the XCR0 register. Bit 1 of XCR0 must also be set (indicating SSE support).
 
Here is an example of assembly code enabling AVX after SSE has been enabled (you should check AVX and XSAVE are supported first, see above):
 
<syntaxhighlight lang="asm">
enable_avx:
push rax
push rcx
push rdx
 
xor rcx, rcx
xgetbv ;Load XCR0 register
or eax, 7 ;Set AVX, SSE, X87 bits
xsetbv ;Save back to XCR0
 
pop rdx
pop rcx
pop rax
ret
</syntaxhighlight>
 
To enable AVX-512, set the OPMASK (bit 5), ZMM_Hi256 (bit 6), Hi16_ZMM (bit 7) of XCR0. You must ensure that these bits are valid first (see above).
 
==See Also==
* [[MMX]]
* The [[User:01000101/optlib/|optimisation library]] of 01000101, containing example code
*[[SSE2]]
 
=== References ===
SSE* Wikipedia article:The [http[Wikipedia:EN://en.wikipedia.org/wiki/Streaming_SIMD_Extensions|Wikipedia article]] on SSE
* The [[Wikipedia:EN:AVX-512|Wikipedia article]] on AVX-512
 
[[Category:X86]]
A few SSE examples: [[User:01000101/optlib/]]
[[Category:SSE]]