Getting to Ring 3: Difference between revisions
[unchecked revision] | [unchecked revision] |
m →Entering Ring 3: typo |
m Bot: fixing lint errors, replacing obsolete HTML tags: |
||
(16 intermediate revisions by 7 users not shown) | |||
Line 1: | Line 1: | ||
{{In_Progress}} |
{{In_Progress}} |
||
The end goal of writing a kernel is to get to '''userspace''', or, in other words, going from ring 0 to ring 3. While one might expect that a ring 3 [[GDT]] entries would be sufficient, it is more complicated. All of the following tasks must be completed: |
|||
{{FirstPerson}} |
|||
* Add two new GDT entries (at least) configured for ring 3. |
|||
As fun as making a kernel is, eventually we have to get outside the kernel into '''userspace'''. This involves getting from ring 0 to ring 3. I am sure all of us wish we could just make a [[GDT]] entry and - poof - ring 3 works, but Intel wants us to pull our hair out at least a little with their [[Task State Segment]]. So, in order to get to ring 3 we must do the following: |
|||
* Get 2 new GDT entries (at least) configured for ring 3. |
|||
** These entries are needed for the user's code and data segments (one each) |
** These entries are needed for the user's code and data segments (one each) |
||
* Set up a barebones TSS with an ESP0 stack. |
* Set up a barebones [[TSS]] with an ESP0 stack. |
||
** When an interrupt(be it fault, IRQ, or software interrupt) happens while the CPU is in user mode the CPU needs to know where the kernel stack is located |
** When an interrupt (be it fault, IRQ, or software interrupt) happens while the CPU is in user mode, the CPU needs to know where the kernel stack is located. This location is stored in the ESP0 (0 for ring 0) entry of the TSS. |
||
* Set up an IDT entry for ring 3 system call interrupts (optional |
* Set up an [[IDT]] entry for ring 3 system call interrupts (optional). |
||
** System calls are the way user code requests the kernel to do IO and process management. For more information see [[System Calls]] |
** System calls are the way user code requests the kernel to do IO and process management. For more information see [[System Calls]] |
||
== Requirements == |
== Requirements == |
||
* Ring 0 GDT and IDT |
|||
I'm not going to go through making a whole kernel that can get to ring 3. I will assume you have a decent and usable ring 0 GDT and IDT, along with being able to handle IRQs properly. I also assume you will be multitasking, and so will cover switching ring 3>ring 0(switch task)>ring 3. |
|||
* IRQ handling |
|||
* Plans for multitasking with task switching |
|||
== GDT == |
== GDT == |
||
Following is an example of a GDT entry structure in C, utilizing bit-fields: |
|||
< |
<syntaxhighlight lang="c"> |
||
struct gdt_entry_bits |
struct gdt_entry_bits { |
||
unsigned int limit_low : 16; |
|||
{ |
|||
unsigned int |
unsigned int base_low : 24; |
||
unsigned int |
unsigned int accessed : 1; |
||
unsigned int read_write : 1; // readable for code, writable for data |
|||
//attribute byte split into bitfields |
|||
unsigned int |
unsigned int conforming_expand_down : 1; // conforming for code, expand down for data |
||
unsigned int |
unsigned int code : 1; // 1 for code, 0 for data |
||
unsigned int |
unsigned int code_data_segment : 1; // should be 1 for everything but TSS and LDT |
||
unsigned int |
unsigned int DPL : 2; // privilege level |
||
unsigned int |
unsigned int present : 1; |
||
unsigned int |
unsigned int limit_high : 4; |
||
unsigned int available : 1; // only used in software; has no effect on hardware |
|||
unsigned int present :1; |
|||
unsigned int long_mode : 1; |
|||
//and now into granularity |
|||
unsigned int big : 1; // 32-bit opcodes for code, uint32_t stack for data |
|||
unsigned int limit_high :4; |
|||
unsigned int gran : 1; // 1 to use 4k page addressing, 0 for byte addressing |
|||
unsigned int available :1; |
|||
unsigned int |
unsigned int base_high : 8; |
||
} __packed; // or `__attribute__((packed))` depending on compiler |
|||
unsigned int big :1; //32bit opcodes for code, uint32_t stack for data |
|||
</syntaxhighlight> |
|||
unsigned int gran :1; //1 to use 4k page addressing, 0 for byte addressing |
|||
Using this structure, two ring 3 segments, both with base of 0 and limit of 0xFFFFFFFF, can be added, as follows: |
|||
unsigned int base_high :8; |
|||
<syntaxhighlight lang="c"> |
|||
} __packed; //or __attribute__((packed)) |
|||
static gdt_entry_bits gdt[6]; // one null segment, two ring 0 segments, two ring 3 segments, TSS segment |
|||
</source> |
|||
// (ring 0 segments) |
|||
We will be doing a simple setup, and I will assume you will later implement paging in your OS, so we will use only two ring 3 segments both with base of 0 and limit of 0xFFFFFFFF so we will set up our two GDT segments like this: |
|||
<source lang="c"> |
|||
//....insert your ring 0 segments here or whatever |
|||
gdt_entry_bits *code; |
|||
gdt_entry_bits *data; |
|||
//I assume your ring 0 segments are in gdt[1] and gdt[2] (0 is null segment) |
|||
code=(void*)&gdt[3]; //gdt is a static array of gdt_entry_bits or equivalent |
|||
data=(void*)&gdt[4]; |
|||
code->limit_low=0xFFFF; |
|||
code->base_low=0; |
|||
code->accessed=0; |
|||
code->read_write=1; //make it readable for code segments |
|||
code->conforming=0; //don't worry about this.. |
|||
code->code=1; //this is to signal it's a code segment |
|||
code->always_1=1; |
|||
code->DPL=3; //set it to ring 3 |
|||
code->present=1; |
|||
code->limit_high=0xF; |
|||
code->available=1; |
|||
code->always_0=0; |
|||
code->big=1; //signal it's 32 bits |
|||
code->gran=1; //use 4k page addressing |
|||
code->base_high=0; |
|||
*data=*code; //copy it all over, cause most of it is the same |
|||
data->code=0; //signal it's not code; so it's data. |
|||
gdt_entry_bits *ring3_code = &gdt[3]; |
|||
install_tss(&gdt[5]); //we'll implement this function later... |
|||
gdt_entry_bits *ring3_data = &gdt[4]; |
|||
ring3_code->limit_low = 0xFFFF; |
|||
//...go on to install GDT segments and such |
|||
ring3_code->base_low = 0; |
|||
//after those are installed we'll tell the CPU where our TSS is: |
|||
ring3_code->accessed = 0; |
|||
flush_tss(); //implement this later |
|||
ring3_code->read_write = 1; // since this is a code segment, specifies that the segment is readable |
|||
</source> |
|||
ring3_code->conforming = 0; // does not matter for ring 3 as no lower privilege level exists |
|||
Ok, so now we have our two user mode segments. Now technically, we can get to user mode right now with these two segments. The problem is we can't get back to ring 0 for system calls or faults or even IRQs. That is where the TSS comes in. |
|||
ring3_code->code = 1; |
|||
ring3_code->code_data_segment = 1; |
|||
ring3_code->DPL = 3; // ring 3 |
|||
ring3_code->present = 1; |
|||
ring3_code->limit_high = 0xF; |
|||
ring3_code->available = 1; |
|||
ring3_code->long_mode = 0; |
|||
ring3_code->big = 1; // it's 32 bits |
|||
ring3_code->gran = 1; // 4KB page addressing |
|||
ring3_code->base_high = 0; |
|||
*ring3_data = *ring3_code; // contents are similar so save time by copying |
|||
ring3_data->code = 0; // not code but data |
|||
install_tss(&gdt[5]); // TSS segment will be the fifth |
|||
flush_tss(); |
|||
</syntaxhighlight> |
|||
In actuality, the CPU can be put into user mode with just these two segments. However it is impossible to return to ring 0 for system calls, faults, or even IRQs. That is where the TSS comes in. |
|||
== The TSS == |
== The TSS == |
||
The TSS can be used for multitasking, though it is recommended to use software multitasking for these reasons: |
The TSS can be used for multitasking, though it is recommended to use software multitasking for these reasons: |
||
* Software task switching is faster(usually) |
* Software task switching is faster (usually) |
||
* When you port your OS to a different CPU, it won't have the TSS, so you'll have to implement software task switching anyway |
* When you port your OS to a different CPU, it probably won't have the TSS, so you'll have to implement software task switching anyway |
||
* x86 |
* x86 64-bit mode does not allow you to use the TSS for task switching (the main reason, especially if your goal is to read 64-bit mode) |
||
This guide will use software multitasking. Because of this the 32-bit TSS will contain a lot of junk we don't need. Here is the structure of the TSS: |
|||
< |
<syntaxhighlight lang="c"> |
||
struct tss_entry_struct { |
|||
// A struct describing a Task State Segment. |
|||
uint32_t prev_tss; // The previous TSS - with hardware task switching these form a kind of backward linked list. |
|||
struct tss_entry_struct |
|||
uint32_t esp0; // The stack pointer to load when changing to kernel mode. |
|||
{ |
|||
uint32_t ss0; // The stack segment to load when changing to kernel mode. |
|||
uint32_t prev_tss; // The previous TSS - if we used hardware task switching this would form a linked list. |
|||
// Everything below here is unused. |
|||
uint32_t esp0; // The stack pointer to load when we change to kernel mode. |
|||
uint32_t esp1; // esp and ss 1 and 2 would be used when switching to rings 1 or 2. |
|||
uint32_t ss1; |
|||
uint32_t esp1; // everything below here is unusued now.. |
|||
uint32_t esp2; |
|||
uint32_t ss2; |
|||
uint32_t cr3; |
|||
uint32_t eip; |
|||
uint32_t eflags; |
|||
uint32_t eax; |
|||
uint32_t ecx; |
|||
uint32_t edx; |
|||
uint32_t ebx; |
|||
uint32_t esp; |
|||
uint32_t ebp; |
|||
uint32_t esi; |
|||
uint32_t edi; |
|||
uint32_t es; |
|||
uint32_t cs; |
|||
uint32_t ss; |
|||
uint32_t ds; |
|||
uint32_t fs; |
|||
uint32_t gs; |
|||
uint32_t ldt; |
|||
uint16_t trap; |
|||
uint32_t ldt; |
|||
uint16_t iomap_base; |
|||
uint16_t iomap_base; |
|||
} __packed; |
} __packed; |
||
typedef struct tss_entry_struct tss_entry_t; |
typedef struct tss_entry_struct tss_entry_t; |
||
</syntaxhighlight> |
|||
</source> |
|||
To setup this TSS structure, give it an initial esp0 stack with the correct ss0 segment. |
|||
As you can see.. a lot of wasted crap you don't need. But, intel demands it be used, so... |
|||
<syntaxhighlight lang="c"> |
|||
Basically what we want to do to setup this TSS structure is give it an initial esp0 stack and setup the segments to point to our kernel segments, and really that's it.. so we can do something like this: |
|||
// Note: some of the GDT entry struct field names may not match perfectly to the TSS entries. |
|||
<source lang="c"> |
|||
/**Ok, this is going to be hackish, but we will salvage the gdt_entry_bits struct to form our TSS descriptor |
|||
So some of these names of the fields will actually be different.. maybe I'll fix this later..**/ |
|||
tss_entry_t tss_entry; |
tss_entry_t tss_entry; |
||
void write_tss(gdt_entry_bits *g) |
void write_tss(gdt_entry_bits *g) { |
||
// Compute the base and limit of the TSS for use in the GDT entry. |
|||
{ |
|||
uint32_t base = (uint32_t) &tss_entry; |
|||
// Firstly, let's compute the base and limit of our entry into the GDT. |
|||
uint32_t limit = sizeof tss_entry; |
|||
uint32_t limit = sizeof(tss_entry); |
|||
// Add a TSS descriptor to the GDT. |
|||
g->limit_low = limit; |
|||
g->base_low = base; |
|||
g->accessed = 1; // With a system entry (`code_data_segment` = 0), 1 indicates TSS and 0 indicates LDT |
|||
g->read_write = 0; // For a TSS, indicates busy (1) or not busy (0). |
|||
g->conforming_expand_down = 0; // always 0 for TSS |
|||
g->code = 1; // For a TSS, 1 indicates 32-bit (1) or 16-bit (0). |
|||
g->code_data_segment=0; // indicates TSS/LDT (see also `accessed`) |
|||
g->always_1=0; //indicate it is a TSS |
|||
g->DPL = 0; // ring 0, see the comments below |
|||
g->present = 1; |
|||
g->limit_high = (limit & (0xf << 16)) >> 16; // isolate top nibble |
|||
g->available = 0; // 0 for a TSS |
|||
g->long_mode = 0; |
|||
g->always_0=0; //same thing |
|||
g->big = 0; // should leave zero according to manuals. |
|||
g->gran = 0; // limit is in bytes, not pages |
|||
g->base_high = (base & (0xff << 24)) >> 24; //isolate top byte |
|||
// Ensure the TSS is initially zero'd. |
|||
memset(&tss_entry, 0, sizeof tss_entry); |
|||
tss_entry.ss0 = REPLACE_KERNEL_DATA_SEGMENT; // Set the kernel stack segment. |
|||
tss_entry.esp0 = REPLACE_KERNEL_STACK_ADDRESS; // Set the kernel stack pointer. |
|||
//note that CS is loaded from the IDT entry and should be the regular kernel code segment |
|||
} |
} |
||
void set_kernel_stack(uint32_t stack) // |
void set_kernel_stack(uint32_t stack) { // Used when an interrupt occurs |
||
tss_entry.esp0 = stack; |
|||
{ |
|||
} |
|||
tss_entry.esp0 = stack; |
|||
</syntaxhighlight> |
|||
} |
|||
Finally, the implementation of the flush_tss function (Intel syntax): |
|||
</source> |
|||
<syntaxhighlight lang="asm"> |
|||
; C declaration: void flush_tss(void); |
|||
global flush_tss |
|||
flush_tss: |
|||
mov ax, (5 * 8) | 0 ; fifth 8-byte selector, symbolically OR-ed with 0 to set the RPL (requested privilege level). |
|||
ltr ax |
|||
ret |
|||
</syntaxhighlight> |
|||
At this point the kernel is ready to enter ring 3. It's worth noticing that the DPL of the TSS descriptor in GDT has ''nothing'' to do with the privilege level the task will run on: that depends on the DPL of the code segment used to set CS. The DPL of the TSS descriptor determines at which privilege level is it possible to CALL it, triggering a hardware context switch (32-bit only). |
|||
From the Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3, Section 7.2.2 (TSS Descriptor): |
|||
Now, I know you may spend a while looking at that atrocious code..but I do believe it works. Oh, and here is our flush_tss function: (in yasm/nasm syntax) |
|||
<blockquote>In most systems, the DPLs of TSS descriptors are set to values less than 3, so that only privileged software can perform task switching. However, in multitasking applications, DPLs for some TSS descriptors may be set to 3 to allow task switching at the application (or user) privilege level.</blockquote> |
|||
<source lang="asm"> |
|||
== Entering Ring 3 == |
|||
GLOBAL tss_flush ; Allows our C code to call tss_flush(). |
|||
The x86 is a tricky CPU. No matter how you approach it, there is no easy way to enter user mode. Nonetheless, below are three ways to enter user mode. |
|||
tss_flush: |
|||
=== iret method === |
|||
mov ax, 0x2B ; Load the index of our TSS structure - The index is |
|||
One of the ways to get to ring 3 is to make the processor think it was already in ring 3 to start with. This can be accomplished with an iret. Following is a simple example of this trick: |
|||
; 0x28, as it is the 5th selector and each is 8 bytes |
|||
<syntaxhighlight lang="asm"> |
|||
; long, but we set the bottom two bits (making 0x2B) |
|||
global jump_usermode |
|||
; so that it has an RPL of 3, not zero. |
|||
extern test_user_function |
|||
ltr ax ; Load 0x2B into the task state register. |
|||
jump_usermode: |
|||
ret |
|||
mov ax, (4 * 8) | 3 ; ring 3 data with bottom 2 bits set for ring 3 |
|||
</source> |
|||
mov ds, ax |
|||
mov es, ax |
|||
mov fs, ax |
|||
mov gs, ax ; SS is handled by iret |
|||
; set up the stack frame iret expects |
|||
Ok, so now we are just about ready to do some ring 3 fun stuff!! |
|||
mov eax, esp |
|||
push (4 * 8) | 3 ; data selector |
|||
push eax ; current esp |
|||
pushf ; eflags |
|||
push (3 * 8) | 3 ; code selector (ring 3 code with bottom 2 bits set for ring 3) |
|||
push test_user_function ; instruction address to return to |
|||
iret |
|||
</syntaxhighlight> |
|||
This will call test_user_function and it will be operating in user mode! Have the test_user_function execute a cli or other privileged instruction and you'll be pleased by a [[General Protection Fault]]. |
|||
=== sysexit method === |
|||
The second way is to use the sysexit instruction as follows: |
|||
<syntaxhighlight lang="asm"> |
|||
global jump_usermode |
|||
extern test_user_function |
|||
jump_usermode: |
|||
mov ax, (4 * 8) | 3 ; user data segment with RPL 3 |
|||
mov ds, ax |
|||
mov es, ax |
|||
mov fs, ax |
|||
mov gs, ax ; sysexit sets SS |
|||
; setup wrmsr inputs |
|||
== Entering Ring 3 == |
|||
xor edx, edx ; not necessary; set to 0 |
|||
Ok, the x86 is really a tricky CPU. The only way to get to ring 3 is to fool the processor into thinking it was already in ring 3 to start with. We effectively do this using an iret. I'll give you a simple example on how to execute something as ring 3:(yasm/nasm syntax) |
|||
mov eax, 0x8 ; the segments are computed as follows: CS=MSR+0x10 (0x8+0x10=0x18), SS=MSR+0x18 (0x8+0x18=0x20). |
|||
<source lang="asm"> |
|||
mov ecx, 0x174 ; MSR specifier: IA32_SYSENTER_CS |
|||
GLOBAL _jump_usermode ;you may need to remove this _ to work right.. |
|||
wrmsr ; set sysexit segments |
|||
EXTERN _test_user_function |
|||
_jump_usermode: |
|||
mov ax,0x23 |
|||
mov ds,ax |
|||
mov es,ax |
|||
mov fs,ax |
|||
mov gs,ax ;we don't need to worry about SS. it's handled by iret |
|||
mov eax,esp |
|||
push 0x23 ;user data segment with bottom 2 bits set for ring 3 |
|||
push eax ;push our current esp for the iret stack frame |
|||
pushf |
|||
push 0x1B; ;user code segment with bottom 2 bits set for ring 3 |
|||
push _test_user_function ;may need to remove the _ for this to work right |
|||
iret |
|||
;end |
|||
</source> |
|||
Now then, this will call the C function test_user_function and it will be operating in user mode! There is no easy way of getting back to ring 0(excluding IRQs) except for by setting up a task switching system, which you really should have in place to properly appreciate ring 3 in the first place.. But if you would like to test things out in user mode, just have the test_user_function execute a cli or other privileged instruction and you'll be pleased by a GPF. I won't give you source examples on implementing this into your task switching system, as these vary a lot by operating system. |
|||
; setup sysexit inputs |
|||
== Multitasking considerations == |
|||
mov edx, test_user_function ; to be loaded into EIP |
|||
There are a lot of subtle things with user mode and task switching that you may not realize at first. First: Whenever a system call interrupt happens, the first thing that happens is the CPU changes to ESP0 stack. Then, it will push all the system information. So when you enter the interrupt handler, your working off of the ESP0 stack. This could become a problem with 2 ring 3 tasks going if all you do is merely push context info and change esp. Think about it. you will change the esp, which is the esp0 stack, to the other tasks esp, which is the same esp0 stack. So, what you must do is change the ESP0 stack(along with the interrupt pushed ESP stack) on each task switch, or you'll end up overwriting yourself. |
|||
mov ecx, esp ; to be loaded into ESP |
|||
sysexit |
|||
</syntaxhighlight> |
|||
=== sysret method === |
|||
The other way is to use the sysret instruction as follows: |
|||
<syntaxhighlight lang="asm"> |
|||
; note: this code is for 64-bit long mode only. |
|||
; it is unknown if it works in protected mode. |
|||
; using intel assembly style |
|||
global jump_usermode |
|||
extern test_user_function |
|||
jump_usermode: |
|||
;enable system call extensions that enables sysret and syscall |
|||
mov rcx, 0xc0000082 |
|||
wrmsr |
|||
mov rcx, 0xc0000080 |
|||
rdmsr |
|||
or eax, 1 |
|||
wrmsr |
|||
mov rcx, 0xc0000081 |
|||
rdmsr |
|||
mov edx, 0x00180008 |
|||
wrmsr |
|||
mov ecx, test_user_function ; to be loaded into RIP |
|||
mov r11, 0x202 ; to be loaded into EFLAGS |
|||
sysretq ;use "o64 sysret" if you assemble with NASM |
|||
</syntaxhighlight> |
|||
== Multitasking considerations == |
|||
There are a lot of subtle aspects of user mode and task switching. Whenever a system call interrupt happens, ESP0 is loaded into the stack pointer and all system information is pushed before the interrupt handler is entered. This could become a problem with two ring 3 tasks. Imagine: esp is currently set to the ESP0 stack. When the interrupt is receive, esp is set to the other task's esp which is the same ESP0 stack. To avoid overwriting data, the handler must change the ESP0 stack (along with the interrupt-pushed ESP stack) on each task switch. |
|||
[[Category:Tutorials]] |
[[Category:Tutorials]] |
||
[[Category:X86 CPU]] |
[[Category:X86 CPU]] |
Latest revision as of 15:45, 9 June 2024
The end goal of writing a kernel is to get to userspace, or, in other words, going from ring 0 to ring 3. While one might expect that a ring 3 GDT entries would be sufficient, it is more complicated. All of the following tasks must be completed:
- Add two new GDT entries (at least) configured for ring 3.
- These entries are needed for the user's code and data segments (one each)
- Set up a barebones TSS with an ESP0 stack.
- When an interrupt (be it fault, IRQ, or software interrupt) happens while the CPU is in user mode, the CPU needs to know where the kernel stack is located. This location is stored in the ESP0 (0 for ring 0) entry of the TSS.
- Set up an IDT entry for ring 3 system call interrupts (optional).
- System calls are the way user code requests the kernel to do IO and process management. For more information see System Calls
Requirements
- Ring 0 GDT and IDT
- IRQ handling
- Plans for multitasking with task switching
GDT
Following is an example of a GDT entry structure in C, utilizing bit-fields:
struct gdt_entry_bits {
unsigned int limit_low : 16;
unsigned int base_low : 24;
unsigned int accessed : 1;
unsigned int read_write : 1; // readable for code, writable for data
unsigned int conforming_expand_down : 1; // conforming for code, expand down for data
unsigned int code : 1; // 1 for code, 0 for data
unsigned int code_data_segment : 1; // should be 1 for everything but TSS and LDT
unsigned int DPL : 2; // privilege level
unsigned int present : 1;
unsigned int limit_high : 4;
unsigned int available : 1; // only used in software; has no effect on hardware
unsigned int long_mode : 1;
unsigned int big : 1; // 32-bit opcodes for code, uint32_t stack for data
unsigned int gran : 1; // 1 to use 4k page addressing, 0 for byte addressing
unsigned int base_high : 8;
} __packed; // or `__attribute__((packed))` depending on compiler
Using this structure, two ring 3 segments, both with base of 0 and limit of 0xFFFFFFFF, can be added, as follows:
static gdt_entry_bits gdt[6]; // one null segment, two ring 0 segments, two ring 3 segments, TSS segment
// (ring 0 segments)
gdt_entry_bits *ring3_code = &gdt[3];
gdt_entry_bits *ring3_data = &gdt[4];
ring3_code->limit_low = 0xFFFF;
ring3_code->base_low = 0;
ring3_code->accessed = 0;
ring3_code->read_write = 1; // since this is a code segment, specifies that the segment is readable
ring3_code->conforming = 0; // does not matter for ring 3 as no lower privilege level exists
ring3_code->code = 1;
ring3_code->code_data_segment = 1;
ring3_code->DPL = 3; // ring 3
ring3_code->present = 1;
ring3_code->limit_high = 0xF;
ring3_code->available = 1;
ring3_code->long_mode = 0;
ring3_code->big = 1; // it's 32 bits
ring3_code->gran = 1; // 4KB page addressing
ring3_code->base_high = 0;
*ring3_data = *ring3_code; // contents are similar so save time by copying
ring3_data->code = 0; // not code but data
install_tss(&gdt[5]); // TSS segment will be the fifth
flush_tss();
In actuality, the CPU can be put into user mode with just these two segments. However it is impossible to return to ring 0 for system calls, faults, or even IRQs. That is where the TSS comes in.
The TSS
The TSS can be used for multitasking, though it is recommended to use software multitasking for these reasons:
- Software task switching is faster (usually)
- When you port your OS to a different CPU, it probably won't have the TSS, so you'll have to implement software task switching anyway
- x86 64-bit mode does not allow you to use the TSS for task switching (the main reason, especially if your goal is to read 64-bit mode)
This guide will use software multitasking. Because of this the 32-bit TSS will contain a lot of junk we don't need. Here is the structure of the TSS:
struct tss_entry_struct {
uint32_t prev_tss; // The previous TSS - with hardware task switching these form a kind of backward linked list.
uint32_t esp0; // The stack pointer to load when changing to kernel mode.
uint32_t ss0; // The stack segment to load when changing to kernel mode.
// Everything below here is unused.
uint32_t esp1; // esp and ss 1 and 2 would be used when switching to rings 1 or 2.
uint32_t ss1;
uint32_t esp2;
uint32_t ss2;
uint32_t cr3;
uint32_t eip;
uint32_t eflags;
uint32_t eax;
uint32_t ecx;
uint32_t edx;
uint32_t ebx;
uint32_t esp;
uint32_t ebp;
uint32_t esi;
uint32_t edi;
uint32_t es;
uint32_t cs;
uint32_t ss;
uint32_t ds;
uint32_t fs;
uint32_t gs;
uint32_t ldt;
uint16_t trap;
uint16_t iomap_base;
} __packed;
typedef struct tss_entry_struct tss_entry_t;
To setup this TSS structure, give it an initial esp0 stack with the correct ss0 segment.
// Note: some of the GDT entry struct field names may not match perfectly to the TSS entries.
tss_entry_t tss_entry;
void write_tss(gdt_entry_bits *g) {
// Compute the base and limit of the TSS for use in the GDT entry.
uint32_t base = (uint32_t) &tss_entry;
uint32_t limit = sizeof tss_entry;
// Add a TSS descriptor to the GDT.
g->limit_low = limit;
g->base_low = base;
g->accessed = 1; // With a system entry (`code_data_segment` = 0), 1 indicates TSS and 0 indicates LDT
g->read_write = 0; // For a TSS, indicates busy (1) or not busy (0).
g->conforming_expand_down = 0; // always 0 for TSS
g->code = 1; // For a TSS, 1 indicates 32-bit (1) or 16-bit (0).
g->code_data_segment=0; // indicates TSS/LDT (see also `accessed`)
g->DPL = 0; // ring 0, see the comments below
g->present = 1;
g->limit_high = (limit & (0xf << 16)) >> 16; // isolate top nibble
g->available = 0; // 0 for a TSS
g->long_mode = 0;
g->big = 0; // should leave zero according to manuals.
g->gran = 0; // limit is in bytes, not pages
g->base_high = (base & (0xff << 24)) >> 24; //isolate top byte
// Ensure the TSS is initially zero'd.
memset(&tss_entry, 0, sizeof tss_entry);
tss_entry.ss0 = REPLACE_KERNEL_DATA_SEGMENT; // Set the kernel stack segment.
tss_entry.esp0 = REPLACE_KERNEL_STACK_ADDRESS; // Set the kernel stack pointer.
//note that CS is loaded from the IDT entry and should be the regular kernel code segment
}
void set_kernel_stack(uint32_t stack) { // Used when an interrupt occurs
tss_entry.esp0 = stack;
}
Finally, the implementation of the flush_tss function (Intel syntax):
; C declaration: void flush_tss(void);
global flush_tss
flush_tss:
mov ax, (5 * 8) | 0 ; fifth 8-byte selector, symbolically OR-ed with 0 to set the RPL (requested privilege level).
ltr ax
ret
At this point the kernel is ready to enter ring 3. It's worth noticing that the DPL of the TSS descriptor in GDT has nothing to do with the privilege level the task will run on: that depends on the DPL of the code segment used to set CS. The DPL of the TSS descriptor determines at which privilege level is it possible to CALL it, triggering a hardware context switch (32-bit only).
From the Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3, Section 7.2.2 (TSS Descriptor):
In most systems, the DPLs of TSS descriptors are set to values less than 3, so that only privileged software can perform task switching. However, in multitasking applications, DPLs for some TSS descriptors may be set to 3 to allow task switching at the application (or user) privilege level.
Entering Ring 3
The x86 is a tricky CPU. No matter how you approach it, there is no easy way to enter user mode. Nonetheless, below are three ways to enter user mode.
iret method
One of the ways to get to ring 3 is to make the processor think it was already in ring 3 to start with. This can be accomplished with an iret. Following is a simple example of this trick:
global jump_usermode
extern test_user_function
jump_usermode:
mov ax, (4 * 8) | 3 ; ring 3 data with bottom 2 bits set for ring 3
mov ds, ax
mov es, ax
mov fs, ax
mov gs, ax ; SS is handled by iret
; set up the stack frame iret expects
mov eax, esp
push (4 * 8) | 3 ; data selector
push eax ; current esp
pushf ; eflags
push (3 * 8) | 3 ; code selector (ring 3 code with bottom 2 bits set for ring 3)
push test_user_function ; instruction address to return to
iret
This will call test_user_function and it will be operating in user mode! Have the test_user_function execute a cli or other privileged instruction and you'll be pleased by a General Protection Fault.
sysexit method
The second way is to use the sysexit instruction as follows:
global jump_usermode
extern test_user_function
jump_usermode:
mov ax, (4 * 8) | 3 ; user data segment with RPL 3
mov ds, ax
mov es, ax
mov fs, ax
mov gs, ax ; sysexit sets SS
; setup wrmsr inputs
xor edx, edx ; not necessary; set to 0
mov eax, 0x8 ; the segments are computed as follows: CS=MSR+0x10 (0x8+0x10=0x18), SS=MSR+0x18 (0x8+0x18=0x20).
mov ecx, 0x174 ; MSR specifier: IA32_SYSENTER_CS
wrmsr ; set sysexit segments
; setup sysexit inputs
mov edx, test_user_function ; to be loaded into EIP
mov ecx, esp ; to be loaded into ESP
sysexit
sysret method
The other way is to use the sysret instruction as follows:
; note: this code is for 64-bit long mode only.
; it is unknown if it works in protected mode.
; using intel assembly style
global jump_usermode
extern test_user_function
jump_usermode:
;enable system call extensions that enables sysret and syscall
mov rcx, 0xc0000082
wrmsr
mov rcx, 0xc0000080
rdmsr
or eax, 1
wrmsr
mov rcx, 0xc0000081
rdmsr
mov edx, 0x00180008
wrmsr
mov ecx, test_user_function ; to be loaded into RIP
mov r11, 0x202 ; to be loaded into EFLAGS
sysretq ;use "o64 sysret" if you assemble with NASM
Multitasking considerations
There are a lot of subtle aspects of user mode and task switching. Whenever a system call interrupt happens, ESP0 is loaded into the stack pointer and all system information is pushed before the interrupt handler is entered. This could become a problem with two ring 3 tasks. Imagine: esp is currently set to the ESP0 stack. When the interrupt is receive, esp is set to the other task's esp which is the same ESP0 stack. To avoid overwriting data, the handler must change the ESP0 stack (along with the interrupt-pushed ESP stack) on each task switch.