Getting to Ring 3: Difference between revisions

From OSDev.wiki
Jump to navigation Jump to search
[unchecked revision][unchecked revision]
Content added Content deleted
mNo edit summary
m (Bot: fixing lint errors, replacing obsolete HTML tags:)
 
(19 intermediate revisions by 10 users not shown)
Line 1: Line 1:
{{In_Progress}}
{{In_Progress}}
The end goal of writing a kernel is to get to '''userspace''', or, in other words, going from ring 0 to ring 3. While one might expect that a ring 3 [[GDT]] entries would be sufficient, it is more complicated. All of the following tasks must be completed:
{{FirstPerson}}
* Add two new GDT entries (at least) configured for ring 3.
As fun as making a kernel is, eventually we have to get outside the kernel into '''userspace'''. This involves getting from ring 0 to ring 3. I am sure all of us wish we could just make a [[GDT]] entry and - poof - ring 3 works, but Intel wants us to pull our hair out at least a little with their [[Task State Segment]]. So, in order to get to ring 3 we must do the following:

* Get 2 new GDT entries (at least) configured for ring 3.
** These entries are needed for the user's code and data segments (one each)
** These entries are needed for the user's code and data segments (one each)
* Set up a barebones TSS with an ESP0 stack.
* Set up a barebones [[TSS]] with an ESP0 stack.
** When an interrupt(be it fault, IRQ, or software interrupt) happens while the CPU is in user mode the CPU needs to know where the kernel stack is located, this location is stored in the ESP0 entry of the TSS.
** When an interrupt (be it fault, IRQ, or software interrupt) happens while the CPU is in user mode, the CPU needs to know where the kernel stack is located. This location is stored in the ESP0 (0 for ring 0) entry of the TSS.
* Set up an IDT entry for ring 3 system call interrupts (optional, actually).
* Set up an [[IDT]] entry for ring 3 system call interrupts (optional).
** System calls are the way user code requests the kernel to do IO and process management. For more information see [[System Calls]]
** System calls are the way user code requests the kernel to do IO and process management. For more information see [[System Calls]]

== Requirements ==
== Requirements ==
* Ring 0 GDT and IDT
I'm not going to go through making a whole kernel that can get to ring 3. I will assume you have a decent and usable ring 0 GDT and IDT, along with being able to handle IRQs properly. I also assume you will be multitasking, and so will cover switching ring 3>ring 0(switch task)>ring 3.
* IRQ handling

* Plans for multitasking with task switching
== GDT ==
== GDT ==
This is my GDT struct I will be using(it's split into bit-fields)
Following is an example of a GDT entry structure in C, utilizing bit-fields:
<source lang="c">
<syntaxhighlight lang="c">
struct gdt_entry_bits
struct gdt_entry_bits {
unsigned int limit_low : 16;
{
unsigned int limit_low:16;
unsigned int base_low : 24;
unsigned int base_low : 24;
unsigned int accessed : 1;
unsigned int read_write : 1; // readable for code, writable for data
//attribute byte split into bitfields
unsigned int accessed :1;
unsigned int conforming_expand_down : 1; // conforming for code, expand down for data
unsigned int read_write :1; //readable for code, writable for data
unsigned int code : 1; // 1 for code, 0 for data
unsigned int conforming_expand_down :1; //conforming for code, expand down for data
unsigned int code_data_segment : 1; // should be 1 for everything but TSS and LDT
unsigned int code :1; //1 for code, 0 for data
unsigned int DPL : 2; // privilege level
unsigned int always_1 :1; //should be 1 for everything but TSS and LDT
unsigned int present : 1;
unsigned int DPL :2; //priviledge level
unsigned int limit_high : 4;
unsigned int available : 1; // only used in software; has no effect on hardware
unsigned int present :1;
unsigned int long_mode : 1;
//and now into granularity
unsigned int big : 1; // 32-bit opcodes for code, uint32_t stack for data
unsigned int limit_high :4;
unsigned int gran : 1; // 1 to use 4k page addressing, 0 for byte addressing
unsigned int available :1;
unsigned int always_0 :1; //should always be 0
unsigned int base_high : 8;
} __packed; // or `__attribute__((packed))` depending on compiler
unsigned int big :1; //32bit opcodes for code, uint32_t stack for data
</syntaxhighlight>
unsigned int gran :1; //1 to use 4k page addressing, 0 for byte addressing
Using this structure, two ring 3 segments, both with base of 0 and limit of 0xFFFFFFFF, can be added, as follows:
unsigned int base_high :8;
<syntaxhighlight lang="c">
} __packed; //or __attribute__((packed))
static gdt_entry_bits gdt[6]; // one null segment, two ring 0 segments, two ring 3 segments, TSS segment
</source>
// (ring 0 segments)
We will be doing a simple setup, and I will assume you will later implement paging in your OS, so we will use only two ring 3 segments both with base of 0 and limit of 0xFFFFFFFF so we will set up our two GDT segments like this:
<source lang="c">
//....insert your ring 0 segments here or whatever
gdt_entry_bits *code;
gdt_entry_bits *data;
//I assume your ring 0 segments are in gdt[1] and gdt[2] (0 is null segment)
code=(void*)&gdt[3]; //gdt is a static array of gdt_entry_bits or equivalent
data=(void*)&gdt[4];
code->limit_low=0xFFFF;
code->base_low=0;
code->accessed=0;
code->read_write=1; //make it readable for code segments
code->conforming=0; //don't worry about this..
code->code=1; //this is to signal its a code segment
code->always_1=1;
code->DPL=3; //set it to ring 3
code->present=1;
code->limit_high=0xF;
code->available=1;
code->always_0=0;
code->big=1; //signal it's 32 bits
code->gran=1; //use 4k page addressing
code->base_high=0;
*data=*code; //copy it all over, cause most of it is the same
data->code=0; //signal it's not code; so it's data.


gdt_entry_bits *ring3_code = &gdt[3];
install_tss(&gdt[5]); //we'll implement this function later...
gdt_entry_bits *ring3_data = &gdt[4];


ring3_code->limit_low = 0xFFFF;
//...go on to install GDT segments and such
ring3_code->base_low = 0;
//after those are installed we'll tell the CPU where our TSS is:
ring3_code->accessed = 0;
flush_tss(); //implement this later
ring3_code->read_write = 1; // since this is a code segment, specifies that the segment is readable
</source>
ring3_code->conforming = 0; // does not matter for ring 3 as no lower privilege level exists
Ok, so now we have our two user mode segments. Now technically, we can get to user mode right now with these two segments. The problem is we can't get back to ring 0 for system calls or faults or even IRQs. That is where the TSS comes in.
ring3_code->code = 1;
ring3_code->code_data_segment = 1;
ring3_code->DPL = 3; // ring 3
ring3_code->present = 1;
ring3_code->limit_high = 0xF;
ring3_code->available = 1;
ring3_code->long_mode = 0;
ring3_code->big = 1; // it's 32 bits
ring3_code->gran = 1; // 4KB page addressing
ring3_code->base_high = 0;


*ring3_data = *ring3_code; // contents are similar so save time by copying
ring3_data->code = 0; // not code but data

install_tss(&gdt[5]); // TSS segment will be the fifth

flush_tss();
</syntaxhighlight>
In actuality, the CPU can be put into user mode with just these two segments. However it is impossible to return to ring 0 for system calls, faults, or even IRQs. That is where the TSS comes in.
== The TSS ==
== The TSS ==
The TSS can be used for multitasking, though it is recommended to use software multitasking for these reasons:
The TSS can be used for multitasking, though it is recommended to use software multitasking for these reasons:
* Software task switching is faster(usually)
* Software task switching is faster (usually)
* When you port your OS to a different CPU, it won't have the TSS, so you'll have to implement software task switching anyway
* When you port your OS to a different CPU, it probably won't have the TSS, so you'll have to implement software task switching anyway
* x86 64bit mode does not allow you to use the TSS for task switching.
* x86 64-bit mode does not allow you to use the TSS for task switching (the main reason, especially if your goal is to read 64-bit mode)
Since we will be using the software multitasking approach, the TSS will contain a lot of junk we don't need. Here is the structure of the TSS
This guide will use software multitasking. Because of this the 32-bit TSS will contain a lot of junk we don't need. Here is the structure of the TSS:
<source lang="c">
<syntaxhighlight lang="c">
struct tss_entry_struct {
// A struct describing a Task State Segment.
uint32_t prev_tss; // The previous TSS - with hardware task switching these form a kind of backward linked list.
struct tss_entry_struct
uint32_t esp0; // The stack pointer to load when changing to kernel mode.
{
uint32_t ss0; // The stack segment to load when changing to kernel mode.
uint32_t prev_tss; // The previous TSS - if we used hardware task switching this would form a linked list.
// Everything below here is unused.
uint32_t esp0; // The stack pointer to load when we change to kernel mode.
uint32_t ss0; // The stack segment to load when we change to kernel mode.
uint32_t esp1; // esp and ss 1 and 2 would be used when switching to rings 1 or 2.
uint32_t ss1;
uint32_t esp1; // everything below here is unusued now..
uint32_t ss1;
uint32_t esp2;
uint32_t esp2;
uint32_t ss2;
uint32_t ss2;
uint32_t cr3;
uint32_t cr3;
uint32_t eip;
uint32_t eip;
uint32_t eflags;
uint32_t eflags;
uint32_t eax;
uint32_t eax;
uint32_t ecx;
uint32_t ecx;
uint32_t edx;
uint32_t edx;
uint32_t ebx;
uint32_t ebx;
uint32_t esp;
uint32_t esp;
uint32_t ebp;
uint32_t ebp;
uint32_t esi;
uint32_t esi;
uint32_t edi;
uint32_t edi;
uint32_t es;
uint32_t es;
uint32_t cs;
uint32_t cs;
uint32_t ss;
uint32_t ss;
uint32_t ds;
uint32_t ds;
uint32_t fs;
uint32_t fs;
uint32_t gs;
uint32_t gs;
uint32_t ldt;
uint16_t trap;
uint32_t ldt;
uint16_t trap;
uint16_t iomap_base;
uint16_t iomap_base;
} __packed;
} __packed;


typedef struct tss_entry_struct tss_entry_t;
typedef struct tss_entry_struct tss_entry_t;
</syntaxhighlight>
</source>
To setup this TSS structure, give it an initial esp0 stack with the correct ss0 segment.
As you can see.. a lot of wasted crap you don't need. But, intel demands it be used, so...
<syntaxhighlight lang="c">
Basically what we want to do to setup this TSS structure is give it an initial esp0 stack and setup the segments to point to our kernel segments, and really that's it.. so we can do something like this:
// Note: some of the GDT entry struct field names may not match perfectly to the TSS entries.
<source lang="c">
/**Ok, this is going to be hackish, but we will salvage the gdt_entry_bits struct to form our TSS descriptor
So some of these names of the fields will actually be different.. maybe I'll fix this later..**/
tss_entry_t tss_entry;
tss_entry_t tss_entry;


void write_tss(gdt_entry_bits *g)
void write_tss(gdt_entry_bits *g) {
// Compute the base and limit of the TSS for use in the GDT entry.
{
uint32_t base = (uint32_t) &tss_entry;
// Firstly, let's compute the base and limit of our entry into the GDT.
uint32_t base = (uint32_t) &tss_entry;
uint32_t limit = sizeof tss_entry;
uint32_t limit = sizeof(tss_entry);


// Now, add our TSS descriptor's address to the GDT.
// Add a TSS descriptor to the GDT.
g->limit_low=limit&0xFFFF;
g->limit_low = limit;
g->base_low=base&0xFFFFFF; //isolate bottom 24 bits
g->base_low = base;
g->accessed=1; //This indicates it's a TSS and not a LDT. This is a changed meaning
g->accessed = 1; // With a system entry (`code_data_segment` = 0), 1 indicates TSS and 0 indicates LDT
g->read_write=0; //This indicates if the TSS is busy or not. 0 for not busy
g->read_write = 0; // For a TSS, indicates busy (1) or not busy (0).
g->conforming_expand_down=0; //always 0 for TSS
g->conforming_expand_down = 0; // always 0 for TSS
g->code=1; //For TSS this is 1 for 32bit usage, or 0 for 16bit.
g->code = 1; // For a TSS, 1 indicates 32-bit (1) or 16-bit (0).
g->code_data_segment=0; // indicates TSS/LDT (see also `accessed`)
g->always_1=0; //indicate it is a TSS
g->DPL=3; //same meaning
g->DPL = 0; // ring 0, see the comments below
g->present=1; //same meaning
g->present = 1;
g->limit_high=(limit&0xF0000)>>16; //isolate top nibble
g->limit_high = (limit & (0xf << 16)) >> 16; // isolate top nibble
g->available=0;
g->available = 0; // 0 for a TSS
g->long_mode = 0;
g->always_0=0; //same thing
g->big=0; //should leave zero according to manuals. No effect
g->big = 0; // should leave zero according to manuals.
g->gran=0; //so that our computed GDT limit is in bytes, not pages
g->gran = 0; // limit is in bytes, not pages
g->base_high=(base&0xFF000000)>>24; //isolate top byte.
g->base_high = (base & (0xff << 24)) >> 24; //isolate top byte


// Ensure the TSS is initially zero'd.
// Ensure the TSS is initially zero'd.
memset(&tss_entry, 0, sizeof(tss_entry));
memset(&tss_entry, 0, sizeof tss_entry);


tss_entry.ss0 = REPLACE_KERNEL_DATA_SEGMENT; // Set the kernel stack segment.
tss_entry.ss0 = REPLACE_KERNEL_DATA_SEGMENT; // Set the kernel stack segment.
tss_entry.esp0 = REPLACE_KERNEL_STACK_ADDRESS; // Set the kernel stack pointer.
tss_entry.esp0 = REPLACE_KERNEL_STACK_ADDRESS; // Set the kernel stack pointer.
//note that CS is loaded from the IDT entry and should be the regular kernel code segment
//note that CS is loaded from the IDT entry and should be the regular kernel code segment
}
}


void set_kernel_stack(uint32_t stack) //this will update the ESP0 stack used when an interrupt occurs
void set_kernel_stack(uint32_t stack) { // Used when an interrupt occurs
tss_entry.esp0 = stack;
{
}
tss_entry.esp0 = stack;
</syntaxhighlight>
}
Finally, the implementation of the flush_tss function (Intel syntax):
</source>
<syntaxhighlight lang="asm">
; C declaration: void flush_tss(void);
global flush_tss
flush_tss:
mov ax, (5 * 8) | 0 ; fifth 8-byte selector, symbolically OR-ed with 0 to set the RPL (requested privilege level).
ltr ax
ret
</syntaxhighlight>
At this point the kernel is ready to enter ring 3. It's worth noticing that the DPL of the TSS descriptor in GDT has ''nothing'' to do with the privilege level the task will run on: that depends on the DPL of the code segment used to set CS. The DPL of the TSS descriptor determines at which privilege level is it possible to CALL it, triggering a hardware context switch (32-bit only).


From the Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3, Section 7.2.2 (TSS Descriptor):
Now, I know you may spend a while looking at that atrocious code..but I do believe it works. Oh, and here is our flush_tss function: (in yasm/nasm syntax)
<blockquote>In most systems, the DPLs of TSS descriptors are set to values less than 3, so that only privileged software can perform task switching. However, in multitasking applications, DPLs for some TSS descriptors may be set to 3 to allow task switching at the application (or user) privilege level.</blockquote>
<source lang="asm">
== Entering Ring 3 ==
GLOBAL tss_flush ; Allows our C code to call tss_flush().
The x86 is a tricky CPU. No matter how you approach it, there is no easy way to enter user mode. Nonetheless, below are three ways to enter user mode.
tss_flush:
=== iret method ===
mov ax, 0x2B ; Load the index of our TSS structure - The index is
One of the ways to get to ring 3 is to make the processor think it was already in ring 3 to start with. This can be accomplished with an iret. Following is a simple example of this trick:
; 0x28, as it is the 5th selector and each is 8 bytes
<syntaxhighlight lang="asm">
; long, but we set the bottom two bits (making 0x2B)
global jump_usermode
; so that it has an RPL of 3, not zero.
extern test_user_function
ltr ax ; Load 0x2B into the task state register.
jump_usermode:
ret
mov ax, (4 * 8) | 3 ; ring 3 data with bottom 2 bits set for ring 3
</source>
mov ds, ax
mov es, ax
mov fs, ax
mov gs, ax ; SS is handled by iret


; set up the stack frame iret expects
Ok, so now we are just about ready to do some ring 3 fun stuff!!
mov eax, esp
push (4 * 8) | 3 ; data selector
push eax ; current esp
pushf ; eflags
push (3 * 8) | 3 ; code selector (ring 3 code with bottom 2 bits set for ring 3)
push test_user_function ; instruction address to return to
iret
</syntaxhighlight>
This will call test_user_function and it will be operating in user mode! Have the test_user_function execute a cli or other privileged instruction and you'll be pleased by a [[General Protection Fault]].
=== sysexit method ===
The second way is to use the sysexit instruction as follows:
<syntaxhighlight lang="asm">
global jump_usermode
extern test_user_function
jump_usermode:
mov ax, (4 * 8) | 3 ; user data segment with RPL 3
mov ds, ax
mov es, ax
mov fs, ax
mov gs, ax ; sysexit sets SS


; setup wrmsr inputs
== Entering Ring 3 ==
xor edx, edx ; not necessary; set to 0
Ok, the x86 is really a tricky CPU. The only way to get to ring 3 is to fool the processor into thinking it was already in ring 3 to start with. We effectively do this using an iret. I'll give you a simple example on how to execute something as ring 3:(yasm/nasm syntax)
mov eax, 0x8 ; the segments are computed as follows: CS=MSR+0x10 (0x8+0x10=0x18), SS=MSR+0x18 (0x8+0x18=0x20).
<source lang="asm">
mov ecx, 0x174 ; MSR specifier: IA32_SYSENTER_CS
GLOBAL _jump_usermode ;you may need to remove this _ to work right..
wrmsr ; set sysexit segments
EXTERN _test_user_function
_jump_usermode:
mov ax,0x23
mov ds,ax
mov es,ax
mov fs,ax
mov gs,ax ;we don't need to worry about SS. it's handled by iret
mov eax,esp
push 0x23 ;user data segment with bottom 2 bits set for ring 3
push eax ;push our current stack just for the heck of it
pushf
push 0x1B; ;user code segment with bottom 2 bits set for ring 3
push _test_user_function ;may need to remove the _ for this to work right
iret
;end
</source>
Now then, this will call the C function test_user_function and it will be operating in user mode! There is no easy way of getting back to ring 0(excluding IRQs) except for by setting up a task switching system, which you really should have in place to properly appreciate ring 3 in the first place.. But if you would like to test things out in user mode, just have the test_user_function execute a cli or other privileged instruction and you'll be pleased by a GPF. I won't give you source examples on implementing this into your task switching system, as these vary a lot by operating system.


; setup sysexit inputs
== Multitasking considerations ==
mov edx, test_user_function ; to be loaded into EIP
There are a lot of subtle things with user mode and task switching that you may not realize at first. First: Whenever a system call interrupt happens, the first thing that happens is the CPU changes to ESP0 stack. Then, it will push all the system information. So when you enter the interrupt handler, your working off of the ESP0 stack. This could become a problem with 2 ring 3 tasks going if all you do is merely push context info and change esp. Think about it. you will change the esp, which is the esp0 stack, to the other tasks esp, which is the same esp0 stack. So, what you must do is change the ESP0 stack(along with the interrupt pushed ESP stack) on each task switch, or you'll end up overwriting yourself.
mov ecx, esp ; to be loaded into ESP
sysexit
</syntaxhighlight>
=== sysret method ===
The other way is to use the sysret instruction as follows:
<syntaxhighlight lang="asm">
; note: this code is for 64-bit long mode only.
; it is unknown if it works in protected mode.
; using intel assembly style
global jump_usermode
extern test_user_function
jump_usermode:
;enable system call extensions that enables sysret and syscall
mov rcx, 0xc0000082
wrmsr
mov rcx, 0xc0000080
rdmsr
or eax, 1
wrmsr
mov rcx, 0xc0000081
rdmsr
mov edx, 0x00180008
wrmsr


mov ecx, test_user_function ; to be loaded into RIP
mov r11, 0x202 ; to be loaded into EFLAGS
sysretq ;use "o64 sysret" if you assemble with NASM
</syntaxhighlight>
== Multitasking considerations ==
There are a lot of subtle aspects of user mode and task switching. Whenever a system call interrupt happens, ESP0 is loaded into the stack pointer and all system information is pushed before the interrupt handler is entered. This could become a problem with two ring 3 tasks. Imagine: esp is currently set to the ESP0 stack. When the interrupt is receive, esp is set to the other task's esp which is the same ESP0 stack. To avoid overwriting data, the handler must change the ESP0 stack (along with the interrupt-pushed ESP stack) on each task switch.
[[Category:Tutorials]]
[[Category:Tutorials]]
[[Category:X86 CPU]]
[[Category:X86 CPU]]

Latest revision as of 15:45, 9 June 2024

This page is a work in progress.
This page may thus be incomplete. Its content may be changed in the near future.

The end goal of writing a kernel is to get to userspace, or, in other words, going from ring 0 to ring 3. While one might expect that a ring 3 GDT entries would be sufficient, it is more complicated. All of the following tasks must be completed:

  • Add two new GDT entries (at least) configured for ring 3.
    • These entries are needed for the user's code and data segments (one each)
  • Set up a barebones TSS with an ESP0 stack.
    • When an interrupt (be it fault, IRQ, or software interrupt) happens while the CPU is in user mode, the CPU needs to know where the kernel stack is located. This location is stored in the ESP0 (0 for ring 0) entry of the TSS.
  • Set up an IDT entry for ring 3 system call interrupts (optional).
    • System calls are the way user code requests the kernel to do IO and process management. For more information see System Calls

Requirements

  • Ring 0 GDT and IDT
  • IRQ handling
  • Plans for multitasking with task switching

GDT

Following is an example of a GDT entry structure in C, utilizing bit-fields:

struct gdt_entry_bits {
	unsigned int limit_low              : 16;
	unsigned int base_low               : 24;
	unsigned int accessed               :  1;
	unsigned int read_write             :  1; // readable for code, writable for data
	unsigned int conforming_expand_down :  1; // conforming for code, expand down for data
	unsigned int code                   :  1; // 1 for code, 0 for data
	unsigned int code_data_segment      :  1; // should be 1 for everything but TSS and LDT
	unsigned int DPL                    :  2; // privilege level
	unsigned int present                :  1;
	unsigned int limit_high             :  4;
	unsigned int available              :  1; // only used in software; has no effect on hardware
	unsigned int long_mode              :  1;
	unsigned int big                    :  1; // 32-bit opcodes for code, uint32_t stack for data
	unsigned int gran                   :  1; // 1 to use 4k page addressing, 0 for byte addressing
	unsigned int base_high              :  8;
} __packed; // or `__attribute__((packed))` depending on compiler

Using this structure, two ring 3 segments, both with base of 0 and limit of 0xFFFFFFFF, can be added, as follows:

static gdt_entry_bits gdt[6]; // one null segment, two ring 0 segments, two ring 3 segments, TSS segment
// (ring 0 segments)

gdt_entry_bits *ring3_code = &gdt[3];
gdt_entry_bits *ring3_data = &gdt[4];

ring3_code->limit_low = 0xFFFF;
ring3_code->base_low = 0;
ring3_code->accessed = 0;
ring3_code->read_write = 1; // since this is a code segment, specifies that the segment is readable
ring3_code->conforming = 0; // does not matter for ring 3 as no lower privilege level exists
ring3_code->code = 1;
ring3_code->code_data_segment = 1;
ring3_code->DPL = 3; // ring 3
ring3_code->present = 1;
ring3_code->limit_high = 0xF;
ring3_code->available = 1;
ring3_code->long_mode = 0;
ring3_code->big = 1; // it's 32 bits
ring3_code->gran = 1; // 4KB page addressing
ring3_code->base_high = 0;

*ring3_data = *ring3_code; // contents are similar so save time by copying
ring3_data->code = 0; // not code but data

install_tss(&gdt[5]); // TSS segment will be the fifth 

flush_tss();

In actuality, the CPU can be put into user mode with just these two segments. However it is impossible to return to ring 0 for system calls, faults, or even IRQs. That is where the TSS comes in.

The TSS

The TSS can be used for multitasking, though it is recommended to use software multitasking for these reasons:

  • Software task switching is faster (usually)
  • When you port your OS to a different CPU, it probably won't have the TSS, so you'll have to implement software task switching anyway
  • x86 64-bit mode does not allow you to use the TSS for task switching (the main reason, especially if your goal is to read 64-bit mode)

This guide will use software multitasking. Because of this the 32-bit TSS will contain a lot of junk we don't need. Here is the structure of the TSS:

struct tss_entry_struct {
	uint32_t prev_tss; // The previous TSS - with hardware task switching these form a kind of backward linked list.
	uint32_t esp0;     // The stack pointer to load when changing to kernel mode.
	uint32_t ss0;      // The stack segment to load when changing to kernel mode.
	// Everything below here is unused.
	uint32_t esp1; // esp and ss 1 and 2 would be used when switching to rings 1 or 2.
	uint32_t ss1;
	uint32_t esp2;
	uint32_t ss2;
	uint32_t cr3;
	uint32_t eip;
	uint32_t eflags;
	uint32_t eax;
	uint32_t ecx;
	uint32_t edx;
	uint32_t ebx;
	uint32_t esp;
	uint32_t ebp;
	uint32_t esi;
	uint32_t edi;
	uint32_t es;
	uint32_t cs;
	uint32_t ss;
	uint32_t ds;
	uint32_t fs;
	uint32_t gs;
	uint32_t ldt;
	uint16_t trap;
	uint16_t iomap_base;
} __packed;

typedef struct tss_entry_struct tss_entry_t;

To setup this TSS structure, give it an initial esp0 stack with the correct ss0 segment.

// Note: some of the GDT entry struct field names may not match perfectly to the TSS entries.
tss_entry_t tss_entry;

void write_tss(gdt_entry_bits *g) {
	// Compute the base and limit of the TSS for use in the GDT entry.
	uint32_t base = (uint32_t) &tss_entry;
	uint32_t limit = sizeof tss_entry;

	// Add a TSS descriptor to the GDT.
	g->limit_low = limit;
	g->base_low = base;
	g->accessed = 1; // With a system entry (`code_data_segment` = 0), 1 indicates TSS and 0 indicates LDT
	g->read_write = 0; // For a TSS, indicates busy (1) or not busy (0).
	g->conforming_expand_down = 0; // always 0 for TSS
	g->code = 1; // For a TSS, 1 indicates 32-bit (1) or 16-bit (0).
	g->code_data_segment=0; // indicates TSS/LDT (see also `accessed`)
	g->DPL = 0; // ring 0, see the comments below
	g->present = 1;
	g->limit_high = (limit & (0xf << 16)) >> 16; // isolate top nibble
	g->available = 0; // 0 for a TSS
	g->long_mode = 0;
	g->big = 0; // should leave zero according to manuals.
	g->gran = 0; // limit is in bytes, not pages
	g->base_high = (base & (0xff << 24)) >> 24; //isolate top byte

	// Ensure the TSS is initially zero'd.
	memset(&tss_entry, 0, sizeof tss_entry);

	tss_entry.ss0  = REPLACE_KERNEL_DATA_SEGMENT;  // Set the kernel stack segment.
	tss_entry.esp0 = REPLACE_KERNEL_STACK_ADDRESS; // Set the kernel stack pointer.
	//note that CS is loaded from the IDT entry and should be the regular kernel code segment
}

void set_kernel_stack(uint32_t stack) { // Used when an interrupt occurs
	tss_entry.esp0 = stack;
}

Finally, the implementation of the flush_tss function (Intel syntax):

; C declaration: void flush_tss(void);
global flush_tss
flush_tss:
	mov ax, (5 * 8) | 0 ; fifth 8-byte selector, symbolically OR-ed with 0 to set the RPL (requested privilege level).
	ltr ax
	ret

At this point the kernel is ready to enter ring 3. It's worth noticing that the DPL of the TSS descriptor in GDT has nothing to do with the privilege level the task will run on: that depends on the DPL of the code segment used to set CS. The DPL of the TSS descriptor determines at which privilege level is it possible to CALL it, triggering a hardware context switch (32-bit only).

From the Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3, Section 7.2.2 (TSS Descriptor):

In most systems, the DPLs of TSS descriptors are set to values less than 3, so that only privileged software can perform task switching. However, in multitasking applications, DPLs for some TSS descriptors may be set to 3 to allow task switching at the application (or user) privilege level.

Entering Ring 3

The x86 is a tricky CPU. No matter how you approach it, there is no easy way to enter user mode. Nonetheless, below are three ways to enter user mode.

iret method

One of the ways to get to ring 3 is to make the processor think it was already in ring 3 to start with. This can be accomplished with an iret. Following is a simple example of this trick:

global jump_usermode
extern test_user_function
jump_usermode:
	mov ax, (4 * 8) | 3 ; ring 3 data with bottom 2 bits set for ring 3
	mov ds, ax
	mov es, ax 
	mov fs, ax 
	mov gs, ax ; SS is handled by iret

	; set up the stack frame iret expects
	mov eax, esp
	push (4 * 8) | 3 ; data selector
	push eax ; current esp
	pushf ; eflags
	push (3 * 8) | 3 ; code selector (ring 3 code with bottom 2 bits set for ring 3)
	push test_user_function ; instruction address to return to
	iret

This will call test_user_function and it will be operating in user mode! Have the test_user_function execute a cli or other privileged instruction and you'll be pleased by a General Protection Fault.

sysexit method

The second way is to use the sysexit instruction as follows:

global jump_usermode
extern test_user_function
jump_usermode:
	mov ax, (4 * 8) | 3 ; user data segment with RPL 3
	mov ds, ax
	mov es, ax
	mov fs, ax
	mov gs, ax ; sysexit sets SS

	; setup wrmsr inputs
	xor edx, edx ; not necessary; set to 0
	mov eax, 0x8 ; the segments are computed as follows: CS=MSR+0x10 (0x8+0x10=0x18), SS=MSR+0x18 (0x8+0x18=0x20).
	mov ecx, 0x174 ; MSR specifier: IA32_SYSENTER_CS
	wrmsr ; set sysexit segments

	; setup sysexit inputs
	mov edx, test_user_function ; to be loaded into EIP
	mov ecx, esp ; to be loaded into ESP
	sysexit

sysret method

The other way is to use the sysret instruction as follows:

; note: this code is for 64-bit long mode only.
;       it is unknown if it works in protected mode.
;       using intel assembly style
global jump_usermode
extern test_user_function
jump_usermode:
;enable system call extensions that enables sysret and syscall
	mov rcx, 0xc0000082
	wrmsr
	mov rcx, 0xc0000080
	rdmsr
	or eax, 1
	wrmsr
	mov rcx, 0xc0000081
	rdmsr
	mov edx, 0x00180008
	wrmsr

	mov ecx, test_user_function ; to be loaded into RIP
	mov r11, 0x202 ; to be loaded into EFLAGS
	sysretq ;use "o64 sysret" if you assemble with NASM

Multitasking considerations

There are a lot of subtle aspects of user mode and task switching. Whenever a system call interrupt happens, ESP0 is loaded into the stack pointer and all system information is pushed before the interrupt handler is entered. This could become a problem with two ring 3 tasks. Imagine: esp is currently set to the ESP0 stack. When the interrupt is receive, esp is set to the other task's esp which is the same ESP0 stack. To avoid overwriting data, the handler must change the ESP0 stack (along with the interrupt-pushed ESP stack) on each task switch.