Getting to Ring 3
The end goal of writing a kernel is to get to userspace, or, in other words, going from ring 0 to ring 3. While one might expect that a ring 3 GDT entries would be sufficient, it is more complicated. All of the following tasks must be completed:
- Add two new GDT entries (at least) configured for ring 3.
- These entries are needed for the user's code and data segments (one each)
- Set up a barebones TSS with an ESP0 stack.
- When an interrupt (be it fault, IRQ, or software interrupt) happens while the CPU is in user mode, the CPU needs to know where the kernel stack is located. This location is stored in the ESP0 (0 for ring 0) entry of the TSS.
- Set up an IDT entry for ring 3 system call interrupts (optional).
- System calls are the way user code requests the kernel to do IO and process management. For more information see System Calls
Requirements
- Ring 0 GDT and IDT
- IRQ handling
- Plans for multitasking with task switching
GDT
Following is an example of a GDT entry structure in C, utilizing bit-fields:
struct gdt_entry_bits {
unsigned int limit_low : 16;
unsigned int base_low : 24;
unsigned int accessed : 1;
unsigned int read_write : 1; // readable for code, writable for data
unsigned int conforming_expand_down : 1; // conforming for code, expand down for data
unsigned int code : 1; // 1 for code, 0 for data
unsigned int code_data_segment : 1; // should be 1 for everything but TSS and LDT
unsigned int DPL : 2; // privilege level
unsigned int present : 1;
unsigned int limit_high : 4;
unsigned int available : 1; // only used in software; has no effect on hardware
unsigned int long_mode : 1;
unsigned int big : 1; // 32-bit opcodes for code, uint32_t stack for data
unsigned int gran : 1; // 1 to use 4k page addressing, 0 for byte addressing
unsigned int base_high : 8;
} __packed; // or `__attribute__((packed))` depending on compiler
Using this structure, two ring 3 segments, both with base of 0 and limit of 0xFFFFFFFF, can be added, as follows:
static gdt_entry_bits gdt[6]; // one null segment, two ring 0 segments, two ring 3 segments, TSS segment
// (ring 0 segments)
gdt_entry_bits *ring3_code = &gdt[3];
gdt_entry_bits *ring3_data = &gdt[4];
ring3_code->limit_low = 0xFFFF;
ring3_code->base_low = 0;
ring3_code->accessed = 0;
ring3_code->read_write = 1; // since this is a code segment, specifies that the segment is readable
ring3_code->conforming = 0; // does not matter for ring 3 as no lower privilege level exists
ring3_code->code = 1;
ring3_code->code_data_segment = 1;
ring3_code->DPL = 3; // ring 3
ring3_code->present = 1;
ring3_code->limit_high = 0xF;
ring3_code->available = 1;
ring3_code->long_mode = 0;
ring3_code->big = 1; // it's 32 bits
ring3_code->gran = 1; // 4KB page addressing
ring3_code->base_high = 0;
*ring3_data = *ring3_code; // contents are similar so save time by copying
ring3_data->code = 0; // not code but data
install_tss(&gdt[5]); // TSS segment will be the fifth
flush_tss();
In actuality, the CPU can be put into user mode with just these two segments. However it is impossible to return to ring 0 for system calls, faults, or even IRQs. That is where the TSS comes in.
The TSS
The TSS can be used for multitasking, though it is recommended to use software multitasking for these reasons:
- Software task switching is faster (usually)
- When you port your OS to a different CPU, it probably won't have the TSS, so you'll have to implement software task switching anyway
- x86 64-bit mode does not allow you to use the TSS for task switching (the main reason, especially if your goal is to read 64-bit mode)
This guide will use software multitasking. Because of this the 32-bit TSS will contain a lot of junk we don't need. Here is the structure of the TSS:
struct tss_entry_struct {
uint32_t prev_tss; // The previous TSS - with hardware task switching these form a kind of backward linked list.
uint32_t esp0; // The stack pointer to load when changing to kernel mode.
uint32_t ss0; // The stack segment to load when changing to kernel mode.
// Everything below here is unused.
uint32_t esp1; // esp and ss 1 and 2 would be used when switching to rings 1 or 2.
uint32_t ss1;
uint32_t esp2;
uint32_t ss2;
uint32_t cr3;
uint32_t eip;
uint32_t eflags;
uint32_t eax;
uint32_t ecx;
uint32_t edx;
uint32_t ebx;
uint32_t esp;
uint32_t ebp;
uint32_t esi;
uint32_t edi;
uint32_t es;
uint32_t cs;
uint32_t ss;
uint32_t ds;
uint32_t fs;
uint32_t gs;
uint32_t ldt;
uint16_t trap;
uint16_t iomap_base;
} __packed;
typedef struct tss_entry_struct tss_entry_t;
To setup this TSS structure, give it an initial esp0 stack with the correct ss0 segment.
// Note: some of the GDT entry struct field names may not match perfectly to the TSS entries.
tss_entry_t tss_entry;
void write_tss(gdt_entry_bits *g) {
// Compute the base and limit of the TSS for use in the GDT entry.
uint32_t base = (uint32_t) &tss_entry;
uint32_t limit = sizeof tss_entry;
// Add a TSS descriptor to the GDT.
g->limit_low = limit;
g->base_low = base;
g->accessed = 1; // With a system entry (`code_data_segment` = 0), 1 indicates TSS and 0 indicates LDT
g->read_write = 0; // For a TSS, indicates busy (1) or not busy (0).
g->conforming_expand_down = 0; // always 0 for TSS
g->code = 1; // For a TSS, 1 indicates 32-bit (1) or 16-bit (0).
g->code_data_segment=0; // indicates TSS/LDT (see also `accessed`)
g->DPL = 0; // ring 0, see the comments below
g->present = 1;
g->limit_high = (limit & (0xf << 16)) >> 16; // isolate top nibble
g->available = 0; // 0 for a TSS
g->long_mode = 0;
g->big = 0; // should leave zero according to manuals.
g->gran = 0; // limit is in bytes, not pages
g->base_high = (base & (0xff << 24)) >> 24; //isolate top byte
// Ensure the TSS is initially zero'd.
memset(&tss_entry, 0, sizeof tss_entry);
tss_entry.ss0 = REPLACE_KERNEL_DATA_SEGMENT; // Set the kernel stack segment.
tss_entry.esp0 = REPLACE_KERNEL_STACK_ADDRESS; // Set the kernel stack pointer.
//note that CS is loaded from the IDT entry and should be the regular kernel code segment
}
void set_kernel_stack(uint32_t stack) { // Used when an interrupt occurs
tss_entry.esp0 = stack;
}
Finally, the implementation of the flush_tss function (Intel syntax):
; C declaration: void flush_tss(void);
global flush_tss
flush_tss:
mov ax, (5 * 8) | 0 ; fifth 8-byte selector, symbolically OR-ed with 0 to set the RPL (requested privilege level).
ltr ax
ret
At this point the kernel is ready to enter ring 3. It's worth noticing that the DPL of the TSS descriptor in GDT has nothing to do with the privilege level the task will run on: that depends on the DPL of the code segment used to set CS. The DPL of the TSS descriptor determines at which privilege level is it possible to CALL it, triggering a hardware context switch (32-bit only).
From the Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3, Section 7.2.2 (TSS Descriptor):
In most systems, the DPLs of TSS descriptors are set to values less than 3, so that only privileged software can perform task switching. However, in multitasking applications, DPLs for some TSS descriptors may be set to 3 to allow task switching at the application (or user) privilege level.
Entering Ring 3
The x86 is a tricky CPU. No matter how you approach it, there is no easy way to enter user mode. Nonetheless, below are three ways to enter user mode.
iret method
One of the ways to get to ring 3 is to make the processor think it was already in ring 3 to start with. This can be accomplished with an iret. Following is a simple example of this trick:
global jump_usermode
extern test_user_function
jump_usermode:
mov ax, (4 * 8) | 3 ; ring 3 data with bottom 2 bits set for ring 3
mov ds, ax
mov es, ax
mov fs, ax
mov gs, ax ; SS is handled by iret
; set up the stack frame iret expects
mov eax, esp
push (4 * 8) | 3 ; data selector
push eax ; current esp
pushf ; eflags
push (3 * 8) | 3 ; code selector (ring 3 code with bottom 2 bits set for ring 3)
push test_user_function ; instruction address to return to
iret
This will call test_user_function and it will be operating in user mode! Have the test_user_function execute a cli or other privileged instruction and you'll be pleased by a General Protection Fault.
sysexit method
The second way is to use the sysexit instruction as follows:
global jump_usermode
extern test_user_function
jump_usermode:
mov ax, (4 * 8) | 3 ; user data segment with RPL 3
mov ds, ax
mov es, ax
mov fs, ax
mov gs, ax ; sysexit sets SS
; setup wrmsr inputs
xor edx, edx ; not necessary; set to 0
mov eax, 0x8 ; the segments are computed as follows: CS=MSR+0x10 (0x8+0x10=0x18), SS=MSR+0x18 (0x8+0x18=0x20).
mov ecx, 0x174 ; MSR specifier: IA32_SYSENTER_CS
wrmsr ; set sysexit segments
; setup sysexit inputs
mov edx, test_user_function ; to be loaded into EIP
mov ecx, esp ; to be loaded into ESP
sysexit
sysret method
The other way is to use the sysret instruction as follows:
; note: this code is for 64-bit long mode only.
; it is unknown if it works in protected mode.
; using intel assembly style
global jump_usermode
extern test_user_function
jump_usermode:
;enable system call extensions that enables sysret and syscall
mov rcx, 0xc0000082
wrmsr
mov rcx, 0xc0000080
rdmsr
or eax, 1
wrmsr
mov rcx, 0xc0000081
rdmsr
mov edx, 0x00180008
wrmsr
mov ecx, test_user_function ; to be loaded into RIP
mov r11, 0x202 ; to be loaded into EFLAGS
sysretq ;use "o64 sysret" if you assemble with NASM
Multitasking considerations
There are a lot of subtle aspects of user mode and task switching. Whenever a system call interrupt happens, ESP0 is loaded into the stack pointer and all system information is pushed before the interrupt handler is entered. This could become a problem with two ring 3 tasks. Imagine: esp is currently set to the ESP0 stack. When the interrupt is receive, esp is set to the other task's esp which is the same ESP0 stack. To avoid overwriting data, the handler must change the ESP0 stack (along with the interrupt-pushed ESP stack) on each task switch.