System Calls: Difference between revisions

[unchecked revision][unchecked revision]
Content deleted Content added
m Bot: Replace deprecated source tag with syntaxhighlight
 
(6 intermediate revisions by 5 users not shown)
Line 10:
For example, on i386, the Linux kernel gets its arguments in <code>eax, ebx, ecx, edx, esi, edi, and ebp</code> in that order. The ABI however places all arguments in reverse order on the stack. Linux proceeds to construct a <code>pt_regs</code> structure on the stack and passes a pointer to it to a C function to handle the call itself. This can be simplified into something like this:
 
<sourcesyntaxhighlight lang="asm">
Int128Handler:
; already on stack: ss, sp, flags, cs, ip.
Line 37:
add esp, 4
iretd
</syntaxhighlight>
</source>
 
Many protected mode OSes use EAX to hold the function code. DOS uses the AX register to store the function code — AH for the service and AL for functions of the service, or AH for the functions if there are no services. For example, let's say you have read() and write(). The codes are 1 for read() and 2 for write() from the interrupt 0A9h (an arbitrary choice, possibly wrong). You can write
<sourcesyntaxhighlight lang="asm">
IntA9Handler:
CMP AH, 1
Line 55:
.done:
IRETD
</syntaxhighlight>
</source>
 
However, if all function codes are small contiguous numbers, a better option might be a function table, such as:
 
<sourcesyntaxhighlight lang="asm">
dispatch_syscall:
cmp eax, NR_syscalls
Line 67:
mov eax, -ENOSYS
ret
</syntaxhighlight>
</source>
 
Note that this assumes the syscall table to be NULL free. If there is a hole in the table, fill it with a pointer to a function returning an error code!
Line 80:
In 64-bit mode, the flags register can be modified by way of the SFMASK MSR. The original RFLAGS value will be saved in r11.
 
Note that, although these instructions did appear in pairs, there is no actual need to keep these instructions paired. With a properly constructed stack-frame, a system call that was started with <code>syscall</code> can be ended with <code>iret</code>, or else <code>sysret</code> might be used whenever an interrupt returns to user space. The sky is the limit!
 
The kernel can specify which registers are preserved and which registers are lost on SYSENTER or SYSCALL (with the exception of r11 in 64-bit mode, which is always lost) as part of its syscall ABI. It then does not need to save all registers but only those specified as being preserved. Most commonly the C calling conventions in use are followed. By using a tiny assembler stub that calls SYSENTER or SYSCALL the C compiler will safeguard caller saved registers. The kernel entry point for SYSENTER or SYSCALL can then be another small assembler stub that avoids changing any callee saved register before calling a C function for the syscall. That way only the user space stack pointer (and r11 in 64-bit mode) need to be saved as everything else is either preserved by the C compiler or allowed to be destroyed.
 
Note that for security reasons the kernel should zero all the registers that are not preserved across SYSENTER or SYSCALL so no information is accidentally leaked from kernel to userspace.
 
===Trap===
Line 94 ⟶ 98:
To use the gate, user-space code must use a far-call instruction. The offset will be ignored. Assuming the gate is the first entry in the GDT, segment 0x0b will have to be requested (offset 8 and RPL 3):
 
<sourcesyntaxhighlight lang="asm">
call far 0x0b:0
</syntaxhighlight>
</source>
 
In 64-bit mode, the descriptor size is doubled, with the high half of the handler address directly after the rest of the descriptor described above. Also, the argument count has to be zero, and the second DWORD of the second descriptor has to be all zeros. Otherwise, no changes.
Line 134 ⟶ 138:
 
==Security/safety implications==
Since the kernel is running at higher privilege than the user mode code calling it, it is imperative to check everything. This is not merely paranoia for fear of malicious programs, but also to protect your kernel from broken applications. It is therefore necessary to check all arguments for being in range, and all pointers for being actual user land pointers (note that Linux apparently fails to do this). The kernel can write anywhere, but you would not want a specially crafted <code>read()</code> system call to overwrite the credentials of some process with zeroes (thus giving it root access).
 
As for making sure that pointers are in range, checking if they point to user or kernel memory can be difficult to do efficiently unless you are writing a [[Higher Half Kernel]]. For checking all user space accesses for being valid, you can either check with your [[Page Frame Allocation|Virtual Memory Manager]] to see if the requested bytes are mapped, or else you can just access them and handle the resulting page faults. Linux switched to doing the latter from version 2.6 onwards.
Line 160 ⟶ 164:
* [http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/x86-system-calls.html FreeBSD Developers' Handbook - System Calls] - Discusses System Calls in FreeBSD from the usermode perspective.
 
[[Category:System Calls]]
[[Category:OS theory]]