Paging: Difference between revisions

2,157 bytes added ,  26 days ago
m
Bot: Replace deprecated source tag with syntaxhighlight
[unchecked revision][unchecked revision]
m (64-Bit Paging: fix physical => virtual for 5-level page tables)
m (Bot: Replace deprecated source tag with syntaxhighlight)
 
(10 intermediate revisions by 7 users not shown)
Line 25:
When PS=0, the page table address field represents the physical address of the page table that manages the four megabytes at that point. Please note that it is very important that this address be 4-KiB aligned. This is needed, due to the fact that the last 12 bits of the 32-bit value are overwritten by access bits and such. Similarly, when PS=1, the address must be 4-MiB aligned.
 
* PAT, or ''''P'''age '''A'''ttribute '''T'''able'. If [https://en.wikipedia.org/wiki/Page_attribute_table PAT] is supported, then PAT along with PCD and PWT shall indicate the memory caching type. Otherwise, it is reserved and must be set to 0.
* G, or ''''G'''lobal' tells the processor not to invalidate the TLB entry corresponding to the page upon a MOV to CR3 instruction. Bit 7 (PGE) in CR4 must be set to enable global pages.
* PS, or ''''P'''age '''S'''ize' stores the page size for that specific entry. If the bit is set, then the PDE maps to a page that is 4 MiB in size. Otherwise, it maps to a 4 KiB page table. Please note that 4-MiB pages require PSE to be enabled.
* D, or ''''D'''irty' is used to determine whether a page has been written to.
Line 52:
Say the kernel is loaded to 0x100000. However, it needed to be remapped to 0xC0000000. After loading the kernel, it'll initiate paging, and set up the appropriate tables. (See [[Higher Half Kernel]]) After [[Identity Paging]] the first megabyte, it'll need to create a second table (ie. at entry #768 in the paging directory.) to map 0x100000 to 0xC0000000. The code may be like:
 
<sourcesyntaxhighlight lang="ASM">
mov eax, 0x0
mov ebx, 0x100000
Line 62:
inc eax
cmp eax, 1024
jejne .endfill_table
</syntaxhighlight>
jmp .fill_table
.end:
</source>
 
== 64-Bit Paging ==
Line 96 ⟶ 94:
Enabling paging is actually very simple. All that is needed is to load CR3 with the address of the page directory and to set the paging (PG) and protection (PE) bits of CR0.
 
<sourcesyntaxhighlight lang="ASM">
mov eax, page_directory
mov cr3, eax
Line 103 ⟶ 101:
or eax, 0x80000001
mov cr0, eax
</syntaxhighlight>
</source>
 
Note: setting the paging flag when the protection flag is clear causes a [[Exceptions#General_Protection_Fault|general protection exception]]. Also, once paging has been enabled, any attempt to enable long mode by setting LME (bit 8) of the [[CPU_Registers_x86-64#IA32_EFER|EFER register]] will trigger a [[Exceptions#General_Protection_Fault|GPF]]. The CR0.PG must first be cleared before EFER.LME can be set.
Line 111 ⟶ 109:
To enable PSE (4 MiB pages) the following code is required.
 
<sourcesyntaxhighlight lang="ASM">
mov eax, cr4
or eax, 0x00000010
mov cr4, eax
</syntaxhighlight>
</source>
 
=== 64-bit Paging ===
Enabling paging in long mode requires a few more additional steps. Since it is not possible to enter long mode without paging with PAE active, the order in which one enables the bits are important. Firstly, paging must not be active (i.e. CR0.PG must be cleared.) Then, CR4.PAE (bit 5) and EFER.LME (bit 8 of MSR 0xC0000080) are set. If 57-bit virtual addresses are to be enabled, then CR4.LA57 (bit 12) is set. Finally, CR0.PG is set to enable paging.
 
<sourcesyntaxhighlight lang="ASM">
; Skip these 3 lines if paging is already disabled
mov ebx, cr0
Line 161 ⟶ 159:
hlt ; Done. Replace these lines with your own code
jmp reloadCS
</syntaxhighlight>
</source>
 
Once paging has been enabled, you cannot switch from 4-level paging to 5-level paging (and vice-versa) directly. The same is true for switching to legacy 32-bit paging. You must first disable paging by clearing CR0.PG before making changes. Failure to do so will result in a [[Exceptions#General_Protection_Fault|general protection fault]].
Line 192 ⟶ 190:
Many prefer to map the last PDE to itself. The page directory will look like a page table to the system. To get the physical address of any virtual address in the range 0x00000000-0xFFFFF000 is then just a matter of:
 
<sourcesyntaxhighlight lang="C">
void *get_physaddr(void *virtualaddr) {
unsigned long pdindex = (unsigned long)virtualaddr >> 22;
Line 205 ⟶ 203:
return (void *)((pt[ptindex] & ~0xFFF) + ((unsigned long)virtualaddr & 0xFFF));
}
</syntaxhighlight>
</source>
 
To map a virtual address to a physical address can be done as follows:
 
<sourcesyntaxhighlight lang="C">
void map_page(void *physaddr, void *virtualaddr, unsigned int flags) {
// Make sure that both addresses are page-aligned.
Line 230 ⟶ 228:
// or you might not notice the change.
}
</syntaxhighlight>
</source>
 
Unmapping an entry is essentially the same as above, but instead of assigning the <code>pt[ptindex]</code> a value, you set it to 0x00000000 (i.e. not present). When the entire page table is empty, you may want to remove it and mark the page directory entry 'not present'. Of course you don't need the 'flags' or 'physaddr' for unmapping.
Line 263 ⟶ 261:
== INVLPG ==
 
INVLPG is an instruction available since the i486 that invalidates a single page in the TLB. Intel notes that this instruction may be implemented differently on future processesprocessors, but that this alternate behavior must be explicitly enabled. INVLPG modifies no flags.
 
NASM example:
 
<sourcesyntaxhighlight lang="ASM">
invlpg [0]
</syntaxhighlight>
</source>
 
Inline assembly for GCC (from Linux kernel source):
 
<sourcesyntaxhighlight lang="C">
static inline void __native_flush_tlb_single(unsigned long addr) {
asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
}
</syntaxhighlight>
</source>
 
This only invalidates the page on the current processor. If you're using SMP, you'll need to send an IPI to the other processors so that they can also invalidate the page (this is called a TLB shootdown; it's very slow), making sure to avoid any nasty race conditions. You may only want to do this when removing a mapping, and just make your page fault handler invalidate a page if it you didn't invalidate a mapping addition on that processor by looking through the page directory, again avoiding race conditions.
Line 287 ⟶ 285:
 
For memory efficiency, two or more processes can share pages as read-only. If one process were to write to its page, then a page fault would occur and the system could duplicate the page and then mark it as read-write. This is known as copy-on-write (COW). Copy-on-write allows the system to delay memory allocation until a process actually requires it, preventing unnecessary copying.
 
== PAT ==
The Page Attribute Table determines caching attributes on a page granularity. This is similar to [[MTRR]]s, but those apply to physical addresses and are more limited.
 
The PAT is set via the IA32_PAT_MSR [[MSR]] (0x277). It has 8 entries, taking the low order 3 bits of each byte, in standard little endian order. So the high byte is PAT7, low byte is PAT0.
 
The following are the different caching types.
{| class="wikitable" border="1"
|-
! Number
! Name
! Description
|-
| 0
| UC — Uncacheable
| All accesses are uncacheable. Write combining is not allowed. Speculative accesses are not allowed.
|-
| 1
| WC — Write-Combining
| All accesses are uncacheable. Write combining is allowed. Speculative reads are allowed.
|-
| 4
| WT — Writethrough
| Reads allocate cache lines on a cache miss. Cache lines are not allocated on a write miss.
Write hits update the cache and main memory.
|-
| 5
| WP — Write-Protect
| Reads allocate cache lines on a cache miss. All writes update main memory.
Cache lines are not allocated on a write miss. Write hits invalidate the cache
line and update main memory.
|-
| 6
| WB — Writeback
| Reads allocate cache lines on a cache miss, and can allocate to either the shared,
exclusive, or modified state. Writes allocate to the modified state on a cache miss.
|-
| 7
| UC- — Uncached
| Same as uncacheable, ''except'' that this can be overriden by Write-Combining MTRRs.
|}
 
The PAT has a reset value of 0x0007040600070406. This ensures compatibility with non-PAT usage. This corresponds to the following:
{| class="wikitable" border="1"
|-
|UC
|UC-
|WT
|WB
|UC
|UC-
|WT
|WB
|}
 
The PAT is indexed by the three page table bits:
{| class="wikitable" border="1"
|-
|PAT
|PCD
|PWT
|}
The PAT bit is reserved when there isn't a PAT, and the default value of the MSR ensures backwards comaptibility with the PCD and PWT bit.
 
You will need to modify the PAT if you want Write-Combining cache, which is very useful for framebuffers.
 
== See Also ==
Line 305 ⟶ 368:
 
[[Category:Memory management]]
[[Category:Paging]]
[[Category:Virtual Memory]]
[[Category:Security]]
[[de:Paging]]