Ext4
Filesystems |
---|
Virtual Filesystems |
Disk Filesystems |
CD/DVD Filesystems |
Network Filesystems |
Flash Filesystems |
While the ext4 filesystem originated a series of patches to the ext3 filesystem, it was later rebranded as a dedicated filesystem design that shares design with ext2 and ext3. Like ext3, it supports journaling. Amongst the upgrades are larger maximums (file size, filesystem size, files per folder, folders per folder etc), and features that were inspired from existing filesystems like XFS.
This information is based off the Ext4 implementation as of Linux 5.9rc3. Ext4 is subject to change, so one may wish to also check the latest kernel headers for new information.
Basic Concepts
Superblock
See Ext2 wiki page for an easier introduction to the concept of a superblock. All values are little endian unless otherwise specified.
Starting
Byte |
Ending
Byte |
Size | Description |
---|---|---|---|
0 | 3 | 4 | Total number of inodes in file system |
4 | 7 | 4 | Total number of blocks in file system |
8 | 11 | 4 | Number of reserved blocks |
12 | 15 | 4 | Total number of unallocated blocks |
16 | 19 | 4 | Total number of unallocated inodes |
20 | 23 | 4 | Block number of the block containing the superblock. This is 1 on 1024 byte block size filesystems, and 0 for all others. |
24 | 27 | 4 | log2 (block size) - 10. (In other words, the number to shift 1,024 to the left by to obtain the block size) |
28 | 31 | 4 | log2 (fragment size) - 10. (In other words, the number to shift 1,024 to the left by to obtain the fragment size) |
32 | 35 | 4 | Number of blocks in each block group |
36 | 39 | 4 | Number of fragments in each block group |
40 | 43 | 4 | Number of inodes in each block group |
44 | 47 | 4 | Last mount time (in POSIX time) |
48 | 51 | 4 | Last written time (in POSIX time) |
52 | 53 | 2 | Number of times the volume has been mounted since its last consistency check (fsck) |
54 | 55 | 2 | Number of mounts allowed before a consistency check (fsck) must be done |
56 | 57 | 2 | Magic signature (0xef53), used to help confirm the presence of Ext4 on a volume |
58 | 59 | 2 | File system state. |
60 | 61 | 2 | What to do when an error is detected |
62 | 63 | 2 | Minor portion of version (combine with Major portion below to construct full version field) |
64 | 67 | 4 | POSIX time of last consistency check (fsck) |
68 | 71 | 4 | Interval (in POSIX time) between forced consistency checks (fsck) |
72 | 75 | 4 | Operating system ID from which the filesystem on this volume was created (see below) |
76 | 79 | 4 | Major portion of version (combine with Minor portion above to construct full version field) |
80 | 81 | 2 | User ID that can use reserved blocks |
82 | 83 | 2 | Group ID that can use reserved blocks |
These fields are for ext4 dynamic superblocks only. If a bit is set in the required feature set it does not recognize, it must refuse to mount the filesystem. Filesystem checks, however, must abort on any unrecognized flag in the optional or required features.
Starting
Byte |
Ending
Byte |
Size
in Bytes |
Description |
---|---|---|---|
84 | 87 | 4 | First non-reserved inode in file system. |
88 | 89 | 2 | Size of each inode structure in bytes. |
90 | 91 | 2 | Block group that this superblock is part of for backup copies. |
92 | 95 | 4 | Optional features present. |
96 | 99 | 4 | Required features present. |
100 | 103 | 4 | Features that if not supported the volume must be mounted read-only. |
104 | 119 | 16 | File system UUID. |
120 | 135 | 16 | Volume name. |
136 | 199 | 64 | Path Volume was last mounted to. |
200 | 203 | 4 | Compression algorithm used. |
204 | 204 | 1 | Amount of blocks to preallocate for files |
205 | 205 | 1 | Amount of blocks to preallocate for directories. |
206 | 207 | 2 | Amount of reserved GDT entries for filesystem expansion. |
208 | 223 | 16 | Journal UUID. |
224 | 227 | 4 | Journal Inode. |
228 | 231 | 4 | Journal Device number. |
232 | 235 | 4 | Head of orphan inode list. |
236 | 251 | 16 | HTREE hash seed in an array of 32 bit integers. |
252 | 252 | 1 | Hash algorithm to use for directories. |
253 | 253 | 1 | Journal blocks field contains a copy of the inode's block array and size. |
254 | 255 | 2 | Size of group descriptors in bytes, for 64 bit mode. |
256 | 259 | 4 | Mount options. |
260 | 263 | 4 | First metablock block group, if enabled. |
264 | 267 | 4 | Filesystem Creation Time. |
268 | 335 | 68 | Journal Inode Backup in an array of 32 bit integers. |
Valid if the 64bit feature is set.
Starting
Byte |
Ending
Byte |
Size
in Bytes |
Description |
---|---|---|---|
336 | 339 | 4 | High 32-bits of the total number of blocks. |
340 | 343 | 4 | High 32-bits of the total number of reserved blocks. |
344 | 347 | 4 | High 32-bits of the total number of unallocated blocks. |
348 | 349 | 2 | Minimum inode size. |
350 | 351 | 2 | Minimum inode reservation size. |
352 | 355 | 4 | Misc flags, such as sign of directory hash or development status. |
356 | 357 | 2 | Amount logical blocks read or written per disk in a RAID array. |
358 | 359 | 2 | Amount of seconds to wait in Multi-mount prevention checking. |
360 | 367 | 8 | Block to multi-mount prevent. |
368 | 371 | 4 | Amount of blocks to read or write before returning to the current disk in a RAID array. Amount of disks * stride. |
372 | 372 | 1 | log2 (groups per flex) - 10. (In other words, the number to shift 1,024 to the left by to obtain the groups per flex block group) |
373 | 373 | 1 | Metadata checksum algorithm used. Linux only supports crc32. |
374 | 374 | 1 | Encryption version level. |
375 | 375 | 1 | Reserved padding. |
376 | 383 | 8 | Amount of kilobytes written over the filesystem's lifetime. |
384 | 387 | 4 | Inode number of the active snapshot. |
388 | 391 | 4 | Sequential ID of active snapshot. |
392 | 399 | 8 | Number of blocks reserved for active snapshot. |
400 | 403 | 4 | Inode number of the head of the disk snapshot list. |
404 | 407 | 4 | Amount of errors detected. |
408 | 411 | 4 | First time an error occurred in POSIX time. |
412 | 415 | 4 | Inode number in the first error. |
416 | 423 | 8 | Block number in the first error. |
424 | 455 | 32 | Function where the first error occurred. |
456 | 459 | 4 | Line number where the first error occurred. |
460 | 463 | 4 | Most recent time an error occurred in POSIX time. |
464 | 467 | 4 | Inode number in the last error. |
468 | 475 | 8 | Block number in the last error. |
476 | 507 | 32 | Function where the most recent error occurred. |
508 | 511 | 4 | Line number where the most recent error occurred. |
512 | 575 | 64 | Mount options. (C-style string: characters terminated by a 0 byte) |
576 | 579 | 4 | Inode number for user quota file. |
580 | 583 | 4 | Inode number for group quota file. |
584 | 587 | 4 | Overhead blocks/clusters in filesystem. Zero means the kernel calculates it at runtime. |
588 | 595 | 8 | Block groups with backup Superblocks, if the sparse superblock flag is set. |
596 | 599 | 4 | Encryption algorithms used, as a array of unsigned char. |
600 | 615 | 16 | Salt for the `string2key` algorithm. |
616 | 619 | 4 | Inode number of the lost+found directory. |
620 | 623 | 4 | Inode number of the project quota tracker. |
624 | 627 | 4 | Checksum of the UUID, used for the checksum seed. (crc32c(~0, UUID)) |
628 | 628 | 1 | High 8-bits of the last written time field. |
629 | 629 | 1 | High 8-bits of the last mount time field. |
630 | 630 | 1 | High 8-bits of the Filesystem creation time field. |
631 | 631 | 1 | High 8-bits of the last consistency check time field. |
632 | 632 | 1 | High 8-bits of the first time an error occurred time field. |
633 | 633 | 1 | High 8-bits of the latest time an error occurred time field. |
634 | 634 | 1 | Error code of the first error. |
635 | 635 | 1 | Error code of the latest error. |
636 | 637 | 2 | Filename charset encoding. |
638 | 639 | 2 | Filename charset encoding flags. |
640 | 1019 | 380 | Padding. |
1020 | 1023 | 4 | Checksum of the superblock. |
Required Features:
Flag
Value |
Description |
---|---|
0x0001 | Compression is used. |
0x00002 | Directory entries contain a type field. |
0x00004 | Filesystem needs to replay the Journal to recover data. |
0x00008 | Filesystem uses a journal device. |
0x00010 | Filesystem uses Meta Block Groups. |
0x00040 | Filesystem uses extents for files. |
0x00080 | Filesystem uses 64 bit features. |
0x00100 | Filesystem uses Multiple Mount Protection. |
0x00200 | Filesystem uses Flex Block Groups. |
0x00400 | Filesystem uses Extended Attributes in Inodes. |
0x01000 | Filesystem uses Data in Directory Entries. This is not implemented as of Linux 5.9rc3. |
0x02000 | Filesystem stores the metadata checksum seed in the superblock. This allows for changing the UUID wihtout rewriting all of the metadata blocks. |
0x04000 | Directories may be larger than 4GiB and have a maximum HTREE depth of 3. |
0x08000 | Data may be stored in the inode. See Inline Data for an discussion of this feature. |
0x10000 | Filesystem uses Encryption. |
0x20000 | Filesystem uses case folding, storing the filesystem-wide encoding in inodes. |
Optional Features:
Flag
Value |
Description |
---|---|
0x0001 | Preallocate some number of blocks (see byte 205 in the superblock) to a directory when creating a new one. |
0x0002 | Possibly unused, "imagic inodes" |
0x0004 | Filesystem uses a Journal |
0x0008 | Inodes have Extended Attributes. |
0x0010 | Filesystem can resize itself for larger partitions. |
0x0020 | Directories use hash index. |
0x0200 | Backup the superblock in other block groups. |
0x0800 | Inode numbers do not change during resize. |
Block Group Descriptor
See Ext2 wiki page for an introduction to block group descriptor tables.
In Ext4, the block descriptors, in addition to their role in Ext2 as information of important data structures, have new features such as flex block groups and meta block groups.
In a flex block group, multiple block groups are grouped together into a flex block group, as the group descriptor records the location of both bitmaps and the inode table. This allows for better data locality.
In a meta block group, the filesystem is partitioned into multiple 'metablock' groups. This allows the metadata of the block group descriptors to be stored in one block. This strategy also increases the maximum filesystem size to 512PiB from 256TiB without metablock groups.
Locating the Block Group Descriptors
Locating the Block Group Descriptors is similar to Ext2, except for metablock and flex block group locations.
One can check for flex block groups by checking the required option 'Flex Block Group'. The structure for that is found in the superblock data structure.
Flex Block Group info Structure, these fields are atomic integers:
Starting
Byte |
Ending
Byte |
Size
in Bytes |
Description |
---|---|---|---|
0 | 7 | 8 | Atomic 64 bit free clusters. |
8 | 11 | 4 | Atomic free inodes. |
12 | 15 | 4 | Atomic used directories. |
Block Group Descriptor
Block group descriptor Structure:
Starting
Byte |
Ending
Byte |
Size
in Bytes |
Description |
---|---|---|---|
0 | 3 | 4 | Low 32bits of block address of block usage bitmap. |
4 | 7 | 4 | Low 32bits of block address of inode usage bitmap. |
8 | 11 | 4 | Low 32bits of starting block address of inode table. |
12 | 13 | 2 | Low 16bits of number of unallocated blocks in group. |
14 | 15 | 2 | Low 16bits of number of unallocated inodes in group. |
16 | 17 | 2 | Low 16bits of number of directories in group. |
18 | 19 | 2 | Block group features present. |
20 | 23 | 4 | Low 32-bits of block address of snapshot exclude bitmap. |
24 | 25 | 2 | Low 16-bits of Checksum of the block usage bitmap. |
26 | 27 | 2 | Low 16-bits of Checksum of the inode usage bitmap. |
28 | 29 | 2 | Low 16-bits of amount of free inodes. This allows us to optimize inode searching. |
30 | 31 | 2 | Checksum of the block group, CRC16(UUID+group+desc). |
These fields are valid if the 64bit feature is set and the superblock's group desciptor size is greater than 32.
Starting
Byte |
Ending
Byte |
Size
in Bytes |
Description |
---|---|---|---|
32 | 35 | 4 | High 32-bits of block address of block usage bitmap. |
36 | 39 | 4 | High 32-bits of block address of inode usage bitmap. |
40 | 43 | 4 | High 32-bits of starting block address of inode table. |
44 | 45 | 2 | High 16-bits of number of unallocated blocks in group. |
46 | 47 | 2 | High 16-bits of number of unallocated inodes in group. |
48 | 49 | 2 | High 16-bits of number of directories in group. |
50 | 51 | 2 | High 16-bits of amount of free inodes. |
52 | 55 | 4 | High 32-bits of block address of snapshot exclude bitmap. |
56 | 57 | 2 | High 16-bits of checksum of the block usage bitmap. |
58 | 59 | 2 | High 16-bits of checksum of the inode usage bitmap. |
60 | 63 | 4 | Reserved as of Linux 5.9rc3. |
block group flags:
Flag
Value |
Description |
---|---|
0x0001 | Block group's inode bitmap/table is unused. |
0x0002 | Block groups's block bitmap is unused. |
0x0004 | Block groups's inode table is zeroed. |
Multiple Mount Protection
Multiple Mount Protection (MMP) protects filesystems being mounted multiple times leading to dangerous data races. This feature writes a sequence number into the block referenced in the MMP block superblock field.
To check for MMP, check the required feature flag Multiple Mount Protection, and the magic field in the block referenced by the MMP block superblock field.
In MMP, the driver checks the sequence number in the MMP block . If the sequence number is fs check running or any unknown code above maximum MMP value, the drive is not safe to mount, even if the timestamp is outdated.
While running, the driver checks the MMP block sequence number at the interval specified in the MMP check superblock field. If this does not match the in-memory sequence number, a different host has mounted the filesystem, requiring the driver to remount as readonly. If it does match, the driver increments the number in memory and on disk. The driver also writes the hostname and mount path in the MMP block on open() success.
The minimum interval to MMP check is 5 seconds and the maximum interval is 300.
MMP structure
Starting
Byte |
Ending
Byte |
Size
in Bytes |
Description |
---|---|---|---|
0 | 3 | 4 | MMP signature (0x004d4d50) |
4 | 7 | 4 | Sequence Number (see below) |
8 | 15 | 8 | Last updated time (does not affect algorithm) |
16 | 79 | 64 | Hostname of system that open() the filesystem (does not affect algorithm) |
80 | 111 | 32 | Mount path of system that open() the filesystem (does not affect algorithm) |
112 | 113 | 2 | Interval to check MMP block |
114 | 115 | 2 | Padding. |
116 | 1019 | 904 | Padding. |
1020 | 1023 | 4 | Checksum (crc32c(UUID+MMP Block number)) |
Journaling
See Journaling for a high level description of a filesystem journal.
Ext4 uses the Jbd2 Journaling layer.
Ext4 defines the journal inode as inode 8. The superblock contains the first 68 bytes of the journal. The journal is a hidden file in the filesystem, usually using an entire block group, but it is preferred to be in the middle of the volume.
The optional feature filesystem journal protects against filesystem corruption if the system crashes. The filesystem journal writes important data to a small contigous sliver of disk. Once this is flushed to disk, the driver writes a record of the data to be written to the journal. Later, the driver can write the transactions to disk. If the system crashes during the second write, it can simply replay the journal to the last sync. If the write succeedes, the write's record is removed from the journal.
The default journaling strategy is 'ordered', writing only filesystem metadata through journaling. If stronger guarantees are preferred, the filesystem can be use the 'journal' strategy, writing both metadata and data through the journal, slowing operation. The filesystem can also use the 'writeback' strategy, where data is not flushed to the disk before a metadata update.
Ext4 may also use a seperate journal device, specified in the superblock's journal UUID. The separate journal device will have 1024 bytes of padding, then an ext4 superblock with a matching UUID. The journal follows on the next complete block.
Journal Superblock Fields
Ext4 uses Jbd2 as the journal.
All fields are big endian unless otherwise specified.
Starting
Byte |
Ending
Byte |
Size | Description |
---|---|---|---|
0 | 11 | 12 | Journal header (see below) |
12 | 15 | 4 | Block size of the journal device. |
16 | 19 | 4 | Total number of blocks in the journal device. |
20 | 23 | 4 | First block of journal information. |
24 | 27 | 4 | First journal transaction expected. |
28 | 31 | 4 | First block of the journal. |
32 | 35 | 4 | Errno, if the journal has an error. |
36 | 39 | 4 | Required features present. |
40 | 43 | 4 | Optional features present. |
44 | 47 | 4 | Features that if not supported the journal must be mounted read-only. There are no read-only features as of Linux 5.9rc3. |
48 | 63 | 16 | Journal UUID. |
64 | 67 | 4 | Number of filesystems using this journal. |
68 | 71 | 4 | Block number of the journal superblock copy. |
72 | 75 | 4 | Maximum journal blocks per transaction. This is unused as of Linux 5.9rc3. |
76 | 79 | 4 | Maximum data blocks per transaction. This is unused as of Linux 5.9rc3. |
80 | 80 | 1 | Checksum algorithm. |
81 | 83 | 3 | Padding. |
84 | 251 | 168 | Padding. |
252 | 255 | 4 | Checksum of the journal superblock. |
256 | 1023 | 768 | UUID of filesystem |
Journal Header
All fields are big endian unless otherwise specified.
Starting
Byte |
Ending
Byte |
Size | Description |
---|---|---|---|
0 | 3 | 4 | Magic signature (0xc03b3998) |
4 | 7 | 4 | Block Type (see below) |
8 | 11 | 4 | Journal transaction for this block. |
See Also
External Links
- ext4 Data Structures and Algorithms - detailed description of ext4 from the Linux kernel documentation.
- Linux ext4 header - This header contains useful definitions and declarations.