Opcode syntax

From OSDev.wiki
Jump to navigation Jump to search

The AT&T syntax (as understood by GAS, the GNU assembler) is the standard syntax on most non-Intel platforms, but remains rare on x86 platforms. However, AT&T syntax is the default for GCC Inline Assembly, and it is what objdump will provide you with when debugging your kernel.

NASM and FASM use Intel syntax, and Intel syntax is what the Bochs debugger will provide you when debugging your kernel.

Important Details

There are some substantial differences between the AT&T syntax and the Intel syntax, which a programmer intending to use the GNU tools should be aware of. Here are a few key things to look for, when moving to AT&T syntax:

  • Case Sensitivity: MOVL is not the same as movl.
  • Numerical Base: expressed as in C: 1 for decimal, 01 for octal, 0x01 for hex. (The Intel postfix-h syntax is not supported.)
  • Escapes: Special characters are written as C-style escapes (\n, \", \#, \\, ...).
  • Comments: Either C-style ( /* ... */ ) or shell style (# ...).
  • Directive syntax: Directives begin with a period (".align 4" to align on a 32-bit boundary, ".word 0x1234" is the equivalent of "DW 1234h").
  • Strings: Defined using special directives, .ascii (or .asciz for a zero-terminated string). Example: msg: .ascii "Hello, World!\n"
  • Current location address: Indicated by a period (".", equivalent to Intel syntax "$").
  • Initializing Memory: Done with .fill (roughly equivalent to Intel syntax 'times db'). Example: .fill 0x1fe - (. - START) , 1, 0 (where '1' is the size fill mask in bytes and START is a label marking the entry point of the code) is equal to Intel syntax times 1FEh - ($-$$) db 0. (The .skip and .space directives can be used in a similar manner.)
  • the code counter can be set multiple times, using the .org directive (as in .org 0x1fe + START, where START is a label marking the entry point of the code. The location-assignment directive, '.=', can be used in the same manner.
  • 16/32 bit code can be generated by stating .code16 or .code32 (equivalent to Intel syntax [BITS 16] and [BITS 32], respectively).
  • Target CPU: Set with the .arch directive. It is a Good Idea to set it, even if you are sure that the default is 'i386'.
  • Label Declarations: Always end in a colon.
  • a new identifier appearing at the beginning of a line, and not ending in a colon, is assumed to be part of an equivalence statement, and must be followed by an equals sign and an assigned value. Example: FOO = 0xF00
  • End of Instruction: Designated either by a newline or with a semi-colon; the latter is primarily seen in macros, to allow multiple lines of code.
  • Line Continuation: As in C, with a backslash ('\') as last character in a line. This also is mostly used in macros.
  • Registers: Always prefixed with a percent sign: %eax, %cs, %esp, etc.
  • Source, Destination: Move, load, store, and similar operations always have operands in the order 'source, destination', which is very unlike Intel syntax. Thus, "movl %eax, %ebx" moves the value of %eax into %ebx. This is the part that seems to confuse people the most, since it is nearly the opposite of the intel theme:
    Opcode    Register/Memory-being-modified, Data, Data
  • Operand Size Suffixes: Always appended to instructions (with the exception of ljmp, lcall, and lret on the x86): movb for "move byte", movw for "move word", movl for "move long", etc.
  • Direct-address Operands: are not prefixed. Thus, "movl foo, %eax" moves the contents of memory location "foo" into %eax.
  • Immediate Operands: are prefixed with a dollar sign ($): "pushl $4" pushes 0x00000004 onto the stack. This applies to labels as well: "movl $foo, %eax" moves the value of the label foo (that is, the address of variable foo) into %eax.
  • Indexed / Indirect Operands: are used in the format segment:displacement (base, index, scale), like so: movl %eax, %ss:8(%ebp, 2, 3) (which is equivalent to Intel syntax mov dword [ss:ebp + 2 * 3 + 8], eax, that is, it moves the value of %eax to offset (%ebp + (2 *3) + 8) in segment %ss). Any of the five operands of an indirect address may be omitted.
  • Relative Addressing: Used by default in all jump and call instructions. To use absolute addressing, the operand must be prefixed with an asterisk (*).
  • Far jumps / calls / returns: Use the special opcodes 'ljmp', 'lcall' and 'lret'.

The AT&T syntax format for macros:

.macro <name> <args>
<operations>
.endm

Example:

.macro write string
   movw string, %si
   call printstr
.endm

This would be equivalent to the NASM macro:

%macro write 1
   mov si, %1
   call printstr
%endmacro

Additionally, the cpp and M4 macro preprocessors are often used for macro handling.

Converting small snippets of code from Intel syntax to AT&T

You can use the following script to convert short snippets of code (one liners) from Intel syntax to AT&T syntax:

#!/bin/bash
set -e

# Usage:
#
# ./inteltoatt [16|32|64] "mov eax, eax \n xor ecx, edx"
#

case "$1" in
16|32|64)
	bits="$1"
	shift ;;
*)
	bits="32" ;;
esac
code="$1"

nasm="$(mktemp)"
obj="$(mktemp)"
objdump="$(mktemp)"

case "$bits" in
	16) m="i8086"       ;;
	32) m="i386"        ;;
	64) m="i386:x86-64" ;;
esac

echo -e "BITS $bits\n$code" > "$nasm"

nasm "$nasm" -o "$obj"
objdump -D -b binary -m $m -Maddr${bits},data${bits} "$obj" > "$objdump"

lineno="$(egrep -m 1 -n '<\.data>\:$' "$objdump" | cut -d':' -f1)"
lineno=$((lineno+1))

tail -n +$lineno "$objdump"

See Also

External Links