C++ to ASM linkage in GCC

From OSDev.wiki
Jump to navigation Jump to search

A small note before we begin: The GNU Compiler Collection C compiler, a very versatile compiler that has been around for a very long time, is pretty much the standard for OS Dev'ing, since it is even used (as you probably already know) to compile the Linux kernel. In fact, Linux is meant to be compiled with GCC. It has lots of useful extensions (__attribute__(())) which ease development by leaps and bounds, should you take the time to read about it enough.

Also, this article in itself is not really sufficient for a full understanding of Linking to C++ methods within C or assembly. C++ uses the hidden 'this' pointer, which will be discussed in another article proposed for a later date.

We will assume the use of the GCC compiler for your HLL development, and the use of C++ (C wouldn't be that different) for your little HLL - Assembly linkage escapade.

C++ Name Mangling

GCC follows the Itanium C++ ABI. The link preceding this text is of course a link to a public collection of documents on the Itanium C++ ABI. One of the things which prevents generally absolute portability of C++ libraries is the fact that different compilers use different Name Mangling schemes.

Why Mangle Symbol Names?

In C++, the symbols you define are generally not exclusive. For example, the function, getObjId() in C would simply be encoded as getObjId in the object file output. But in C++, since this function may be overloaded, it needs to contain extra information about itself in the name so that when an argument (function signature is the correct term) match is to be made, the linker or compiler may know which function is to be linked to.

Take the following example: A global function (i.e. with no namespace attached) getObjId() in C++ may be overloaded to these three instances: (These are not technically correct; your compiler will most likely emit something different. They are for demonstration, and are simply scratched out from the general mangling pattern GCC uses).

int getObjId(void); int getObjId(int); int getObjId(unsigned int);

From the source point of view, you believe that these three have the same name in the object file, and the compiler magically knows which one to choose for a call. However, three distinct symbols are generated respectively, in the general form:

_ZN8getObjIdEv, _ZN8getObjIdEi, _ZN8getObjIdEj.

The compiler uses the generated symbol name to encode information about the symbol. This generally says that 'this is a mangled symbol' (_Z). 'It has 8 characters of user-defined symbol relevance' (8), and those are [getObjId]. The 'E' is probably used to mean END, and after that GCC generally places several letters and namespace/object names as details on the arguments. (v=void, i=int, j=unsigned int...).

After seeing this, you can tell that, in order therefore to call a C++ function from assembly, if the compiler has mangled it (there are many cases in which there is no need for mangling, and the compiler may naturally leave a symbol alone in many cases, e.g. global variables), you would need to use a executable symbol interpreter, such as nm or objdump to see the name GCC used in the generated object file.

Note well that varying compilers DO NOT use the same mangling scheme, and in fact, are encouraged by the C++ standards committee to go ahead and use their own mangling schemes as they see fit.

Keep this bit of information in mind.

Essentially, then, What is 'Linking' to external an external function, or variable?

Variables in C++, and C, and many other languages have several scope levels. These are the levels of visibility of a symbol in one object file to the code in another object file. Generally, within one object file, unless you take steps to ensure otherwise, all static local and global symbols are available to the code in that file.

Global Symbols

Global symbols are those which are seen universally by all object code in the entire program, at the linking phase. Technically, during execution, the whole program is just bytes, so actually, every symbol is just an address, so every function has access to every symbol. But it is useful to programmers to be able to abstract access to symbols.

In C and C++, you make a symbol global by defining it outside of any function. In C++, you may still hide the symbol by having it inside a 'private' section of a global symbol. But this is irrelevant to this article, seeing as anyone who is reading this article should be attempting to develop an OS. We assume you already understand your language.

In assembly, (NASM, specifically; the author does not use GAS. Anyone who knows how GAS works may add to this article as they see fit.) all symbols are automatically local to the particular assembly file in which they appear. To make a symbol global, you must use the 'global _SYMBOL_NAME_' directive. If I remember right, in GAS, that's '.globl _SYMBOL_NAME'.

;-------------------------------------
; I usually like to place all my directives in a section above the code:
;-------------------------------------
global mysymbol, mycodesymbol
;-------------------------------------

mysymbol:
   dd 1, 2, 3

mycodeSymbol:
   push eax
   pop eax
ret

;; This symbol is confined to be seen only within this file, since the linker can't see it.
mylocalSymbol:
   push eax
   pop eax
ret

To clarify, technically, the linker can 'see' all symbols. It just chooses to ignore linkage between files for symbols not exclusively declared global. If you were to write a second file:

;--------------------------------------
; Extern directives here
;--------------------------------------
extern mysymbol, mycodesymbol
;; This one won't work.
extern mylocalsymbol
;--------------------------------------

This file will assemble, but when passed through the linker, you will receive a message to the effect that no such symbol has been defined for 'mylocalsymbol'.

Local Symbols

These are either Function local (on the stack, and therefore definitely not available to any other code since its persistence is unpredictable. It can be popped off at any time.) or File local (like mylocalsymbol above).

In C/C++, we know that a static variable defined outside of a function is file local. It may not be linked to outside of that particular file.

So now we get back to the main question presented at the top of this section of the article: What is external linkage? External linkage is the linking to a global symbol which is not defined in the same file scope as the file you're working in. The linker will therefore place the referred variable's address where it is referenced in the referencing file.

Linking to External Symbols in C++

In C++, the compiler has no idea what kind of symbol you intend to link to. There is technically no such thing as different 'kinds' of symbols, so we'll immediately discard that idea.

Technically, you can link to any kind of symbol using the "C" linkage style. I'll explain that now.

"C" Linkage

Well, let's think this through. In C, all symbols are simply self representing. If I declare a function, getObjId, it will be called 'getObjId' in the output file. C uses no name mangling since you cannot have different symbols with the same name occurring in the same namespace in C.

In other words, when you tell the compiler that you are linking to an external variable with "C" style linkage, you are telling the compiler: "I am linking to a symbol of name XYZ. This is EXACTLY how the symbol looks, and there is no name mangling.". That is all 'extern "C"' means. It explicitly tells the compiler that there is nothing special about the symbol's name, and that it is to be taken exactly as you type it.

With this in mind, we not understand why in C, there is no need to specify a linkage style, since C only understands symbolas as you type them. C does not have any kind of name mangling, and expects plain, absolute symbol names. So to link to assembly f C, you just say 'extern SYMBOL NAME', and not 'extern "C" SYMBOL NAME'.

So extern "C" getObjId tell the compiler to insert references to a symbol of the exact name 'getObjId' within your output object file. The linker will see your references, and look for a global symbol of that exact name, and if it is found, and there are no duplicates, it simply places the address of that symbol wherever the compiler placed a reference to it.

Now, let's go back to name mangling. Name mangling is simply the generation of a symbol as you define it, but with extra information encoded so as to ensure that symbols in an output object file are unique. Remember: No two global symbols may have the same name. No two File local symbols within the same file may have the same name either. When the linker links together object files with identical symbols, if the symbols are file local, it doesn't matter.

But the most important thing is that AFTER the file is compiled, the symbols are already determined, and so you can then link to any kind of symbol as a "C" style symbol if you know its absolute, post-mangled name, not so?

If say, you have a class 'foo' with a method 'bar(void)', the mangled symbol after compiling looks something like this (if compiled with GCC):

_ZN3foo3barEv.

This is the absolute symbol name for foo::bar. You may link to it therefore, from plain C, provided that the function does not expect a 'this' pointer, which is another matter altogether. Technically it won't work until you satisfy the 'this' pointer condition, but I'm just trying to help you understand exactly what it means when you say a symbol has "C" linkage, and what the compiler takes that to mean.

So linking to 'extern "C"' symbols is the same as telling the compiler to just trust you, and place references to XYZ symbol name as is, even though it may never see the definition of that symbol within the current file. The symbol is defined elsewhere. It may even be defined in a shared library, and be expected to be linked in by the OS at runtime (Dynamic Linking), although this usually requires a little more, and is usually handled by the compiler and linker as set up by the host OS (this is where OS libraries come in).

"C++" Linkage

Surprised? Yes, it does exist. However, a C++ compiler uses this linkage style by default. If you explicitly state it, it won't actually change anything. I know for a fact that GCC/G++ will not complain if you tell it to link to a symbol in C++ style.

So what does "extern "C++"" tell the compiler? It means: "Do your thing."

If you do the following:

class foo {
   public:
      int bar(void);
};

extern foo fubar;
extern "C++" foo fubar

The compiler will take them both to mean the same thing: 'Link to an external symbol which has the demangled name fubar, and is of type foo'.

The compiler will apply its OWN mangling scheme to the symbol, then insert references to the symbol in its own mangling scheme. So what say, a symbol was compiled in MSVC, and then you externally reference it in a C++ file to be compiled in GCC. Will the symbol link?

No. These two compilers will do the following: The symbol is compiled in one object file by MSVC. MSVC places the symbol, all mangled in its own way, into the output object file.

Then GCC is told to link to that symbol. It places a reference to a symbol name, all mangled in its own way. The linker is called on both object files. The second one is referencing a symbols which the linker sees nowhere. You are told the symbol referenced by the second object file does not exist.

C++ - Assembly linkage

I'm sorry I took this long to get to this part of the article, but the facts given above are pertinent.

To link to a C/C++ symbol from an assembly file, you simply tell the Assembler that the symbol is external. Assemblers always insert references as they see them, so they are always "C" style, if you please. The assembler takes you at your word. This is why you can look for the mangled symbol name of a function, or variable, and then actually place the mangled name into your assembly file (extern _ZN3foo3barEv) and it will work (depending on whether or not a 'this' pointer is involved).

To link to an assembly symbol from C/C++, you must know what the absolute symbol name in the assembly file is, and ensure it's global, then link to the absolute symbol name "C" style. (extern "C" SYMBOL_NAME).

The 'This' pointer issue is another thing altogether, and is actually a very serious consideration you should take into account when designing your kernel, or choosing whether or not to use C++ altogether. C makes library generation, and linking easier. C++ makes design and re-structuring (you will restructure your design many times, so this is a big plus) much easier.

See Also

Calling Conventions