Code Management

From OSDev.wiki
Jump to navigation Jump to search

An Operating System is, by its very nature, a quite large project. As such, it requires proper code management, otherwise your code base is likely to turn into a big mess that will only keep getting harder to maintain as the project grows, and you will end up rewriting your code base every month because it has become impossible to maintain, instead of actually implementing new features. On the other hand, code management must be scaled to the project: a project that doesn't mean to grow very big does not need a complicated organization.

Consistency in the codebase

The first step in writing a codebase that is easy to maintain whatever the count of people working on it, is to ensure consistency. This applies to various fields, which will be detailed within this section.

At the source level

In a large project, it is important that all source files be written the same way. It starts with the coding style. Everybody has his own and likes to use it, but large projects need to impose one coding style for the whole project, or people willing to contribute will spend all their time deciphering the coding styles of the others, and lose any will to help out, resulting on your project to be abandoned. Most projects also impose that each source file contains a heading comments which typically contains a description of the file's purpose, its filename, and its licensing conditions (or a reference to the license, in case of long licenses like the GPL). Additionally, projects should impose other aspects of the coding standard, such as the tab policy (the way indentation is performed through tab characters or spaces, and how many spaces per indentation level), the preferred comment style (either C++-style // or C-style /* ... */ for which purpose), or the encoding (pure ASCII, UTF-8, etc.) of all source files.

Perhaps the most essential thing to define is the naming convention (that is, how functions, structures, types, macros, variables etc. are named) in the interface (see Interfaces, Implementations and Black Boxes below for more details). This is less important for internal (e.g. static in C) functions which are not exposed, though it's good practice to keep the same standards everywhere.

The last concern is the method that is used to accomplish certain tasks that are used in multiple places across the OS. For example, the C++ language allows two ways of allocating memory (new and the C standard malloc). A good project has well-defined conventions that are followed throughout the project.

Fortunately, with the evolution of modern IDEs, following the coding conventions of a project has become a simple matter of setting up the IDE to use the correct encoding/tab policy/headers, and using the IDE's automated source formatting utilities to produce correctly formatted source files. Some IDEs can even create the heading comment block directly at the start of each source file you create, so that you only have to fill in the gaps.

In the source tree

The other place where it is essential to have a consistent organization is in the project tree. In other words, the way files are classified in subdirectories must be consistent. This is more of a concern in big projects, and becomes essential when developing an operating system. Because one day or another you will want to port your operating system to other architectures, it is necessary to place source files that are architecture-dependent (e.g. x86-specific paging management) apart from architecture-independent ones (e.g. your command-line shell) (see Portability)). GRUB does that with an "arch" directory aside from the source files containing subdirectories for different platforms (e.g. x86, ARM, SPARC, MIPS...). Then, you should name your source files in a uniform manner. For example, C++ headers must have the same extension (.h, .hpp, .hxx) throughout the project. Here's another example: let's say you have an extensible command-line shell, and have commands implemented in different source files. You must decide whether you're going to create a subdirectory specifically for each commands (which would be better, e.g. "cmd/xxxxxx.c" and "cmd/yyyyy.c") or name them cmdxxxxx.c and "cmdyyyyy.c", or some other naming scheme.

In the versioning scheme

Although this is usually less important than the previous concern, you need to keep the way different versions are numbered consistent. Some number versions sequentially (e.g. 0.1, then 0.2, then 0.3...) while others increment the minor version (e.g. the 2 in 1.2) number sequentially for small updates and only increment the major version number (the first number in the version) only when they add a major improvement to the projects. Some also add revision and build numbers, which are incremented by some random number every time a change is made to a file, which gives weird version numbers such as 2.2.11127.56150, which are hard to remember.

Additionally, you may give each release a special name. Projects that use that strategy include Mac OS X (e.g. Leopard, Lion, Mountain Lion...), Android (e.g. Gingerbread, Jelly Bean, Ice Cream Sandwich...), Windows (e.g. Millennium, XP, Vista), and probably others I don't know about. Just pick the one you prefer, or something completely different, or even more simple, such as MyOS 1, then MyOS 2, then MyOS 3... it's your project, after all!

"Semantic Versioning" is an attempt to unify versioning schemes. You probably already use a scheme that is close to this one.

Interfaces, Implementations and Black Boxes

When dealing with a large codebase, such as that of an operating system, code can quickly turn into a big mess. In order to avoid that, you need proper code organization. Note that it's still perfectly possible to code without using this organization, but organizing your codebase highly increases the chances that your project to succeed, and be stable.

It's all about splitting the interface from the implementation. The interface is made of function prototypes, structure definitions, typedefs, structures and possibly preprocessor macros in C, plus classes in C++. Note that different languages support different interfaces: while some languages such as D or Java provide built-in support for that through special keywords and constructs, while other languages don't (assembly is an example, though few OS's are written completely in assembly).

Interface

The interface is the part that is visible to the user. It is usually present in header files, and consists, for the C language at least, of function prototypes, structure and type definitions, and preprocessor symbols and macros. This part should not contain any code, except for small inline functions that usually defer the call to other functions, with different arguments formatting. An example of that are the ubiquitous utility inb/inw/inl/outb/outw/outl inline assembly functions.

When writing an operating system kernel, one usually wants the interface to be platform-independent. This can be achieved by using typedefs to represent each element. For example, when implementing paging, you may use a specific typedef for physical addresses, and another for linear addresses. If you refer to physical addresses as "phys_addr_t"'s, you will be able to reuse the same interface when adapting your kernel to x86_64 (assuming a start with the x86) by simply changing the typedef of "phys_addr_t" from uint32_t to uint64_t. You can also just use conditional compilation, or some similar feature your language of choice provides.

It is very important that the interface be completely independent of the implementation. That will allow you to write a second implementation of your Hardware Abstraction Layer (the part of a kernel that abstracts the hardware, for beginners) for a different platform later, without having to change the interface. And assuming that your interface is platform-independent enough, and your kernel uses it and only it to access the hardware, then porting to a new platform will simply be a matter of writing a new HAL, and the rest will magically compile and work perfectly on the new architecture (for more info, see Portability).

For this to work, you also have to ensure that the rest of the kernel is really platform-independent. In fact, the interface of your kernel should never directly use fixed-width types (such as uint32_t). If the fixed-width is required by the platform (such as special registers), then you should have a special typedef for it (e.g. special_register_t). Otherwise, if you just want to ensure a given capacity, then use the standard type that is closest to what you want. A good example of that is the C standard library, which uses special type for file offsets, sizes, times, etc, but uses (unsigned) char/int/long when necessary. And as you can see, the interface to the C library remains the same through countless different platforms. A well-designed kernel interface allows the same thing to apply to... your kernel.

Implementation

The implementation is the opposite of the interface. That is, it contains the actual implementation of the functions defined in the interface. The implementation is allowed (and in fact, has to, in the context of a kernel) depend on the platform. It is in the implementation that you write code that directly accesses specific hardware (most notably inline assembly). (Code that addresses specific hardware should be placed in separate drivers, according to your kernel design). However, if you want your code to remain readable, I suggest that you still use the C preprocessor for magic values, such as fixed addresses of memory-mapped devices (e.g. the 0xB8000 address of the VGA's text mode video memory), with explicit names (e.g. VGA_TEXT_MEMORY).

The implementation can be as ugly as it wants or needs to be, as long as it properly implements the interface. However, to achieve proper isolation between interface and implementation, it is important that functions that are internal to the implementation not be accessible from outside it. The C language provides the static keyword for that.

A special case are bigger components, whose implementation can be considered as a separate library. In that case, a common practice is to have an internal implementation-specific interface. However, and since internal functions can't be declared static, it's up to the developer to ensure that they are not used from outside the implementation. There are countless professional libraries using such techniques. Good examples are the industry-standard boost C++ libraries with their "detail" directories. An example of such components are complicated device drivers, such as those for hardware-accelerated video cards, or complicated filesystems.

Black Boxes

In time, you will probably want to extend your development team, if not already done. But once you will have more people working on the codebase, you will face a problem: the more they are, the longer it takes to develop. The reason for that is that each developer in charge of one part of the kernel needs to take in account the code written by other developers in charge of other parts of the kernel. That's when black boxes come into play. The Black Boxes approach consists in treating your kernel as a set of separate libraries (the Hardware Abstraction Layer is a library, the VFS is another etc.). Once your project is properly divided into pieces, you can give each developer his library, and no longer worry about conflicts. Combined to interface and implementation isolation, you will have each team try to implement its part of the interface as well as possible. With a properly-designed interface, this will make the development process much faster and easier to manage, for the project leader.

That approach will also help you during the recruitment process of your development team. In effect, you will realize that there are many people who decided to start their own OS project with the goal of making the best possible GUI, but lose any will to continue that project when they realize that they have to implement countless other things before. If you announce on the forum that you are looking for people to design a GUI under an already mature kernel, these people will be happy to join your project, since it allows them to work only on what they like, without to worry about the rest.

It will also make it easier to fix bugs. Because each part of the kernel is clearly separate from the rest, it's easy to locate an individual problem with the code, and fix it, without also having to modify a lot of other parts of the kernel which relied on it, due to a lack of precise interface. Due to that, your kernel has chances to be more stable than average in the end.

Now comes the problem of the interface. Since every component is neatly isolated from the rest, and implements the interface, you need to have a well-designed interface, and clear rules about who can modify it. If anybody can change the interface, it is useless to have one, because your team will spend its time adapting the implementation to suit the new interface, instead of actually coding things. A (good) scheme that proved to work is to have, on your project page, a forum with a topic dedicated to discussing the interface. Then, when somebody wants to change the interface, he will need to present his change to the whole developer community, with arguments, and then it will be up to you to decide whether the change should be done or not.

Working across changes

It is obvious that your code will have to change, and sometimes a change is not as good as you thought it would be. In this case, you will want to roll back the change. You may also encounter situations where you remove a part of your code (e.g. you want to completely change your scheduler) and later want to see that code again (e.g. your new scheduler supports different methods and you now want to re-include your old scheduler's code).

Note: If you are already using a version control system, you'd better skip this section :)

Version Control Systems

A Version Control system (VCS) is a program that manages changes to your source files (for OSX users, this is much like Time Machine, but enhanced for source code). Concretely, you perform a commit operation after each significant change (e.g. after you rewrote your scheduler, or added a new driver, don't commit after each modified line), and the VCS will remember the state of the files before and after the changes. This means that if you change your mind and want to see your code again, then the VCS will be able to give it back. A VCS also enables you to generate patches, which are files that contain only the differences between two versions of a same file.

But the main point in using a VCS is that it enables two (or more) persons to work on the same codebase at once without interfering with each other. A VCS also allows you to place your source code on the Internet and have your whole team co-operate using a single server. Many source code hosting websites (such as Google Code) support accessing your codebase through a VCS, and using such tools for your code gives you more chances to get people to contribute to your project.