Optimizing: Difference between revisions

From OSDev.wiki
Jump to navigation Jump to search
[unchecked revision][unchecked revision]
Content deleted Content added
mNo edit summary
 
(17 intermediate revisions by 9 users not shown)
Line 1: Line 1:
Optimization is the practise of taking functioning code and modifying it in order to make it execute faster. Optimization is already discussed very thoroughly in other places for various userland problem domains, and below are a few areas that are important for operating system developers to consider.
'''Optimization''' is the practise of taking functioning code and modifying it in order to make it execute faster. Optimization is already discussed very thoroughly in other places for various userland problem domains, and below are a few areas that are important for operating system developers to consider.


Note that to fully understand the ideas on this page, it is important to have fairly good grasp of assembly (x86 examples will be used), and a thorough knowledge of the x86 architecture is a must. But you've already read the Intel CPU manuals, haven't you?
Note that to fully understand the ideas on this page, it is important to have fairly good grasp of assembly (x86 examples will be used), and a thorough knowledge of the x86 architecture is a must. But you've already read the Intel CPU manuals, haven't you?
Line 8: Line 8:
====Things to Optimize====
====Things to Optimize====
* '''Deeply-nested loops.''' If possible, try to eliminate some of the nesting of the loop. Any instruction taken out of a loop will have a ''much'' more significant speed impact than an instruction taken out of a general routine. If you run though a loop one million times, and there is just one "unnecessary line of code", then there is a significant portion of a second wasted, especially on slower computers.
* '''Deeply-nested loops.''' If possible, try to eliminate some of the nesting of the loop. Any instruction taken out of a loop will have a ''much'' more significant speed impact than an instruction taken out of a general routine. If you run though a loop one million times, and there is just one "unnecessary line of code", then there is a significant portion of a second wasted, especially on slower computers.
* '''Code than is run often.'''
* '''Code that is run often.'''
* '''Graphics code.'''
* '''Graphics code.'''
* '''API code.'''
* '''API code.'''
Line 21: Line 21:


==Compiler Optimization==
==Compiler Optimization==

If you're using one of the more complex compilers (such as [[GCC]]), you have available to you a lot of optimizations that can be done without having to write (or remove) a single line of code. One thing to be aware of when using compiler optimizations is that when overdone (or even done at all) they can cause previously functioning code to become buggy, or not work at all. It is important to remember that this is not so much a problem with the optimizations the compiler does, but the fact that the code had underlying errors in the first place. That is to say, not using the [[Volatile (keyword)|volatile]] keyword correctly.
Modern compilers (such as [[GCC]]) provide inbuilt optimization logic, which is available to you without having to write (or remove) a single line of code. Enabling these compiler optimizations might seem to break previously functioning code; it is important to remember that this is because of bugs in your code that have been there all along, that are just exposed by the optimizations. One such example is when the [[Volatile (keyword)|volatile]] keyword is not used where necessary.

Another side-effect of optimization is that source-level debugging might not always produce the expected results, as variables or whole lines of code are optimized away.

===GCC===


To use compiler optimizations in GCC, simply add the -O(x) switch to the command line when compiling, where (x) is one of the following:
To use compiler optimizations in GCC, simply add the -O(x) switch to the command line when compiling, where (x) is one of the following:

{| {{wikitable}}
{| {{wikitable}}
! Switch
! Switch
! Use
! Use
|-
|-
| 0 || Disable optimization. (This is the default.)
| s || Optimize for size.
|-
|-
| 1 || Try to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time.
| 1 || Do mild speed optimizations.
|-
|-
| 2 || Do more speed optimizations.
| 2 || Perform nearly all supported optimizations that do not involve a space-speed tradeoff.
|-
|-
| 3 || Do lots of speed optimizations (not recommended).
| 3 || Perform nearly all supported optimizations ''including'' those involving a space-speed tradeoff.
|-
| fast || Optimize for speed. This includes all optimizations from -O3, as well as aggressive optimizations that may violate strict compliance with language standards. (For example, use a fused multiply-add instruction for floating point numbers instead of multiplying and adding separately.)
|-
| s || Optimize for size. This includes all optimizations from -O2 that do not typically increase code size, plus additional size-reducing optimizations.
|-
| g || Turn on optimizations that do not interfere with debugging (see the [https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html GCC docs]). Available since 4.8.
|}
|}
Note that only one command line switch can be used on a single compilation, and that each level of speed optimizations does everything the level below it does, and more. For more information, read the GCC manual.


Numerous sources suggest the existence of further optimization levels, like "-O4", "-O6" or "-O99"; but only -O0 through -O3 and -Os (and -Og since 4.8) are ''documented'' and can be relied on.
The main reason for code to become buggy or not work after adding compiler optimizations is that it removes variables, or assumes that it doesn't have to reload variables because their value isn't changed in any other code (which isn't true for most components of kernels, which are hugely interdependent). To prevent this from happening, use the [[Volatile (keyword)|volatile]] keyword.


For more information, read the GCC manual; it lists the exact optimizations done at each level, plus some additional options not enabled at any of the above optimization levels.
==Inline Functions==


==== What? ====
===Clang===
The '''inline''' keyword is a way of requesting that the compiler insert the code instead of calling it. This means that if you create an inline function and then call it from another function, the compiler will try to replace your call with the function body (similar to a preprocessor define, but with none of its disadvantages).


Clang supports all of the same optimization flags as GCC, in addition to <code>-Oz</code> which is similar to <code>-Os</code> but reduces code size even more.
==== Advantages ====
Inline functions save the clock cycles involved in actually calling a function (argument and return code pushes, call and return jump). Compared to preprocessor defines, they add type safety, and avoid side effects that make macros a hazard. Inlining can also be triggered on and off using compiler options so you can measure the size / speed tradeoff.


'''ToDo: Describe Link-time optimizations.'''
==== Disadvantages ====
A normal function called from five different locations will only reside in memory once. If you inline that function, it will reside in memory ''five'' times. Your code becomes faster, but it also becomes bigger.


==== Caveats ====
==Inline Functions==
The ''inline'' keyword is merely a ''request'' to the compiler, which the compiler can choose to ignore if it deems the size / speed tradeoff to be detrimental.
The '''inline''' keyword is a way of requesting that the compiler insert the code instead of calling it. This means that if you create an inline function and then call it from another function, the compiler will try to replace your call with the function body (similar to a preprocessor define, but with none of its disadvantages).


Inline functions save the clock cycles involved in actually calling a function (argument and return code pushes, call and return jump). Compared to preprocessor defines, they add type safety, and avoid side effects that make macros a hazard. Inlining can also be triggered on and off using compiler options so you can measure the size / speed tradeoff. On the downside, a normal function called from five different locations will only reside in memory once. If you inline that function, it will reside in memory five times. Your code becomes faster, but it also becomes bigger. It is important to note that the inline keyword is merely a request to the compiler, which the compiler can choose to ignore if it deems the size / speed tradeoff to be detrimental.
==== When To Use It? ====
It is best to only use the '''inline''' keyword on small functions, where it is known to work best and to provide the largest speed boost. An example of small function is an accessor or a mutator inside a C++ class, or a function that simply sets the value.


It is best to only use the '''inline''' keyword on small functions, where it is known to work best and to provide the largest speed boost. An example of small function is an accessor or a mutator inside a C++ class, or a function that simply sets the value. The inline keyword is simply specified before the function type, just like the '''static''' keyword:
==== How To Use It? ====
The inline keyword is simply specified before the function type, just like the '''static''' keyword:


<pre>
<pre>
Line 70: Line 76:


==The Cache==
==The Cache==
{{Main|CPU Caches}}
The CPU's cache contains a high-bandwidth store of currently used data. On a modern CPU, it usually takes 2 cycles to do a read or write from level 1 (L1) cache (compared with 1 cycle for a register), 5 cycles for level 2 (L2) cache, and perhaps around 30 cycle for main memory. As you can see, reading from memory is very expensive compared to reading from a register or cache. Unfortunately, this trade-off for speed comes at a size cost: L1 cache is only around 8KiB per processor core, with L2 cache sitting around 4MiB to 12MiB per Core 2 chip. For more information, see [[Wikipedia:Memory_hierarchy|this]] Wikipedia entry.
The CPU's cache contains a high-bandwidth store of currently used data. On a modern CPU, it usually takes 2 cycles to do a read or write from level 1 (L1) cache (compared with 1 cycle for a register), 5 cycles for level 2 (L2) cache, and perhaps around 30 cycle for main memory. As you can see, reading from memory is very expensive compared to reading from a register or cache. Unfortunately, this trade-off for speed comes at a size cost: L1 cache is only around 8KiB per processor core, with L2 cache sitting around 4MiB to 12MiB per Core 2 chip. For more information, see [[Wikipedia:Memory_hierarchy|this]] Wikipedia entry.


Most processors try to keep the most recently used data in the two caches (though the method used can differ, check the manuals). For this reason, it pays to keep in mind what data is likely to be in the caches at any given time. If you for instance (and this is a very simplistic example) set your multitasking code to switch processes too fast, you'll waste a higher percentage of time going through cache 'flushes', refilling the cache with new data.
Most processors try to keep the most recently used data in the two caches (though the method used can differ, check the manuals). For this reason, it pays to keep in mind what data is likely to be in the caches at any given time. If you, for example (and this is a very simplistic example), set your multitasking code to switch processes too fast, you'll waste a higher percentage of time going through cache 'flushes', refilling the cache with new data.


==CPU Pipelines==
'''ToDo: More info on how we can keep the cache in mind while programming.'''
{{Stub}}

== Piping ==
in the older days, where dual cores did not exist, and the CPUs were slow, they need to speed things up. of cause there are many ways to
achieve this. Intel decided to use the Pipe idea, used in RICS CPUs and other high speed CPUs. The idea behind piping is that you can execute more instructions in only 1 cycle. This means that if you have 2 pipes ( 1 pipe = 1 instruction / cycle), you can execute 2 instructions in 1 time. Now there are only 1 big problem. How can that be possible if you change a register in the fist instruction, and then use it in the next?, the solution is bad because there is no other solution: wait until the first instruction has completed. So the upside of this is simply:
you can increase the speed many times, but you can also make the code extremely slow.
today, Intel says that their CPUs can have up to 6 pipes, so at a assembly level we can make the code up to 600 % faster, in case that we don't think about piping.
here is a example for 2 pipes, the first is not optimized, the other is:
mov eax,0x0000aa55 ; don't think about the values, only the instruction order
mov esi,eax
mov ebx,0x0000beef
mov ecx,ebx
mov es,cx
mov edi,ebx
; and so could we continue all the day long...
now, can you see the problem ?, it should be very simple to image.
however, here is a example witch is optimized for 2 pipes:


==See Also==
mov eax,0x0000aa55, the code does the same, but in a different order.
===Articles===
mov ebx,0x0000beef
* [[User:01000101/optlib/|01000101's Optimized Library]]
mov esi,eax
mov ecx,ebx
mov edi,ebx
mov es,cx

this example will run 3 times as fast as the unoptimized. The 2 pipes names is "V" and the "U" pipe, one thing to notice, is that the "U" pipe can NOT execute all types of instructions, only the most common instructions.

{{Stub}}


==Links==
===Threads===
===Forum Topics===
* [[topic:18119|01000101 asks about optimizing memcpy() and memset()]]
* [[topic:18119|01000101 asks about optimizing memcpy() and memset()]]
* [[topic:18063|A discussion on the best way to do multi-processor scheduling]]
* [[topic:18063|A discussion on the best way to do multi-processor scheduling]]


===External===
===External Links===
* [http://www.eventhelix.com/realtimemantra/basics/optimizingcandcppcode.htm Handy tips on optimizing C/C++ code.]
* [http://www.eventhelix.com/realtimemantra/basics/optimizingcandcppcode.htm Handy tips on optimizing C/C++ code.]
* [https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html GCC Optimization Documentation.]
[[de:Codeoptimierung]]

Latest revision as of 20:52, 4 October 2020

Optimization is the practise of taking functioning code and modifying it in order to make it execute faster. Optimization is already discussed very thoroughly in other places for various userland problem domains, and below are a few areas that are important for operating system developers to consider.

Note that to fully understand the ideas on this page, it is important to have fairly good grasp of assembly (x86 examples will be used), and a thorough knowledge of the x86 architecture is a must. But you've already read the Intel CPU manuals, haven't you?

When and What to Optimize

Please take the following list with some common sense: your kernel might be different.

Things to Optimize

  • Deeply-nested loops. If possible, try to eliminate some of the nesting of the loop. Any instruction taken out of a loop will have a much more significant speed impact than an instruction taken out of a general routine. If you run though a loop one million times, and there is just one "unnecessary line of code", then there is a significant portion of a second wasted, especially on slower computers.
  • Code that is run often.
  • Graphics code.
  • API code.

Things not to bother Optimizing

  • Boot code. Code that only runs once (and doesn't take very long, such as loading an IDT), is not worth optimizing.

Things not to Optimize

  • Untested or nonworking code. If you try to prematurely optimize, you will almost certainly make any problems you might have worse. 'First make it run, then make it run fast.'
  • Code that needs to retain clarity. It is generally fairly hard to keep optimized code looking as neat and understandable as code that is designed to be easy to understand. Comments can help (and should be used anyway, especially to mark why optimizations were made).
  • Crash recovery code. This will depend on your personal stance. It may be a good idea to err on the side of caution and concentrate on stability of the recovery code, not speed.

Compiler Optimization

Modern compilers (such as GCC) provide inbuilt optimization logic, which is available to you without having to write (or remove) a single line of code. Enabling these compiler optimizations might seem to break previously functioning code; it is important to remember that this is because of bugs in your code that have been there all along, that are just exposed by the optimizations. One such example is when the volatile keyword is not used where necessary.

Another side-effect of optimization is that source-level debugging might not always produce the expected results, as variables or whole lines of code are optimized away.

GCC

To use compiler optimizations in GCC, simply add the -O(x) switch to the command line when compiling, where (x) is one of the following:

Switch Use
0 Disable optimization. (This is the default.)
1 Try to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time.
2 Perform nearly all supported optimizations that do not involve a space-speed tradeoff.
3 Perform nearly all supported optimizations including those involving a space-speed tradeoff.
fast Optimize for speed. This includes all optimizations from -O3, as well as aggressive optimizations that may violate strict compliance with language standards. (For example, use a fused multiply-add instruction for floating point numbers instead of multiplying and adding separately.)
s Optimize for size. This includes all optimizations from -O2 that do not typically increase code size, plus additional size-reducing optimizations.
g Turn on optimizations that do not interfere with debugging (see the GCC docs). Available since 4.8.

Numerous sources suggest the existence of further optimization levels, like "-O4", "-O6" or "-O99"; but only -O0 through -O3 and -Os (and -Og since 4.8) are documented and can be relied on.

For more information, read the GCC manual; it lists the exact optimizations done at each level, plus some additional options not enabled at any of the above optimization levels.

Clang

Clang supports all of the same optimization flags as GCC, in addition to -Oz which is similar to -Os but reduces code size even more.

ToDo: Describe Link-time optimizations.

Inline Functions

The inline keyword is a way of requesting that the compiler insert the code instead of calling it. This means that if you create an inline function and then call it from another function, the compiler will try to replace your call with the function body (similar to a preprocessor define, but with none of its disadvantages).

Inline functions save the clock cycles involved in actually calling a function (argument and return code pushes, call and return jump). Compared to preprocessor defines, they add type safety, and avoid side effects that make macros a hazard. Inlining can also be triggered on and off using compiler options so you can measure the size / speed tradeoff. On the downside, a normal function called from five different locations will only reside in memory once. If you inline that function, it will reside in memory five times. Your code becomes faster, but it also becomes bigger. It is important to note that the inline keyword is merely a request to the compiler, which the compiler can choose to ignore if it deems the size / speed tradeoff to be detrimental.

It is best to only use the inline keyword on small functions, where it is known to work best and to provide the largest speed boost. An example of small function is an accessor or a mutator inside a C++ class, or a function that simply sets the value. The inline keyword is simply specified before the function type, just like the static keyword:

inline int max_of_three(int x, int y, int z)
{
    if ( x > y && x > z )
        return x;
    return ( y > z ) ? y : z;
}

The Cache

Main article: CPU Caches

The CPU's cache contains a high-bandwidth store of currently used data. On a modern CPU, it usually takes 2 cycles to do a read or write from level 1 (L1) cache (compared with 1 cycle for a register), 5 cycles for level 2 (L2) cache, and perhaps around 30 cycle for main memory. As you can see, reading from memory is very expensive compared to reading from a register or cache. Unfortunately, this trade-off for speed comes at a size cost: L1 cache is only around 8KiB per processor core, with L2 cache sitting around 4MiB to 12MiB per Core 2 chip. For more information, see this Wikipedia entry.

Most processors try to keep the most recently used data in the two caches (though the method used can differ, check the manuals). For this reason, it pays to keep in mind what data is likely to be in the caches at any given time. If you, for example (and this is a very simplistic example), set your multitasking code to switch processes too fast, you'll waste a higher percentage of time going through cache 'flushes', refilling the cache with new data.

CPU Pipelines

This page is a stub.
You can help the wiki by accurately adding more contents to it.

See Also

Articles

Threads

External Links