Troubleshooting: Difference between revisions

← Older edit

Troubleshooting (view source)

Revision as of 05:40, 9 June 2024

1,825 bytes added , 29 days ago

m

Bot: Replace deprecated source tag with syntaxhighlight

[unchecked revision]

VisualWikitext

Xtex

Interface administrators, Administrators

972

edits

Revision as of 17:02, 19 July 2008 (view source) osdev>Adek336 No edit summary ← Older edit		Latest revision as of 05:40, 9 June 2024 (view source) Xtex (talk \| contribs) m (Bot: Replace deprecated source tag with syntaxhighlight)
(18 intermediate revisions by 15 users not shown)
Line 1: {{Tone}} ~~==Providing Basic Debugging Environment==~~ == Providing a basic debugging environment == ~~===Exception Handlers===~~ === Exception handlers === The first thing ever to do is to implement a reliable 'exception handler' which will tell you what went wrong. Under an emulator like [[Bochs]], the absence of such a handler will lead to a '3rd Exception without resolution' panic message (a.k.a [[Triple Fault]]), if the emulator is configured to do so... On bare hardware, it will simply reset your computer with a laconic 'bip'. Every time the CPU is unable to call some exception handler, it tries to execute the [[Double Fault]] exception handler. If it fails to call it either, a [[Triple Fault]] occurs. Also keep in mind that exceptions cannot be masked, so either your code is perfect or you need exception handlers. Also keep in mind that when you run applications '''''their''''' code must be perfect without exception handlers, so it's a good idea to get them quite quickly. The first thing ever to do is to implement a reliable 'exception handler' that will tell you what was wrong. Under an emulator like BOCHS, the absence of such a handler will lead to a '3rd Exception without resolution' panic message, if the emulator is configured to do so... On bare hardware, it will simply reset your computer with a laconic bip. ~~Every~~It's ~~time~~quite ~~the CPU is unable~~convenient to ~~call~~have ~~some~~the exception handler, itshowing ~~tries~~what tokind ~~execute the Double Fault~~of exception ~~handler.~~occurred Ifbefore itanything ~~fails~~'hazardous' tois ~~call~~attempted. ~~it either~~Displaying, afor ~~Triple~~instance, ~~Fault~~the ~~occurs.~~(hexadecimal) ~~Also~~exception ~~keep~~number in ~~mind~~a ~~that~~corner ~~exceptions~~of ~~cannot~~the bescreen, ~~masked~~can ~~so either your code is perfect or~~save you ~~need~~hours ~~exception~~of ~~handlers~~debugging. :) It's quite convenient to have the exception handler show what kind of exception occurred before anything 'hazardous' is attempted. Displaying (as hex) the exception number in a corner of the screen, for instance can save you hours of debugging :) ~~<pre>~~ ~~AsmExample:~~ <!-- The following code example should be more generalized (e.g. exc_0d_handler, gpfExcHandler renamed to more meaningful names).--> <syntaxhighlight lang="asm"> exc_0d_handler: push gs Line 38 ⟶ 37: pop gs iret </syntaxhighlight> ~~</pre>~~ Once you have implemented such a technique, it may be wise to test it, deliberately issuing 'faulty' instructions to see if the correct code is displayed. Having the 'double fault' exception (08) displayed somewhere else on the screen may also be a smart move. ==== What to do if characters cannot be displayed ==== Such things occurs for instance when your GDT or paging tables has been badly configured (e.g. 0xb8000 no longer refers to the video memory). Fortunately enough, the video memory is not your sole communication technique with your kernel: * you may use the [[PS2 Keyboard\|keyboard]] LEDs to report some events (for instance enabling the 'scroll lock' LED when you're a handler and disabling it when you're out). * you may use the internal [[PC Speaker]] to make Morse-code-like signals reporting early errors. (May be disgraceful if coding late, though.) * you may use keyboard LEDs to report some events (for instance enabling the 'scroll lock' led when you're a handler and disabling it when you're out) * you may use [[VGA Hardware\|VGA registers]] to change the background color or the overscan (screen border) color to report the current 'state' of your kernel. (E.g. black = normal operations, yellow = processing interrupt, red = crash condition occurred (just before cli:hlt), blue = processing an exception, etc...) * you may use the internal PC speaker to make Morse-code-like signals reporting early errors (may be disgraceful if coding late, though) Refer to [[:Category:VGA\|VGA Resources]] to see how you can modify colors. The [[resources]] page should have all the documentation for LED flashing and speaker beeping. * you may use VGA registers to change the background color or the overscan (screen border) color to report the current 'state' of your kernel (black = normal operations, yellow = processing interrupt, red = crash condition occurred (just before cli:hlt), blue = processing an exception, etc) * you may output bytes via the [[serial ports\|serial port]]. Most emulators allow you to redirect these characters into a file and unlike the screen the number of characters is not limited. A driver for the serial port is also very easy to implement. ~~Refer to [[Vga Resources]] to see how you can modify colors. "Miscellaneous devices" in OSRC should have all the documentation for LED flashing and speaker beeping.~~ ~~One Step Further ...~~ ~~Avoiding exception loops~~ ==== Avoiding exception loops ==== So we know when exceptions occur and which exception occurred. That's better but still not especially useful. Your exception handler is likely to become something complex as your kernel will evolve, and you'll discover that exceptions mainly occur ... in exception handlers. In order to avoid recursive exceptions to occur endlessly, you can easily maintain a 'nested exceptions counter' that will be incremented every time you enter an exception handler and decremented just before you leave that handler. If the counter is above a certain threshold of a few units (3 should give interesting enough results), the kernel will abort trying to solve the exception and enter a 'panic' mode (red background, flashing ~~led~~LED, whatever). <syntaxhighlight lang="c"> ~~<pre>~~ int nestexc = 0; // called by the stub void gpfExcHandler(void) { if (nestexc > MAX_NESTED_EXCEPTIONS) panic(); nestexc++; Line 71 ⟶ 68: return; } </syntaxhighlight> ~~</pre>~~ You need to know, of course, that some exceptions are not 'resumable'. If your kernel issued a division by zero, trying to return to the 'div' instruction will only trigger the exception one more time (yeah! altogether, now :). Such loops cannot be solved by the 'nestexc' counter ==== Showing the stack content ==== Much of your program's state (function arguments, return address, local variables) is stored on the [[stack]], especially when using C/C++ code. A complete debugger (like GDB) will inspect the debugging info to give names to the stack content, provide a list of calls, etc. This is a bit complex to do ourselves, but if your kernel can simply ''show'' the content of the stack and if you know ''where'' in the code the process halted, you can already fix quite a lot of bugs by doing the job of the debugger yourself, guessing which stack location holds which variable, where the return addresses are, etc. The stack content is still in memory. The [[EBP]] value of the erroring process is still in memory, and points to the start of the stack frame for the current function. Everything from this address and up was the current stack. Now, you can use the value in ebp as the source. Just use the following call: Much of you program's state (function arguments, return address, local variables) is stored on the [[stack]], especially when using C/C++ code. A complete debugger (like GDB) will inspect the debugging info to give names to the stack content, provide a list of calls, etc. This is a bit complex for us, but if your kernel can simply show the content of the stack and if you know where in the code the process halted, you can already fix quite alot of bugs by doing the job of the debugger yourself, guessing which stack location holds which variable, where are return addresses, etc. <syntaxhighlight lang="asm"> The stack content is still in memory. The EBP value of the erroring process is still in memory, and points to the start of the stack frame for the current function. Everything from this address and up was the current stack. Now, you can use the value in ebp as the source. Just use the following call: ~~<pre>~~ stack_dump: push ebp Line 88 ⟶ 84: pop ebp ret ; note that this is not going to work, but it should be here for completion. </syntaxhighlight> ~~</pre>~~ and use <tt>void dump_hex(char stack)</tt>. ~~Locating the Faulty instruction~~ ==== Locating the Faulty instruction ==== In most cases, when your exception handler is called, the address of the faulty instruction is somewhere on the stack. The first step here is to print out the address of this instruction. Once this is done, you (as a human) can inspect the ''linker map'' and find out in which object file the problem was. You can request a map with <tt>ld </tt>''<usual options>''<tt> -Map </tt>''<filename.map>''. <pre> Line 114 ⟶ 110: </pre> Is an example of what a map can look like. If the error address was 0x554f, we can tell from this that the error is somewhere in init.o, and most likely in <tt>kinit()</tt> function (it may still be in some other function if there are some ''static'' functions in the source file). All we know is that the error ~~occured~~occurred at offset +21f in the file. Now, we can use <tt>objdump -drS bin/init.o</tt> to get a look at the disassembled output. Note that this step will work properly only if you had enabled debug information in those separated <tt>.o</tt> files... <syntaxhighlight lang="c"> ~~<pre>~~ #ifdef __DEBUG__ kprint("kernel in debug-mode(%x) press [SHIFT+SPACE] to bypass anykey()\n", Line 131 ⟶ 127: DbMsk); #endif </syntaxhighlight> Of course, as I picked up a random address, there's nothing wrong to see at +21f, but I guess you got my point. :) ==== Locating the offending line of source code ==== Once you have found the address of the faulty instruction in the previous step, you can identify the corresponding line of source code by running <pre> addr2line -e <your_kernel.elf> <address of faulty instruction> </pre> === Enhanced debugging techniques === ~~Of course, as i picked up a random address, there's nothing wrong to see at +21f, but i guess you got my point :)~~ ==== Stack tracing ==== ~~Enhanced debugging techniques~~ By analyzing the default way to create a stack frame, you can rip off a stack frame at a time, resulting in the call sequence that leads to the fault. For a single bonus point, also extract the arguments and dump them as well. For multiple bonus points, use C++ name mangling, and export the arguments in readable form in the correct type. ~~Stack tracing~~ Each time a function is called it gets the following head/tail: ([[GCC]] 3.3.2) By analyzing the default way to create a stack frame, you can rip off a stack frame at a time, resulting in the call sequence that lead to the fault. For a single bonus point, also extract the arguments and dump them as well. For multiple bonus points, use C++ name mangling, and export the arguments in readable form in the correct type. <syntaxhighlight lang="asm"> ~~Each time a function is called it gets the following head/tail: (GCC 3.3.2)~~ ~~<pre>~~ push ebp mov ebp, esp Line 147 ⟶ 150: leave ret </syntaxhighlight> ~~</pre>~~ On the place of the ... the rest of the code is filled in. Now, if you analyze the stack output, it looks something like: <pre> 0000FFC0 0000FFD0 -> this is the result of a push ~~ebp~~EBP (which pushes the esp at the start of the previous function) 0000FFC4 001023A5 -> this was the old EIP, which can be looked up in the function table (map file) 0000FFC8 01234567 -> this is an argument 0000FFCC 89ABCDEF -> this is another argument 0000FFD0 0000FFF0 -> this is again another ~~ebp~~EBP 0000FFD4 00105BC3 -> this is again an ~~eip~~EIP 0000FFD8 001023A5 -> this is an argument (could be a function pointer) 0000FFDC 01234567 -> this is another argument Line 164 ⟶ 167: 0000FFE8 FFC00000 -> this is again an argument to a different function 0000FFEC 00010000 -> this is again another argument, but again not to this function. 0000FFF0 0000FFFC -> this is the previous ~~ebp~~EBP again (note this is in this case the top) 0000FFF4 0010002C -> this is an old EIP 0000FFF8 00000001 -> this is an argument 0000FFFC 00000000 -> this is the ~~ebp~~EBP at the start of the first function, not necessarily valid! </pre> Now, you can traverse along the path of execution. The content of ~~ebp~~EBP is the old value of ~~ebp~~EBP, that is, the one of the last stack frame. The value above that is the old instruction pointer (which points inside the current function), and the values above that, up to but not including the value pointed to by the old ~~ebp~~EBP, are the arguments. Note that the arguments don't have to belong to this function, GCC occasionally saves an add to esp by not popping the values. By then pretending the old ~~ebp~~EBP is the current ~~ebp~~EBP, you can unwind another call. Do this until you are fed up by it, you have enough output or the stack ends. If the last one, watch out for not generating a double fault. If you use C++ name mangling, the arguments are encoded in the function name. If you can read that, you can decode what the value on the stack must be, so you can actually present it to the user in the form of a normal function call with legible arguments and everything. This is the ~~creme~~'crème de la ~~creme~~crème' of stack dumping methods, so most aren't expected to do this. While I program my kernel in C, I actually thought of writing a script that would parse the header files for function declarations, extract the debugging symbols from the compiled kernel image using <tt>objdump</tt>, and write a system map which would provide the types. Forgot it after falling in love with ~~BOCHS~~[[Bochs]]'s debugger though. Similarly, typemaps for structured types could be created, which would allow the same kind of browsing that GDB or [[Visual Studio]] give you. THIS would be the ~~creme~~crème de la ~~creme~~crème. === Debugging techniques === If your function <tt>x()</tt> wreaks havoc only after 1000 calls it may not suffice to put a "<tt>panic()"</tt> statement inside the functions to see where the functions breaks. You may want to know which call is malignant. To do this, one might use a global or static var to count calls and panic() after an amount to see if it managed to crash. If not, you try twice that amount; if it does crash, you try bisection to find the amount. ~~void scheduler_choose_task() {~~ ~~static uint32 Z=0;~~ ~~Z++;~~ ~~uint32 N = 1000;~~ ~~if (Z > N) panic(); //find the largest integer N for which it crashes not~~ ~~if (in_critical_section()) return ;~~ ~~....~~ } ~~and then check how far does it go~~ ~~Z++;~~ ~~uint32 N = 1000;~~ ~~//we get here,~~ ~~if (in_critical_section()) return;~~ ~~if (Z > N) panic(); //do we get here to panic() before a crash?~~ <syntaxhighlight lang="c"> However as complexity rises or multithreading is involved it is less probable that a crash would be consistenlty occuring in the same point, after the same amount of calls everytime. Then it would not be possible to find the number of the call to scheduler_choose_talk() that crashes it (because that number changes). Debugging needs some imagination; what if you knew, by tracing the program flow with print(__LINE__) that scheduler_choose_task() crashes only when a call to fun1() is in progress? You might use a global var void scheduler_choose_task() { ~~uint32 dbg;~~ static uint32_t Z=0; ~~or an array of various dbg vars (which are used only in debugging code which is cleaned after the programmer ceases to debug)~~ Z++; ~~uint32 dbg[20];~~ uint32_t N = 1000; ~~in a manner such as~~ if (Z > N) panic(); //find the largest integer N for which it crashes not ~~void fun1() {~~ if (in_critical_section()) return; ~~dbg[3]= 1;~~ ... } ~~if (x()) { dbg[3]=0; return ; }~~ </syntaxhighlight> ~~...~~ ...and then check how far does it go: ~~dbg[3]= 0;~~ <syntaxhighlight lang="c"> } Z++; uint32_t N = 1000; ~~and~~ //we get here, if (in_critical_section()) return; ~~void scheduler_choose_task() {~~ // if (~~dbg[3]==1~~Z > N) panic(); //~~check~~do we get here.. ato panic() ~~saves~~before ~~the day from~~a ~~crashing!~~crash? </syntaxhighlight> ~~if (in_critical_section()) return;~~ However as complexity rises or multithreading is involved, it is less probable that a crash would be consistently occurring at the same point, after the same amount of calls every time. Then it would not be possible to find the number of the call to <tt>scheduler_choose_talk()</tt> that crashes it (because that number changes). Debugging needs some imagination; what if you knew, by tracing the program flow with <tt>print(__LINE__)</tt> that <tt>scheduler_choose_task()</tt> crashes only when a call to <tt>fun1()</tt> is in progress? You might use a global var <tt>uint32_t dbg</tt> or an array (<tt>uint32_t dbg[20]</tt>) of various <tt>dbg</tt> vars (which are used only in debugging code which is cleaned after the programmer ceases to debug) in a manner such as: ~~if (dbg[3]==1) panic(); //check here.. it crashes~~ <syntaxhighlight lang="c"> } void fun1() { ~~(or mix it with a call count, Z++; if (Z>5 && dbg[3] == 1) panic() )~~ dbg[3] = 1; ... if (x()) { dbg[3]=0; return; } ~~Using the __LINE__ aids tracing the program flow:~~ ... ~~print(__LINE__);~~ dbg[3] = 0; } ~~See [[C_preprocessor#Uses_for_debugging]] for more info.~~ </syntaxhighlight> ...and: <syntaxhighlight lang="c"> void scheduler_choose_task() { // if (dbg[3]==1) panic(); //check here.. a panic saves the day from crashing! if (in_critical_section()) return; if (dbg[3]==1) panic(); //check here.. it crashes } </syntaxhighlight> (Or mix it with a call count, <tt>Z++; if (Z>5 && dbg[3] == 1) panic()</tt>.) Using the <tt>__LINE__</tt> aids tracing the program flow: <syntaxhighlight lang="c"> print(__LINE__); </syntaxhighlight> See [[C preprocessor#Uses for debugging\|Uses for debugging]] for more info. == External assistance == So far, we assumed that the kernel was containing all the information required for debugging (like symbols names, etc). In production-stage, however, this information, as well as most of the 'debugging prints' have usually been stripped out of the final binary object. Still, one could imagine a kernel that would be equipped with serial-line communication code and connected to another computer that would have all the 'removed' information (like the symbols map, or the debugging-info featured intermediate binaries). In case of a panic, the kernel could for instance send the value of <tt>eip</tt> over the serial line and expect the helper PC to reply with the function name and line number corresponding to that address (or dump it on the helper PC's screen, that's a matter of choice :) ). ~~Debugging interface~~ === Debugging interface === ~~Does anybody else know of an OS that allows the hacker to interactively probe the system state when crashes occur ?~~ ''Now we have plenty of information about what was wrong... can we ask for more? What do Mobius' debugging shell and Clicker's information panels tell us...'' ~~AmigaOS~~''Does ~~allowed~~anybody else know of an OS that allows the hacker to interactively probe the system state when crashes occur~~. It offered a mode that allowed debugging over serial line even after the system went into Guru Meditation. That was possible because AmigaOS enjoyed a 256 / 512 kByte ROM image that could not get~~ ~~corrupted.~~?'' :''Yes. Guess. Correct: AmigaOS. ;-) It offered a mode that allowed debugging over serial line even after the system went into Guru Meditation. That was possible because AmigaOS enjoyed a 256 / 512 kbyte ROM image that could not get corrupted.'' - [[User:Solar\|MartinBaute]] ~~Unix systems traditionnally write their state to /dev/core for offline guru meditation.~~ ''Unix systems traditionally write their state to /dev/core for offline guru meditation'' ~~==See Also==~~ [[How Do I Use A Debugger With My OS]] [[Help My Kernel Crashed] == See Also == === Articles === [[How Do I Use A Debugger With My OS]] [[Category:Troubleshooting\| ]] [[Category:Debugging]] [[de:Debugging]]