Troubleshooting: Difference between revisions

← Older edit

Troubleshooting (view source)

Revision as of 05:40, 9 June 2024

691 bytes added , 29 days ago

m

Bot: Replace deprecated source tag with syntaxhighlight

[unchecked revision]

VisualWikitext

Xtex

Interface administrators, Administrators

972

edits

Revision as of 20:22, 14 December 2010 (view source) osdev>Jrw (→‎What to do if characters cannot be displayed) ← Older edit		Latest revision as of 05:40, 9 June 2024 (view source) Xtex (talk \| contribs) m (Bot: Replace deprecated source tag with syntaxhighlight)
(11 intermediate revisions by 10 users not shown)
Line 1: {{Tone}} == Providing a basic debugging environment == === Exception handlers === The first thing ever to do is to implement a reliable 'exception handler' which will tell you what went wrong. Under an emulator like [[Bochs]], the absence of such a handler will lead to a '3rd Exception without resolution' panic message (a.k.a [[Triple Fault]]), if the emulator is configured to do so... On bare hardware, it will simply reset your computer with a laconic 'bip'. Every time the CPU is unable to call some exception handler, it tries to execute the [[Double Fault]] exception handler. If it fails to call it either, a [[Triple Fault]] occurs. Also keep in mind that exceptions cannot be masked, so either your code is perfect or you need exception handlers. Also keep in mind that when you run applications '''''their''''' code must be perfect without exception handlers, so it's a good idea to get them quite quickly. It's quite convenient to have the exception handler showing what kind of exception occurred before anything 'hazardous' is attempted. Displaying, for instance, the (hexadecimal) exception number in a corner of the screen, can save you hours of debugging. :) <!-- The following code example should be more generalized (e.g. exc_0d_handler, gpfExcHandler renamed to more meaningful names).--> <~~source~~syntaxhighlight lang="asm"> exc_0d_handler: push gs Line 37: pop gs iret </syntaxhighlight> ~~</source>~~ Once you have implemented such a technique, it may be wise to test it, deliberately issuing 'faulty' instructions to see if the correct code is displayed. Having the 'double fault' exception (08) displayed somewhere else on the screen may also be a smart move. ==== What to do if characters cannot be displayed ==== Such things occurs for instance when your GDT or paging tables has been badly configured (e.g. 0xb8000 no longer refers to the video memory). Fortunately enough, the video memory is not your sole communication technique with your kernel: * you may use the [[PS2 Keyboard\|keyboard]] LEDs to report some events (for instance enabling the 'scroll lock' LED when you're a handler and disabling it when you're out). Line 51 ⟶ 50: ==== Avoiding exception loops ==== So we know when exceptions occur and which exception occurred. That's better but still not especially useful. Your exception handler is likely to become something complex as your kernel will evolve, and you'll discover that exceptions mainly occur ... in exception handlers. In order to avoid recursive exceptions to occur endlessly, you can easily maintain a 'nested exceptions counter' that will be incremented every time you enter an exception handler and decremented just before you leave that handler. If the counter is above a certain threshold of a few units (3 should give interesting enough results), the kernel will abort trying to solve the exception and enter a 'panic' mode (red background, flashing LED, whatever). <~~source~~syntaxhighlight lang="c"> int nestexc = 0; Line 70 ⟶ 68: return; } </syntaxhighlight> ~~</source>~~ You need to know, of course, that some exceptions are not 'resumable'. If your kernel issued a division by zero, trying to return to the 'div' instruction will only trigger the exception one more time (yeah! altogether, now :). Such loops cannot be solved by the 'nestexc' counter ==== Showing the stack content ==== Much of ~~you~~your program's state (function arguments, return address, local variables) is stored on the [[stack]], especially when using C/C++ code. A complete debugger (like GDB) will inspect the debugging info to give names to the stack content, provide a list of calls, etc. This is a bit complex ~~for~~to do usourselves, but if your kernel can simply ''show'' the content of the stack and if you know ''where'' in the code the process halted, you can already fix quite a lot of bugs by doing the job of the debugger yourself, guessing which stack location holds which variable, where ~~are~~the return addresses are, etc. The stack content is still in memory. The [[EBP]] value of the erroring process is still in memory, and points to the start of the stack frame for the current function. Everything from this address and up was the current stack. Now, you can use the value in ebp as the source. Just use the following call: <~~source~~syntaxhighlight lang="asm"> stack_dump: push ebp Line 86 ⟶ 84: pop ebp ret ; note that this is not going to work, but it should be here for completion. </syntaxhighlight> ~~</source>~~ and use <tt>void dump_hex(char *stack)</tt>. ==== Locating the Faulty instruction ==== In most cases, when your exception handler is called, the address of the faulty instruction is somewhere on the stack. The first step here is to print out the address of this instruction. Line 117 ⟶ 114: Now, we can use <tt>objdump -drS bin/init.o</tt> to get a look at the disassembled output. Note that this step will work properly only if you had enabled debug information in those separated <tt>.o</tt> files... <~~source~~syntaxhighlight lang="c"> #ifdef __DEBUG__ kprint("kernel in debug-mode(%x) press [SHIFT+SPACE] to bypass anykey()\n", Line 130 ⟶ 127: DbMsk); #endif </syntaxhighlight> ~~</source>~~ Of course, as I picked up a random address, there's nothing wrong to see at +21f, but I guess you got my point. :) ==== Locating the offending line of source code ==== === Enhanced debugging techniques ===▼ Once you have found the address of the faulty instruction in the previous step, you can identify the corresponding line of source code by running <pre> addr2line -e <your_kernel.elf> <address of faulty instruction> </pre> ▲=== Enhanced debugging techniques === ==== Stack tracing ==== By analyzing the default way to create a stack frame, you can rip off a stack frame at a time, resulting in the call sequence that ~~lead~~leads to the fault. For a single bonus point, also extract the arguments and dump them as well. For multiple bonus points, use C++ name mangling, and export the arguments in readable form in the correct type.▼ ▲By analyzing the default way to create a stack frame, you can rip off a stack frame at a time, resulting in the call sequence that lead to the fault. For a single bonus point, also extract the arguments and dump them as well. For multiple bonus points, use C++ name mangling, and export the arguments in readable form in the correct type. Each time a function is called it gets the following head/tail: ([[GCC]] 3.3.2) <~~source~~syntaxhighlight lang="asm"> push ebp mov ebp, esp Line 148 ⟶ 150: leave ret </syntaxhighlight> ~~</source>~~ On the place of the ... the rest of the code is filled in. Now, if you analyze the stack output, it looks something like: Line 173 ⟶ 175: Now, you can traverse along the path of execution. The content of EBP is the old value of EBP, that is, the one of the last stack frame. The value above that is the old instruction pointer (which points inside the current function), and the values above that, up to but not including the value pointed to by the old EBP, are the arguments. Note that the arguments don't have to belong to this function, GCC occasionally saves an add to esp by not popping the values. By then pretending the old EBP is the current EBP, you can unwind another call. Do this until you are fed up by it, you have enough output or the stack ends. If the last one, watch out for not generating a double fault. If you use C++ name mangling, the arguments are encoded in the function name. If you can read that, you can decode what the value on the stack must be, so you can actually present it to the user in the form of a normal function call with legible arguments and everything. This is the ~~creme~~'crème de la ~~creme~~crème' of stack dumping methods, so most aren't expected to do this. While I program my kernel in C, I actually thought of writing a script that would parse the header files for function declarations, extract the debugging symbols from the compiled kernel image using <tt>objdump</tt>, and write a system map which would provide the types. Forgot it after falling in love with ~~BOCHS~~[[Bochs]]'s debugger though. Similarly, typemaps for structured types could be created, which would allow the same kind of browsing that GDB or [[Visual Studio]] give you. THIS would be the ~~creme~~crème de la ~~creme~~crème. === Debugging techniques === If your function <tt>x()</tt> wreaks havoc only after 1000 calls it may not suffice to put a <tt>panic()</tt> statement inside the functions to see where the functions breaks. You may want to know which call is malignant. To do this, one might use a global or static var to count calls and panic() after an amount to see if it managed to crash. If not, you try twice that amount; if it does crash, you try bisection to find the amount. <~~source~~syntaxhighlight lang="c"> void scheduler_choose_task() { static ~~uint32~~uint32_t Z=0; Z++; ~~uint32~~uint32_t N = 1000; if (Z > N) panic(); //find the largest integer N for which it crashes not if (in_critical_section()) return; ... } </syntaxhighlight> ~~</source>~~ ...and then check how far does it go: <~~source~~syntaxhighlight lang="c"> Z++; ~~uint32~~uint32_t N = 1000; //we get here, if (in_critical_section()) return; if (Z > N) panic(); //do we get here to panic() before a crash? </syntaxhighlight> ~~</source>~~ However as complexity rises or multithreading is involved, it is less probable that a crash would be consistently occurring at the same point, after the same amount of calls every time. Then it would not be possible to find the number of the call to <tt>scheduler_choose_talk()</tt> that crashes it (because that number changes). Debugging needs some imagination; what if you knew, by tracing the program flow with <tt>print(__LINE__)</tt> that <tt>scheduler_choose_task()</tt> crashes only when a call to <tt>fun1()</tt> is in progress? You might use a global var <tt>~~uint32~~uint32_t dbg</tt> or an array (<tt>~~uint32~~uint32_t dbg[20]</tt>) of various <tt>dbg</tt> vars (which are used only in debugging code which is cleaned after the programmer ceases to debug) in a manner such as: <~~source~~syntaxhighlight lang="c"> void fun1() { dbg[3] = 1; Line 207 ⟶ 209: dbg[3] = 0; } </syntaxhighlight> ~~</source>~~ ...and: <~~source~~syntaxhighlight lang="c"> void scheduler_choose_task() { // if (dbg[3]==1) panic(); //check here.. a panic saves the day from crashing! Line 215 ⟶ 217: if (dbg[3]==1) panic(); //check here.. it crashes } </syntaxhighlight> ~~</source>~~ (Or mix it with a call count, <tt>Z++; if (Z>5 && dbg[3] == 1) panic()</tt>.) Using the <tt>__LINE__</tt> aids tracing the program flow: <~~source~~syntaxhighlight lang="c"> print(__LINE__); </syntaxhighlight> ~~</source>~~ See [[C preprocessor#Uses for debugging\|Uses for debugging]] for more info. == External assistance == Line 233 ⟶ 234: === Debugging interface === ''Now we have plenty of information about what was wrong... can we ask for more ? ~~what~~What do Mobius' debugging shell and Clicker's information panels tell us ...''▼ ▲''Now we have plenty of information about what was wrong... can we ask for more ? what do Mobius' debugging shell and Clicker's information panels tell us ...'' ''Does anybody else know of an OS that allows the hacker to interactively probe the system state when crashes occur ?'' :''Yes. Guess. Correct: AmigaOS. ;-) It offered a mode that allowed debugging over serial line even after the system went into Guru Meditation. That was possible because AmigaOS enjoyed a 256 / 512 ~~kByte~~kbyte ROM image that could not get corrupted.'' - [[User:Solar\|MartinBaute]] ''Unix systems traditionally write their state to /dev/core for offline guru meditation'' Line 247: [[Category:Troubleshooting]] [[Category:Debugging]] [[de:Debugging]]