Bare Metal PC Hacking 2 - early protected mode debugging

by John Tsiombikas
Last update: 24 April 2018.

The situation

I wrote the bit of code necessary to switch the processor to 32bit protected mode in my second stage boot loader. It entails a bunch of things which I won't go into in detail, like enabling the A20 line (in multiple ways, because no one method is guranateed to work), setting up a temporary GDT (Global Descriptor Table) with descriptors for the code and data segments, and flipping the PE (protection enable) bit in cr0.

        # load initial GDT
        lgdt (gdt_lim)
        # enable protection
        mov %cr0, %eax
        or $1, %eax
        mov %eax, %cr0
        # inter-segment jump to set cs selector to segment 1
        ljmp $0x8,$0f

        .code32
        # set all data selectors to segment 2
0:      mov $0x10, %ax
        mov %ax, %ds
        mov %ax, %ss
        mov %ax, %es
        mov %ax, %gs
        mov %ax, %fs
        ...

        .align 4
        .word 0
gdt_lim: .word 23
gdt_base:.long gdt

        .align 8
gdt:    # 0: null segment
        .long 0
        .long 0
        # 1: code - base:0, lim:4g, G:4k, 32bit, avl, pres|app, dpl:0, type:code/non-conf/rd
        .long 0x0000ffff
        .long 0x00cf9a00
        # 2: data - base:0, lim:4g, G:4k, 32bit, avl, pres|app, dpl:0, type:data/rw
        .long 0x0000ffff
        .long 0x00cf9200

All well and good, but what if after a random interval the machine just reboots? And what if, to make matters worse, this only happens on the real computers, and not on any of the emulators (qemu, bochs) which would allow me to attach a debugger and see what's going on?

Debugging boot code with QEMU

As a quick aside, here's how to use a debugger with qemu:

Tell qemu to start with the gdb server stub enabled, and wait for gdb to connect before starting to boot: qemu-system-i386 -fda floppy.img -s -S
Start gdb, then tell it to connect to the qemu gdb server, which is waiting for TCP/IP connections at port 1234: target remote localhost:1234
Instruct gdb to print the value of the program counter (eip) at every step, and also to disassemble the instruction pointed by it: display/i $pc
Set a breakpoint at a particular address before giving the continue (c) command. For instance to set a breakpoint at the start of the first stage boot loader, set a breakpoint at address 7c00: b *0x7c00
Finally when debugging 16bit code, it might be necessary to let gdb know that, to decode instructions correctly: set architecture i8086

To figure out where to set a breakpoint, we need to know the address of some instruction. We can disassemble the elf binary to get that information: objdump -D test.elf -m i8086 >disasm

Triple-fault on the real thing

On the real thing, debugging immediately becomes more complicated. The good news is, as I mentioned in the previous article, I had the forethought of included serial output in my boot loader for printf-debugging, so that's a start, because I can see how far the code goes before the computer reboots.

When hacking PCs down to the metal, an abrupt reboot is generally due to a condition called a "triple-fault". It all starts when something raises an exception. It could be anything: illegal instructions, general protection faults, page faults, numeric exceptions, etc. When an exception is raised, the processor tries to jump to the appropriate interrupt vector. If for some reason the interrupt vectoring fails, for instance if the appropriate interrupt descriptor is not present, or it points to an invalid code segment, the processor raises a "double fault", and tries to jump to the double fault exception vector (8). If the double fault interrupt descriptor is not present, the processor will try the general protection exception vector (13) instead. If a further exception is raised at that point, the processor detects this as a triple fault, and simply resets itself.

The first attempt to figure out where the problem is by sprinkling putstr calls, was inconclusive, So I decided to find out exactly which instruction starts the fault avalanche, by installing an interrupt handler. I didn't want to have to install all possible exception handlers just for this, so I opted for populating only the general protection exception vector, to catch the problem after the double-fault happens. When my interrupt handler is called, I can examine the value of eip in the interrupt stack frame, which will be pointing to the address of the next instruction after the one which triggered the fault.

I didn't bother with setting up an IDT so far, because I'm running with interrupts disabled until I'm done with the startup code and I'm ready to start executing the main program (or so I thought).

In protected mode, interrupt vectors are installed by populating an interrupt gate descriptor in the interrupt descriptor table, the address and size of which is set in the idtr register by means of an lidt instruction, similar to how we've set the gdtr previously for the global descriptor table which contained our memory segment descriptors.

        lidt (idt_lim)
        ...

prot_fault:
        # grab the error code from the stack frame
        mov (%esp), %eax
        shr $3, %eax
        call print_num
        mov $64, %al
        call putchar
        # grab the value of eip from the stack frame
        mov 4(%esp), %eax
        call print_num
        mov $10, %al
        call putchar
        hlt

gpf_msg: .asciz "GP fault "

        .align 4
        .word 0
idt_lim: .word 111
idt_base:.long idt

        .align 8
idt:    .space 104
        # trap gate 13: general protection fault
        .short prot_fault
        .short 0x8
        # type: trap, present, default
        .short 0x8f00
        .short 0

Turns out the fault was being triggered by the serial output code. The UART was raising an interrupt, which I couldn't handle, because I haven't populated the IDT yet. Of course this shouldn't happen if I was running as I thought with interrupts disabled, but it turns out when I was calling the BIOS sector read functions to read from the boot medium, the BIOS would re-enable interrupts, and leave them enabled.

This wouldn't happen with either of the emulators, because they both use the same SeaBIOS implementation, which apparently doesn't do that, or makes sure to restore the original state before returning control to my code.

Discuss this article

Back to index