Re: [Trion-kernel-dev] v0.2 sources -- tarball time?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> I tested it on MP bochs and the first AP does not run into a triple fault
> just after booting anymore - I just found out that this happens because  
> it is not even booted anymore... Instead, a page fault occurs on the BSP  
> at
> 0xe0001a34, which is somewhere in mp_detect::DetectFloatingPointer. This  
> is also reported correctly by the kernel. The same thing happens on UP  
> bochs.

Since my last update on friday DetectFloatingPointer() no longer allocates  
a big block of memory to search for the structure, but rather uses a  
single page, that is assigned a new physical address on each loop  
iteration. Although I had a really thorough look at the function I  
couldn't find the problem, and thus now assume that it must be some bug in  
either the heap_manager or the mmu class.
What you might want to try is to add a cout statement at the beginning of  
the main loop (line 137) that prints the current virtual  
(virt_range.GetVirtualBase()) aswell as the physical (pbase + i*4096)  
address. While the physical address should increase in 4kB steps, the  
virtual address must always be 0xE003F000 - if it isn't there's some  
problem in the heap_manager.
Also check on which run of the loop the page-fault occures. If it does  
work the first time but crashes on the second run, something must have  
gone wrong when the page was assigned its new physical address.
I just had a look at the old kernel's page class, and the only real  
difference I could find is, that it invalidates the TLB entry before the  
new physical address gets written to it, while I do it the other way  
round. On real hardware this doesn't make any difference, but if your  
bochs loads the TLB right away (and not only once the page gets accessed),  
it might simply miss the new mapping. You should therefore try to reverse  
the order of these two instructions (/hal/mmu.cpp line 103 <-> 106).

> Even more interesting is the result on AMD SimNow!: The UP version boots
> correctly and reports that there is only one cpu present. On the MP  
> version, our kernel correctly finds 2 cpus and reports that all AP's  
> have been
> booted - but the AP does not print its "Hello world" message. (There was  
> a small bug in mp_detect::BootAllProcessors - it always returned true.)  
> In
> fact, the AP is still not booted.

Could it be that you're using a slightly outdated version of the code ?  
The latest version uses an integer return value that either holds the  
number of processors booted, or zero if there were some problems during  
startup.

> IMHO the usage of bit fields with "empty bits" is quite dangerous because
> the values of these empty bits are undefined and may cause conflicts.
> However, I just checked the value of the ICR and it is absolutely  
> correct - but the AP still does not respond. I'll keep on trying...

I would argue that it's not any worse than with traditonal flags. If you  
want to make sure that the unused bits are all set to zero, all you have  
to do it initialize the bitfield (lapic_icr icr = {0}). Regarding upwards  
compatibility this is however hardly any better as it might aswell be that  
some of the reserved bits actually have to be set in the future. To be on  
the save side one would therefore have to read the register out first and  
only mask those bits that really have to be altered. This however should  
also be possible using a bitfield:

lapic_icr icr = GetICR()

You might nevertheless be right that the unused bits are part of the  
problem, as the AMD's 64bit processors may use an updated APIC version  
that requires new flags. Try what happends if you only mask those bits of  
the ICR that are really necessary (Use lapic_icr icr = GetICR() then only  
set vector, delivery_mode & dest_shorthand). Also check if it you can get  
the code to work if you update the ICR's value manually:

46: WriteRegister(lapic_reg(ICR1), target << 24);
47: // SetICR(flags);
48: uint value = (ReadRegister(lapic_reg(ICR0) &FFF3F000) | 0x0608;
49: WriteRegister(lapic_reg(ICR0), value);

To exclude the faintest possibility that gcc for some reasons changes  
memory ordering when the ICRs are accessed, you might try to declare the  
local APIC's base address (/hal/apic_local line 152) as volatile.

cheers,
Daniel