Disregard this for now, there's another bug to work out then will submit a newer patch.
AMD PauseFilterThreshold support
Out of bounds memory access in memory.cc
AMD SVM VMCBPTR not saved on snapshot
Ok I'm kinda getting closer. What's happening is first the MSR KernelGSBase (0xC0000102) is being accessed and this is in the MSR bitmap and NOT being intercepted, so that occurs in the guest without VMEXIT. Now when the MSR 0x40000071 follows, this somehow leads to a memory access exception (still need to trace where exactly). But if I intercept all MSRs, meaning the KernelGSBase MSR leads to a VMEXIT instead, then it works fine and the following 0x40000071 MSR VMEXITs and doesn't cause any more...
Actually no, I'm so confused. When guest does wrmsr to 0x40000071, MSR is 0x40000071. However somehow when I check "if(msr == 0x40000071)" the if fails? And I know MSR is 0x40000071 from BX_INFO printing it, yet somehow it also passes the checks "else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff))" and enters this if? When it shouldn't? So confused whats going on here at the moment haha.
Hey man, New issue to fix, this one is why I hate C lol. in SVM.cc you have SvmInterceptMSR, theres a bunch of if and else if statements like: if (msr <= 0x1fff) msr_map_offset = 0; else if (msr >= 0xc0000000 && msr <= 0xc0001fff) msr_map_offset = 2048; else if (msr >= 0xc0010000 && msr <= 0xc0011fff) msr_map_offset = 4096; Theres a problem here, specifically the double conditions inside the brackets like "msr >= 0xc0000000 && msr <= 0xc0001fff" Its not being calculated properly because you need...
Yep, with these changes AMD Hyper-V boots and works! Thanks for you assistance with finding the bugs and patches
Yeah that works
Yeah the hack works, Hyper-V AMD boots fine with that alongside the Guest EFER.SVME change I mentioned earlier too.
I checked with your changes in event.cc, it doesn't have the intercept set so it doesn't vmexit, it goes the "take it normally" route. Also: - The SMI comes via apic_bus_deliver_smi called via iodev/acpi.cc in generate_smi with the value 0xf1 - When doing the problematic page walk, status is: GUEST_NXE:0, HOST_NXE:1, IN_SVM:1, IN_SMM:1, RW:2, is_page_walk:0
I'll look into it more tomorrow too, but for now I can provide a dump of the debug prints right before it happens (with some extras in there): 07119718387i[CPU0 ] GUEST_NXE:1, HOST_NXE:1, IN_SVM:1 07119718387d[CPU0 ] Nested walk for guest paddr 0x000004605000 07119718387i[CPU0 ] GUEST_NXE:1, HOST_NXE:1, IN_SVM:1 07119718387d[CPU0 ] Nested walk for guest paddr 0x000004606070 07119718387i[CPU0 ] GUEST_NXE:1, HOST_NXE:1, IN_SVM:1 07119718387d[CPU0 ] Nested walk for guest paddr 0x00000010e04c 07119718393i[CPU0...
Hey, so when the NX fault occurs, the EFER.NXE status for guest and host are: GUEST_NXE:0, HOST_NXE:1 Also yes your fix for the nested_page_fault issue above seems good
Yeah all the changes were necessary, reasons being: For the paging.cc nested_walk change, without this change as mentioned the exitinfo1 was like 0000000200000004 and when this went to hyper-v, it must have injected an exception or something because code execution jumps to BSOD in guest, but with this change it doesn't and the exitinfo1 becomes 0000000100000004 and the guest continues. This code is hit when the guest attempts to read addresses like 0xfee00320 and 0xfee00340 For the paging.cc *nx_fault...
Yeah all the changes were necessary, reasons being: For the paging.cc nested_walk change, without this change as mentioned the exitinfo1 was like 0000000200000004 and when this went to hyper-v, it must have injected an exception or something because code execution jumps to BSOD in guest, but with this change it doesn't and the guest continues. This code is hit when the guest attempts to read addresses like 0xfee00320 and 0xfee00340 For the paging.cc *nx_fault change, this is required because the...
Yeah all the changes were necessary, reasons being: - For the paging.cc nested_walk change, without this change as mentioned the exitinfo1 was like 0000000200000004 and when this went to hyper-v, it must have injected an exception or something because code execution jumps to BSOD in guest, but with this change it doesn't and the guest continues. This code is hit when the guest attempts to read addresses like 0xfee00320 and 0xfee00340 For the paging.cc *nx_fault change, this is required because the...
Well I actually got Hyper-V AMD to boot properly with the following changes: Change that paging.cc line ~1384 from nested_walk(paddress, rw, 0); to nested_walk(paddress, rw, 1); Change paging.cc line ~657 by removing the "*nx_fault = 1;" line, as this was being hit for some unknown reason Change cpu/svm.cc line ~430 by removing the guest efer.svme requirment by commiting out the lines "BX_ERROR(("VMRUN: Guest EFER.SVME = 0"));" and "return 0;" With those changes, I can actually boot into windows...
Note that by patching line 1384 in cpu/paging.cc to "nested_walk(paddress, rw, 1); instead of (paddress,rw, 0), does then make bochs send the proper exitinfo1 value, which allows windows to boot further. However then somehow we get a "PAE PTE: non-executable page fault occurred" and " SVM VMEXIT reason=1024 exitinfo1=0000000100000015 exitinfo2=00000000000a8000" and then further down a panic happens "[BXVGA ] >>PANIC<< update: select_high_bank != 1". Not sure if the patch I mentioned is actually correct,...
Hey so I looked into this, turns out the ExitInfo1 value bochs provides when a NPF occurs doesn't match hardware. For example, on the same build of windows the guest accesses to 0xfee00320 cause an ExitInfo1 code of "0000000100000004" however under bochs we see "0000000200000004". I'm not familiar with how the codes used for NPF yet, in the meantime could you reconfirm the accuracy of how bochs sets the ExitInfo1 code for nested page faults under SVM, as they're different to hardware. Though I also...
Hey so I've got another crash, in windows with AMD SVM enabled alongside nested paging, the windows guest will try to access the APIC address 0xfee00320, this leads to nested page fault vmexit and the hypervisor ends up injecting a general protection fault exception that crashes the guest. Trying to look into it, but if you have any thoughts on this that'd be great. Thanks.
Ok so with the code in the repo now, its like my last dot point above. Execution continues until the kernel panics due to corrupted registers, which I think is from the stale VMCB bug I described. Thanks for the fix, I figure the PAT bug / incorrect handling was just unrelated to the VMCB issues im having. I'm continuing to look into this, but any ideas are appreciated.
Ok so a couple of updates: With the current code in the repo, booting AMD hyper-v in Bochs results in a CPU exception that panics and kills execution By adding in the hardcoding of the PAT, no panics occur but the host just seems to hang and nothing really happens By adding in code to set the guest PAT properly before vmrun (setting msr.pat to the saved guest pat in VMCB in SvmEnterLoadCheckControls) code progresses further to the original BSOD I was getting when I started this thread. By adding...
Sorry i should have confirmed, no hyper-v will not load without nested paging support (it just loads into base windows without hyper-v if it doesnt detect nested paging support).
oh also i dont see the guest pat actually being loaded before vmrun? i see it being checked in SvmEnterLoadCheckControls, but not actually applied to msr.pat?
Actually you will want to save/restore the host PAT. right now with guest pat change just made, hyper-v amd doesnt boot nearly as far as it did, at the first svm exit it just resets, i think the host pat being corrupted by not being restored is definitely affecting this.
Hello, It looks like there are MSRs (like the PAT) that aren't saved/restored for the guest and host in svm.cc. This looks like for all MSRs defined after SVM_GUEST_PAT too. I think this is the issue preventing hyper-v from booting properly. Thanks.
AMD SVM Hyper-V fails (bug)