From: Chris L. <cla...@re...> - 2008-04-23 18:23:30
Attachments:
kvm-66-msr-k7.patch
|
Avi, Joerg, While trying to boot a RHEL-4 guest on latest KVM tip on an AMD machine, I found that the guest would consistently crash when trying to setup the NMI watchdog. I traced it down to the following commit: 51ef1ac7b23ee32bfcc61c229d634fdc1c68b38a It seems that in that commit, the K7_EVNTSEL MSR's were set to fail if the data != 0. That test is actually fine, the problem is how the code around it is generated. That is, we are only supposed to go to unhandled if data != 0; but for some reason, we are *always* going to unhandled, even when the data == 0. That causes RHEL-4 kernel to crash. If I rearrange the code to look like this: case MSR_K7_EVNTSEL0: case MSR_K7_EVNTSEL1: case MSR_K7_EVNTSEL2: case MSR_K7_EVNTSEL3: if (data != 0) return kvm_set_msr_common(vcpu, ecx, data); default: return kvm_set_msr_common(vcpu, ecx, data); } Then everything works again. A patch that does just this is attached. It might be slightly nicer to say "if (data == 0) return 0" and then just fall through to the default case, but I don't much care either way. Signed-off-by: Chris Lalancette <cla...@re...> |
From: Avi K. <av...@qu...> - 2008-04-24 07:35:49
|
Chris Lalancette wrote: > Avi, Joerg, > While trying to boot a RHEL-4 guest on latest KVM tip on an AMD machine, I > found that the guest would consistently crash when trying to setup the NMI > watchdog. I traced it down to the following commit: > > 51ef1ac7b23ee32bfcc61c229d634fdc1c68b38a > > It seems that in that commit, the K7_EVNTSEL MSR's were set to fail if the data > != 0. That test is actually fine, the problem is how the code around it is > generated. That is, we are only supposed to go to unhandled if data != 0; but > for some reason, we are *always* going to unhandled, even when the data == 0. > That causes RHEL-4 kernel to crash. If I rearrange the code to look like this: > > case MSR_K7_EVNTSEL0: > case MSR_K7_EVNTSEL1: > case MSR_K7_EVNTSEL2: > case MSR_K7_EVNTSEL3: > > if (data != 0) > return kvm_set_msr_common(vcpu, ecx, data); > > default: > return kvm_set_msr_common(vcpu, ecx, data); > } > > Then everything works again. A patch that does just this is attached. It might > be slightly nicer to say "if (data == 0) return 0" and then just fall through to > the default case, but I don't much care either way. > You mean the gcc generates wrong code? It seems fine here (though wonderfully obfuscated). Can you attach an objdump -Sr svm.o? Also, what gcc version are you using? -- error compiling committee.c: too many arguments to function |
From: Avi K. <av...@qu...> - 2008-04-24 15:20:41
|
Chris Lalancette wrote: > Avi Kivity wrote: > >> You mean the gcc generates wrong code? It seems fine here (though >> wonderfully obfuscated). >> >> Can you attach an objdump -Sr svm.o? Also, what gcc version are you using? >> >> > > (sending attachment in private mail, so I don't spam the whole list with 189K of > objdump). Note that this is an objdump -Sr of the current code, with my patch > *not* applied. > > gcc is gcc-4.3.0-7 in Fedora 9. > > It's a gcc bug. svm_set_msr() places ecx in %rsi, and consistently uses %esi to refer to the first 32 bits. But when it compiles this bit: > case MSR_K7_EVNTSEL0: > case MSR_K7_EVNTSEL1: > case MSR_K7_EVNTSEL2: > case MSR_K7_EVNTSEL3: > /* > * only support writing 0 to the performance counters for now > * to make Windows happy. Should be replaced by a real > * performance counter emulation later. > */ > if (data != 0) > goto unhandled; > break; (where MSR_K7_EVENTSEL[0123] == 0xc001000[0123]) it compiles it into > 1811: 8d 86 00 00 ff 3f lea 0x3fff0000(%rsi),%eax > 1817: 83 f8 03 cmp $0x3,%eax > 181a: 0f 87 e2 01 00 00 ja 1a02 <svm_set_msr+0x27f> Now it uses %rsi instead of %esi, and any junk in the upper bits will cause the ja to be taken. We need to get a reduced testcase to the gcc folks, this is a serious bug. Any changes in the code to work around this would be fragile. -- error compiling committee.c: too many arguments to function |
From: Chris L. <cla...@re...> - 2008-04-24 15:44:07
|
Avi Kivity wrote: > >> 1811: 8d 86 00 00 ff 3f lea 0x3fff0000(%rsi),%eax >> 1817: 83 f8 03 cmp $0x3,%eax >> 181a: 0f 87 e2 01 00 00 ja 1a02 <svm_set_msr+0x27f> > Now it uses %rsi instead of %esi, and any junk in the upper bits will > cause the ja to be taken. > > We need to get a reduced testcase to the gcc folks, this is a serious > bug. Any changes in the code to work around this would be fragile. > Ouch, I missed this on my reading of it. I'll try to come up with a standalone program that shows this. Thanks, Avi. Chris Lalancette |
From: Chris L. <cla...@re...> - 2008-04-24 22:13:13
|
Avi Kivity wrote: > Now it uses %rsi instead of %esi, and any junk in the upper bits will > cause the ja to be taken. > > We need to get a reduced testcase to the gcc folks, this is a serious > bug. Any changes in the code to work around this would be fragile. > Avi, I've now filed a bug in the upstream gcc database: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36040 And I came up with a reduced test case, available here: http://people.redhat.com/clalance/rsi-test-case.tar.bz2 If I compile the code in the above and look at the disassembly, it shows the problem; however, I can't reproduce the bug by actually running the code. I suspect the %rsi register is always 0 when we start in this userland code, so I never run into the bogus ja, but I just thought I'd mention it. Chris Lalancette |
From: Avi K. <av...@qu...> - 2008-04-25 07:31:49
|
Chris Lalancette wrote: > Avi Kivity wrote: > >> Now it uses %rsi instead of %esi, and any junk in the upper bits will >> cause the ja to be taken. >> >> We need to get a reduced testcase to the gcc folks, this is a serious >> bug. Any changes in the code to work around this would be fragile. >> >> > > Avi, > I've now filed a bug in the upstream gcc database: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36040 > > And I came up with a reduced test case, available here: > > http://people.redhat.com/clalance/rsi-test-case.tar.bz2 > > If I compile the code in the above and look at the disassembly, it shows the > problem; however, I can't reproduce the bug by actually running the code. I > suspect the %rsi register is always 0 when we start in this userland code, so I > never run into the bogus ja, but I just thought I'd mention it. > > Hmm, looking back at the dump: > 1811: 8d 86 00 00 ff 3f lea 0x3fff0000(%rsi),%eax > 1817: 83 f8 03 cmp $0x3,%eax > 181a: 0f 87 e2 01 00 00 ja 1a02 <svm_set_msr+0x27f> So while gcc is using %rsi, it loads the result back into %eax, which has the effect of dropping back into 32-bits. So looks like gcc was right here. Sorry for spreading confusion and apologies to gcc. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. |
From: Chris L. <cla...@re...> - 2008-04-25 13:08:08
|
Avi Kivity wrote: > > Hmm, looking back at the dump: > >> 1811: 8d 86 00 00 ff 3f lea 0x3fff0000(%rsi),%eax >> 1817: 83 f8 03 cmp $0x3,%eax >> 181a: 0f 87 e2 01 00 00 ja 1a02 <svm_set_msr+0x27f> > > So while gcc is using %rsi, it loads the result back into %eax, which > has the effect of dropping back into 32-bits. So looks like gcc was > right here. Sorry for spreading confusion and apologies to gcc. > OK. Well, then I can't explain why we are unconditionally calling kvm_set_msr_common(), regardless of whether data == 0 or not. Avi, you said it works for you; what version of gcc are you using, and can you send me your objdump -Sr? I'd like to compare the assembly output with what 4.3.0 is spitting out. Chris Lalancette |
From: Chris L. <cla...@re...> - 2008-04-25 18:43:36
|
Avi Kivity wrote: > > Hmm, looking back at the dump: > >> 1811: 8d 86 00 00 ff 3f lea 0x3fff0000(%rsi),%eax >> 1817: 83 f8 03 cmp $0x3,%eax >> 181a: 0f 87 e2 01 00 00 ja 1a02 <svm_set_msr+0x27f> > > So while gcc is using %rsi, it loads the result back into %eax, which > has the effect of dropping back into 32-bits. So looks like gcc was > right here. Sorry for spreading confusion and apologies to gcc. > Avi, Arg. I was completely, utterly wrong about the problem here (although there is definitely still a problem). I'm sorry for making a confusing mess out of this. Here is what is actually happening: During startup, the RHEL-4 x86_64 kernel (2.6.9-67.EL, if you care) setups up the NMI watchdog. It does the following: for(i = 0; i < 4; ++i) { /* Simulator may not support it */ if (checking_wrmsrl(MSR_K7_EVNTSEL0+i, 0UL)) return; wrmsrl(MSR_K7_PERFCTR0+i, 0UL); } checking_wrmsrl() just does a "test write" to the msr; because of the code that is currently in there, this succeeds. However, when it tries to do the MSR_K7_PERFCTR0 wrmsr, *that* is where it fails, since we don't currently handle that MSR, and KVM injects a GPF into the guest (which kills it). My previous patch just happened to fix this because it was making checking_wrmsrl() fail on the EVNTSEL0, so we just returned out of this loop rather than trying to write to the PERFCTR0. Unfortunately, we can't just "fake emulate" MSR_K7_PERFCTR[0-3] like we are doing for MSR_K7_EVNTSEL[0-3]; if they are there, linux expects to be able to put values into them. I think the correct solution here is to emulate MSR_K7_PERFCTR[0-3] and MSR_K7_EVNTSEL[0-3] for real. I'm working on a patch to do this now. Chris Lalancette |
From: Joerg R. <joe...@am...> - 2008-04-28 12:46:17
|
On Fri, Apr 25, 2008 at 02:43:19PM -0400, Chris Lalancette wrote: > Unfortunately, we can't just "fake emulate" MSR_K7_PERFCTR[0-3] like we are > doing for MSR_K7_EVNTSEL[0-3]; if they are there, linux expects to be able to > put values into them. I think the correct solution here is to emulate > MSR_K7_PERFCTR[0-3] and MSR_K7_EVNTSEL[0-3] for real. I'm working on a patch to > do this now. We already discussed the emulation of the performance counter registers in the past. The conclusion is, that we loose live migration with that emulation because performance monitoring is implemented differently between AMD and Intel systems. Maybe discarding _any_ writes to the performance counter MSRs will fix the guest crash. What we should not do is injecting GPF on zero writes to the MSRs because that will break Windows XP 64bit installation. Joerg -- | AMD Saxony Limited Liability Company & Co. KG Operating | Wilschdorfer Landstr. 101, 01109 Dresden, Germany System | Register Court Dresden: HRA 4896 Research | General Partner authorized to represent: Center | AMD Saxony LLC (Wilmington, Delaware, US) | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy |
From: Chris L. <cla...@re...> - 2008-04-28 14:19:57
Attachments:
kvm-66-fix-k7-msr.patch
|
Joerg Roedel wrote: > We already discussed the emulation of the performance counter registers > in the past. The conclusion is, that we loose live migration with that > emulation because performance monitoring is implemented differently > between AMD and Intel systems. Maybe discarding _any_ writes to > the performance counter MSRs will fix the guest crash. What we should > not do is injecting GPF on zero writes to the MSRs because that will > break Windows XP 64bit installation. OK, yeah, I read some of those threads over the weekend. It is a larger problem than just this crash, and probably one to be solved later. I have attached a patch which just discards all writes, as you suggested; this allows my RHEL-4 guest to successfully boot, although obviously the NMI watchdog which it thinks it set up is not actually working. Joerg, can you test this on your 64-bit Windows XP guest to make sure it is still working? Thanks, Chris Lalancette |
From: Andi K. <an...@fi...> - 2008-04-28 16:42:10
|
Chris Lalancette <cla...@re...> writes: > this allows my RHEL-4 guest to successfully boot, although obviously the NMI > watchdog which it thinks it set up is not actually working. That's fine. On a virtualized environment you don't really need a NMI watchdog because you can always debug from the outside -Andi |
From: Joerg R. <joe...@am...> - 2008-04-28 15:58:30
|
On Mon, Apr 28, 2008 at 10:17:37AM -0400, Chris Lalancette wrote: > Joerg Roedel wrote: > > We already discussed the emulation of the performance counter registers > > in the past. The conclusion is, that we loose live migration with that > > emulation because performance monitoring is implemented differently > > between AMD and Intel systems. Maybe discarding _any_ writes to > > the performance counter MSRs will fix the guest crash. What we should > > not do is injecting GPF on zero writes to the MSRs because that will > > break Windows XP 64bit installation. > > OK, yeah, I read some of those threads over the weekend. It is a larger > problem than just this crash, and probably one to be solved later. > I have attached a patch which just discards all writes, as you suggested; > this allows my RHEL-4 guest to successfully boot, although obviously the NMI > watchdog which it thinks it set up is not actually working. Joerg, can you test > this on your 64-bit Windows XP guest to make sure it is still working? XP 64 bit installs and works fine with your patch. Joerg -- | AMD Saxony Limited Liability Company & Co. KG Operating | Wilschdorfer Landstr. 101, 01109 Dresden, Germany System | Register Court Dresden: HRA 4896 Research | General Partner authorized to represent: Center | AMD Saxony LLC (Wilmington, Delaware, US) | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy |
From: Chris L. <cla...@re...> - 2008-04-28 16:41:33
|
Joerg Roedel wrote: >> OK, yeah, I read some of those threads over the weekend. It is a larger >> problem than just this crash, and probably one to be solved later. >> I have attached a patch which just discards all writes, as you suggested; >> this allows my RHEL-4 guest to successfully boot, although obviously the NMI >> watchdog which it thinks it set up is not actually working. Joerg, can you test >> this on your 64-bit Windows XP guest to make sure it is still working? > > XP 64 bit installs and works fine with your patch. Great, thanks for testing, I'll post the patch again with a proper changelog for Avi. Chris Lalancette |