[SSI-devel] Re: I think I've fixed the problems in the OPENSSI-FC-1-2-0-STABLE branch
Brought to you by:
brucewalker,
rogertsang
From: Jose A. R. <jos...@ac...> - 2005-01-17 12:34:07
|
You are right, I updated the repository and recompiled the kernel. Now that bug is gone. I got another one! :( Oops on loadlevel_log+0x79 (called by dvp_rexecve+0x1e2). I commented out this function (loadlevel_log) and now the cluster is working with 78 nodes! I'm running the linpack to test it. On Fri, Jan 14, 2005 at 11:11:40AM -0800, John Byrne wrote: > Jose A. Rodriguez wrote: > > Sorry, I've tried it but and I got more or less the same problem. > > though the cluster booted more nodes. Following is the crash dump > > (I didn't compare it, but it seems the same). > > It has the signature of the memory stomp I thought I get rid of. I don't > see how it can still be there and I don't know of anything else that > would have that signature. If you have the uncompressed version > (vmlinux) of that kernel around, can you run "gdb vmlinux" and send the > output of the command "disass loadlevel_log_init". That should tell me > if you got the most critical part of the fix into your kernel. If it > looks like this, it is wrong: > > 0xc02b9370 <loadlevel_log_init+0>: push %ebp > 0xc02b9371 <loadlevel_log_init+1>: mov %esp,%ebp > 0xc02b9373 <loadlevel_log_init+3>: sub $0x8,%esp > 0xc02b9376 <loadlevel_log_init+6>: movl $0x1f0,0x4(%esp) > 0xc02b937e <loadlevel_log_init+14>: movl $0x40,(%esp) > 0xc02b9385 <loadlevel_log_init+21>: call 0xc035e050 <nsc_log_init> > 0xc02b938a <loadlevel_log_init+26>: mov %eax,0xc06feb88 > 0xc02b938f <loadlevel_log_init+31>: mov %ebp,%esp > 0xc02b9391 <loadlevel_log_init+33>: pop %ebp > 0xc02b9392 <loadlevel_log_init+34>: ret > > The right code is: > > 0xc023bc60 <loadlevel_log_init+0>: push %ebp > 0xc023bc61 <loadlevel_log_init+1>: mov %esp,%ebp > 0xc023bc63 <loadlevel_log_init+3>: push %edi > 0xc023bc64 <loadlevel_log_init+4>: push %ebx > 0xc023bc65 <loadlevel_log_init+5>: sub $0x8,%esp > 0xc023bc68 <loadlevel_log_init+8>: movl $0x1f0,0x4(%esp) > 0xc023bc70 <loadlevel_log_init+16>: movl $0x40,(%esp) > 0xc023bc77 <loadlevel_log_init+23>: call 0xc02994d0 <nsc_log_init> > 0xc023bc7c <loadlevel_log_init+28>: mov %eax,0xc05ef308 > 0xc023bc81 <loadlevel_log_init+33>: test %eax,%eax > 0xc023bc83 <loadlevel_log_init+35>: je 0xc023bcd4 > <loadlevel_log_init+116> > 0xc023bc85 <loadlevel_log_init+37>: mov 0x10(%eax),%eax > 0xc023bc88 <loadlevel_log_init+40>: lea (%eax,%eax,4),%eax > 0xc023bc8b <loadlevel_log_init+43>: lea 0x0(,%eax,4),%ebx > 0xc023bc92 <loadlevel_log_init+50>: cmp $0x20000,%ebx > 0xc023bc98 <loadlevel_log_init+56>: ja 0xc023bcdb > <loadlevel_log_init+123> > 0xc023bc9a <loadlevel_log_init+58>: lea 0x0(%esi),%esi > 0xc023bca0 <loadlevel_log_init+64>: movl $0x1f0,0x4(%esp) > 0xc023bca8 <loadlevel_log_init+72>: mov %ebx,(%esp) > 0xc023bcab <loadlevel_log_init+75>: call 0xc014d7d0 <kmalloc> > 0xc023bcb0 <loadlevel_log_init+80>: test %eax,%eax > 0xc023bcb2 <loadlevel_log_init+82>: mov %eax,%edx > 0xc023bcb4 <loadlevel_log_init+84>: je 0xc023bca0 > <loadlevel_log_init+64> > 0xc023bcb6 <loadlevel_log_init+86>: mov %ebx,%ecx > 0xc023bcb8 <loadlevel_log_init+88>: xor %eax,%eax > 0xc023bcba <loadlevel_log_init+90>: shr $0x2,%ecx > 0xc023bcbd <loadlevel_log_init+93>: mov %edx,%edi > 0xc023bcbf <loadlevel_log_init+95>: repz stos %eax,%es:(%edi) > 0xc023bcc1 <loadlevel_log_init+97>: test $0x2,%bl > 0xc023bcc4 <loadlevel_log_init+100>: je 0xc023bcc8 > <loadlevel_log_init+104> > 0xc023bcc6 <loadlevel_log_init+102>: stos %ax,%es:(%edi) > 0xc023bcc8 <loadlevel_log_init+104>: test $0x1,%bl > 0xc023bccb <loadlevel_log_init+107>: je 0xc023bcce > <loadlevel_log_init+110> > 0xc023bccd <loadlevel_log_init+109>: stos %al,%es:(%edi) > 0xc023bcce <loadlevel_log_init+110>: mov %edx,0xc06faa04 > 0xc023bcd4 <loadlevel_log_init+116>: add $0x8,%esp > 0xc023bcd7 <loadlevel_log_init+119>: pop %ebx > 0xc023bcd8 <loadlevel_log_init+120>: pop %edi > 0xc023bcd9 <loadlevel_log_init+121>: pop %ebp > 0xc023bcda <loadlevel_log_init+122>: ret > > If the code is correct and you're sure you tested this kernel, then I'll > go back to debugging. > > If the code is incorrect, ine thing I didn't think of when I saw your > e-mail last night is that there is a lag in anonymous CVS access in > SourceForge. (Developers don't see this lag.) It is possible you hit > this. Try updating your CVS tree today and see if you get different results. > > John Byrne > > <...snipped...> Jose ____________________________________________________________________________ Jose A. Rodriguez OOO Universitat Politecnica de Catalunya (UPC) jo...@ac... OOO Departament d'Arquitectura de Computadors Tel. 16990 OOO -*- LCAC -*- UPC |