From: SourceForge.net <no...@so...> - 2009-04-15 20:10:49
|
Bugs item #2756909, was opened at 2009-04-13 00:23 Message generated for change (Comment added) made by henryn You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=622063&aid=2756909&group_id=98788 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. >Category: Linux Kernel Group: v0.8.x (devel) >Status: Closed >Resolution: Fixed >Priority: 8 Private: No Submitted By: Nobody/Anonymous (nobody) >Assigned to: Henry N. (henryn) Summary: sleep crashing with "Assertion `0 <= seconds' failed." Initial Comment: While researching bug 2748015 (http://sourceforge.net/tracker/?func=detail&aid=2748015&group_id=98788&atid=622063), we came along another problem with the 0.74-rc1 and 0.8.0 code base. When starting command "while true; do sleep 0.1; done" and starting command "openssl genrsa -out /dev/null 4096" in another session the sleep command in the first session aborts occasionally with error: sleep: xnanosleep.c:67: xnanosleep: Assertion `0 <= seconds' failed. Aborted This problem can even be reproduced when you start the different commands as different unprivileged users. This seems like the kernel is changing the memory of random processes. Tested versions: colinux 0.8.0: suffers this bug colinux 0.7.4-rc1: suffers this bug colinux 0.7.3: does not suffer this bug Test system: AMD AthlonXP 3800+ Windows XP SP3 + all updates to date Guest OS is ArchLinux (ver 2009.02) (using only prebuild packages from the ArchLinux repositories) While researching this further, I discovered this thread which describes a bug in the User Mode Linux kernel almost a year ago. http://fixunix.com/openssl/518688-re-uml-devel-dev-random-problems-fp-registers-corruption.html I have not been able to link this to a bug on the UML Sourceforge.net development page. Keith ---------------------------------------------------------------------- >Comment By: Henry N. (henryn) Date: 2009-04-15 22:10 Message: This bug is fixed now by reverting the changes from SVN r1237 (Floating point optimizations for operating switch). It's committed as SVN revision r1243 (devel) and r1245 (stable). New snapshots are available on http://www.colinux.org/snapshots/ Keith, many thanks for reporting and helpfully test environments. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2009-04-14 08:54 Message: Hi, I have read carefully the email. I have tried your code, both dbl and int version. I cannot see the problem. It would be very interesting if you (nobody :=) )could test my colinux version. If you can, please concact me at paolo DOT minazzi AT gmail DOT com I send you a link, so you can try a lettle different version. It help us to understand better the problem. On my hardware (3 PCs) I cannot see this problem. Thanks, Paolo ---------------------------------------------------------------------- Comment By: Henry N. (henryn) Date: 2009-04-13 23:40 Message: I called Stano, who has reported the same problem on UML. He sayed, that this bug is not solved in UML, and he has this workaround: "The only thing that helps in my case is running the guest with mode=skas0. This eliminated the problem completely and the guest is running for months without any problem." Currently I can not find what skas0 does, I not found the mode changer. The test have more modified, so I can say, it is not a stack clobbering: #include <stdio.h> volatile double theDouble; int main(int argc, char* argv[]){ theDouble = 1; while(1){ usleep(0); if(theDouble != 1){ printf("Double test fails!\n"); printf("- current Double: %f (%llX)\n", theDouble, theDouble); printf("- current Double: %f (%llX)\n", theDouble, theDouble); printf("- current Double: %f (%llX)\n", theDouble, theDouble); break; } } return 0; } Compiled with "gcc -ggdb -o dblchange dblchange.c" on Debian 4.0 (gcc 4.1.2). Here some of the errors: Double test fails! - current Double: nan (FFF8000000000000) - current Double: nan (FFF8000000000000) - current Double: nan (FFF8000000000000) Double test fails! - current Double: 1.000000 (3FF0000000000000) - current Double: 1.000000 (3FF0000000000000) - current Double: 1.000000 (3FF0000000000000) Double test fails! - current Double: 1.000000 (3FF0000000000000) - current Double: nan (FFF8000000000000) - current Double: 1.000000 (3FF0000000000000) ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2009-04-13 22:58 Message: Aah, yeah the double instead of integer thing wasn't that clever :) I changed the code to really use integers this time and it doesn't crash anymore. Seems only the double operations/registers get hosed. Also when the int registers would be unreliable I would expect alot more errors popping up when running the guest system. As for reverting, I think that's the best solution for now. But I did notice the speed improvements in the 0.8.0 code base as opossed to the 0.7.3 one and it was quite nice, so I would love to see a working version of the speed enhancements. Henry, thanks for helping hunting this bug down and good luck with the colinux development. It's been real. Keith ---------------------------------------------------------------------- Comment By: Henry N. (henryn) Date: 2009-04-13 17:32 Message: The comes from changes in SVN revision r1237. 2009-03-21T23:56:07 henryn r1237 * Remove co_switch_wrapper_protected and all workaround for SSE/MMX on raid modules. Reverts the workaround from SVN r1212, related Bugs #2524658, #2551241. r1236 20090319-Snapshot runs r1237 20090321-Snapshot fails Tested dblchange-nosleep.c and "openssl genrsa -out /dev/null 4096" in fltk console. Snapshorts are from http://www.henrynestler.com/colinux/testing/devel-0.8.0/ I feel, we should revert this change in this release to the slower, but saver code. Henry ---------------------------------------------------------------------- Comment By: Henry N. (henryn) Date: 2009-04-13 17:11 Message: Hello Keith, in intchange.c: > double theDouble; > double theLastDouble; there you used also double, not integer. But very nice to see, that your dblchange.c fails very shortly after starting the "openssl genrsa -out /dev/null 4096" in second console. I have little tuned the fail by removing some calculations and force a task switch before the compair: dblchange-nosleep.c: #include <stdio.h> int main(int argc, char* argv[]){ double theDouble, theLastDouble; theDouble = 1; while(1){ theDouble += 1; theLastDouble = theDouble; sleep(0); /* force task switch here */ if(theLastDouble != theDouble){ printf("Double test fails!\n"); printf("- previous Double: %f (%LX)\n", theLastDouble, theLastDouble); printf("- current Double: %f (%LX)\n", theDouble, theDouble); break; } } return 0; } Some example failures: Double test fails! - previous Double: 151.000000 (4062E00000000000) - current Double: 151.000000 (4062E00000000000) Double test fails! - previous Double: 3.000000 (4008000000000000) - current Double: 3.000000 (4008000000000000) Double test fails! - previous Double: nan (FFF8000000000000) - current Double: nan (FFF8000000000000) Double test fails! - previous Double: 2.000000 (4000000000000000) - current Double: 2.000000 (4000000000000000) Double test fails! - previous Double: 6.000000 (4018000000000000) - current Double: 6.000000 (4018000000000000) Double test fails! - previous Double: 3.000000 (4008000000000000) - current Double: 3.000000 (4008000000000000) This fails also, if I remove the "sleep" completely: - previous Double: nan (FFF8000000000000) - current Double: nan (FFF8000000000000) Double test fails! - previous Double: nan (FFF8000000000000) - current Double: 27945100.000000 (417AA688C0000000) Double test fails! - previous Double: 56525465.000000 (418AF414C8000000) - current Double: 56525465.000000 (418AF414C8000000) So, it is not the sleep self. It is the task switch, and/or something stupid in the keygen. I will check this some revisions before we changed the FPU save/restore (20090321, SVN r1237). Henry ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2009-04-13 11:52 Message: To test this problem I simplified the program listed in the fixunix thread above. dblchange.c: #include <stdio.h> #define true 1 #define false 0 int main(int argc, char* argv[]){ double theDouble; double theLastDouble; theDouble = 1; while(true){ theLastDouble = theDouble; theDouble += 1; if(theLastDouble + 1 != theDouble){ printf("Double test fails!\n"); printf("- previous double: %f (%LX)\n", theLastDouble, theLastDouble); printf("- current double: %f (%LX)\n", theDouble, theDouble); break; } sleep(1); } return 0; } intchange.c: #include <stdio.h> #define true 1 #define false 0 int main(int argc, char* argv[]){ double theInteger; double theLastInteger; theInteger = 1; while(true){ theLastInteger = theInteger; theInteger += 1; if(theLastInteger + 1 != theInteger){ printf("Integer test fails!\n"); printf("- previous int: %d (%X)\n", theLastInteger, theLastInteger); printf("- current int: %d (%X)\n", theInteger, theInteger); break; } sleep(1); } return 0; } By analyzing the error thrown by sleep it seems the double value which specifies how long to sleep gets changed outside of the program's control. First I adapted the fixunix program to test for doubles being changed. It runs smoothly until I start the openssl key generation operation. Then it errors after several seconds: Double test fails! - previous double: nan (FFF8000000000000) - current double: nan (FFF8000000000000) By injecting some other printfs I've seen that in the fatal iteration the second read of the previous double goes wrong. But this doesn't matter that much because it gets overwritten by the current double variable. After that both variables are good again, but when increasing the current variable the outcome becomes the NAN value. Output below: +++ - previous double: 4.000000 (FFF8000000000000) - current double: 5.000000 (4014000000000000) theLastDouble = theDouble; - previous double: 5.000000 (4014000000000000) - current double: 5.000000 (4014000000000000) theDouble += 1; - previous double: 5.000000 (4014000000000000) - current double: nan (FFF8000000000000) Double test fails! - previous double: 5.000000 (4014000000000000) - current double: nan (FFF8000000000000) Note: at the "Double test fails!" piece the previous double does not have a NAN value. This only occurs when I add printfs so I blame this on the printfs doing stuff in between which changes the data flow. After a little further investigation it shows the previous double gets corrupted because in the final iteration the second read of any double gets turned into the NAN value. This means the current value wil be read as NAN and then copied to the previous value. After the double catastrophy I was curious if integers would also be affected so I wrote intchange.c. This showed that even integers are affected by this bug. Output below: Integer test fails! - previous int: 0 (FFF80000) - current int: 0 (FFF80000) The hex pattern seems to be the same as the corruption which doubles seem to get. Keith ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=622063&aid=2756909&group_id=98788 |