From: Gene C. <ge...@cc...> - 2011-07-29 16:44:04
|
Thanks for the further explanatoin, Nick. I'll cc this message to dmtcp-forum, in case this part is still of general interest. We can get further debugging information out of this. It will require a little more effort on your side, but we're very happy to work jointly at this. For further debugging information, could you re-compile DMTCP with debugging? I assume here that we're using DMTCP-1.2.3. It seems like you already tried some of this, but let's take it from the top. ./configure --enable-debug Edit mtcp/Makefile to change CFLAGS. It should look like this when done: # CFLAGS += -O0 -g ... CFLAGS += -O0 -g -DDEBUG -DTIMING -Wall make clean make [or "make -j10" is faster and usually safe.] make check [quick sanity test for a correct build. I expect no errors, but even if one or two tests fail, let's continue. ] rm -rf /tmp/dmtcp-USER@HOST [with USER and HOST substituted, of course] Run your software When it hangs, keep it alive for now. We'll use gdb on it. (cd tmp; tar zcvf dmtcp-USER@HOST.tar.gz ./dmtcp-USER@HOST.targ.gz) Copy the dmtcp-USER@HOST.tar.gz tarball somewhere. We'll want you to mail that to us. Next, let's inspect in gdb why it's hanging. I think you said that you have only one process, running with OMP (many threads). We'll attach with gdb, where PROGRAM is an absolute path to your PROGRAM. We'll want you to capture the output from your screen/window. Copying with the mouse and pasting into an editor works well, but any way you like. gdb PRORGRAM PID (gdb) info threads [ typically, thread 1 is your original thread, and thread 2 is our checkpoint thread. The rest are yours again. ] (gdb) thread 1 (gdb) where (gdb) thread 2 (gdb) where [ If there are not too many threads, doing this for all threads is nice. If there are many threads, do "info threads" again, and you'll see some repeated patterns. We just need to see the stack ["where"] for one example from each of the patterns. If the stack is huge, it's enough to see one screenful of lines (about 20 or 30 call frames).] [ While you're doing this, you may already get some strong clues about where it's hanging. I'm suspecting some kind of deadlock. Some locks to look out for are lll_lock (glibc low level lock), C++ locks, maybe a specialized OMP lock, our own DMTCP lock. As you probably already know, the keywords "wait", "lock", "mutex" are all significant. ] [ From those keywords, you may be able to identify the two threads (or more than two in an unusual case), that are all involved with locks. I expect one thread to directly hold a lock in the latest call frame. But other threads may have acquired a lock higher up. See next note. ] [ You may also see references to "mtcp_futex". Those are harmless. That's where DMTCP causes a user thread to wait during checkpointing. Nevertheless, one of those threads waiting on a futex (Linux internal form of mutex) could also be holding a different lock at the same time as it's waiting on a futex. So, it's still useful to quickly sample some of the threads waiting on mtcp_futex to see if they're also holding another lock. ] And so, if you see anything interesting here also, please report it to us. Then send us: 1. dmtcp-USER@HOST.tar.gz 2. the screen copy of the results of inspecting with gdb 3. any other observations you may have made while inspecting with gdb Then we'll inspect that here, and see what to suggest next. If we're lucky, we'll get this on the first round. But one never knows with bugs. :-) Thanks again for reporting this bug, - Gene On Fri, Jul 29, 2011 at 10:29:56AM -0500, Nick Hall wrote: > Hello Gene. No problem, I actually left to go out of town on the 22nd and > just got back yesterday so it didn't impact me too much. This is actually > 64-bit linux, Ubuntu 10.10 to be specific, and I'm running kernel 2.6.38. > The machine has 16 GB of memory. Unfortunately the software is for an > internal project so the code is not available. Is there any way I can enable > more debug output or something that would be more helpful to you to see what > the problem is? I understand it would be easiest if you were on the box but > I would have to think about how to best do that as the box is currently > locked down, there is a little bit of complexity to set the app up and it > probably takes an hour of running before it gets in the state where the > checkpointing fails (it seems to work if I do it soon, when the memory > footprint is small), so it would be easiest for me if I could simply send > more detailed trace...I have no problem compiling a different version of > dmtcp either if you want me to try something. > > As a side note, I said the program takes up almost 2GB of memory in my > previous message, I am referring to the "resident memory" (the RES column if > you do top). At this point the VIRT column (virtual memory space) is 5-6+ > GB, and it should be noted that this machine actually doesn't have swap > enabled so I think the reason for the discrepancy between virtual and > resident memory is due to the fact that it is running with 5 threads, > although I'm somewhat lacking in my knowledge about how the kernel handles > its memory management so I could very well be wrong about that. I know part > of it is shared libraries but they should make up a pretty small amount of > the difference. > > Thanks, > Nick > On Thu, Jul 28, 2011 at 7:26 PM, Gene Cooperman <ge...@cc...> wrote: > > > Thanks very much for the bug report. This is very helpful. > > > > First, a quick question. Is this a 32-bit Linux? If it is, the > > 2 GB (half the total virtual memory) could be significant. > > Beyond that, which Linux distro is this? > > > > Beyond that, it will be easiest if we can reproduce the bug locally. > > We can then poke into it with gdb, and probably quickly arrive at a > > diagnosis. > > There are several possibilities. Which might work best for you? > > 1. If your code is freely available and easy to run, could you send > > us a copy so we can reproduce it? > > 2. Alternatively. is it reasonable for us to get a short-term guest > > account > > on your machine and try it out after you've configured the software? > > 3. Alternatively, would you like a guest account on our machine here, > > so you can reproduce the bug at our site? > > 4. Alternatively, we could use VNC and a phone conversation, although > > that would take more of your time. > > > > Do some of those options work for you? Thanks, > > - the DMTCP team > > > > |