From: Daniel P. <dp...@gm...> - 2015-06-12 19:01:44
|
BTW, a good way to debug a zombie process is to look at its parent PID (ppid) and check what that process is doing. E.g. is it stopped? If so why? Or maybe it's busy doing something else. On Fri, Jun 12, 2015 at 1:59 PM, Daniel Povey <dp...@gm...> wrote: > This is not a Kaldi problem, it's almost certainly a problem either > with your GridEngine software or equivalent, or with your machine > (e.g. the linux mem-killer might be being invoked). Check the system > logs and the GridEngine logs. > Dan > > > On Fri, Jun 12, 2015 at 6:30 AM, Xingyu Na <asr...@gm...> wrote: >> No, the user didn't kill the script. And the terminal is alive. >> It happens rather randomly, but only when the job is submitted to a certain >> node, called "g05". >> The log hangs at >> ======================================= >> # Running on g05 >> # Started at Fri Jun 12 17:02:47 CST 2015 >> # nnet-shuffle-egs --buffer-size=5000 --srand=2094 >> ark:exp/nnet4d_gpu/egs/egs.12.113.ark ark:- | nnet-train-simple >> --minibatch-size=512 --srand=2094 exp/nnet4d_gpu/2094.mdl ark:- >> exp/nnet4d/2095.12.mdl >> nnet-train-simple --minibatch-size=512 --srand=2094 exp/nnet4d_gpu/2094.mdl >> ark:- exp/nnet4d_gpu/2095.12.mdl >> nnet-shuffle-egs --buffer-size=5000 --srand=2094 >> ark:exp/nnet4d_gpu/egs/egs.12.113.ark ark:- >> ======================================= >> >> It seems that nnet-shuffle-egs and nnet-train-simple do not cooperate on >> this specific job. Weird..... >> >> Best, >> X. >> >> >> On 06/12/2015 01:05 PM, Daniel Povey wrote: >>> >>> Possibly it is in zombie status because something interrupted or >>> killed the run.pl process that had launched that process. E.g. a user >>> did ctrl-z to the to-level script, maybe. >>> >>> Dan >>> >>> >>> On Thu, Jun 11, 2015 at 11:11 PM, Xingyu Na <asr...@gm...> >>> wrote: >>>> >>>> Hi, >>>> >>>> A user report this when he was using the train_pnorm_fast script. Top >>>> gave this: >>>> 60442 zhangpe+ 20 0 1193408 13112 10704 S 0.0 0.0 0:02.30 >>>> nnet-shuffle-eg >>>> 60443 zhangpe+ 20 0 0 0 0 Z 0.0 0.0 0:02.19 >>>> nnet-train-simp >>>> >>>> It remains in zombie status forever.... >>>> Any idea how this goes wrong? >>>> >>>> Best, >>>> Xingyu >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> _______________________________________________ >>>> Kaldi-users mailing list >>>> Kal...@li... >>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >> >> |