From: Xingyu Na <asr...@gm...> - 2015-06-13 02:36:19
|
Thank you Dan. I'll checked the things you suggest. :-) On 06/13/2015 03:01 AM, Daniel Povey wrote: > BTW, a good way to debug a zombie process is to look at its parent PID > (ppid) and check what that process is doing. E.g. is it stopped? If > so why? Or maybe it's busy doing something else. > > > On Fri, Jun 12, 2015 at 1:59 PM, Daniel Povey <dp...@gm...> wrote: >> This is not a Kaldi problem, it's almost certainly a problem either >> with your GridEngine software or equivalent, or with your machine >> (e.g. the linux mem-killer might be being invoked). Check the system >> logs and the GridEngine logs. >> Dan >> >> >> On Fri, Jun 12, 2015 at 6:30 AM, Xingyu Na <asr...@gm...> wrote: >>> No, the user didn't kill the script. And the terminal is alive. >>> It happens rather randomly, but only when the job is submitted to a certain >>> node, called "g05". >>> The log hangs at >>> ======================================= >>> # Running on g05 >>> # Started at Fri Jun 12 17:02:47 CST 2015 >>> # nnet-shuffle-egs --buffer-size=5000 --srand=2094 >>> ark:exp/nnet4d_gpu/egs/egs.12.113.ark ark:- | nnet-train-simple >>> --minibatch-size=512 --srand=2094 exp/nnet4d_gpu/2094.mdl ark:- >>> exp/nnet4d/2095.12.mdl >>> nnet-train-simple --minibatch-size=512 --srand=2094 exp/nnet4d_gpu/2094.mdl >>> ark:- exp/nnet4d_gpu/2095.12.mdl >>> nnet-shuffle-egs --buffer-size=5000 --srand=2094 >>> ark:exp/nnet4d_gpu/egs/egs.12.113.ark ark:- >>> ======================================= >>> >>> It seems that nnet-shuffle-egs and nnet-train-simple do not cooperate on >>> this specific job. Weird..... >>> >>> Best, >>> X. >>> >>> >>> On 06/12/2015 01:05 PM, Daniel Povey wrote: >>>> Possibly it is in zombie status because something interrupted or >>>> killed the run.pl process that had launched that process. E.g. a user >>>> did ctrl-z to the to-level script, maybe. >>>> >>>> Dan >>>> >>>> >>>> On Thu, Jun 11, 2015 at 11:11 PM, Xingyu Na <asr...@gm...> >>>> wrote: >>>>> Hi, >>>>> >>>>> A user report this when he was using the train_pnorm_fast script. Top >>>>> gave this: >>>>> 60442 zhangpe+ 20 0 1193408 13112 10704 S 0.0 0.0 0:02.30 >>>>> nnet-shuffle-eg >>>>> 60443 zhangpe+ 20 0 0 0 0 Z 0.0 0.0 0:02.19 >>>>> nnet-train-simp >>>>> >>>>> It remains in zombie status forever.... >>>>> Any idea how this goes wrong? >>>>> >>>>> Best, >>>>> Xingyu >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> _______________________________________________ >>>>> Kaldi-users mailing list >>>>> Kal...@li... >>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> |