CMU Sphinx / Forums / Help: SphinxTrain+PBS: norm never converge

ben - 2008-07-12

Hi, there,

I get a small PBS cluster setup (4 nodes). I had problem with SphinxTrain to run on the cluster. It never finishes step 02 if I set NPART to any number other than 1. I first tried with my own training data (about 18 hours of speech). I then tried an4. Step 02 converges in neither cases. One reason I could think of is that the cluster nodes are not symmetric. Actually, all nodes are different CPU and RAM, but all running the same copy of fedora 8 thru netboot. Could this (asymmetric hardware) be the reason for step 02 to never converge?

btw: I believe the cluster setup was ok since I was able to finish step 40 in my own training data on the cluster. Of course, I had to fix some coding in PBS.pm.

thanks in advance,

Ben

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- David Huggins-Daines - 2008-07-12
  
  By "never converges", what exactly do you mean? Do the likelihoods never decrease, or is it not able to run the "norm" program successfully?
  
  In order for an iteration to complete, two things need to happen:
  
  1) All of the directories in bwaccumdir/ which are created by running baum_welch.pl need to be accessible and complete
  2) All of the log-files from baum_welch.pl, which live in logdir/02.falign_ci_hmm/, need to be accessible and complete
  
  Is there a file which ends in ".norm.log" in that log directory? If so, what does it say? It could be that for whatever reason, "qstat" is reporting jobs as being finished before they actually complete, which means that norm.pl is running too soon...
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - ben - 2008-07-12
    
    I think I got something. In my cluster config, I suppose all jobs will be submitted by the head server. From the script, it seems the 2nd and after iterations might be submitted from the client nodes. Is this right? Anyway, I'll install the pbs_client on these nodes and give it a try.
    
    Ben
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - David Huggins-Daines - 2008-07-12
      
      Yeah, that's correct, I bet that's why it's not working. You need to be able to launch jobs from the client nodes.
      
      Arguably this isn't really a good idea, and we should revise the scripts so that there's one master process that handles all iterations. But for the time being that's how it works...
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - ben - 2008-07-13
        
        Thanks, David, and Yes, that's one of the reason. I enabled the nodes in hosts.equiv.
        
        But I also had to hack PBS.pm. The problem was with the jobs submitted from the compute nodes. Becasue the compute nodes were netboot and their hostnames were all "localhost". The network_path function in PBS.pm converts the absolute path to something like /net/localhost/... I guess I could just make a link from /net/PBS_Server to /net/localhost to make it work. However, since I had symmetric setup in all the nodes and they all mount the training directory at the exact same location as the head server, I just commented out the call to network_path. Now it all seems to be working.
        
        Comparing to the current /net/$hostname solution to networkify the file path, I would think to have all nodes mount the training directory symmetrically as the master node is easier. First, this is more natural way that an admin will set it up; Second, easier to maintain the /net directory if you have a large cluster (unless it's maintained automatically).
        
        btw: I also had to fix the way qstat was found in PBS.pm. PBS.pm looks up qsub correctly. It first checks if $self->pbsbin is defined or not. Somehow this was missed for qstat and qdel. I basically copied the code to look up qsub to qstat and qdel. BTW: in fedora 8, qstat and qdel are in /usr/bin, which is luckily in path.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        David Huggins-Daines - 2008-07-15
        
        The /net/$hostname thing is an artifact of how our queue here at CMU is set up, actually. It would be better to be able to configure this somehow.
        
        Could you send me your changes to PBS.pm?
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        ben - 2008-07-16
        
        Sorry for the delay, David. Too much homework from other course;(.
        
        Anyway, below are the changes. I didn't include the changes to /net/$hostname, as that really depends on the setup. The changes below are for qsub and qdel.
        
        Ben
        
        --- PBS.pm 2007-12-24 01:04:31.000000000 -0500
        +++ PBS.pm.new 2008-07-16 01:13:02.000000000 -0400
        @@ -182,7 +182,13 @@
        sub query_job {
        my ($self, $jobid) = @_;
        
        my $qstatbin = catfile($self->{pbsbin}, 'qstat');
        
        my ($qstatbin);
        
        if (defined($self->{pbsbin})) {
        
        $qstatbin = catfile($self->{pbsbin}, 'qstat');
        
        }
        
        else{
        
        $qstatbin="qstat";
        
        }
        my $pid = open QSTAT, "-|";
        die "Failed to fork: $!" unless defined $pid;
        if ($pid) {
        @@ -220,7 +226,13 @@
        sub cancel_job {
        my ($self, $jobid) = @_;
        
        my $qdelbin = catfile($self->{pbsbin}, 'qdel');
        
        my ($qdelbin);
        
        if (defined($self->{pbsbin})) {
        
        $qdelbin = catfile($self->{pbsbin}, 'qdel');
        
        }
        
        else{
        
        $qdelbin = "qdel";
        
        }
        system($qdelbin, $jobid) == 0;
        }
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- ben - 2008-07-12
  
  Thanks, David.
  
  I checked the logfile and perl scripts, I think the problem is the likelihood between iterations never converges to the converge ratio configured. The bw and norm jobs were finished by the nodes, so TiedWaitForConvergence in Util.pm never finishes.
  
  note: this is the an4 from training tutorial, I only changed it to do force align. You can check the .cfg file in the url posted below.
  
  Both directories you mentioned are there and accessible by the nodes. You can check my training directory here: http://stellar.servebeer.com/sphinx/an4/.
  
  Ben
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

SphinxTrain+PBS: norm never converge

Speech Recognition Toolkit

Forums

Help

SphinxTrain+PBS: norm never converge document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

SphinxTrain+PBS: norm never converge