Menu

Parallel running on clusters

2014-09-11
2014-11-04
  • Josh Bowden

    Josh Bowden - 2014-09-11

    I have been working with some researchers here who have been using hamstrd and I have some suggestions for code change/additions to the hamstr.pl script so that it can run in parallel on clusters:

    ## Add two new variables:
    my $PBS_ARRAYID ;
    my $PBS_ARRAYSIZE ;
    
    
    ## and some help:
    \n${bold}USAGE:${norm} hamstr.pl -sequence_file=<> -hmmset=<> -taxon=<>  [-pbs_arrayid= -pbs_arraysize=]  -refspec=<> [OPTIONS]
    
    
    ## Add to the GetOptions function:
    "pbs_arrayid=i" => \$PBS_ARRAYID,  # the current index of the array job (1 based) 
    "pbs_arraysize=i" => \$PBS_ARRAYSIZE # how many parts the full job will be split into
    
    ## And in the user input subroutine add something like:
    ## 5) set up cluster job manager array type processing - split the number of hmmrs into seperate jobs 
    if (defined $PBS_ARRAYID) {
        $PBS_ARRAYID = $PBS_ARRAYID -1 ;    
         print "set up array job parallelisation:\t";   
        if (defined $PBS_ARRAYSIZE) {       
            my $hmms_pbsarray_size = floor( @hmms / $PBS_ARRAYSIZE ) + 1;
            my $start_inc = $PBS_ARRAYID * $hmms_pbsarray_size ;
            my $stop_inc  = $start_inc + $hmms_pbsarray_size ;
            if ($stop_inc > @hmms) {
                $stop_inc = @hmms           
            } 
            $stop_inc = $stop_inc - 1 ;
            print "start_inc = $start_inc stop_inc = $stop_inc\n";
            @hmms = @hmms[$start_inc..$stop_inc] ;              
        }
        else {
    
            push @log, "Please provide both -pbs_arrayid= (1 based) and -pbs_arraysize=";
            $check = 0;
            print "failed (Please provide both -pbs_arrayid= (1 based) and -pbs_arraysize=)\n";
        }       
    }
    

    Then cluster based job batch systems like PBS Torque can use inputs like:
    #PBS -t 1-20
    MAXARRAY=20
    $HAMSTRADPATH/hamstr_array.pl <other inputs=""> -pbs_arrayid=$PBS_ARRAYID -pbs_arraysize=$MAXARRAY</other>

    Which will split the job into 20 parts and run in parallel on a cluster.

    Thanks,
    Josh Bowden
    CSIRO Scientific Computing.

     
  • Ingo Ebersberger

    Hi Josh,

    Thanks a lot for the suggestion. However, I think you are referring to a different program called HaMStRad mentioned in the publication of Peters et al. that appeared this year in BMC Evol Biol. Peters et al. have basically modified an outdated version from HaMStR and I was not involved in this modification.

    Please note that the current version of HaMStR is fully capable of starting multiple threads for speeding up the ortholog search and features quite a number of new features but also bug fixes compared to the earlier versions. Maybe it is worth a try.

    Best,

    Ingo

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.