Hamstr / Discussion / General Discussion: Parallel running on clusters

Parallel running on clusters

Created: 2014-09-11

Updated: 2014-11-04

I have been working with some researchers here who have been using hamstrd and I have some suggestions for code change/additions to the hamstr.pl script so that it can run in parallel on clusters:

## Add two new variables:
my $PBS_ARRAYID ;
my $PBS_ARRAYSIZE ;


## and some help:
\n${bold}USAGE:${norm} hamstr.pl -sequence_file=<> -hmmset=<> -taxon=<>  [-pbs_arrayid= -pbs_arraysize=]  -refspec=<> [OPTIONS]


## Add to the GetOptions function:
"pbs_arrayid=i" => \$PBS_ARRAYID,  # the current index of the array job (1 based) 
"pbs_arraysize=i" => \$PBS_ARRAYSIZE # how many parts the full job will be split into

## And in the user input subroutine add something like:
## 5) set up cluster job manager array type processing - split the number of hmmrs into seperate jobs 
if (defined $PBS_ARRAYID) {
    $PBS_ARRAYID = $PBS_ARRAYID -1 ;    
     print "set up array job parallelisation:\t";   
    if (defined $PBS_ARRAYSIZE) {       
        my $hmms_pbsarray_size = floor( @hmms / $PBS_ARRAYSIZE ) + 1;
        my $start_inc = $PBS_ARRAYID * $hmms_pbsarray_size ;
        my $stop_inc  = $start_inc + $hmms_pbsarray_size ;
        if ($stop_inc > @hmms) {
            $stop_inc = @hmms           
        } 
        $stop_inc = $stop_inc - 1 ;
        print "start_inc = $start_inc stop_inc = $stop_inc\n";
        @hmms = @hmms[$start_inc..$stop_inc] ;              
    }
    else {

        push @log, "Please provide both -pbs_arrayid= (1 based) and -pbs_arraysize=";
        $check = 0;
        print "failed (Please provide both -pbs_arrayid= (1 based) and -pbs_arraysize=)\n";
    }       
}

Then cluster based job batch systems like PBS Torque can use inputs like:
#PBS -t 1-20
MAXARRAY=20
$HAMSTRADPATH/hamstr_array.pl <other inputs=""> -pbs_arrayid=$PBS_ARRAYID -pbs_arraysize=$MAXARRAY</other>

Which will split the job into 20 parts and run in parallel on a cluster.

Thanks,
Josh Bowden
CSIRO Scientific Computing.

Ingo Ebersberger - 2014-11-04

Hi Josh,

Thanks a lot for the suggestion. However, I think you are referring to a different program called HaMStRad mentioned in the publication of Peters et al. that appeared this year in BMC Evol Biol. Peters et al. have basically modified an outdated version from HaMStR and I was not involved in this modification.

Please note that the current version of HaMStR is fully capable of starting multiple threads for speeding up the ortholog search and features quite a number of new features but also bug fixes compared to the earlier versions. Maybe it is worth a try.

Best,

Ingo

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Parallel running on clusters

A tool for directed ortholog search in ESTs and proteins

Forums

Help

Parallel running on clusters

Parallel running on clusters

A tool for directed ortholog search in ESTs and proteins

Forums

Help

Parallel running on clusters document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Parallel running on clusters