CMD when running PBJelly on a cluster

Hilary
2014-03-17
2014-08-12
  • Hilary

    Hilary - 2014-03-17

    Hello,
    Apologies if this is a stupid question, but I can't work out from the documentation how to set PBJelly to run on a cluster correctly.

    The queuing system on our cluster is SGE and we're not meant to run jobs on the head nodes except scripts which submit jobs to the queue. The submission script looks something like this:


    !/bin/bash

    $ -P project.prjb

    $ -q queue.qb

    $ -pe shmem 3

    $ -N "pbjelly_on_original_ref.v1"

    $ -o pbjelly_on_original_ref.v1.out

    $ -e pbjelly_on_original_ref.v1.err

    $ -cwd

    $ -V

    Jelly.py setup Protocol.original_ref.v1.xml -x "-i"
    Jelly.py mapping Protocol.original_ref.v1.xml -x "-sam -clipping hard"
    Jelly.py support Protocol.original_ref.v1.xml
    Jelly.py extraction Protocol.original_ref.v1.xml
    Jelly.py assembly Protocol.original_ref.v1.xml
    Jelly.py output Protocol.original_ref.v1.xml


    I have a few questions about how to do this properly:

    1. What is the CMD string meant to be in the .xml file, and does this differ depending on which stage you're running?
    2. Does the <nJobs> specified in the .xml file have to match (or be less than?) the cores I specify with the "-pe" line?
    3. If I submit this shell script on the head node of the cluster, it will then be sent to another node and executed. However, I can't submit other jobs to SGE from these nodes, so is this parallelisation going to work properly?

    Many thanks,
    Hilary

     
    Last edit: Hilary 2014-03-19
  • Adam English

    Adam English - 2014-03-21

    Below is the conversation Hilary and I had. I'm posting it here in case it can help someone else.

    ADAM ENGLISH:
    Thanks for your question and interest in PBJelly. Hopefully we can make it work for you.

    The way to think about the Protocol.xml when using Jelly.py <stage> Protocol.xml is that Jelly.py builds scripts and submits them to your cluster. So you would run Jelly.py setup Protocol.xml, for example, from your head node. From there, Jelly.py will make <nJobs> number of scripts to submit to the cluster. Each of these jobs would use --nproc processors (if that command is specified and available for that stage). ${CMD} will be the path to one of the scripts to be submitted.

    Say you have a script named myScript.sh which you need to submit to your cluster. Since I'm unfamiliar with how you may need to submit the script to your cluster, we can imagine the submission command's structure as
    clusterSubmitCmd --stdout stdout.file.txt --stderr stderr.file.txt --JobName "myjob" --cmd myScript.sh

    To make this with a template so that Jelly.py can fill in the unique identifiers on a per-stage/per-job basis, we replace some of these elements with variables

    clusterSubmitCmd --stdout ${STDOUT} --stderr ${STDERR} --JobName "${JOBNAME}" --cmd ${CMD}

    So, to get PBJelly working with your cluster, all you need to do is figure out the submission command and it's structure to plug into the Protocol.xml element.
    This isn't optimal for 'pipelining' PBJelly's stages (i.e. automatically running mapping once setup finishes) and will require the user to manually run each stage. However, it does make it easier to submit each stage with a large number of jobs per step regardless of their cluster environment.

    HILARY MARTIN:
    I'm afraid I still don't get what should be in the script I substitute for ${CMD}. This part of my .xml script so far (intended to run with SGE on our cluster) looks like this:

    <cluster>
        <command >echo '${CMD}' | qsub -N my.pbjelly.job -o my.pbjelly.job.out -e my.pbjelly.job.err -P project.prjb –q queue.qb –pe shmem 3 –cwd -V</command>
        <nJobs>3</nJobs>
    </cluster>
    

    I'm confused because, in the ReadMe file, you've said "CMD - The command one uses to execute on the cluster." But I thought "msub" (for MOAB) (or "qsub" for SGE, in my case) is the command used to execute on the cluster, but, according to the TemplateProtocol.xml, you then pipe ${CMD} to msub:

        <command notes="For PBS/Moab">echo '${CMD}' | msub -N "${JOBNAME}" -o ${STDOUT} -e ${STDERR} -l nodes=1:ppn=8,mem=48000mb</command>
    

    So what script do I substitute for ${CMD}?

    ADAM ENGLISH:
    should start with explaining what the PBS/Moab command does. You'll need to translate this to SGE however your qsub works.

    Jelly.py - when nJobs is specified, will split all of the tasks into 3 total jobs. Say we have 6 input .fasta files that we need to map, Jelly.py mapping Protocol.xml will do two things
    1) Create three mapping scripts

    script1.sh:
        blasr input1.fasta reference
        blasr input2.fasta reference
    
    script2.sh:
        blasr input3.fasta reference
        blasr input4.fasta reference
    
    script3.sh:
        blasr input5.fasta reference
        blasr input6.fasta reference
    

    2) Jelly.py will execute the following 3 commands based on what you put in the xml's template, each submitting one of the jobs to the cluster - so for Moab/PBS:

          echo "./script1.sh" | msub -N 'map1' ....
          echo "./script2.sh" | msub -N 'map2' ....
          echo "./script3.sh" | msub -N 'map3' ....
    

    You can see that this is the basic structure of what's in the msub template:

    echo '${CMD}' | msub -N "${JOBNAME}" -o ${STDOUT} -e ${STDERR} -l nodes=1:ppn=8,mem=48000mb
    

    From there, we'll have three jobs running, each with it's own node, and each of those jobs have two blasr commands they'll run.
    So, you don't need to substitute anything for ${CMD}, Jelly.py will substitute the script for execution, which your should be able to handle.

    HILARY MARTIN:
    Can I just clarify: do I need to substitute anything for ${JOBNAME}, ${STDOUT} and ${STDERR} , or does Jelly.py make those up?

    ADAM ENGLISH
    ${CMD} is the same as ${JOBNAME} and the like. They're template variables that Jelly.py will populate to give each job a unique name and logs.

     
  • Justin Peyton

    Justin Peyton - 2014-07-23

    Thank you. This was very helpful, but I have a followup question. What is the best way to get a "setup" line onto each one of those scripts? Following Adam's example above, how do we change

    script1.sh:
        blasr input1.fasta reference
        blasr input2.fasta reference
    
    script2.sh:
        blasr input3.fasta reference
        blasr input4.fasta reference
    
    script3.sh:
        blasr input5.fasta reference
        blasr input6.fasta reference
    

    to something like this

    script1.sh:
        source $HOME/local/PBSuite_14.6.24/setup.sh
        blasr input1.fasta reference
        blasr input2.fasta reference
    
    script2.sh:
        source $HOME/local/PBSuite_14.6.24/setup.sh
        blasr input3.fasta reference
        blasr input4.fasta reference
    
    script3.sh:
        source $HOME/local/PBSuite_14.6.24/setup.sh
        blasr input5.fasta reference
        blasr input6.fasta reference
    
     
  • Adam English

    Adam English - 2014-08-12

    There is no best way to do what you've described.
    However, you should consider adding

    source $HOME/local/PBSuite_14.6.24/setup.sh
    

    to the ~/.bash_profile of whatever user your SGE uses when you submit jobs (generally this would be your personal .bash_profile, however I can't be certain for your configuration)

    Then, the correct environment variables will automatically be included.

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks