Menu

Tree [0829f8] master /
 History

HTTPS access


File Date Author Commit
 testing 2019-08-07 Bill Andreopoulos Bill Andreopoulos [0829f8] Initial commit
 Constants.py 2019-08-07 Bill Andreopoulos Bill Andreopoulos [0829f8] Initial commit
 DL_Model.py 2019-08-07 Bill Andreopoulos Bill Andreopoulos [0829f8] Initial commit
 Dockerfile 2019-08-07 Bill Andreopoulos Bill Andreopoulos [0829f8] Initial commit
 Dockerfile.dev 2019-08-07 Bill Andreopoulos Bill Andreopoulos [0829f8] Initial commit
 JGI_Pipeline.py 2019-08-07 Bill Andreopoulos Bill Andreopoulos [0829f8] Initial commit
 Plotter_Plasmid.py 2019-08-07 Bill Andreopoulos Bill Andreopoulos [0829f8] Initial commit
 README.md 2019-08-07 Bill Andreopoulos Bill Andreopoulos [0829f8] Initial commit
 Util_Plasmid.py 2019-08-07 Bill Andreopoulos Bill Andreopoulos [0829f8] Initial commit
 batchTrain.slr 2019-08-07 Bill Andreopoulos Bill Andreopoulos [0829f8] Initial commit
 feature_DL_plasmid_predict_CORI.sh 2019-08-07 Bill Andreopoulos Bill Andreopoulos [0829f8] Initial commit
 feature_DL_plasmid_train_CORI.sh 2019-08-07 Bill Andreopoulos Bill Andreopoulos [0829f8] Initial commit
 format_predict.py 2019-08-07 Bill Andreopoulos Bill Andreopoulos [0829f8] Initial commit
 format_train.py 2019-08-07 Bill Andreopoulos Bill Andreopoulos [0829f8] Initial commit
 license.txt 2019-08-07 Bill Andreopoulos Bill Andreopoulos [0829f8] Initial commit
 mySetup_nersc.source 2019-08-07 Bill Andreopoulos Bill Andreopoulos [0829f8] Initial commit
 predict_Plasmid.py 2019-08-07 Bill Andreopoulos Bill Andreopoulos [0829f8] Initial commit
 read_fasta2_plasmids.py 2019-08-07 Bill Andreopoulos Bill Andreopoulos [0829f8] Initial commit
 run_chromsketch.sh 2019-08-07 Bill Andreopoulos Bill Andreopoulos [0829f8] Initial commit
 run_pentamer.sh 2019-08-07 Bill Andreopoulos Bill Andreopoulos [0829f8] Initial commit
 run_plasORIsketch.sh 2019-08-07 Bill Andreopoulos Bill Andreopoulos [0829f8] Initial commit
 run_plassketch.sh 2019-08-07 Bill Andreopoulos Bill Andreopoulos [0829f8] Initial commit
 run_prodigal.sh 2019-08-07 Bill Andreopoulos Bill Andreopoulos [0829f8] Initial commit
 train_Plasmid.py 2019-08-07 Bill Andreopoulos Bill Andreopoulos [0829f8] Initial commit

Read Me


DelPlasmid Copyright (c) 2019, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.

If you have questions about your rights to use or distribute this software, please contact Berkeley Lab's Innovation & Partnerships Office at IPO@lbl.gov referring to " DelPlasmid" (LBNL Ref 2019-037)."

NOTICE. This software was developed under funding from the U.S. Department of Energy. As such, the U.S. Government has been granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, prepare derivative works, and perform publicly and display publicly. The U.S. Government is granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, prepare derivative works, distribute copies to the public, perform publicly and display publicly, and to permit others to do so.


README file
August 1, 2019
Maintainer: Bill Andreopoulos, wandreopoulos@lbl.gov

DelPlasmid is a tool based on machine learning that separates plasmids from chromosomal sequences. The input sequences are in the form of contigs and could have been produced from any sequencing technology or assembly algorithm. The deep learning model was trained on a corpus of:
1) plasmids from ACLAME, and
2) chromosomal sequences from refseq.microbial from which plasmids and mito were removed.


To run delplasmid training
Training on Cori is done as follows:
. mySetup_nersc.source

Training command for the entire model:
andreopo@cori21:/global/cscratch1/sd/andreopo/plasmidml_tests> SRC=/global/projectb/sandbox/rqc/andreopo/src/bitbucket/jgi-ml_clean/classifier/dl
andreopo@cori21:/global/cscratch1/sd/andreopo/plasmidml_tests> $SRC/feature_DL_plasmid_train_CORI.sh $SRC/../DATA/ACLAME.REFSEQMICROB/aclame_plasmid_sequences.fasta.MIN1kMAX330k.fasta ./aclame_plasmid_sequences.fasta.MIN1kMAX330k.fasta.OUT19 $SRC/../DATA/ACLAME.REFSEQMICROB/refseq.bacteria.nonplasmid.nonmito.fasta.subsam40kreads.fasta.MIN1kMAX330k.fasta ./refseq.bacteria.nonplasmid.nonmito.fasta.subsam40kreads.fasta.MIN1kMAX330k.fasta.OUT19

Old way to run out of the src dir:
andreopo@cori12:/global/projectb/sandbox/rqc/andreopo/src/bitbucket/jgi-ml2/classifier/dl> ./feature_DL_plasmid_train_CORI.sh ../DATA/ACLAME.REFSEQMICROB/aclame_plasmid_sequences.fasta.MIN1kMAX330k.fasta ../DATA/ACLAME.REFSEQMICROB/aclame_plasmid_sequences.fasta.MIN1kMAX330k.fasta.yml ../DATA/ACLAME.REFSEQMICROB/refseq.bacteria.nonplasmid.nonmito.fasta.subsam40kreads.fasta.MIN1kMAX330k.fasta ../DATA/ACLAME.REFSEQMICROB/refseq.bacteria.nonplasmid.nonmito.fasta.subsam40kreads.fasta.MIN1kMAX330k.fasta.yml


To run delplasmid prediction example
Predict plasmids in the IMG dataset:
andreopo@cori21:/global/cscratch1/sd/andreopo/plasmidml_tests> $SRC/feature_DL_plasmid_predict_CORI.sh $SRC/../DATA/IMG/genome_list.fasta.MIN1kMAX330k.fasta

Old way to run out of the src dir:
andreopo@nid00401:/global/projectb/sandbox/rqc/andreopo/src/bitbucket/jgi-ml2/classifier/dl> ./feature_DL_plasmid_predict_CORI.sh ../DATA/IMG/genome_list.fasta.MIN1kMAX330k.fasta ../DATA/IMG/genome_list.fasta.MIN1kMAX330k.fasta.OUT2

./feature_DL_plasmid_predict_CORI.sh ../DATA/ACLAME.REFSEQMICROB.testseg6/assayer4.plasm_main-scaff-split.yml.fasta.fasta ../DATA/ACLAME.REFSEQMICROB.testseg6/assayer4.plasm_main-scaff-split.yml.fasta.fasta.OUT4


Plasmid Finding GoogleDoc:
https://docs.google.com/document/d/1W7B-O5xKuXbWA_CwHjCl9K0-qKLBLpTnvuxkslcOH30/edit#

LucidChart with the software design:
https://www.lucidchart.com/documents/edit/97c70c55-cdf2-483e-a095-dbf8bed3537c/0

Code:
/global/projectb/sandbox/rqc/andreopo/src/bitbucket/jgi-ml_clean/classifier/dl
/global/projectb/sandbox/rqc/andreopo/src/bitbucket/jgi-ml2/classifier/dl
Was cloned from:
git clone git@bitbucket.org:berkeleylab/jgi-ml.git jgi-ml2

Excel sheet with data results: https://docs.google.com/spreadsheets/d/1TDPn9uOAnZOBS95dJzUfVtb6mhteTuRFrFcv-_IZ4z4/edit#gid=1266368803


Testing methodology on smaller datasets:
Test the training script - for training use a subset of the data, though training will still take some time to complete (30 epochs):
cd $CSCRATCH
mkdir plasmidml_tests
cd plasmidml_tests/
SRC=/global/projectb/sandbox/rqc/andreopo/src/bitbucket/jgi-ml_clean/classifier/dl
$SRC/feature_DL_plasmid_train_CORI.sh $SRC/../DATA/ACLAME.REFSEQMICROB/aclame_plasmid_sequences.fasta.MIN1kMAX330k.fasta.SUB.fasta ./aclame_plasmid_sequences.fasta.MIN1kMAX330k.fasta.SUB.OUT $SRC/../DATA/ACLAME.REFSEQMICROB/refseq.bacteria.nonplasmid.nonmito.fasta.subsam40kreads.fasta.MIN1kMAX330k.fasta.SUB.fasta ./refseq.bacteria.nonplasmid.nonmito.fasta.subsam40kreads.fasta.MIN1kMAX330k.fasta.SUB.OUT

Test the prediction script - use a small dataset with 10 contigs:
$SRC/feature_DL_plasmid_predict_CORI.sh $SRC/../DATA/ACLAME.REFSEQMICROB/aclame_plasmid_sequences.fasta.MIN1kMAX330k.fasta.SUB.fasta ./aclame_plasmid_sequences.fasta.MIN1kMAX330k.fasta.SUB.OUTPRED
The predictions will be under the file ./aclame_plasmid_sequences.fasta.MIN1kMAX330k.fasta.SUB.OUTPRED/outPR.*/predictions.txt
All predictions should be PLASMID since the test dataset comes from the ACLAME dataset.

The path to the model used and cmd line is under: outPR.*/model_path.txt


Docker:
There is the Dockerfile and Dockerfile.dev under the dl directory,
which can be used to create Docker images with the tool.
The latest Docker image can be pulled from:

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.