delplasmid Code

Brought to you by: billandreo

Tree [0829f8] master /

History

HTTPS access

File	Date	Author	Commit
testing	2019-08-07	Bill Andreopoulos	[0829f8] Initial commit
Constants.py	2019-08-07	Bill Andreopoulos	[0829f8] Initial commit
DL_Model.py	2019-08-07	Bill Andreopoulos	[0829f8] Initial commit
Dockerfile	2019-08-07	Bill Andreopoulos	[0829f8] Initial commit
Dockerfile.dev	2019-08-07	Bill Andreopoulos	[0829f8] Initial commit
JGI_Pipeline.py	2019-08-07	Bill Andreopoulos	[0829f8] Initial commit
Plotter_Plasmid.py	2019-08-07	Bill Andreopoulos	[0829f8] Initial commit
README.md	2019-08-07	Bill Andreopoulos	[0829f8] Initial commit
Util_Plasmid.py	2019-08-07	Bill Andreopoulos	[0829f8] Initial commit
batchTrain.slr	2019-08-07	Bill Andreopoulos	[0829f8] Initial commit
feature_DL_plasmid_predict_CORI.sh	2019-08-07	Bill Andreopoulos	[0829f8] Initial commit
feature_DL_plasmid_train_CORI.sh	2019-08-07	Bill Andreopoulos	[0829f8] Initial commit
format_predict.py	2019-08-07	Bill Andreopoulos	[0829f8] Initial commit
format_train.py	2019-08-07	Bill Andreopoulos	[0829f8] Initial commit
license.txt	2019-08-07	Bill Andreopoulos	[0829f8] Initial commit
mySetup_nersc.source	2019-08-07	Bill Andreopoulos	[0829f8] Initial commit
predict_Plasmid.py	2019-08-07	Bill Andreopoulos	[0829f8] Initial commit
read_fasta2_plasmids.py	2019-08-07	Bill Andreopoulos	[0829f8] Initial commit
run_chromsketch.sh	2019-08-07	Bill Andreopoulos	[0829f8] Initial commit
run_pentamer.sh	2019-08-07	Bill Andreopoulos	[0829f8] Initial commit
run_plasORIsketch.sh	2019-08-07	Bill Andreopoulos	[0829f8] Initial commit
run_plassketch.sh	2019-08-07	Bill Andreopoulos	[0829f8] Initial commit
run_prodigal.sh	2019-08-07	Bill Andreopoulos	[0829f8] Initial commit
train_Plasmid.py	2019-08-07	Bill Andreopoulos	[0829f8] Initial commit

Read Me

DelPlasmid Copyright (c) 2019, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.

If you have questions about your rights to use or distribute this software, please contact Berkeley Lab's Innovation & Partnerships Office at IPO@lbl.gov referring to " DelPlasmid" (LBNL Ref 2019-037)."

NOTICE. This software was developed under funding from the U.S. Department of Energy. As such, the U.S. Government has been granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, prepare derivative works, and perform publicly and display publicly. The U.S. Government is granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, prepare derivative works, distribute copies to the public, perform publicly and display publicly, and to permit others to do so.

README file
August 1, 2019
Maintainer: Bill Andreopoulos, wandreopoulos@lbl.gov

DelPlasmid is a tool based on machine learning that separates plasmids from chromosomal sequences. The input sequences are in the form of contigs and could have been produced from any sequencing technology or assembly algorithm. The deep learning model was trained on a corpus of:
1) plasmids from ACLAME, and
2) chromosomal sequences from refseq.microbial from which plasmids and mito were removed.

To run delplasmid training
Training on Cori is done as follows:
. mySetup_nersc.source

Training command for the entire model:
andreopo@cori21:/global/cscratch1/sd/andreopo/plasmidml_tests> SRC=/global/projectb/sandbox/rqc/andreopo/src/bitbucket/jgi-ml_clean/classifier/dl
andreopo@cori21:/global/cscratch1/sd/andreopo/plasmidml_tests> $SRC/feature_DL_plasmid_train_CORI.sh $SRC/../DATA/ACLAME.REFSEQMICROB/aclame_plasmid_sequences.fasta.MIN1kMAX330k.fasta ./aclame_plasmid_sequences.fasta.MIN1kMAX330k.fasta.OUT19 $SRC/../DATA/ACLAME.REFSEQMICROB/refseq.bacteria.nonplasmid.nonmito.fasta.subsam40kreads.fasta.MIN1kMAX330k.fasta ./refseq.bacteria.nonplasmid.nonmito.fasta.subsam40kreads.fasta.MIN1kMAX330k.fasta.OUT19

Old way to run out of the src dir:
andreopo@cori12:/global/projectb/sandbox/rqc/andreopo/src/bitbucket/jgi-ml2/classifier/dl> ./feature_DL_plasmid_train_CORI.sh ../DATA/ACLAME.REFSEQMICROB/aclame_plasmid_sequences.fasta.MIN1kMAX330k.fasta ../DATA/ACLAME.REFSEQMICROB/aclame_plasmid_sequences.fasta.MIN1kMAX330k.fasta.yml ../DATA/ACLAME.REFSEQMICROB/refseq.bacteria.nonplasmid.nonmito.fasta.subsam40kreads.fasta.MIN1kMAX330k.fasta ../DATA/ACLAME.REFSEQMICROB/refseq.bacteria.nonplasmid.nonmito.fasta.subsam40kreads.fasta.MIN1kMAX330k.fasta.yml

To run delplasmid prediction example
Predict plasmids in the IMG dataset:
andreopo@cori21:/global/cscratch1/sd/andreopo/plasmidml_tests> $SRC/feature_DL_plasmid_predict_CORI.sh $SRC/../DATA/IMG/genome_list.fasta.MIN1kMAX330k.fasta

Old way to run out of the src dir:
andreopo@nid00401:/global/projectb/sandbox/rqc/andreopo/src/bitbucket/jgi-ml2/classifier/dl> ./feature_DL_plasmid_predict_CORI.sh ../DATA/IMG/genome_list.fasta.MIN1kMAX330k.fasta ../DATA/IMG/genome_list.fasta.MIN1kMAX330k.fasta.OUT2

./feature_DL_plasmid_predict_CORI.sh ../DATA/ACLAME.REFSEQMICROB.testseg6/assayer4.plasm_main-scaff-split.yml.fasta.fasta ../DATA/ACLAME.REFSEQMICROB.testseg6/assayer4.plasm_main-scaff-split.yml.fasta.fasta.OUT4

Plasmid Finding GoogleDoc:
https://docs.google.com/document/d/1W7B-O5xKuXbWA_CwHjCl9K0-qKLBLpTnvuxkslcOH30/edit#

LucidChart with the software design:
https://www.lucidchart.com/documents/edit/97c70c55-cdf2-483e-a095-dbf8bed3537c/0

Code:
/global/projectb/sandbox/rqc/andreopo/src/bitbucket/jgi-ml_clean/classifier/dl
/global/projectb/sandbox/rqc/andreopo/src/bitbucket/jgi-ml2/classifier/dl
Was cloned from:
git clone git@bitbucket.org:berkeleylab/jgi-ml.git jgi-ml2

Excel sheet with data results: https://docs.google.com/spreadsheets/d/1TDPn9uOAnZOBS95dJzUfVtb6mhteTuRFrFcv-_IZ4z4/edit#gid=1266368803

Testing methodology on smaller datasets:
Test the training script - for training use a subset of the data, though training will still take some time to complete (30 epochs):
cd $CSCRATCH
mkdir plasmidml_tests
cd plasmidml_tests/
SRC=/global/projectb/sandbox/rqc/andreopo/src/bitbucket/jgi-ml_clean/classifier/dl
$SRC/feature_DL_plasmid_train_CORI.sh $SRC/../DATA/ACLAME.REFSEQMICROB/aclame_plasmid_sequences.fasta.MIN1kMAX330k.fasta.SUB.fasta ./aclame_plasmid_sequences.fasta.MIN1kMAX330k.fasta.SUB.OUT $SRC/../DATA/ACLAME.REFSEQMICROB/refseq.bacteria.nonplasmid.nonmito.fasta.subsam40kreads.fasta.MIN1kMAX330k.fasta.SUB.fasta ./refseq.bacteria.nonplasmid.nonmito.fasta.subsam40kreads.fasta.MIN1kMAX330k.fasta.SUB.OUT

Test the prediction script - use a small dataset with 10 contigs:
$SRC/feature_DL_plasmid_predict_CORI.sh $SRC/../DATA/ACLAME.REFSEQMICROB/aclame_plasmid_sequences.fasta.MIN1kMAX330k.fasta.SUB.fasta ./aclame_plasmid_sequences.fasta.MIN1kMAX330k.fasta.SUB.OUTPRED
The predictions will be under the file ./aclame_plasmid_sequences.fasta.MIN1kMAX330k.fasta.SUB.OUTPRED/outPR.*/predictions.txt
All predictions should be PLASMID since the test dataset comes from the ACLAME dataset.

The path to the model used and cmd line is under: outPR.*/model_path.txt

Docker:
There is the Dockerfile and Dockerfile.dev under the dl directory,
which can be used to create Docker images with the tool.
The latest Docker image can be pulled from:

delplasmid Code

Branches

Tree [0829f8] master / Download Snapshot History

Read Me

Tree [0829f8] master /

History