ETHA Code

Hybrid Illumina/PacBio assembly of Plasmodium falciparum var genes

Brought to you by: edrabek

Tree [f02648] master / History

HTTPS access

File	Date	Author	Commit
Makefile	2017-02-17	Elliott Drabek	[f02648] Initial commit
README.txt	2017-02-17	Elliott Drabek	[f02648] Initial commit
cluster.R	2017-02-17	Elliott Drabek	[f02648] Initial commit
combined.exon1.all.tail.fa	2017-02-17	Elliott Drabek	[f02648] Initial commit
count_stop_codons.fancy	2017-02-17	Elliott Drabek	[f02648] Initial commit
cut_sections	2017-02-17	Elliott Drabek	[f02648] Initial commit
delcher.cc	2017-02-17	Elliott Drabek	[f02648] Initial commit
delcher.hh	2017-02-17	Elliott Drabek	[f02648] Initial commit
delcher.o	2017-02-17	Elliott Drabek	[f02648] Initial commit
drabek.py	2017-02-17	Elliott Drabek	[f02648] Initial commit
etha.exon1	2017-02-17	Elliott Drabek	[f02648] Initial commit
etha.exon2	2017-02-17	Elliott Drabek	[f02648] Initial commit
exceptions.hh	2017-02-17	Elliott Drabek	[f02648] Initial commit
exon-ends.awk	2017-02-17	Elliott Drabek	[f02648] Initial commit
exon-starts.awk	2017-02-17	Elliott Drabek	[f02648] Initial commit
exon1-end-mer.awk	2017-02-17	Elliott Drabek	[f02648] Initial commit
exon1-start-mer.awk	2017-02-17	Elliott Drabek	[f02648] Initial commit
extract-fasta-bytag-rev.awk	2017-02-17	Elliott Drabek	[f02648] Initial commit
extract-fasta-bytag.awk	2017-02-17	Elliott Drabek	[f02648] Initial commit
extract-long-seqs	2017-02-17	Elliott Drabek	[f02648] Initial commit
fasta.cc	2017-02-17	Elliott Drabek	[f02648] Initial commit
fasta.hh	2017-02-17	Elliott Drabek	[f02648] Initial commit
fasta.o	2017-02-17	Elliott Drabek	[f02648] Initial commit
fastalen.awk	2017-02-17	Elliott Drabek	[f02648] Initial commit
get-contained-matches.awk	2017-02-17	Elliott Drabek	[f02648] Initial commit
get-exon1.awk	2017-02-17	Elliott Drabek	[f02648] Initial commit
get-exon2.awk	2017-02-17	Elliott Drabek	[f02648] Initial commit
get-uniq-path-seqs.awk	2017-02-17	Elliott Drabek	[f02648] Initial commit
group_matches	2017-02-17	Elliott Drabek	[f02648] Initial commit
kmer-hash.cc	2017-02-17	Elliott Drabek	[f02648] Initial commit
kmer-hash.hh	2017-02-17	Elliott Drabek	[f02648] Initial commit
kmer-hash.o	2017-02-17	Elliott Drabek	[f02648] Initial commit
kmer-repair	2017-02-17	Elliott Drabek	[f02648] Initial commit
kmer-repair.cc	2017-02-17	Elliott Drabek	[f02648] Initial commit
kmer-repair.hh	2017-02-17	Elliott Drabek	[f02648] Initial commit
kmer-repair.o	2017-02-17	Elliott Drabek	[f02648] Initial commit
kmer_correct	2017-02-17	Elliott Drabek	[f02648] Initial commit
make-unitig-seq.awk	2017-02-17	Elliott Drabek	[f02648] Initial commit
multi-trace	2017-02-17	Elliott Drabek	[f02648] Initial commit
multi-trace.cc	2017-02-17	Elliott Drabek	[f02648] Initial commit
multi-trace.hh	2017-02-17	Elliott Drabek	[f02648] Initial commit
multi-trace.o	2017-02-17	Elliott Drabek	[f02648] Initial commit
multi-walk	2017-02-17	Elliott Drabek	[f02648] Initial commit
multi-walk.cc	2017-02-17	Elliott Drabek	[f02648] Initial commit
multi-walk.hh	2017-02-17	Elliott Drabek	[f02648] Initial commit
multi-walk.o	2017-02-17	Elliott Drabek	[f02648] Initial commit
muscle_fasta_to_consensus	2017-02-17	Elliott Drabek	[f02648] Initial commit
n50.awk	2017-02-17	Elliott Drabek	[f02648] Initial commit
primer-pair-matches	2017-02-17	Elliott Drabek	[f02648] Initial commit
primer-pair-matches.cc	2017-02-17	Elliott Drabek	[f02648] Initial commit
primer-pair-matches.hh	2017-02-17	Elliott Drabek	[f02648] Initial commit
primer-pair-matches.o	2017-02-17	Elliott Drabek	[f02648] Initial commit
promer_coords_to_nucmer_coords	2017-02-17	Elliott Drabek	[f02648] Initial commit
ref.ex1-splice.71mer	2017-02-17	Elliott Drabek	[f02648] Initial commit
remove_inclusions.composite	2017-02-17	Elliott Drabek	[f02648] Initial commit
rev-comp.awk	2017-02-17	Elliott Drabek	[f02648] Initial commit
run_exon_1	2017-02-17	Elliott Drabek	[f02648] Initial commit
show-coords_to_distance_matrix	2017-02-17	Elliott Drabek	[f02648] Initial commit
take_full_exons_only	2017-02-17	Elliott Drabek	[f02648] Initial commit
uni-classify.awk	2017-02-17	Elliott Drabek	[f02648] Initial commit
unify_blobs_and_results.composite	2017-02-17	Elliott Drabek	[f02648] Initial commit
union-regions-tail.awk	2017-02-17	Elliott Drabek	[f02648] Initial commit
unitig	2017-02-17	Elliott Drabek	[f02648] Initial commit
unitig.cc	2017-02-17	Elliott Drabek	[f02648] Initial commit
unitig.hh	2017-02-17	Elliott Drabek	[f02648] Initial commit
unitig.o	2017-02-17	Elliott Drabek	[f02648] Initial commit

Read Me

For information about purposes and methods of ETHA, see Dadra et al 2016 "Reconstruction of full-length Plasmodium falciparum var exon 1 sequences reveals severe malaria and pregnancy-associated malaria vars in uncomplicated malaria infections in Malian children".

To run ETHA to reconstruct var exon 1 sequences, you will need a set of Illumina reads and a pre-existing whole genome assembly of the same isolate.  You will also need access to three software dependencies:

* Jellyfish (tested with jellyfish-2.0.0beta6.1) http://www.cbcb.umd.edu/software/jellyfish/
* Glimmer (testedw with glimmer-3.02) https://ccb.jhu.edu/software/glimmer/
* MUMmer (tested with version 3.06) http://mummer.sourceforge.net/

You will need to make sure that the executables for each of these packages are available in you path. Edit the primary driver script "run_exon_1" to assign the PATH variable appropriately to include the correct paths on your system.

Running the pipeline consists of three steps:

1) Running Jellyfish on the Illumina reads to get counts of all observed 71mers. See the Jellyfish documentation for instructions for this step.

2) Setting up the working directory with three inputs files. These should be copied or symlinked to these exact names:
** asm.seq.fa, the whole genome assembly
** 71.mer_counts, the output of step 1
** exon1.all.tail.fa, which lists the tail ends of known exon 1 sequences. A version is included with this code. Augmenting the included version with sequences likely to be similar to those of the target strain may improve sensitivity

3) Running the main driver script:

run_exon_1 $etha $working_directory $lower_kmer_bound $upper_kmer_bound

Here, $etha is the full path of the directory containing the code and this README file, $working_directory is the path of the directory created in step 2, and the kmer bounds are numbers indicating the minimum and maximum numbers of times a 71mer must be seen in the Illumina data to be used. These should be set to reflect the reasonable variation in read depth that characterizes the particular dataset. Note that if you know what value you will be using for the lower bound, you can save storage space by asking Jellyfish to keep only those kmers above that value.

ETHA will run for some hours, putting all of its intermediate and output files in the same working directory. For most purposes, the files you will be most interested in will be these:

* finish/results.deduplicated.fa, the output of ETHA proper, high confidence var sequences
* finish/union.fa, the output of ETHA proper, plus var-like sequences identified in the whole genome assembly which are not accounted for in the ETHA output.

If you run into any difficulty or question that is not addressed here, please email elliott.drabek@gmail.com or jcsilva@som.umaryland.edu

ETHA Code

Hybrid Illumina/PacBio assembly of Plasmodium falciparum var genes

Branches

Tree [f02648] master / Download Snapshot History

Read Me

Tree [f02648] master /

History