| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| ATswedishTEcalls | 2014-04-25 | ||
| TE-locate.tar | 2015-03-31 | 10.0 MB | |
| README | 2013-12-02 | 5.6 kB | |
| splitDemoData.04 | 2013-12-02 | 631.5 MB | |
| splitDemoData.03 | 2013-12-02 | 1.0 GB | |
| splitDemoData.02 | 2013-12-02 | 1.0 GB | |
| splitDemoData.01 | 2013-12-02 | 1.0 GB | |
| splitDemoData.00 | 2013-12-02 | 1.0 GB | |
| Totals: 8 Items | 4.8 GB | 1 |
TE-locate 1.0
is brought to you by:
Alexander Platzer ( alexander.platzer@gmi.oeaw.ac.at )
Quan Long ( quan.long@mssm.edu )
TE-locate is a tool to locate all copies of sequences in a reference sequence
using read-pairs.
1. Prerequisites
1.1 Extract the package
(TE_locate.tar.bz2 , the location is later referred as main folder)
1.2 SAM files
Generate SAM files of the read-pairs of your accessions and move them to
one folder, e.g. SAM/ .
! The SAM files must be sorted lexically !
In Linux you can use the 'sort' command for this, e.g.:
sort --temporary-directory=. <sam file> > <sorted sam file>
See for SAM file format : http://samtools.sourceforge.net
Aligners producing this format are e.g. :
bwa ( http://bio-bwa.sourceforge.net/ )
SMALT ( http://www.sanger.ac.uk/resources/software/smalt/ )
segemehl ( www.bioinf.uni-leipzig.de/Software/segemehl/ )
...
1.2 Reference sequences
The reference as fasta, should/must be the same as used for align the reads.
1.3 TE annotation
The TE-annotated reference in gff3 format.
E.g as in TAIR :
http://www.arabidopsis.org/portals/genAnnotation/gene_structural_annotation/annotation_data.jsp
-> ftp://ftp.arabidopsis.org/Maps/gbrowse_data/TAIR10/TAIR10_GFF3_genes_transposons.gff
One example line :
Chr1 TAIR10 transposable_element 11897 11976 . + . ID=AT1TE00010;Name=AT1TE00010;Alias=ATCOPIA24
General description of gff:
http://en.wikipedia.org/wiki/General_feature_format
2. Processing
2.1 Change hierarchy
This step is optional but for TEs recommended.
If another hierarchy level should be used, or the item names are not the
same (item name in the example line = 'Alias'), a first conversion step must
be done.
For this the script TE_hierarchy.pl is useful, it replaces the items with
something else (the replacement must be provided).
Its usage:
perl TE_hierarchy.pl <annotation file> <conversion file> <item name>
e.g. :
perl TE_hierarchy.pl TAIR/TAIR10_GFF3_transposable_element.gff TAIR/family2superfamily.dat Alias
the annotation file is the original gff file. The conversion file is a file
with two columns, where items in the first column are replaced by the item
in the second column. The item name is the gff item name (in the example gff
line this is 'Alias' ).
The script generates a file with '_HL' added before the
ending (for Hierarchy Level).
2.2 Run TE-locate
The usage of the program:
perl TE_locate.pl <java maximal memory in GB> <SAM folder>
<TE annotation file> <reference fasta> <prefix of output>
<minimal Distance to count> <minimal supporting reads>
<minimal supporting individuals>
in detail:
<java maximal memory in GB> java memory restriction,
if too less memory provided the program
stops.
<SAM folder> folder of the sam files.
<TE annotation file> gff file with the annotation,
can be the result of 2.1
<reference fasta> reference sequence
<prefix of output> for the file naming of the output
<minimal Distance to count> resolution for the loci, if a supporting
read-pair is found in a distance up to this
value it is counted for the same event.
Should be set clearly higher as the proper
insert size, e.g. 3x insert size.
<minimal supporting reads> only events supported by this number of
reads in all accessions are kept.
<minimal supporting individuals> only events supported in this number of
individuals are kept.
Paths can be relative or absolute.
Naming of the final output files:
<prefix of output>_<minimal Distance to count>_reads<minimal supporting reads>_acc<minimal supporting individuals>.csv
and
<prefix of output>_<minimal Distance to count>_reads<minimal supporting reads>_acc<minimal supporting individuals>.info
If anything is failing the output should be helpful and can be redirected
to a file with ' > temp.out 2>&1'.
One example command line:
perl TE_locate.pl 9 SAM/ TAIR/TAIR10_GFF3_transposable_element_HL.gff ref/at.fa TE 1000 5 1 > temp.out 2>&1
output files -> TE_1000_reads5_acc1.csv and TE_1000_reads5_acc1.info
The format of the output, beside the method itself, is described in the
article.
Additional comments: in the csv are the numbers of supporting reads per
locus and individual.
3. Reference:
Platzer, A., Nizhynska, V. & Long, Q. TE-Locate: A tool to locate and group transposable element occurrences
using paired-end next-generation sequencing data. Biology 1, 395–410 (2012).
4. Example data
In the file DemoData.zip is an example set of data and annotation. If you
extract it in the same folder as the TE-locate package, it should run with
the example command lines in this readme.
Because of file restrictions, the file is provided as
splitDemoData.00 - splitDemoData.04, you can join them back with:
cat splitDemoData.* > DemoData2.zip
5. License
http://creativecommons.org/licenses/by/2.5/
http://creativecommons.org/licenses/by/2.5/legalcode