Download Latest Version TE-locate.tar (10.0 MB)
Email in envelope

Get an email when there's a new version of TE-locate

Home
Name Modified Size InfoDownloads / Week
ATswedishTEcalls 2014-04-25
TE-locate.tar 2015-03-31 10.0 MB
README 2013-12-02 5.6 kB
splitDemoData.04 2013-12-02 631.5 MB
splitDemoData.03 2013-12-02 1.0 GB
splitDemoData.02 2013-12-02 1.0 GB
splitDemoData.01 2013-12-02 1.0 GB
splitDemoData.00 2013-12-02 1.0 GB
Totals: 8 Items   4.8 GB 1
TE-locate 1.0

is brought to you by:
        Alexander Platzer ( alexander.platzer@gmi.oeaw.ac.at )
        Quan Long         ( quan.long@mssm.edu )

TE-locate is a tool to locate all copies of sequences in a reference sequence
using read-pairs.



1. Prerequisites

1.1 Extract the package
    (TE_locate.tar.bz2 , the location is later referred as main folder)


1.2 SAM files
    
    Generate SAM files of the read-pairs of your accessions and move them to 
    one folder, e.g. SAM/ .
    ! The SAM files must be sorted lexically !
    In Linux you can use the 'sort' command for this, e.g.:    
    
    sort --temporary-directory=. <sam file> > <sorted sam file>

    See for SAM file format : http://samtools.sourceforge.net
    Aligners producing this format are e.g. : 
           bwa ( http://bio-bwa.sourceforge.net/ )
           SMALT ( http://www.sanger.ac.uk/resources/software/smalt/ )
           segemehl ( www.bioinf.uni-leipzig.de/Software/segemehl/ )
           ...

1.2 Reference sequences
    
   The reference as fasta, should/must be the same as used for align the reads.
    

1.3 TE annotation

    The TE-annotated reference in gff3 format.
    E.g as in TAIR : 
    http://www.arabidopsis.org/portals/genAnnotation/gene_structural_annotation/annotation_data.jsp
    -> ftp://ftp.arabidopsis.org/Maps/gbrowse_data/TAIR10/TAIR10_GFF3_genes_transposons.gff

    One example line :    
    Chr1	TAIR10	transposable_element	11897	11976	.	+	.	ID=AT1TE00010;Name=AT1TE00010;Alias=ATCOPIA24
    
    General description of gff: 
    http://en.wikipedia.org/wiki/General_feature_format
   

2. Processing	
   
2.1 Change hierarchy

   This step is optional but for TEs recommended.
   If another hierarchy level should be used, or the item names are not the 
   same (item name in the example line = 'Alias'), a first conversion step must
   be done.
   For this the script TE_hierarchy.pl is useful, it replaces the items with
   something else (the replacement must be provided).
   Its usage:
   
   perl TE_hierarchy.pl <annotation file> <conversion file> <item name>
   
   e.g. :
   perl TE_hierarchy.pl TAIR/TAIR10_GFF3_transposable_element.gff TAIR/family2superfamily.dat Alias
   
   the annotation file is the original gff file. The conversion file is a file
   with two columns, where items in the first column are replaced by the item
   in the second column. The item name is the gff item name (in the example gff
   line this is 'Alias' ).
   The script generates a file with '_HL' added before the 
   ending (for Hierarchy Level).

   
2.2 Run TE-locate

   The usage of the program:
   perl TE_locate.pl <java maximal memory in GB> <SAM folder> 
            <TE annotation file> <reference fasta> <prefix of output>
            <minimal Distance to count> <minimal supporting reads>
            <minimal supporting individuals>

   in detail:
   
   <java maximal memory in GB>    java memory restriction, 
                                  if too less memory provided the program 
                                  stops.
   <SAM folder>                   folder of the sam files.
   <TE annotation file>           gff file with the annotation, 
                                  can be the result of 2.1
   <reference fasta>              reference sequence 
   <prefix of output>             for the file naming of the output
   <minimal Distance to count>    resolution for the loci, if a supporting 
                                  read-pair is found in a distance up to this
                                  value it is counted for the same event. 
                                  Should be set clearly higher as the proper
                                  insert size, e.g. 3x insert size.
   <minimal supporting reads>     only events supported by this number of
                                  reads in all accessions are kept.
   <minimal supporting individuals>   only events supported in this number of
                                      individuals are kept.
   
   Paths can be relative or absolute.
   
   Naming of the  final output files: 
   <prefix of output>_<minimal Distance to count>_reads<minimal supporting reads>_acc<minimal supporting individuals>.csv
   and
   <prefix of output>_<minimal Distance to count>_reads<minimal supporting reads>_acc<minimal supporting individuals>.info	
   
   If anything is failing the output should be helpful and can be redirected 
   to a file with ' > temp.out 2>&1'.
   
   One example command line:
   perl TE_locate.pl 9 SAM/ TAIR/TAIR10_GFF3_transposable_element_HL.gff ref/at.fa TE 1000 5 1 > temp.out 2>&1
   
   output files -> TE_1000_reads5_acc1.csv and TE_1000_reads5_acc1.info
   
   The format of the output, beside the method itself, is described in the 
   article.
   Additional comments: in the csv are the numbers of supporting reads per
   locus and individual.

   
3.  Reference:
   
    Platzer, A., Nizhynska, V. & Long, Q. TE-Locate: A tool to locate and group transposable element occurrences
    using paired-end next-generation sequencing data. Biology 1, 395–410 (2012).


4. Example data

   In the file DemoData.zip is an example set of data and annotation. If you
   extract it in the same folder as the TE-locate package, it should run with
   the example command lines in this readme.
   Because of file restrictions, the file is provided as 
   splitDemoData.00 - splitDemoData.04, you can join them back with:
   cat splitDemoData.* > DemoData2.zip


5. License

   http://creativecommons.org/licenses/by/2.5/
   http://creativecommons.org/licenses/by/2.5/legalcode
   
Source: README, updated 2013-12-02