kitchen_drawer / Wiki / Home

elizabeth henaff - 2012-12-26

gff_seq_extract.py

download

dependencies:
Bio: Biopython

This script allows you to extract the sequences that correspond to a gff annotation. This is similar to the BedTools fastaFromBed but with more options: you can control how your sequences are named, select them according to a regular expression and specify flanking regions.

The options are:

-g gff file
-d seqDatabase in fasta format
-f flankingBp : default 0
-t tagFilter : filter by regex the tags in the 9th column
-s seqSet : sequence titles will be written as >seqSet_<line_number> <gff_line>
-o outFile: file to write extracted sequences to
-l if T will sort by decreasing length. default 0
-i if T will include the first sequence twice, once with and once without flanking sequence. default 0
-n use the value of this tag from the 9th column for the title of the sequence
-a if T append the entire gff line to the title
-x if T cut out annotated sequence range and only output flanking sequences appended together</gff_line></line_number>

Last edit: elizabeth henaff 2012-12-26

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

elizabeth henaff - 2012-12-26

gff_smart_merge.py

download

This script merges the annotations in a gff file and outputs a non-redundant set of annotations. There are various options to define what should be merged, and how the merged annotations should be tagged.

The possible cases are Inclusion (B is included in A) or Overlap (A and C overlap)

To remove redundancy the B annotation will be removed, and the A annotation's end coordinate will be extended to that of C, and the C annotation removed

++++++====================A+++++++++++++
+++++++++=====B+++++++++======C+++++++++
options:

-g input gff file

-i Include: if T will tag the containing annotation A with the included B annotation. format of tag is modified by -l (default)
if F will just remove the included B annotation, leaving A unchanged

-m Merge: if T will tag the extended A annotation with the C annotation. format of tag is modified by -l (default)
if F will just extend the end coordinate of the A annotation to that of C

-l Long: if T will tag merges and includes with MERGED=gff_line_of_merged_annotation and INCLUDED=ff_line_of_included_annotation, respectively.
if F will tag only with MERGED=tag or INCLUDED=tag where tag is set by -t

-t name of tag from the 9th column to use if -l option is not set (default use ID )

-b defines the bookend value to merge non-overlapping annotations. ie, annotations separated by b base pairs will be merged. If b=1, adjacent (separated by 0bp) annotations will be merged
default is 0, meaning annotations must truly overlap to be merged

-o Omit Overlaps: if T, only consider includes, and do not merge overlaps.
if F, consider both includes and overlaps (default)

-s Score: if T, will not merge overlaps or includes but rather just report the annotation that has the best score of the two

Last edit: elizabeth henaff 2012-12-26

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

hierarchichal_cluster_seqs.py

dependencies:
pp: Parallel Python
Bio: BioPython

this program takes a multifasta file and performs hierarchical clustering. Clustering is based on a distance matrix of (100 - percent similarity), as calculated with blast2seq -p blastn -F F
the program generates three files:
inputfile.dist, which is a pickled form of the distance matrix
inputfile.desc, which is a pickled list of sequence descriptions, indexed according to their index in the distance matrix
inputfile.clust, the list of clusters calculated. One cluster per line, with the sequence names separated by spaces and sorted by decreasing length
if it is the first time to run the clustering, one only needs to supply the input fasta file
to run the clustering again but with different parameters, one can load a previsously calculated .dist file and .desc file, decreasing computation time
the distance matrix calculation can be parallelized by specifying the number of processors to use

               options: 
               -s sequence file in fasta format

               -l length cutoff, expressed as a percent 0-100. Only sequences that align along at least this fraction of both of their lengths will be considered for clustering. default is 80%

               -p percent similarity cutoff, expressed as a percent 0-100. this percentage similarity will be used to define cutoffs for clusters. default is 80%

               -m precalculated distance matrix file, <input>.dist

               -d precalculated sequence descirption file, <input>.desc

               -c clustering method: 
                   's': Single pairwise linkage 
                   'm': Complete (maximum) pairwise linkage (default) 
                   'c': Centroid linkage 
                   'a': Average pairwise linkage

                -n number of processors to use

                -h print this help

identify copies of a sequence in a genome

download

There are several cases where you might want to identify copies of a given query sequence in a genome, for example:

searching for the homolog of your favorite gene in a different organism
identifying the copies of a transposon family

The obvious approach is to blast your sequence against your database. A common issue, however, is the fact that your target may be broken up into several consecutive hits, if the copy you have identified has diverged from you query by point mutations or insertions. This is due to the fact that BLAST is parametrized to prefer smaller hits with high similarity rather than longer ones with lower similarity.
However, for annotation purposes, one might want for all of those hits to be assembled into one "copy".

This is the reason why I designed this program: I was annotating the transposable elements in the C. melo genome and wanted a tool to assemble fractionated BLAST hits into one copy. (Garcia Mas et. al. 2012)
The second issue, which is particular to repetitive elements, is that if you have a set of queries that are similar enough to each other, the same region in the genome may be annotated as being a copy of several queries. I therefore implemented a system to resolve this redundancy in a logical manner, by choosing to keep the "best" copy and truncating the overlapping ones.

The included bash script will create the necessary directory structure, call BLAST for each of the queries and find all non-redundant copies.

These copies are composed by assembling BLAST HSPs that are :

                - ordered along the query
                - separated by less than the specified allowed gap 
                - correspond to the maximal copy in the sense of query coverage. 
            It is implemented by finding the set of longest non-overlapping paths in the graph whose nodes are the HSPs, connected by an edge if they satisfy the first two criteria
            The main input is the name of the directory in which can be found the BLAST output files, in xml format (blastall -m 7 option ), one output per query.

            options: 
            -g (int) max gap allowed between two HSPs for them to be assembled together default:1000 
            -o (int) max overlap allowed between two HSPs for them to be assembled. BLAST tends to have overlapping HSPs when there could be a continuous one, for reasons of better scores if they are separate. default: 30
            -d sequence database in fasta format, the same one used for blasting. This is only necessary to specify if you also specify the -f option 
            -i (int) bin size. specify this if you want the copies to be annotated as belonging to x bin as far as query coverage 
            -b blast output directory, where are found the blast outputs in xml format, one per query. will try to parse every file in that directory, non-blast files will give an error
            -a annotation type. Will be the second column in the gff file of copy annotations
            -w prefix for output files
            -c [per_q_len|copy_len] specifies basis on which to pick the \"best\" copy when resolving redundancy. The best copy will be kept in its full length and any overlapping ones will be truncated.
            -r if set to T will only perform the step of resolving redundancy. default F. specify this option if you have already run the calculation once, 
                and the program will load the \"pickled\" files in the directory specified in -b to reconstitute the original list of copies, and proceed to resolving redundancy. This is useful because the first step is the most time consuming. 
            -f set to T if you want to output the sequences of the copies in fasta format as well as the annotations. Must also set the -d option. This option significantly increases the runtime. 
            -n set number of CPUs to use.

kitchen_drawer Wiki

a set of tools for biological sequence analysis

Home

Project Admins:

Discussion

gff_seq_extract.py

gff_smart_merge.py

hierarchichal_cluster_seqs.py

identify copies of a sequence in a genome