Process_hits
Version 12.07.10
Written By: Jeremy D. Smith (jds775@msstate.edu)
==========================================================================================
Concepts
==========================================================================================
Takes in Repeatmasker, Epcr, or NCBI-blast output data and uses it to retrieve sequences or subsequences based on a given criteria.
==========================================================================================
Background
==========================================================================================
With advances in sequencing technology, greater and greater amounts of genomic data are becoming available every day. A large portion of these genomics sequences consists of transposable elements; frequently 50% or more in vertebrates. Transposable elements are known to act as drivers of genomic evolution and diversification and are important genetic markers. Each transposable element family may have thousands of copies within a given genome, and therefore it can take an exorbitant amount of time and effort to process data in a meaningful fashion.
In order to combat this problem, we developed a set of modern bioinformatics techniques and programs to streamline the analysis. This includes a unique perl script which automates the process of taking BLAST, Repeatmasker and similar data and extracting the hit sequences from the genome. This script, called Process_hits uses an Object-oriented methodology to compile all hit locations from a given file for processing, organize this data into useable categories, and output it into multiple formats. It is capable of handling large amounts of transposon data in an efficient fashion, with each of the major sub-functions, hit processing, nucleotide sequence extraction, and hit-object methods, are contained within their own sub-modules to allow for greater expandability and as foundation for future program design
==========================================================================================
Availability and Requirements
==========================================================================================
Project name: Process_hits
Project home page: https://sourceforge.net/projects/processhits/
Operating systems: Platform independent
Programming language: Perl 5.10.0+
Other requirements: Bioperl 1.61+
License: Academic Free License (AFL)
==========================================================================================
Optional Script Usage
==========================================================================================
file_split.repeatmasker.pl -n -o target_file.txt
Use: target_file.txt is the name or location of the file to be split.
Options: -n Append a string to the beginning of all filenames.
Options: -o sends created files into a given directory, defaults to the same directory as original file.
file_split.repeatmasker.pl -n -o target_file.txt
Use: target_file.txt is the name or location of the file to be split.
Options: -n Append a string to the beginning of all filenames.
Options: -o sends created files into a given directory, defaults to the same directory as original file.
file_split.blast.pl -n -o target_file.txt
Use: Divides a NCBI-BLAST output file into a series of subfiles based on given criteria.
Use: target_file.txt is the name or location of the file to be split.
Options: -n Append a string to the beginning of all filenames.
Options: -o sends created files into a given directory, defaults to the same directory as original file.
file_split.fasta.pl fastafile
Use: Splits a given fasta file into smaller files
Options: -n splits into files of n numbered sequences
Options: -f splits into n numbered files
Options: -l splits into files of a total of l length
file_analyze.fasta.pl fastafile.fas >outputfile
Use: Gives you basic information about a given fasta file.
file_clean.fasta.pl fastafile outfile
Use: Converts a fasta file to all caps, removes linebreaks and trasnforms any non-GTCN characters into dashes
file_join.fasta.pl list_of_files outfile
Use: Takes in a line-deliminted list of fasta files and joins them into a single file
file_remove.dups.pl fastafile outfile
Use: Compares ids of all the fasta sequences in the file and removes duplicated entries. Most useful following a join_fasta run.
file_analyze.blast.pl target_file.txt
Use: target_file.txt is the name or location of the file to be analyzed.
Use: Gives a list of the matching queries and the number of matches per query.
dir_comb.fasta.pl directory outfile
Use: directory is the directory of fasta files to be combined.
Use: outfile is the fasta file to write sequences too.
dir_align.fasta.pl directory
Use: directory is the directory of fasta files to be combined.
Use: Uses muscle to align all fasta format files in the directory. Outputs files as *name.align.fas
file_randomseq.fasta fastafile outfile.fas -n int
Use: fastafile is the initial fasta file
Use: Outfile.fas is the file to output randomly selected sequences too.
Options: -n int: the number of random sequences to select. Defaults to 250 if not assigned.
Use: Randomly selects a number of fasta sequences with a given file and outputs them into a new file.
==========================================================================================
Process_hits Usage
==========================================================================================
# process_hits.pl -input b/r hit.txt genome.fas
or
# process_hits.pl -input -l list.txt genome.fas -random 100
or
# process hits.pl -input epcr hit.txt genome.fas
where hit.txt is the output file from the Blast, Epcr, or Repeatmasker search and genome.fas is the file searched in fasta format.
Use: Process_hits.pl is the user interface, takes in data, and provides a framework for calling the library modules.
Use: Gene_loc.pm contains the modules for creating Geneloc objects which store information on sequence location within a fasta file.
Use: Gene_process.pm contains the modules required to convert input data into Geneloc objects.
Use: Gene_extract.pm contains the modules required to convert Geneloc objects into fasta sequence and output in several different methods.
Parameters:
Input Formats: (-i or -input)
Blast = b, B, Blast, or blast
Repeatmasker = r,R,repeatmask,Repeatmask,Repeatmasker,repeat,Repeat,or RepeatMask
epcr = epcr
Usage Options:
-v = Verbose
-log = log
Sequence Options:
-b or -buffer = Extract upstream and/or downstream from a given sequence.
-front = Extracts upstream from a given sequence.
-back = Extracts downstream from a given sequence.
-o int or -overlap int = assigns integer value to the overlap function. Combines overlapping sequence objects into a single object.
Quality Control:
-maxlength int = maximum seqence length.
-minlength int = minimum sequence length.
-gap int = Maximum gap number.
-mis int = Maximum mismatches.
-bit int = Minimum bitscore, Blast only.
-evalue real = Minimum evalue, Blast only. Use in 10e-100 format. Other mathmatical expressions may work.
Input / Output Options:
-l or -list file: Take in list of files.
-a or -align: Align all sequences. Default program is MUSCLE.
-R or -random -Random int = Align a random selection of sequences.
Split output into multiple files based on:
-splitquery = query (Blast)
-splitname = element name (RM)
-splittype = element family (RM)
-print = Print formatted text file of final Object set.
-qseq file = Retrieve query sequence from library.
-noextract = Performs other functions, but does not extract or align fasta sequences. Typically used in conjection with -print.