Home
Name Modified Size InfoDownloads / Week
README 2014-06-27 1.3 kB
sequence_cleaner.py 2014-06-27 1.9 kB
Totals: 2 Items   3.2 kB 0
####### Description:

Analyzing poor data takes CPU time and interpreting the results from poor data takes people time, so it's always important to make a preprocessing.

Let me call my script as “Sequence_cleaner” and the big idea is to remove duplicate sequences, remove too short sequences ( the user defines the minimum length) and remove sequences which have too many unknown nucleotides (N) ( the user defines the % of N is allows ) and in the end the user can choose if he/she wants to have a file as output or print the result. 

####### Usage:
Using command line, you should run python sequence_cleaner.py INPUT-(1st) MIN_LENGHT-(2nd) MIN_%-(3rd) - there are 3 basic parameters:

        #1st: your fasta file 
        #2nd: the user defines the minimum length (default value 0 (It means you don't have to care about the minimum length)
        #3rd: the user defines the % of N is allowed (default value 100 (all sequences with 'N' will be in your ouput), 
              set value to 0 if you want no sequences with "N" in your output)

        For exemple: python sequence_cleaner.py Aip_coral.fasta 10 10

FYI: if you don't care about the 2nd and the 3rd parameters, you are only going to remove the duplicate sequences.


Questions, Suggestions or Improvement

Send an email to genivaldo.gueiros@gmail.com 
Source: README, updated 2014-06-27