| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| README | 2014-06-27 | 1.3 kB | |
| sequence_cleaner.py | 2014-06-27 | 1.9 kB | |
| Totals: 2 Items | 3.2 kB | 0 |
####### Description:
Analyzing poor data takes CPU time and interpreting the results from poor data takes people time, so it's always important to make a preprocessing.
Let me call my script as Sequence_cleaner and the big idea is to remove duplicate sequences, remove too short sequences ( the user defines the minimum length) and remove sequences which have too many unknown nucleotides (N) ( the user defines the % of N is allows ) and in the end the user can choose if he/she wants to have a file as output or print the result.
####### Usage:
Using command line, you should run python sequence_cleaner.py INPUT-(1st) MIN_LENGHT-(2nd) MIN_%-(3rd) - there are 3 basic parameters:
#1st: your fasta file
#2nd: the user defines the minimum length (default value 0 (It means you don't have to care about the minimum length)
#3rd: the user defines the % of N is allowed (default value 100 (all sequences with 'N' will be in your ouput),
set value to 0 if you want no sequences with "N" in your output)
For exemple: python sequence_cleaner.py Aip_coral.fasta 10 10
FYI: if you don't care about the 2nd and the 3rd parameters, you are only going to remove the duplicate sequences.
Questions, Suggestions or Improvement
Send an email to genivaldo.gueiros@gmail.com