findprimer.py (sourceforge.net)
Primer trimming:
To find and remove primer binding sites and upstream sequences we have written the python script findprimer.py. Due to vector cloning our sequences can be in either 3’-5’ or 5’-3’ orientation, and the script takes this into account by reversing the sequences.
This script searches for the forward primer (in our case Lco1490) and reverses the sequences where the forward primer is not found. The sequences are then searched again and trimmed upstream and including the forward primer. See the attachment for a figure on how the code works.
The full length of the primer is 25 bp and we have found that searching with only the last 7bp of the primer maximises the chances of finding the primer in the sequence. Reducing the primer length further is not desirable as it increases the chance of randomly finding the sequence in the middle of the cytochrome oxidase 1 gene.
The 5’ trimmed sequences are then searched for the reverse primer (Hco2198) and trimmed including all downstream sequences.
The output includes a count of the number of sequences in the input file, the number of sequences in forward orientation, the number of sequences reversed, the number of sequences trimmed upstream and including the forward primer and the number of sequences trimmed downstream and including the reverse primer. This allows the whole process to be monitored on the terminal window.
Three output files are generated: One with all the sequences in forward orientation, one with all the sequences after trimming for the forward primer, and one with all the sequences after trimming for the reverse primer. Each output file will be used in the script as an input file in the following step, such as sequences in the last output file will be in forward orientation with both primers trimmed.
If the program is given a file which is not in the directory it will raise an IO Error is raised stating that the file is either not found or can not be read.
[testfile1_findprimer.fa] : This file contains different scenarios that could occur. First entry contains the forward primer, the second also contains the primer but is in lower case. In the third case we have both the forward and reverse primer. Also a few instances lacking the primer as well as an empty record, a sequence consisting of only "N"s, one which is shorter than 10bp and finally en entry containing a protein sequence instead.
When running findprimer.py with this test file we get the following:
Sarahs-MacBook-Pro:Code_Virkki&Bourlat sarahbourlat$ python findprimer.py
which file:testfile1_findprimer.fa
10 sequences in file
copied 2 sequences in forward orientation to file out_allfwd.fa
reversed 8 sequences and appended them to file out.allfwd.fa
trimmed 2 sequences with forward primer and saved them to file out_trimmedfwd.fasta
out of which 1 sequences with reverse primer, saved to file out_trimmedrev.fasta