Name | Modified | Size | Downloads / Week |
---|---|---|---|
mutation_caller_V7.py | 2018-05-23 | 7.0 kB | |
README.txt | 2018-05-14 | 5.2 kB | |
find_start_codon.py | 2018-05-04 | 651 Bytes | |
aa_mutation_caller_V3.py | 2018-05-04 | 3.4 kB | |
combine_tables_V3.py | 2018-05-04 | 3.5 kB | |
translate_V3.py | 2018-05-04 | 1.6 kB | |
mutation_caller_V5.py | 2018-02-12 | 6.6 kB | |
extract_sam_accession_simp_V2.py | 2018-02-12 | 2.5 kB | |
mutation_caller_V4.py | 2018-02-09 | 6.4 kB | |
Totals: 9 Items | 36.8 kB | 1 |
Table of Contents --------------------- 1. Introduction 2. Requirements 3. Extract reads with high MAPQ 4. Nucleotide mutation calling 5. Find start codon 6. Translate 7. Amino acid differences 8. Combine amino acid changes with nucleotide changes 1 INTRODUCTION --------------------- This collection of scripts create a consensus sequence from contigs and finds substitutions, insertions and deletions in a sam file. The position of the mutation in the reference genome will be returned for each mutation. The sequence of inserted nucleotides as well as the substituted nucleotides are reported. The data is written to csv files. The consensus sequence of the aligned contigs is written to a fasta file. The script was written for use with experiments with multiple time points but can be used for a single time point. The consensus sequence from the previous time point can be used as the reference for the next timepoint. The script was written for use in a Linux terminal. Arguments are always monospaced. Scripts are to be run in the order presented below. 2 REQUIREMENTS --------------------- This script has been tested with Python 3.6.3 and Linux version 2.6.32-696.18.7.el6.x86_64, gcc version 4.4.7 20120313, Red Hat 4.4.7-18. The script uses the Pysam module which requires at least Python 3.4.X. All modules needed are imported in the script. Note on file names: Sam file names must fit the following format: NGSV<sample number>_alignedto_NGSV<second sample number>.sam For example: NGSV2_alignedto_NGSV1.sam 3 EXTRACT READS WITH MAPQ >= 20 ------------------------------- Requires Python 2.6. Arguments are input sam file and output file name. python 2.6 extract_sam_accession_simp_V2.py input_file.sam output_file_extraced.sam 4 Nucleotide mutation calling ------------------------------ (mutation_caller.py) SUMMARY: Creates a consensus sequence from the contigs, finds indels, and finds nucleotide mutations. USAGE: In a Linux terminal. python3 /mutation_caller_V4.py aligned_contigs.sam path_to_reference/reference.fa /output_directory/ INPUT: 3 arguments. 1. Sam file of aligned contigs. 2. Fasta file containing the reference sequence to which the contigs were aligned. 3. Location of the output directory file where the output files will be created. OUTPUT: 1. Substitutions csv file containing location of substitutions according to the reference, the reference nucleotide, and the substituted nucleotide. 2. Indel and totals csv file containing the positions of deletions according to the reference, the positions of insertions according to the reference, and the inserted nucleotides. Also, a table of total deletions, insertions, and substitutions is included. 3. Consensus sequence fasta file containing a sequence made up of the contigs in the order they are aligned to the reference. In regions where there were deletions or where the contigs do not fully cover the genome an ambiguous base marker "N" was inserted. (note: a depth.txt file will also be output and used for genome coverage calculation but does not contain really useful information) 5 Find start codon ------------------ (find_start_codon.py) SUMMARY: Prints "ATG" and the position of the start codon. Look for the start codon around where you would expect it to be based on literature. Written for a fasta with a single sequence. INPUT: The fasta file of the sequence you want to find start codons in. USEAGE: python2.6 find_start_codon.py NGSV2_to_NGSV1_mutseq.fa 6 Translate ------------ (translate.py) SUMMARY: Translates DNA sequence to amino acid sequence. To be used after find_start_codon.py INPUT: 3 arguments. 1. Fasta file with single sequence to translate. 2. Integer of the position of the start codon 3. Path where you want to store the translated sequence USEAGE: python2.6 translate.py NGSV2_to_NGSV1_mutseq.fa 746 /home/strain/aa_seqs/ 7 Amino acid changes --------------------- (aa_mutation_caller_V2.py) SUMMARY: Finds difference between 2 amino acid sequences of equal length. Assumed to be the same genome but with substitutions accumulated over time (no frameshifts). INPUT: 3 arguments. 1. The fasta file of the sequence made from the older sample. 2. Fasta file of the more recent sequence. 3. Path to save csv file of differences. USEAGE: python3 aa_mutation_caller.py NGSV1_to_STRAIN_start746_AAseq.fa NGSV2_to_NGSV1_start746_AAseq.fa /home/strain/aa_diffs/ 8 Combine amino acid changes with nucleotide change information ---------------------------------------------------------------- (combine_tables_V2.py) SUMMARY: discover if nucleotide SNPs are silent and combine the nt information with amino acid information. INPUT: 3 arguments. 1. Amino acid changes file (aa_table). 2. sub_table.csv (output from nucleotide mutation caller). 3. Output directory. USEAGE: python3 combine_tables.py NGSV2_to_NGSV1_aa_table.csv NGSV182_to_NGSV180_sub_table.csv /home/strain/full_tables OUTPUT: full.csv file with SNP nucleotide locations, nucleotides, amino acid changes and positions, genes, and gene type. Also, a file containing just the SNP positions for plotting.