Name | Modified | Size | Downloads / Week |
---|---|---|---|
splitContigsColourGraph.pl | 2012-12-17 | 4.4 kB | |
SeqTabGenerator.pl | 2012-12-17 | 6.8 kB | |
mutateNew.pl | 2012-12-17 | 8.0 kB | |
definedPermutations.pl | 2012-12-17 | 3.8 kB | |
License.txt | 2012-12-17 | 35.1 kB | |
README.txt | 2012-12-17 | 6.3 kB | |
Totals: 6 Items | 64.5 kB | 0 |
README for the 4 files associated with McComish et al. 1. OVERVIEW There are 4 Perl scripts written that accompany the paper, and are described in more detail in section 3. 2. REQUIREMENTS To use these scripts the following prerequisites are required: 2.1 Perl ======== The following additional modules are also required: List::Util DBI Getopt::Long 2.2 Exonerate ============= Available from http://www.ebi.ac.uk/~guy/exonerate/, and the executable 'exonerate' needs to be in your path, or the appropriate line modified to a usable location where the Perl system call can be made. 2.3 MySQL ========= Available from http://dev.mysql.com/downloads/. The user account the scripts use must have all privileges on the database schema being used, so that tables can be created and destroyed. 2.4 The EMBOSS package ====================== Available from http://emboss.sourceforge.net/, with downloads at http://emboss.sourceforge.net/download/. The executable that is used 'msbar' needs to be in your path, or the appropriate line modified to a usable location where the Perl system call can be made. 2.5 Graphviz ============ Available from http://www.graphviz.org/. The executable 'neato' is the most appropriate for viewing the output from splitContigsColourGraph.pl. 3. FILE DESCRIPTIONS 3.1 SeqTabGenerator.pl ====================== Generates random reads of a given length from a genome stored as a table in a MySQL database, and stores them in a new table in the same database. Takes the following options: -database The name of the MySQL database. -username The username used to access the database. -password The password used to access the database. -chromosome The name of the table containing the genome. -seqLength The length of the reads. -iterations The number of reads to generate. 3.2 mutateNew.pl ================ Simulates sequencing errors in the reads generated by SeqTabGenerator.pl. Adds a column, mutatedSequence, to the table created by seqTabGenerator.pl. Takes the same options as SeqTabGenerator.pl, i.e: -database The name of the MySQL database. -username The username used to access the database. -password The password used to access the database. -chromosome The name of the table containing the genome. -seqLength The length of the reads. -iterations The number of reads to generate. 3.3 definedPermutations.pl ========================== Extracts 4 million 35-bp reads from the database in a series of different permutations defined in a tab-delimited text file definedPermutations.txt. Takes the following options: -database The name of the MySQL database. -username The username used to access the database. -password The password used to access the database. -seq[A-E] The names of the five species for which reads are to be extracted. -seqLength The length of the reads. -iterations The number of reads to generate. 3.3.1 definedPermutations.txt This input file is required for the script definedPermutations.pl, and has the following format. Five numbers for the ratios of sequences A, B, C, D, and E as tab separated plain text. Each row will be treated as a separate set of combinations. It should be in the same folder as that from which all the scripts are executed. 3.4 splitContigsColourGraph.pl ============================== Aligns contigs against a set of reference sequences and splits them into a FASTA file for each reference, and one for the contigs that don't match any of the references. Then converts the Velvet graph file into a DOT file, and colours the nodes according to which reference they align to. Takes the following options: -c The name of the contig file output by Velvet. -r The name of the file containing the reference sequences in FASTA format. -g The name of the LastGraph file output by Velvet. -o The name of the DOT file to be output. The file containing the reference sequences should follow the NCBI FASTA specification, as Exonerate needs to extract the ID for each sequence. 3.5 Running Order ================= The three simulation scripts can be run in the following order: 1) SeqTabGenerator.pl 2) mutateNew.pl 3) definedPermutations.pl All table names called between these three scripts are consistent. 4 EXTRA MySQL TABLE In the created database, there should be a table called 'genomes', which has the following description: mysql> desc genomes; +-----------------+---------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +-----------------+---------------+------+-----+---------+----------------+ | chr_id | int(11) | NO | PRI | NULL | auto_increment | | genomeLength | int(11) | YES | | NULL | | | fastaChromosome | varchar(25) | YES | MUL | NULL | | | chromosome | varchar(25) | YES | | NULL | | | speciesType | varchar(25) | YES | | NULL | | | direction | enum('F','R') | YES | | NULL | | | sequence | mediumtext | YES | | NULL | | +-----------------+---------------+------+-----+---------+----------------+ 7 rows in set (0.00 sec) This is the basis for all the above mentioned scripts, and will contain your genomes of interest, loaded separately in both the forward and reverse orientation. Having the sequence as 'mediumtext' allows genomes of up to 8Mb, so this will cope with all mitochondria and chloroplasts. An example of a couple of rows is shown below: mysql> select * from genomes limit 2 \G *************************** 1. row *************************** chr_id: 1 genomeLength: 17804 fastaChromosome: >Rnigromaculata chromosome: Rnigromaculata speciesType: amphibian direction: F sequence: AACAACTGCCTCCACCTTATGTATATAGAGCATAAATTTATTACCCCATATTAAGACTAACA................ *************************** 2. row *************************** chr_id: 2 genomeLength: 17804 fastaChromosome: >Rnigromaculata chromosome: Rnigromaculata speciesType: amphibian direction: R sequence: TTAAATTTTTAGGAGCTTGTTTTCCAGGAGACCTAGTGATGGGATAAGAAGGACAAAGATAA................ 2 rows in set (0.00 sec)