README for the 4 files associated with McComish et al.
1. OVERVIEW
There are 4 Perl scripts written that accompany the paper, and are described in more detail in
section 3.
2. REQUIREMENTS
To use these scripts the following prerequisites are required:
2.1 Perl
========
The following additional modules are also required:
List::Util
DBI
Getopt::Long
2.2 Exonerate
=============
Available from http://www.ebi.ac.uk/~guy/exonerate/, and the executable 'exonerate' needs to be in
your path, or the appropriate line modified to a usable location where the Perl system call can be
made.
2.3 MySQL
=========
Available from http://dev.mysql.com/downloads/. The user account the scripts use must have all
privileges on the database schema being used, so that tables can be created and destroyed.
2.4 The EMBOSS package
======================
Available from http://emboss.sourceforge.net/, with downloads at
http://emboss.sourceforge.net/download/. The executable that is used 'msbar' needs to be in your
path, or the appropriate line modified to a usable location where the Perl system call can be made.
2.5 Graphviz
============
Available from http://www.graphviz.org/. The executable 'neato' is the most appropriate for
viewing the output from splitContigsColourGraph.pl.
3. FILE DESCRIPTIONS
3.1 SeqTabGenerator.pl
======================
Generates random reads of a given length from a genome stored as a table in a MySQL database, and
stores them in a new table in the same database.
Takes the following options:
-database The name of the MySQL database.
-username The username used to access the database.
-password The password used to access the database.
-chromosome The name of the table containing the genome.
-seqLength The length of the reads.
-iterations The number of reads to generate.
3.2 mutateNew.pl
================
Simulates sequencing errors in the reads generated by SeqTabGenerator.pl. Adds a column,
mutatedSequence, to the table created by seqTabGenerator.pl.
Takes the same options as SeqTabGenerator.pl, i.e:
-database The name of the MySQL database.
-username The username used to access the database.
-password The password used to access the database.
-chromosome The name of the table containing the genome.
-seqLength The length of the reads.
-iterations The number of reads to generate.
3.3 definedPermutations.pl
==========================
Extracts 4 million 35-bp reads from the database in a series of different permutations defined in
a tab-delimited text file definedPermutations.txt.
Takes the following options:
-database The name of the MySQL database.
-username The username used to access the database.
-password The password used to access the database.
-seq[A-E] The names of the five species for which reads are to be extracted.
-seqLength The length of the reads.
-iterations The number of reads to generate.
3.3.1 definedPermutations.txt
This input file is required for the script definedPermutations.pl, and has the following format.
Five numbers for the ratios of sequences A, B, C, D, and E as tab separated plain text. Each row
will be treated as a separate set of combinations. It should be in the same folder as that from
which all the scripts are executed.
3.4 splitContigsColourGraph.pl
==============================
Aligns contigs against a set of reference sequences and splits them into a FASTA file for each
reference, and one for the contigs that don't match any of the references. Then converts the Velvet
graph file into a DOT file, and colours the nodes according to which reference they align to.
Takes the following options:
-c The name of the contig file output by Velvet.
-r The name of the file containing the reference sequences in FASTA format.
-g The name of the LastGraph file output by Velvet.
-o The name of the DOT file to be output.
The file containing the reference sequences should follow the NCBI FASTA specification, as
Exonerate needs to extract the ID for each sequence.
3.5 Running Order
=================
The three simulation scripts can be run in the following order:
1) SeqTabGenerator.pl
2) mutateNew.pl
3) definedPermutations.pl
All table names called between these three scripts are consistent.
4 EXTRA MySQL TABLE
In the created database, there should be a table called 'genomes', which has the following
description:
mysql> desc genomes;
+-----------------+---------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+---------------+------+-----+---------+----------------+
| chr_id | int(11) | NO | PRI | NULL | auto_increment |
| genomeLength | int(11) | YES | | NULL | |
| fastaChromosome | varchar(25) | YES | MUL | NULL | |
| chromosome | varchar(25) | YES | | NULL | |
| speciesType | varchar(25) | YES | | NULL | |
| direction | enum('F','R') | YES | | NULL | |
| sequence | mediumtext | YES | | NULL | |
+-----------------+---------------+------+-----+---------+----------------+
7 rows in set (0.00 sec)
This is the basis for all the above mentioned scripts, and will contain your genomes of interest,
loaded separately in both the forward and reverse orientation. Having the sequence as 'mediumtext'
allows genomes of up to 8Mb, so this will cope with all mitochondria and chloroplasts.
An example of a couple of rows is shown below:
mysql> select * from genomes limit 2 \G
*************************** 1. row ***************************
chr_id: 1
genomeLength: 17804
fastaChromosome: >Rnigromaculata
chromosome: Rnigromaculata
speciesType: amphibian
direction: F
sequence: AACAACTGCCTCCACCTTATGTATATAGAGCATAAATTTATTACCCCATATTAAGACTAACA................
*************************** 2. row ***************************
chr_id: 2
genomeLength: 17804
fastaChromosome: >Rnigromaculata
chromosome: Rnigromaculata
speciesType: amphibian
direction: R
sequence: TTAAATTTTTAGGAGCTTGTTTTCCAGGAGACCTAGTGATGGGATAAGAAGGACAAAGATAA................
2 rows in set (0.00 sec)