Home
Name Modified Size InfoDownloads / Week
splitContigsColourGraph.pl 2012-12-17 4.4 kB
SeqTabGenerator.pl 2012-12-17 6.8 kB
mutateNew.pl 2012-12-17 8.0 kB
definedPermutations.pl 2012-12-17 3.8 kB
License.txt 2012-12-17 35.1 kB
README.txt 2012-12-17 6.3 kB
Totals: 6 Items   64.5 kB 2
README for the 4 files associated with McComish et al.


1. OVERVIEW

There are 4 Perl scripts written that accompany the paper, and are described in more detail in 
section 3.


2. REQUIREMENTS

To use these scripts the following prerequisites are required:

2.1 Perl 
========

The following additional modules are also required:	
	List::Util
	DBI
	Getopt::Long												


2.2 Exonerate
=============

Available from http://www.ebi.ac.uk/~guy/exonerate/, and the executable 'exonerate' needs to be in 
your path, or the appropriate line modified to a usable location where the Perl system call can be 
made.


2.3 MySQL
=========

Available from http://dev.mysql.com/downloads/.  The user account the scripts use must have all 
privileges on the database schema being used, so that tables can be created and destroyed.


2.4 The EMBOSS package
======================

Available from http://emboss.sourceforge.net/, with downloads at 
http://emboss.sourceforge.net/download/.  The executable that is used 'msbar' needs to be in your 
path, or the appropriate line modified to a usable location where the Perl system call can be made.


2.5 Graphviz
============

Available from http://www.graphviz.org/.  The executable 'neato' is the most appropriate for 
viewing the output from splitContigsColourGraph.pl.


3. FILE DESCRIPTIONS

3.1 SeqTabGenerator.pl
======================

Generates random reads of a given length from a genome stored as a table in a MySQL database, and 
stores them in a new table in the same database.

 Takes the following options:
	-database	The name of the MySQL database.
	-username	The username used to access the database.
	-password	The password used to access the database.
	-chromosome	The name of the table containing the genome.
	-seqLength	The length of the reads.
	-iterations	The number of reads to generate.
	
	
3.2 mutateNew.pl
================

Simulates sequencing errors in the reads generated by SeqTabGenerator.pl. Adds a column, 
mutatedSequence, to the table created by seqTabGenerator.pl.

Takes the same options as SeqTabGenerator.pl, i.e:
	-database	The name of the MySQL database.
	-username	The username used to access the database.
	-password	The password used to access the database.
	-chromosome	The name of the table containing the genome.
	-seqLength	The length of the reads.
	-iterations	The number of reads to generate.


3.3 definedPermutations.pl
==========================

Extracts 4 million 35-bp reads from the database in a series of different permutations defined in 
a tab-delimited text file definedPermutations.txt.

Takes the following options:
	-database	The name of the MySQL database.
	-username	The username used to access the database.
	-password	The password used to access the database.
	-seq[A-E]	The names of the five species for which reads are to be extracted.
	-seqLength	The length of the reads.
	-iterations	The number of reads to generate.
	
3.3.1 definedPermutations.txt

This input file is required for the script definedPermutations.pl, and has the following format.  
Five numbers for the ratios of sequences A, B, C, D, and E as tab separated plain text.  Each row 
will be treated as a separate set of combinations. It should  be in the same folder as that from 
which all the scripts are executed.


3.4 splitContigsColourGraph.pl
==============================

Aligns contigs against a set of reference sequences and splits them into a FASTA file for each 
reference, and one for the contigs that don't match any of the references. Then converts the Velvet 
graph file into a DOT file, and colours the nodes according to which reference they align to.

Takes the following options:
	-c	The name of the contig file output by Velvet.
	-r	The name of the file containing the reference sequences in FASTA format.
	-g	The name of the LastGraph file output by Velvet.
	-o	The name of the DOT file to be output.

The file containing the reference sequences should follow the NCBI FASTA specification, as 
Exonerate needs to extract the ID for each sequence.


3.5 Running Order
=================

The three simulation scripts can be run in the following order:

1) SeqTabGenerator.pl
2) mutateNew.pl
3) definedPermutations.pl

All table names called between these three scripts are consistent.


4 EXTRA MySQL TABLE

In the created database, there should be a table called 'genomes', which has the following 
description:

mysql> desc genomes;
+-----------------+---------------+------+-----+---------+----------------+
| Field           | Type          | Null | Key | Default | Extra          |
+-----------------+---------------+------+-----+---------+----------------+
| chr_id          | int(11)       | NO   | PRI | NULL    | auto_increment |
| genomeLength    | int(11)       | YES  |     | NULL    |                |
| fastaChromosome | varchar(25)   | YES  | MUL | NULL    |                |
| chromosome      | varchar(25)   | YES  |     | NULL    |                |
| speciesType     | varchar(25)   | YES  |     | NULL    |                |
| direction       | enum('F','R') | YES  |     | NULL    |                |
| sequence        | mediumtext    | YES  |     | NULL    |                |
+-----------------+---------------+------+-----+---------+----------------+
7 rows in set (0.00 sec)

This is the basis for all the above mentioned scripts, and will contain your genomes of interest, 
loaded separately in both the forward and reverse orientation.  Having the sequence as 'mediumtext' 
allows genomes of up to 8Mb, so this will cope with all mitochondria and chloroplasts.

An example of a couple of rows is shown below:

mysql> select * from genomes limit 2 \G
*************************** 1. row ***************************
         chr_id: 1
   genomeLength: 17804
fastaChromosome: >Rnigromaculata
     chromosome: Rnigromaculata
    speciesType: amphibian
      direction: F
       sequence: AACAACTGCCTCCACCTTATGTATATAGAGCATAAATTTATTACCCCATATTAAGACTAACA................
       
*************************** 2. row ***************************
         chr_id: 2
   genomeLength: 17804
fastaChromosome: >Rnigromaculata
     chromosome: Rnigromaculata
    speciesType: amphibian
      direction: R
       sequence: TTAAATTTTTAGGAGCTTGTTTTCCAGGAGACCTAGTGATGGGATAAGAAGGACAAAGATAA................

2 rows in set (0.00 sec)

Source: README.txt, updated 2012-12-17