Simulate High-Throughput Sequencing Data Code
Status: Beta
Brought to you by:
kbodi1
File | Date | Author | Commit |
---|---|---|---|
LICENSE.txt | 2009-03-06 | kbodi1 | [r3] Added license file, GPLv3 |
README.txt | 2009-03-06 | kbodi1 | [r5] Added README file. |
simhtsd.pl | 2009-03-06 | kbodi1 | [r4] Fixed bug in detecting required options |
Simulate High-Throughput Sequencing Data ./simhtsd.pl Required options are: Either -c or -n (desired coverage or number of reads to output) and -o (output file). Note that every option requires a parameter. So, if you want to enable the error function, you have to run the program with "-e 1". 1) Output Note that the program will create two output files - file_1, and file_2. If you are doing paired end reads, the paired reads will go in file_2. If you are doing single reads, file_2 will just be empty. You can then move / rename / shuffle the files as necessary. 2) Supplied reference genome The last arguments provided to the program should be a list of files that are your reference genomes. These can be in any format that BioPerl's SeqIO library can read - I have been using GenBank format, but I'm sure FastA will work too. 3) Error function The program will add some error to your sequences if you run it with "-e 1". It will increase the error linearly per base based on the starting error rate and the incremental rate per position. 4) Paired-End The program will fill the file_2 file with paired-end reads if you run the program with "-p 1". Options for paired-end reads include insert size (-l) and standard deviation of the insert size (-s). 5) 454 Data This will generate longer reads and ignore all other options except for (-c) and (-n). It attempts to generate a distribution of read lengths that matches 454's sequencer (mean length ~ 400, mode ~ 500), similar to the graph here: http://www.454.com/products-solutions/system-features.asp