CSReadGen Wiki

RNA-Seq read simulator that offers a wide range of parameter options.

Brought to you by: john-archer

quickstart

Back
1. Obtaining CSReadGen

1.1 A zip file (CSReadGen.zip) containing CSReadGen.jar, license, sample data and quick start can be downloaded from the Files tab of the sourceforge url: https://sourceforge.net/projects/csreadgen/.

1.2 CSReadGen has been tested on Ubuntu 20.04, Windows 10 and MacOS High Sierra, but it is usable on any operating system with installed Java Runtime Environment (JRE) 8.0 or higher. To find out what version of Java is running open a terminal window and type java -version. If an update is required the latest JRE's can be obtained from the Oracle website: https://www.oracle.com/java/technologies/javase-downloads.html

1.3 Extract the contents of the zip file and place the jar within the desired directory. Make sure permissions are set on this file so that it can be executed. To do this right click and use the properties tab OR chmod the file (sudo chmod +x).

2. Running CSReadGen

2.1 Although there are many parameters for CSReadGen, the basic command to generate one million read pairs is:

java -jar CSReadGen.jar -refset path-to-ref-set -outdir path-to-out-dir

The various parameter options can be used to generated reads more specific to the users requirements. For example:

(i) To increase the number of reads created use the -no_of_rds parameter:

java -jar CSReadGen.jar -ref_set path-to-ref-set -out_dir path-to-out-dir -no_of_pairs 5000000

(ii) To generate multiple replicates use the -reps parameter e.g:

java -jar CSReadGen.jar -ref_set path-to-ref-set -out_dir path-to-out-dir -no_of_pairs 5000000 -reps 5

(iii) To label each of these replicates with a condition indicator use the -ctag parameter e.g.:

java -jar CSReadGen.jar -ref_set path-to-ref-set -out_dir path-to-out-dir -no_of_pairs 5000000 –reps 5 -ctag condA

(iv) To select 1000 random transcripts for over expression across the conditions use:

java -jar CSReadGen.jar -ref_set path-to-ref-set -out_dir path-to-out-dir -no_of_pairs 5000000 –reps 5 -ctag condA -rnd_ovr_exp 1000

Each of these examples can be made more specific through the use of additional parameters such as those that specify the level of over expression required (-over_exp_lfactor and -over_exp_hfactor) and those that specify the level of general background variation within read counts between replicated (-vlow and -vhi parameters). All parameters are described below:

-no_of_pairs: Indicates the number of pairs to be simulated. The number of pairs created will be very slightly below the value specified. This is because initially read coverage is evenly distributed across all reference sequences present within the reference set, after which any user specified levels of over expression and variation are accounted for. The final numbers of pairs required for each template within the reference set is then normalized based on the value of the –no_of_pairs parameter, where for each template this is rounded down to the nearest integer. Default value: 1000000, maximum value: 100000000 and minimum value: 10000.

-reps: Specifies the number of reps to be created. Reads between replicate datasets share overall identity, as they are generated off the same reference set, but count values differ within: (i) the range of background variation (as specified by–vlow and -vhi), (ii) differentially expressed transcripts (if specified by either -rnd_ovr_exp or -usr_ovr_exp) and (iii) mismatch and indel error rates (if specified by -err_mis, -err_del and/or -err_ins). Default value: 1, maximum value: 100 and minimum value: 1.

-ctag: a piece of text that will be added as a label to the read files e.g. condition. If reps are specified, rep number is added automatically, in this case ctag is in addition to rep number.

-gen_div: The proportion of sites that will be selected for random variation sequences within the reference set. For example if 0.1 is the parameter, 10% of the sites within the reference sequences will be selected for random nucleotide alteration, after which the altered sequences will be used for read simulation. Default value: 0.0, maximum value: 1 and minimum value: 0.0.

-vlow: Defines the lower limit allowed on the background variation present within uniform read coverage across reference sequences. For a given reference, the required number of reads are initially calculated in a manner to provide even coverage relative to all other reference sequences. Then for each specific reference within the reference set the number of reads required is increased by an amount that falls randomly between this and the corresponding upper bound (-vhi). Once this has been done for all sequences within the reference set, the actual numbers of reads required are re-normalized in accordance to the user specified number of required reads (-no_of_rds). Default value 0.1: maximum value: 1.0, minimum value 0.0.

-vhi: Defines the upper limit allowed on the background variation seen in read counts as described for (-vlow). Default value 0.2: maximum value: 1.0, minimum value 0.0.

-ref_set: Specifies the path to the file containing the set of reference sequences from which reads will be simulated. This file must be in fasta or fasta.gz format and this is indicated by the –gz parameter where the default is false.

-out_dir: Specifies the path to the output directory.

-rd_ln: Specifies the length of the reads simulated. If the optional -err_ins and -err_del parameters are used in order to introduce indel error into reads, then some minor variation around this length will occur. Default value 100: maximum value: 500, minimum value 50.

-insert_sz: Specifies the size of the region that is randomly selected from a reference sequence during each read pair generation. If this region is shorter than the combined length of the paired reads, then they will overlap each. Default value 101: maximum value: 1000, minimum value 100.

-min_tln: Sequences below this length within the reference set will be ignored and no reads will be simulated off of them. Default value 101: maximum value: 50000, minimum value 50.

-max_tln: Sequences above this length within the reference set will be ignored and no reads will be simulated off of them. Default value 10000: maximum value: 50000, minimum value 50.

-gz: Specifies whether or not the input reference sequences in fasta format are compressed. Default value: false. Note: all reads are outputted in .gz files regardless of this parameter.

-err_mis: At each site within each read there is this probability that a mismatch error will occurred. If a mismatch occurs then the original nucleotide present will be replaced with one of the other three randomly. Default value: 0.00, Maximum value: 1.00, minimum value: 0.00.

-err_ins: At each site within each read there is this probability that an insertion error will occurred. If an insertion error occurs a random nucleotide will be inserted at the site. Default value: 0.00, Maximum value: 1.00, minimum value: 0.00.

-err_del: At each site within each read there is this probability that a deletion error will occurred. If a deletion errors the nucleotide at that site will be removed. Default value: 0.00, Maximum value: 1.00, minimum value: 0.00.

-rnd_ovr_exp: The number of reference sequences randomly selected for over expression. Once selected this will be consistent across replicates (if any). The titles of the selected transcripts will be outputted into the file “over_expressed_ttls.txt ”. Default value 0: maximum value: 10000, minimum value 0.

-usr_ovr_exp: This specifies a path to the file that contains the titles of the reference sequences that are to be over expressed. If the user creates and specifies this file, they can simulated over expression on pre defined sets of reference sequences.

-ovr_exp_lfactor: Within each replicate the number of reads created for reference sequences (following even coverage) that have been selected for over expression will be increased by a factor falling between this value and the upper limit specified by -over_exp_hfactor. For example, if 70 reads were required for a selected reference sequences to provide even coverage, relative to other reference sequences, then this value would be increase to 70 + (70 * 2) if 2 was the selected for a particular replicated. These values do not include random background variation which is added in accordance with -vlow and –vhi parameters. Default value 1: maximum value: 50, minimum value 1.

-ovr_exp_hfactor: This specifies the upper bound described for -ovr_exp_lfactor.

3. Sample Data

Within the downloaded zip archive there is a file called Serinus_canaria_cdna_ln_300_to_5000_release100.fasta.gz. This contains all the cDNA reference transcripts of between 300 and 5000 in length. These were obtained from Ensembl (https://www.ensembl.org/info/data/ftp/index.html) and discussed further in our manuscript (available shortly). This data set can be used to simulate reads and experiment with the various described parameters. For example:

java -jar CSReadGen.jar -refset /PATH/Serinuscanariacdnaln300to5000release100.fasta.gz -outdir /PATH/test/ -reps 2 -gz true

4. Obtaining Source Code

Alternatively the code can be downloaded from the Code tab, imported into an IDE, such as Netbeans, and recompiled as desired. The steps below are for the Netbeans IDE, but others will have a similar process. Note: this is not the recommended (nor required) path for obtaining the working software, unless there is a specific requirement to edit the code. Steps to do this are:

4.1 On the code tab of the project obtain the read only svn checkout link (svn://svn.code.sf.net/p/CSReadGen/code/). There are three options: (i) SSH, (ii) HTTPS and (iii) RO. The read only option is RO and does not require a password later.

4.2 Open Netbeans and under the Team menu select the sub menu Subversion and then sub-sub menu Checkout. This will open a small window with some field to fill.

4.3 In the field that is labelled Repository URL place the RO svn checkout link obtained in step 1.4.1. The username and password can be left blank. Click next.

4.4 Use the browse button to browse the project Repository Folders and select the core folder. This contains all the code. Once OK is pressed select the local folder where you want to download the code to e.g. testFolder.

4.5 Click finish. All the code files and subfolders within core folder will then be placed into the selected location.

4.6 These can be used to set up a new project within Netbeans and you can begin to edit and recompile the code. The easiest way to do this is to creating a new project from scratch and then past the core folder into the source directory of the new project.

CSReadGen Wiki

RNA-Seq read simulator that offers a wide range of parameter options.

quickstart

Related