Menu

quick start

John Archer

Back
1. Obtaining ChimSim

1.1 A zip file (chimsim.zip) containing ChimSim.jar, license, sample data and quick start can be downloaded from the Files tab of the sourceforge url: https://sourceforge.net/projects/chimsim/.

1.2 ChimSim has been tested on Ubuntu 20.04, Windows 10 and MacOS High Sierra, but it is usable on any operating system with installed Java Runtime Environment (JRE) 8.0 or higher. To find out what version of Java is running open a terminal window and type java -version. If an update is required the latest JRE's can be obtained from the Oracle website: https://www.oracle.com/java/technologies/javase-downloads.html

1.3 Extract the contents of the zip file and place the jar within the desired directory. Make sure permissions are set on this file so that it can be executed. To do this right click and use the properties tab OR chmod the file (sudo chmod +x).

2. Running ChimSim

2.1 The basic command to generate a transcript reference set containing a default (0.1, i.e. 10%) proportion of chimeras is:

java -jar ChimSim.jar -ref_set path-to-ref-set.fasta.gz -gz true -out_dir path-to-out-dir

Output will be placed in fasta format within the specified directory. Parameter options are described below:

-ref_set: Specifies the path to the file containing the set of reference sequences from which reads will be simulated. This file must be in fasta, or fasta.gz, format and this is indicated by the -gz parameter (default is false).

-out_dir: Specifies the path to the output directory.

-min_tln: Sequences below this length within the reference set will not be included. Default value 101: maximum value: 50000, minimum value 50.

-max_tln: Sequences above this length within the reference set will not be included. Default value 10000: maximum value: 50000, minimum value 50.

-gz: Specifies whether or not the input reference sequences in fasta format are compressed. Default value: false.

-chim: This specifies the portion of sequences within the input reference set that will become chimeric within the output. These sequences will be randomly selected and chimerism will be introduced using one of the three categories previously described. The default value of 0.10 means that by default 10% of the output will be chimeric. Default value: 0.10, Maximum value: 1.00, minimum value: 0.00.

-wgen_div: For sequences selected to be chimeric according to category (ii), in that they contain random windows of increased variation, this is the threshold that defines the proportion of sites that will contain a random mutation. Default value: 0.10, Maximum value: 1.00, minimum value: 0.00.

-max_wins: This defines the maximum number of windows that will be used for chimeric categories (ii and iii). A random number of windows will be selected between one and the value specified by this parameter. This number of windows will be spread evenly across the selected template. Default value: 5, Maximum value: 10, minimum value: 1.

-win_ln: This defines the size of each window. Default value: 200, Maximum value: 500, minimum value: 50.

-min_ext: For reference sequences selected to be chimeric through over extension, i.e. category (i), this parameter defines the minimum over extension length that will occur. To create the over-extended region, a different reference sequence is selected at random and a region of length between the selected sequence length and the minimum extension value is randomly defined and used for over-extension. Default value: 100, Maximum value: 5000, minimum value: 50.

-tag: A text tag that is appended to the title of the output files.

-gen_div: If this parameter is specified the user can generated templates that have this proportion of sites that are divergent form the original. Selected sites will contain a random mutation. Default value: 0.00, Maximum value: 1.00, minimum value: 0.00.

3. Sample Data

Within the downloaded zip archive there is a file called Serinus_canaria_cdna_ln_300_to_5000_release100.fasta.gz. This contains all the cDNA reference transcripts of between 300 and 5000 in length. These were obtained from Ensembl (https://www.ensembl.org/info/data/ftp/index.html). This data can be used to generated chimeric reference sets and verify that the software is running.

4. Obtaining Source Code

Alternatively the code can be downloaded from the Code tab, imported into an IDE, such as Netbeans, and recompiled as desired. The steps below are for the Netbeans IDE, but others will have a similar process. Note: this is not the recommended (nor required) path for obtaining the working software, unless there is a specific requirement to edit the code. Steps to do this are:

4.1 On the code tab of the project obtain the read only svn checkout link (svn://svn.code.sf.net/p/chimsim/code/). There are three options: (i) SSH, (ii) HTTPS and (iii) RO. The read only option is RO and does not require a password later.

4.2 Open Netbeans and under the Team menu select the sub menu Subversion and then sub-sub menu Checkout. This will open a small window with some field to fill.

4.3 In the field that is labelled Repository URL place the RO svn checkout link obtained in step 1.4.1. The username and password can be left blank. Click next.

4.4 Use the browse button to browse the project Repository Folders and select the core folder. This contains all the code. Once OK is pressed select the local folder where you want to download the code to e.g. testFolder.

4.5 Click finish. All the code files and subfolders within core folder will then be placed into the selected location.

4.6 These can be used to set up a new project within Netbeans and you can begin to edit and recompile the code. The easiest way to do this is to creating a new project from scratch and then past the core folder into the source directory of the new project.


Related

Wiki: Home