SOAPfuse Wiki

a tool for identifying fusion transcripts from paired-end RNA-Seq data

Brought to you by: jwl890427

Construct_SOAPfuse_database_V1.24-1.26

Note: This introduction is available for SOAPfuse v1.26 and former versions. Click here to check that for v1.27.

From v1.24, SOAPfuse has supplied one script for users to construct the SOAPfuse database. It is just based on several public database files that can be downloaded very easily via Internet.

The script name is SOAPfuse-S00-Generate_SOAPfuse_database.pl, you can find it in the directory:
/PATH_WHERE_YOU_PUT_THE_PACKAGE/SOAPfuse-vX.X/source/

Public files update rapidly, we supply this script for users to update database files of SOAPfuse in good time.

run it, and you can see the help information:

    $ perl SOAPfuse-S00-Generate_SOAPfuse_database.pl
    $
    Usage:
         perl SOAPfuse-S00-Generate_SOAPfuse_database.pl <Options>

    Options:

    [public database files]
       -wg   [s]  human whole genome, fasta format. <required>
       -gtf  [s]  gtf file downloaded from Ensembl website. <required>
                   if gz format (postfix), this script will ungzip it; or just link use linux command 'ln -sf'.
                   Please make sure that the gtf file used is corresponding to the whole_genome file.
                   Such as,
                      huamn_gtf (release_52) is for NCBI36 (hg18);
                      huamn_gtf (release_59/61/64/68/69) is for GRCh37 (hg19).
       -cbd  [s]  the cytoBand database file. &lt;required&gt;
                   Download this file from UCSC:
                     For hg18:  http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/cytoBand.txt.gz
                     For hg19:  http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz
       -gf   [s]  the complete HGNC Gene Family dataset from http://www.genenames.org, text format. <required>
                   This file can be downloaded from
                   www.genenames.org -> Downloads -> Complete HGNCGene Family dataset.

    [directory]
       -sd   [s]  SOAPfuse software package unpacked directory. <required>
                   with tail as '/xxxxx/SOAPfuse-vx.xx/'.
       -dd   [s]  directory where you want to store these database files. <required>

    Author:
          Wenlong Jia (jiawenlong@genomics.org.cn)
          V1.03

You can see that, we only need three public files:

human whole genome fasta file (this is just one text file contains sequence of all chromosomes);
gtf file form Ensembl website;
HGNC Gene Family dataset from its official website.

NOTE:

The reference segments of human whole genome fasta file must be prefixed with 'chr'.
Please make sure all segment names shown below are contained in your human whole genome fasta file.
- [autosome]:
  chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10,
  chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22,
- [allosome]:
  chrM, chrX, chrY
It is suggested to keep the rare segments in your human whole genome fasta file, such as:
chr17_ctg5_hap1, chr4_ctg9_hap1, chr6_apd_hap1 and so on.
SOAPfuse will not detect the fusion genes from these segments, but I think it is good to keep them for comprehensive alignment against whole genome. Of course, if you discard them, SOAPfuse can still run successfully. Anyway, it is your choice to keep them or not.
Requirements for database files are changed slightly in v1.25. Please use the script in v1.25 SOAPfuse
to reconstruct your database in One-Step. Do not use database files of v1.24 and former versions for v1.25 version. Or else, errors may occur.
You may find some gene_names or transcript_names are prefixed by strings like "SOAPfuse2SOAPfuse", this is a special operation of SOAPfuse on the information from GTF file. To know about it, pls clik here.

The maximum memory and cpu-time of this script:
For released 69th Ensembl, the maximum memory is 4.5G and the cpu-time is about 7.5 hours.
PS. most of the cpu-time is used by blast homo gene alignment.

Wenlong Jia
04-03-2013