Menu

Construct_SOAPfuse_database

From v1.24, SOAPfuse has supplied one script for users to construct the SOAPfuse database. It is just based on several public database files that can be downloaded very easily on the Internet.
Note: This introduction is available to SOAPfuse v1.27, click to see that for SOAPfuse v1.26 and former versions.

The script name is SOAPfuse-S00-Generate_SOAPfuse_database.pl, you can find it in the directory:
/PATH_WHERE_YOU_PUT_THE_PACKAGE/SOAPfuse-vX.X/source/

Public files update rapidly, we supply this script for users to update database files of SOAPfuse timely.

Before you run this script, please set the PERL Lib PATH as this post said.

    $ perl SOAPfuse-S00-Generate_SOAPfuse_database.pl -h
    $
     Usage:
      perl SOAPfuse-S00-Generate_SOAPfuse_database.pl <Options>

     Options:

     [public database]
      -wg   [s]  whole genome, fasta format. <required>
                 NOTE: The chromosome names in this fasta file must have prefix string "chr",
                       e.g., for chromosome 1, it should be "chr1", not "1".
      -gtf  [s]  gtf file downloaded from Ensembl website. <required>
                 NOTE: Please make sure that the gtf file used is corresponding to the version of whole_genome reference:
                       NCBI36 (hg18): release_52.
                       GRCh37 (hg19): release_59/61/64/68/69/75.
                       GRCh38 (hg38): release_76/77/78/80/81/82.
      -cbd  [s]  the cytoBand database file. <required>
                 NOTE: Download this file from UCSC:
                       For hg18:  http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/cytoBand.txt.gz
                       For hg19:  http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz
                       For hg38:  http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/cytoBand.txt.gz
      -gf   [s]  the complete HGNC Gene Family dataset from http://www.genenames.org, text format. <required>
                 NOTE: At the official website, go to 'Download' page, find 'Complete dataset download links'.
                       right click the 'TXT' icon of 'HGNC Gene Family dataset', and select 'save as'.
                       You can firstly try this. Currently (2016-01-08), the direct download-link is:
                         'http://www.genenames.org/cgi-bin/genefamilies/download-all/tsv'

     [supplementary data]
      -rft  [s]  tab-delimited list of refseg symbols relationship. <required>
                 NOTE: Generaly, the ref_seg symbols in GTP file (normally, the first column) is different from
                       that in reference file (the '-wg' parameter). So, you should specify the corresponding relationship
                       of refseg symbols between GTF file and reference file. It should be noted that only the refseg_symbol(s)
                       listed in this file will be processed by SOAPfuse, others (in gtf file) will be ignored.
                       Format:       refseg_symbol_in_gtf  \t  refseg_symbol_in_reference
                       Instance:     10 \t chr10
     -stc   [s]  the standard start codon sequence, forced to upper cases, can be used in multiple times. [ATG]
                 NOTE: Although, to our knowledge, the start condon sequence ATG is extensive in all species,
                       we still provide this para for user to state other sequences. Actually, some genes have
                       different sequence, such as, MYC-001 with CTG, TEAD4-001 with TTG, and some genes in
                       mitochondria and chloroplast. So, if you want to save these genes as 'protein-coding',
                       but not 'protein-coding-with-abnormal-start_codon', please state their special start_codon
                       sequences to take into account, like ' -stc CTG -stc TTG '.
     -sor   [s]  select the data source of gtf, can be used in multiple times. Default to accept all sources.
                 NOTE: From v75, ensembl provides the source_info stored by 'gene_source' tag.
                       SOAPfuse will accept the transcript whose gene_source tag contains your input.
                       For instance, once you input 'havana', SOAPfuse will accept both 'havana' and 'ensembl_havana'.
                                     if you input 'ensembl', it means both 'ensembl' and 'ensembl_havana' will be accepted.

     [directory]
     -sd    [s]  SOAPfuse software package unpacked folder. <required>
                 NOTE: it tails as '/xxxxx/SOAPfuse-vx.xx/'.
     -dd    [s]  folder where you want to store these database files. <required>

     -h        Display this help info.

     Author:
      Wenlong Jia (wenlongkxm@gmail.com)
      V1.05 at 2016-01-08

You can see that, we only need three public files:

  1. human whole genome fasta file (this is just one text file contains sequence of all chromosomes);
  2. gtf file form Ensembl website;
  3. HGNC Gene Family dataset from its official website.

Tips:

  1. Generally, the reference segments of human whole genome fasta file are prefixed with string 'chr'.
    Here, we just supply one instance for you to know how should the refseg symbols relationship file for '-rft' option look like.
  2. It is suggested to keep the rare segments in your human whole genome fasta file, such as:
    chr17_ctg5_hap1, chr4_ctg9_hap1, chr6_apd_hap1 and so on.
    SOAPfuse will not detect the fusion genes from these segments, but I think it is good to keep them for comprehensive alignment against whole genome. Of course, if you discard them, SOAPfuse can still run successfully. Anyway, it is your choice to keep them or not.
  3. Click here to know about the new PSL file format for v1.27 SOAPfuse.
  4. Click here to know about the repeated gene_names and transcript_names.

Wenlong Jia
01-19-2016


Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.