Note: This introduction is available for SOAPfuse v1.26 and former versions. Click here to check that for v1.27.
From v1.24, SOAPfuse has supplied one script for users to construct the SOAPfuse database. It is just based on several public database files that can be downloaded very easily via Internet.
The script name is SOAPfuse-S00-Generate_SOAPfuse_database.pl, you can find it in the directory:
/PATH_WHERE_YOU_PUT_THE_PACKAGE/SOAPfuse-vX.X/source/
Public files update rapidly, we supply this script for users to update database files of SOAPfuse in good time.
run it, and you can see the help information:
$ perl SOAPfuse-S00-Generate_SOAPfuse_database.pl $ Usage: perl SOAPfuse-S00-Generate_SOAPfuse_database.pl <Options> Options: [public database files] -wg [s] human whole genome, fasta format. <required> -gtf [s] gtf file downloaded from Ensembl website. <required> if gz format (postfix), this script will ungzip it; or just link use linux command 'ln -sf'. Please make sure that the gtf file used is corresponding to the whole_genome file. Such as, huamn_gtf (release_52) is for NCBI36 (hg18); huamn_gtf (release_59/61/64/68/69) is for GRCh37 (hg19). -cbd [s] the cytoBand database file. <required> Download this file from UCSC: For hg18: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/cytoBand.txt.gz For hg19: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz -gf [s] the complete HGNC Gene Family dataset from http://www.genenames.org, text format. <required> This file can be downloaded from www.genenames.org -> Downloads -> Complete HGNCGene Family dataset. [directory] -sd [s] SOAPfuse software package unpacked directory. <required> with tail as '/xxxxx/SOAPfuse-vx.xx/'. -dd [s] directory where you want to store these database files. <required> Author: Wenlong Jia (jiawenlong@genomics.org.cn) V1.03
You can see that, we only need three public files:
NOTE:
The maximum memory and cpu-time of this script:
For released 69th Ensembl, the maximum memory is 4.5G and the cpu-time is about 7.5 hours.
PS. most of the cpu-time is used by blast homo gene alignment.
Wenlong Jia
04-03-2013