SOAPfuse Wiki
a tool for identifying fusion transcripts from paired-end RNA-Seq data
Brought to you by:
jwl890427
From v1.24, SOAPfuse has supplied one script for users to construct the SOAPfuse database. It is just based on several public database files that can be downloaded very easily on the Internet.
Note: This introduction is available to SOAPfuse v1.27, click to see that for SOAPfuse v1.26 and former versions.
The script name is SOAPfuse-S00-Generate_SOAPfuse_database.pl, you can find it in the directory:
/PATH_WHERE_YOU_PUT_THE_PACKAGE/SOAPfuse-vX.X/source/
Public files update rapidly, we supply this script for users to update database files of SOAPfuse timely.
Before you run this script, please set the PERL Lib PATH as this post said.
$ perl SOAPfuse-S00-Generate_SOAPfuse_database.pl -h $ Usage: perl SOAPfuse-S00-Generate_SOAPfuse_database.pl <Options> Options: [public database] -wg [s] whole genome, fasta format. <required> NOTE: The chromosome names in this fasta file must have prefix string "chr", e.g., for chromosome 1, it should be "chr1", not "1". -gtf [s] gtf file downloaded from Ensembl website. <required> NOTE: Please make sure that the gtf file used is corresponding to the version of whole_genome reference: NCBI36 (hg18): release_52. GRCh37 (hg19): release_59/61/64/68/69/75. GRCh38 (hg38): release_76/77/78/80/81/82. -cbd [s] the cytoBand database file. <required> NOTE: Download this file from UCSC: For hg18: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/cytoBand.txt.gz For hg19: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz For hg38: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/cytoBand.txt.gz -gf [s] the complete HGNC Gene Family dataset from http://www.genenames.org, text format. <required> NOTE: At the official website, go to 'Download' page, find 'Complete dataset download links'. right click the 'TXT' icon of 'HGNC Gene Family dataset', and select 'save as'. You can firstly try this. Currently (2016-01-08), the direct download-link is: 'http://www.genenames.org/cgi-bin/genefamilies/download-all/tsv' [supplementary data] -rft [s] tab-delimited list of refseg symbols relationship. <required> NOTE: Generaly, the ref_seg symbols in GTP file (normally, the first column) is different from that in reference file (the '-wg' parameter). So, you should specify the corresponding relationship of refseg symbols between GTF file and reference file. It should be noted that only the refseg_symbol(s) listed in this file will be processed by SOAPfuse, others (in gtf file) will be ignored. Format: refseg_symbol_in_gtf \t refseg_symbol_in_reference Instance: 10 \t chr10 -stc [s] the standard start codon sequence, forced to upper cases, can be used in multiple times. [ATG] NOTE: Although, to our knowledge, the start condon sequence ATG is extensive in all species, we still provide this para for user to state other sequences. Actually, some genes have different sequence, such as, MYC-001 with CTG, TEAD4-001 with TTG, and some genes in mitochondria and chloroplast. So, if you want to save these genes as 'protein-coding', but not 'protein-coding-with-abnormal-start_codon', please state their special start_codon sequences to take into account, like ' -stc CTG -stc TTG '. -sor [s] select the ‘data source’ of gtf, can be used in multiple times. Default to accept all sources. NOTE: From v75, ensembl provides the source_info stored by 'gene_source' tag. SOAPfuse will accept the transcript whose gene_source tag contains your input. For instance, once you input 'havana', SOAPfuse will accept both 'havana' and 'ensembl_havana'. if you input 'ensembl', it means both 'ensembl' and 'ensembl_havana' will be accepted. [directory] -sd [s] SOAPfuse software package unpacked folder. <required> NOTE: it tails as '/xxxxx/SOAPfuse-vx.xx/'. -dd [s] folder where you want to store these database files. <required> -h Display this help info. Author: Wenlong Jia (wenlongkxm@gmail.com) V1.05 at 2016-01-08
You can see that, we only need three public files:
Tips:
Wenlong Jia
01-19-2016