SOAPfuse Wiki

a tool for identifying fusion transcripts from paired-end RNA-Seq data

Brought to you by: jwl890427

Preparation

Authors:

Attachments

RNA-Seq.directory.structure.for.SOAPfuse.jpg (13234 bytes)

Preparation

Prepare sample list

Prepare list file for samples based on the format below (four columns).

      1                   2                 3             4
\[sample_ID\]     \[sequence_library_ID\]    \[run_ID\]    \[read_length\]

e.g. sample A has RNA-Seq data from three runs: Run-a and Run-b are from the same Library (Lib-a, insert size is 300 nt), and Run-c is from another one Library (Lib-b, insert size is 170 nt). Sequenced read length of Run-a and Run-c are both PE90 nt, while Run-b is PE100 nt. The list file can be created like this:

    A   Lib-a   Run-a   90
    A   Lib-a   Run-b   100
    A   Lib-b   Run-c   90

Note:
* Each line contains information of one run.
If you have N runs for one sample, just write N lines. One run, one line.
* It is suggested to prepare one list for each sample if you want to analyze samples in parallel.
As SOAPfuse needs one sample list for each operation, so N list files are suggested if you have N samples, and run SOAPfuse N times to analyze all samples in parallel.
* Insert size is not required.
Yes, we think the insert size provided by user is not accurate, so it is not required in sample list.
But SOAPfuse will use its algorithm to evaluate the actual insert size in the pipeline.
* Different read lengths are allowed.
a. If you have RNA-Seq data of one sample from several runs but with different read length, never
   mind,SOAPfuse has a complete set of algorithms to distinguish them for accurate calculation.
b. If in one run, the readlengths of /1 end and /2 end are different (uncommon). For example,
   sample A has another run (Run-d from Lib-a) in which /1 end is 80 nt and /2 end is 90 nt.
   SOAPfuse allows users to write the sample list like this:

    A   Lib-a   Run-a   90
    A   Lib-a   Run-b   100
    A   Lib-b   Run-c   90
    A   Lib-a   Run-d   80/90

Of course, you can write it like this (as your wish):

    A   Lib-a   Run-a   90/90
    A   Lib-a   Run-b   100/100
    A   Lib-b   Run-c   90/90
    A   Lib-a   Run-d   80/90

* Sample list for somatic mode.

For example, I am studying kidney cancer. And, tumor sample (K101-T, has Run-n RNA-Seq data [PE90] from Library Lib-n) and control sample (K101-N, has Run-m RNA-Seq data [PE100] from Library Lib-m) are from the same patient (patient-id is K101). I want to run SOAPfuse in somatic mode to detect the somatic fusion transcripts. Just write the information of K101-T and K101-N in one sample list, like this:

    K101-T   Lib-n   Run-n   90
    K101-N   Lib-m   Run-m   100

SOAPfuse distinguishes the tumor sample and control sample based on the postfix of sample-ID. You need to state the postfixes of sample-ID via the parameter 'PA_all_postfix_of_tissue' in the config file, like '-T' for tumor sample and '-N' for control sample. Of course, you must set the config parameter 'PA_all_somatic_mode' as 'yes' to enable the somatic mode.

Prepare RNA-Seq data

The RNA-Seq data fastq/fasta (requirement) files should be stored according to certain directory structure based on the sample list file mentioned above.

==>Follow the next five requirments to construct directories to store RNA-Seq data:
- a. Master directory stores all RNA-Seq data files in its sub-directories.
  We call it 'WHOLE_SEQ-DATA_DIR'.
- b. Use sample_ID to name sub-directories of WHOLE_SEQ-DATA_DIR.
  We call them 'SAMPLE_DIR'.
- c. Use sequence_library_ID to name sub-directories of SAMPLE_DIR.
  We call them 'LIB_DIR'.
- d. RNA-Seq data (fastq/fasta) files are stored in LIB_DIR with their Run_ID as file prefix.
  As SOAPfuse deals with paired-end reads, so the prefix should also concatenate with serial
  number of read, just like 'Run_ID_1' and 'Run_ID_2'. Note that '_1' and '_2' is the required format, and they must be separated from the PostFix by one dot '.'.
- e. There is no requirements for PostFix of RNA-Seq data files.
     Generally, we use 'fq.gz' (fastq) or 'fa.gz' (fasta). Read files are always stored in
     compressed format (gz). Anyway, the PostFix must be stated via parameter 'PA_all_fq_postfix'
     in the config file.
For example, for sample A mentioned in sample list instance. Its RNA-Seq data files (fastq) will be stored like this:

SOAPfuse Wiki

a tool for identifying fusion transcripts from paired-end RNA-Seq data

Preparation

Preparation

Prepare sample list

Prepare RNA-Seq data