Menu

Get started

Serghei Mangul

For a quick start, a toy example with 2508 unmapped reads is distributed with the ROP package. Please note that selected reads are randomly selected from a normal skin (SRR1146076) RNA-Seq sample and might not represent the typical reads of RNA-Seq experiment. The reads are provided for demonstration purposes and can be accessed under ROP directory. Instruction for the analysis of the full RNA-Seq sample (SRR1146076) are provided here.

/rop/example/unmappedExample.fastq

Please make sure that the basic unix commands (wget, python, perl) are available on the cluster. Please install ROP first. The instructions how to install ROP are provided here

The first operation consists of navigating to ROP directory and creating a subdirectory for storing the training data.

cd rop
mkdir tutorial
cd tutorial
mkdir data

Now, download the mapped and unmapped reads from RNA-Seq

wget https://googledrive.com/host/0B_NUyiE86yDwaUxoVjhlSjN5SkE/skinExample.tar
tar -xvf skinExample.tar

ROP is an intensive pipeline requiring substantial amount of computations resources. Thus we don't recommend to run ROP from login nodes. Please check the policy of you cluster, from where to run the ROP pipeline. For hoffman2 (UCLA cluster) read the policy [here] (http://ccn.ucla.edu/wiki/index.php/Hoffman2:Interactive_Sessions).

ROP requires two mandatory command line arguments, i.e. (1) the unmapped reads and (2) the directory to save the results of ROP.

usage: python rop.py [-h] [--qsub] [--qsubArray] [--b] [--skipLowq] [--skipQC]
                     [--circRNA] [--immune] [--gzip] [--quiet] [--dev]
                     [--license]
                     unmappedReads dir

To test ROP for the small training data use the following command under the ROP directory, where results will be saved to example/ropOut/ directory

python rop.py example/unmappedExample.fastq example/ropOut/

You expect the following output of the ROP pipeline:

*********************************************
ROP (version 1.0.1) is a computational protocol aimed to discover the source of all reads, originated from complex RNA molecules, recombinant antibodies and microbial communities. Written by Serghei Mangul (smangul@ucla.edu) and Harry Taegyun Yang (harry2416@gmail.com), University of California, Los Angeles (UCLA). (c) 2016. Released under the terms of the General Public License version 3.0 (GPLv3)

For more details see:
https://sergheimangul.wordpress.com/rop/
*********************************************
Processing 2510 unmapped reads

1. Quality Control...
--filtered 2193 low quality reads
--filtered 2 low complexity reads (e.g. ACACACAC...)
--filtered 22 rRNA reads
In toto : 2217 reads failed QC and are filtered out
2. Remaping to human references...
--identified 6 lost human reads from unmapped reads. Among those: 4 reads with 0 mistmathes; 2 reads with 1 mistmath; 0 reads with 2 mistmathes
***Note: Complete list of lost human reads is available from sam files: /u/home/s/serghei/code2/rop/example/test12/lostHumanReads/unmappedExample_genome.sam,/u/home/s/serghei/code2/rop/example/test12/lostHumanReads/unmappedExample_transcriptome.sam
3. Maping to repeat sequences...
-Identify 1 lost repeat sequences from unmapped reads
***Note : Repeat sequences classification into classes (e.g. LINE) and families (e.g. Alu) will be available in next release
4. Non-co-linear RNA profiling
***Note : Trans-spicing and gene fusions  are currently not supported, but will be in the next release.
--identified 2 reads from circRNA
***Note: circRNAs detected by CIRI are available here: unmappedExample_circRNA.csv
5a. B lymphocytes profiling...
--identified 1 reads mapped to immunoglobulin heavy (IGH) locus
--identified 0 reads mapped to immunoglobulin kappa (IGK) locus 
--identified 1 reads mapped to immunoglobulin lambda (IGL) locus
5b. T lymphocytes profiling...
--identified 0 reads mapped to T cell receptor alpha (TCRA) locus
--identified 2 reads mapped to T cell receptor beta (TCRB) locus
--identified 1 reads mapped to T cell receptor delta (TCRD) locus
--identified 0 reads mapped to T cell receptor gamma locus (TCRG) locus
In toto : 5 reads mapped to antibody repertoire loci
***Note : Combinatorial diversity of the antibody repertoire (recombinations of the of VJ gene segments)  will be available in the next release.
6.  Microbiome profiling...
--identified 0 reads mapped bacterial genomes
--identified 0 reads mapped viral genomes
--identified 4 reads mapped ameoba genomes
--identified 1 reads mapped crypto genomes
--identified 0 reads mapped giardia genomes
--identified 0 reads mapped microsporidia genomes
--identified 0 reads mapped piroplasma genomes
--identified 1 reads mapped plasmo genomes
--identified 1 reads mapped toxo genomes
--identified 0 reads mapped trich genomes
--identified 0 reads mapped tritryp genomes
In toto : 7 reads mapped to microbial genomes
Summary:   The ROP protocol is able to account for 2238 reads
***Unaccounted reads (not explained by ROP) are saved to /u/home/s/serghei/code2/rop/example/test12/unmappedExample_unaccountedReads.fasta

The ropOut directory now contains the output of ROP. The structure of the output is explained here. For example it contains /antibodyProfile/ directory with reads spanning antigen receptor gene rearrangement in the variable domain being identified by IgBLAST.

Genomic profile of RNA-Seq

To get the genomic profile of the mapped reads use gprofile.py. To get the genomic profile of the toy bam file (reads from chr22) use the following command:

python gprofile.py example/mappedReads_chr22.bam /example/mappedReads_chr22.csv

The output of the module is number of reads assigned to each genomic category saved into the /example/mappedReads_chr22.csv

sampleName,nTotalMapped,nJunction,nCDS,nUTR3,nUTR5,nUTR_,nIntron,nIntergenic,nDeep,nMT,nMultiMapped
mappedReads,397134,129580,101541,96210,7457,22473,19420,3084,649,0,16720

You can use /example/mappedReads_chr22.csv to create pie chart. The pie chart corresponding to /example/mappedReads_chr22.csv is presented bellow:

Genomic profile of toy .bam file

Read more about Genomic Profile of RNA-Seq here.

Profile of repeat elements

python rprofile.py example/mappedReads_chr22.bam example/mappedReads_chr22_repeatProfile

The ROP provided three levels of repeat profile, i.e class level (e.g SINE, LINE); family level (e.g. Alu, L1); gene level (e.g. L1P4c). The files contain relative proportions of repeat categories based on the number of reads from the category. More details about the repeat classes used by ROP are here.

Those the output files created for each level

Class level of classification (mappedReads_chr22_repeatProfilerepeatClass.csv):

sample,LINE?,LTR,Satellite,Retroposon,DNA,SINE?,RNA,DNA?,RC,LINE,SINE,LTR?
mappedReads_chr22,0,2445,3,6,568,0,0,0,0,2588,5356,0

The classes with the ? are provided by RepeatMasker. For example MER129 is classified as LTR?. You may merge LTR? with LTR or ignore those.

Those are the repeat classes with sufficient number of reads supporting the class. The repeat profile is presented as a table:

sample  LTR DNA LINE    SINE
mappedReads_chr22   2445    568 2588    5356

Alternatively you can visualize the repeat profile as a pie chart

Family level of classification (mappedReads_chr22_repeatProfilerepeatFamily.csv) is presented bellow. Each family is represented in the following format Class_Family allowing to retrieve the particular class the family belongs to. For example LINE____L1 corresponds to family element L1 from LINE class.

sample,SINE____Alu,DNA____TcMar-Tigger,LTR?____LTR,DNA____PiggyBac,DNA____hAT,Retroposon____SVA_E,LTR____ERV1,LINE____L1,LINE____L2,DNA____MuDR,DNA____TcMar,LINE?____Penelope,LTR____ERVL-MaLR,DNA____DNA,LTR____ERV,DNA____hAT-Charlie,RC____Helitron,DNA____hAT-Blackjack,SINE____MIR,Satellite____centr,DNA____Merlin,LTR____LTR,DNA____hAT-Tip100,RNA____RNA,SINE____Deu,Retroposon____SVA_D,LTR____Gypsy,Retroposon____SVA_F,Retroposon____SVA_A,LINE____CR1,Retroposon____SVA_C,Retroposon____SVA_B,DNA____TcMar-Mariner,LINE____RTE,LINE____RTE-BovB,DNA____TcMar-Tc2,LTR____ERVK,Satellite____acro,Satellite____telo,LTR____ERVL,Satellite____Satellite,SINE?____SINE,DNA?____DNA,SINE____SINE,LINE____Dong-R4
mappedReads_chr22,4502,123,0,1,16,0,859,1476,1009,0,2,0,943,2,0,349,0,4,849,2,0,0,50,0,1,2,3,0,3,93,0,1,21,10,0,0,151,0,0,489,1,0,0,4,0

Those are the repeat families with sufficient number of reads supporting the family. The repeat profile is presented as a table:

sample  LTR____ERV1 LTR____ERVL-MaLR    LTR____ERVK DNA____hAT-Charlie  DNA____TcMar-Tigger DNA____hAT-Tip100   LINE____L1  LINE____L2  SINE____Alu SINE____MIR
mappedReads_chr22   859 943 151 349 123 50  1476    1009    4502    849

Alternatively you can visualize the repeat profile as a pie chart

ROP also provides gene level resolution (mappedReads_chr22_repeatProfilerepeatGene.csv) , where the number of reads assigned to each repeat instance is reported. Each instance is represented in the following format Family__Class__Instance allowing to retrieve the particular class and family the repeat instance belongs to. For example L1____LINE____L1MD corresponds to element L1MD from L1 gamily of LINE class.


Related

Wiki: Home
Wiki: _Sidebar

MongoDB Logo MongoDB