Read Me
first of all, to use jip you need to export the following environment variable:
export LD_LIBRARY_PATH=/software/so/el6.3/PythonPackages-2.7.6/lib:$LD_LIBRARY_PATH
example commands: (further description of the options is available with the jip scripts -h)
RUNNING JITTERBUG:
~/jitterbug/jitterbug-code/jip_scripts/jitterbug.jip --nsorted sample.nsorted.bam.bam --psorted sample.psorted.bam.bam -t TE_annot.gff3 -l sample -o sample -d 2 -n Name -c 4 -b 50000000 -q 15
this will output the following files:
sample.d2.TE_insertions_paired_clusters.gff3 // predicted insertions in gff3 format, with the tags in the 9th column describing the characteristics of the prediction
sample.d2.TE_insertions_paired_clusters.supporting_clusters.table // table describing the insertions, clusters and reads that correspond to the above gff. Here there is more detailed information, including the sequences of the reads.
// this table is useful if you want to extract and assemble the TE mate reads to design primers to verify the insertions
sample.filter_config.txt // config file which can be used for filtering, generated based on the characteristics of the sequencing library and with "reasonable defaults". Described further below.
sample.read_stats.txt // file describing the fragment length, sdev, read length, sdev of the library, as evaluated according to the first million properly mapped read pairs
FILTERING RESULTS:
~/jitterbug/jitterbug-code/jip_scripts/filter.jip -g sample.d2.TE_insertions_paired_clusters.gff --config sample.filter_conf_file.txt
this takes:
-g gff3 formated file of insertion predictions
-c config file of the format:
cluster_size 2 108 // min and max cluster size. reasonable defaults are 2 - 5*coverage
span 2 275 // min and max span (max distance between start points of two reads in a cluster). Span of 0 means reads are stacked. reasonable defaults are 5-fragment lenght
int_size 92 464 // min and max interval size. reasonable defaults are fragment_length - 2*(fragment_length) - read_length
softclipped 2 108 // min and max number of softclipped reads. If you have low coverage (less than 20), dont set this as you cannot expect to have softclipped reads even in correct predictions.
// otherwise, reasonable default is same as cluster_size
pick_consistent 0 -1 // whether to pick consistent inserted TE or not. values are [min,max) indices of tokens to consider, if you split the TE names by "_"
// 0,-1 will take whole string (python-style indexing: -1 is last element of list)
the output is a gff3 formatted file, with the same basename as the input gff file and suffixed with the values of the parameters specified in the config file
for example, the above call and conf file generate a file of the name:
sample.d2.TE_insertions_paired_clusters.clust2_108.span2_275.int92_464.soft2_108.cons(0,-1).gff3
COMPARING TUMOR/NORMAL PAIR:
here you supply the gff files for raw and filtered results for the ND and the TD sample, as well as the position-sorted bam file for the ND sample
~/jitterbug/jitterbug-code/jip_scripts/process_ND_TD.jip --T CLL_043TD.TE_insertions_paired_clusters.gff --N CLL_043.TE_insertions_paired_clusters.gff --TF CLL_043TD.TE_insertions_paired_clusters.gff_i100_I900_p2_P500_s5_S500_c2_C500_f00.gff3 --NF CLL_043.TE_insertions_paired_clusters.gff_i100_I900_p2_P500_s5_S500_c2_C500_f00.gff3 -b CLL_043.psorted.bam -l testCLL -s CLL_043.read_stats.txt
the output are:
pdf venn diagrams of the intersections of:
- ND and TD both filtered (Nf and Tf)
- ND and TD both unfiltered (N and T)
- TD filtered and ND unfiltered (Tf and N)
- a gff file annotating the insertions present in Tf but not in N, as well as a table that consists of the same gff annotations followed by a few columns describing
the sequence context (+/- 100 bp) surrounding the insertion interval (ojo! not the softclipped position): counting repetitive reads, discordant reads, etc (the header line of the file explains these columns)
the last column is a flag: PUTATIVE-SOM if less than 50% of the reads in that interval are not discordant (meaning it might be a somatic insertion in the tumor) and NON-SOM if not
#######################################################
These steps are put together in two pipelines, one that combines running jitterbug and filtering:
jitterbug_filter_pipeline.jip
and one that combines running predictions on tumor and normal sets, filtering them and comparing them
/home/lala/werk/remote_crg/jitterbug/jitterbug-code/jip_scripts/ND_TD_jitterbug_filter_compare_pipeline.jip
#######################################################