CUSCOquality Wiki

Brought to you by: filwierz, fschwarz, rokofler

FlankDesign

Authors:

Introduction

This walkthrough shows how to obtain flanking sequences of certain regions that can be used by CUSCO.

Requirements

For this walkthrough you need to install the following

bwa (alignment alogrithm)
samtools

Preparatory work

The approach requires a reference genome with annotations of your regions of interest.
In this walkthrough, we use the reference genome of A. thaliana TAIR10 (genbank assembly accession: GCA_000001735.1) and annotations of 10 KEEs (DOI: https://doi.org/10.1016/j.molcel.2014.07.009) in bed format.
Both input file can be found here: https://sourceforge.net/projects/cuscoquality/files/Walkthrough/FlankDesign/

Note that names of the reference fasta and annotations (e.g. piRNA cluster IDs) are required not to contain dashes since "-" is reserved as field separator in the step "Converting the sam file...".

Obtaining annotations of flanks

Based on piRNA cluster annotations, we create a file that contains the information on the flanks using the script flankbeder.sh.
For each piRNAc cluster, it writes entries of 5 1kb-flanks at both ends.

bash flankbeder.sh KEEs.bed .

Obtaining the sequences of flanks using samtools

bash flankparser.sh IDs flanks.bed TAIR10.fasta flank-fasta

Aligning flanks back to the reference genome using bwa

bwa index TAIR10.fasta
bwa bwasw TAIR10.fasta resources/flanks.fasta > resources/TAIR10.sam

Converting the sam file to a custom format and filtering for unique flanks (min. mq 5)

cat resources/TAIR10.sam|grep -v '^@'|awk '$5>4'|awk '{print $1,$2,$3,$4,$5}'|awk -F'[+-]' '{print $1,$2,$3,$4,$5,$6,$0}'|awk '{print $1,$2,$3,$4,$5,$6,$7}'|grep '\-l\|\+r' > resources/TAIR10.mod

Validating flanks using flank_validation.py

python flank_validation.py --bed KEEs.bed --modsam resources/TAIR10.mod --inner 100 --outer 5000 > resources/TAIR10.validated.tmp

Generating the cluster definition file for cusco

cat resources/TAIR10.validated.tmp|grep -v '<class'|awk '{print $1,$0}'|sort -u -k1,1|awk '{print $2,$14,$10,$11,$8,$9,$9+1000,"+",$13,$12,$12+1000,"+"}' > resources/cluster-definition-file

Now, we use the files resources/flanks.fasta and resources/cluster-definition-file to run perform the cusco. The files should be identical to those provided here: https://sourceforge.net/projects/cuscoquality/files/Walkthrough/FlankDesign/output/
md5sum:
flank.fasta 0052aa4cc2012b389a5004918065890b
cluster-definition-file 31100240013fa20b1be37f0367ae617c

Wiki: Home
Wiki: Manual