This walkthrough shows how to obtain flanking sequences of certain regions that can be used by CUSCO.
For this walkthrough you need to install the following
The approach requires a reference genome with annotations of your regions of interest.
In this walkthrough, we use the reference genome of A. thaliana TAIR10 (genbank assembly accession: GCA_000001735.1) and annotations of 10 KEEs (DOI: https://doi.org/10.1016/j.molcel.2014.07.009) in bed format.
Both input file can be found here: https://sourceforge.net/projects/cuscoquality/files/Walkthrough/FlankDesign/
Note that names of the reference fasta and annotations (e.g. piRNA cluster IDs) are required not to contain dashes since "-" is reserved as field separator in the step "Converting the sam file...".
Based on piRNA cluster annotations, we create a file that contains the information on the flanks using the script flankbeder.sh.
For each piRNAc cluster, it writes entries of 5 1kb-flanks at both ends.
bash flankbeder.sh KEEs.bed .
bash flankparser.sh IDs flanks.bed TAIR10.fasta flank-fasta
bwa index TAIR10.fasta
bwa bwasw TAIR10.fasta resources/flanks.fasta > resources/TAIR10.sam
cat resources/TAIR10.sam|grep -v '^@'|awk '$5>4'|awk '{print $1,$2,$3,$4,$5}'|awk -F'[+-]' '{print $1,$2,$3,$4,$5,$6,$0}'|awk '{print $1,$2,$3,$4,$5,$6,$7}'|grep '\-l\|\+r' > resources/TAIR10.mod
python flank_validation.py --bed KEEs.bed --modsam resources/TAIR10.mod --inner 100 --outer 5000 > resources/TAIR10.validated.tmp
cat resources/TAIR10.validated.tmp|grep -v '<class'|awk '{print $1,$0}'|sort -u -k1,1|awk '{print $2,$14,$10,$11,$8,$9,$9+1000,"+",$13,$12,$12+1000,"+"}' > resources/cluster-definition-file
Now, we use the files resources/flanks.fasta and resources/cluster-definition-file to run perform the cusco. The files should be identical to those provided here: https://sourceforge.net/projects/cuscoquality/files/Walkthrough/FlankDesign/output/
md5sum:
flank.fasta 0052aa4cc2012b389a5004918065890b
cluster-definition-file 31100240013fa20b1be37f0367ae617c