Here, we run the easyfig pipeline for cluster 38C on assemblies of chrosome arm 2L of three * D. melanogaster * strains (Iso-1, Canton-S and A4).
The pipeline is based on a modified version of easyfig:
Sullivan MJ, Petty NK, Beatson SA. Easyfig: a genome comparison visualizer. Bioinformatics. 2011 Apr 1;27(7):1009-10. doi: 10.1093/bioinformatics/btr039. Epub 2011 Jan 28. PMID: 21278367; PMCID: PMC3065679.
The input file for this walkthrough can be downloaded here: https://sourceforge.net/projects/manna/files/pirnaclustercomparison/walkthrough/
Further required resource files can downloaded here: https://sourceforge.net/projects/manna/files/pirnaclustercomparison/resources/
We need to set a few directories and move files so scripts can work.
We make a several new directories in your working directory in this way:
mkdir fasta
mkdir cluster_bed
mkdir cluster-fasta
mkdir cluster-rm
mkdir cluster-fasta/polyN-added
mkdir adding_lengths
mkdir blast
mkdir genbank
mkdir easyfig
mkdir easyfig/merged
mkdir resources
Then, we move the input files in the desired directory:
mv Iso1.fasta A4.fasta Canton-S.fasta fasta
mv Iso1_cluster.bed A4_cluster.bed Canton-S_cluster.bed cluster_bed
mv shared_clusterID.txt .
mv TE_dmel.fasta TE_dmel_3-names new_dmel_132cons_hier_4letcod resources
This script extracts the sequences of piRNA clusters using samtools. Note that the 6th column of bed files is required to account for reverse complement assembled sequences.
cd fasta
for i in *fasta;do n=${i%.fasta};bash pirnaclustercomparison/cluster_parser.sh ../shared_clusterID.txt ../cluster_bed/${n}_cluster.bed $i ../cluster-fasta;done
We need to annotate TEs in each piRNA cluster sequence and process the annotations
cd ../cluster-fasta
for i in *fasta;do RepeatMasker -frag 10000000 -pa 20 -no_is -s -nolow -dir ../cluster-rm -lib ../resources/TE_dmel.fasta $i;done
cd ../cluster-rm
for i in *.fasta.out;do n=${i%.fasta.out};cat $i|awk 'NR>3'|awk '{print $5,$6,$7,$10,$1,$9,$2}'|awk '{gsub("C","-",$6)} OFS=" "'|sort -k 4|join -a 1 -1 4 -2 1 - ../resources/TE_dmel_3-names|awk '{print $2,$3,$4,$8,$5,$6,$7}' > ${n}.rm;done
This script uses Blast to obtain sequence similarities between clusters.
cd ../cluster-fasta
for i in *fasta;do makeblastdb -in $i -dbtype nucl;done
bash pirnaclustercomparison/cluster_blaster.sh ../blast ../shared_clusterID.txt Iso1 Canton-S A4
This script adds polyN to the end of the sequence to account for difference in cluster length between assemblies which is important for equal scaling during visualization.
First, we need to obtain a two column with the longest variant of each cluster (separated by a space: cluster name; longest length in base pairs)
cd ../cluster_bed
for i in *bed; do n=${i%_cluster.bed}; cat $i|awk '{print $4, $3-$2}'|sort -k 1 > ../adding_lengths/${n}.lengths.sort;done
cd ../adding_lengths
join -1 1 -2 1 Iso1.lengths.sort Canton-S.lengths.sort|awk '{if($2>=$3) print $1,$2;else print $1,$3}'|join -1 1 -2 1 - A4.lengths.sort|awk '{if($2>=$3) print $1,$2;else print $1,$3}' > longest.cluster
Next, we can run add_polyN.sh which uses Picard for formatting
cd ../cluster-fasta
bash pirnaclustercomparison/add_polyN.sh ../adding_lengths/longest.cluster polyN-added
This script uses genbankmaker.py to store the cluster sequence and corresponding TE annotations in genbank format required for easyfig.
cd polyN-added
bash pirnaclustercomparison/genbanker.sh pirnaclustercomparison/genbankmaker.py ../../shared_clusterID.txt Iso1 Canton-S A4
This script uses easyfig_Mod.py to draw pairwise comparisons of assemblies.
cd ../..
bash pirnaclustercomparison/easyfigurer_pairwise.sh pirnaclustercomparison/Easyfig_Mod.py easyfig shared_clusterID.txt Iso1 Canton-S A4
This script merges the pairwise comparisons into a combined figure
cd easyfig
bash pirnaclustercomparison/merger_easyfig.sh pirnaclustercomparison/svg_merger.py ../shared_clusterID.txt Iso1 Canton-S A4
The resulting figure can be found in easyfig/merged.
It shows the comparison of cluster 38C between Iso-1, Canton-S and A4.
We can see that most of the cluster is similar between the three strains.