Menu

piRNAclusterComparison-walkthrough

Filip Wierzbicki
Attachments
5.svg (75594 bytes)

Introduction

Here, we run the easyfig pipeline for cluster 38C on assemblies of chrosome arm 2L of three * D. melanogaster * strains (Iso-1, Canton-S and A4).

The pipeline is based on a modified version of easyfig:
Sullivan MJ, Petty NK, Beatson SA. Easyfig: a genome comparison visualizer. Bioinformatics. 2011 Apr 1;27(7):1009-10. doi: 10.1093/bioinformatics/btr039. Epub 2011 Jan 28. PMID: 21278367; PMCID: PMC3065679.

Requirements

  • python3 and python2.7
  • Samtools to extract and process sequence files
  • Blast to obtain sequence similarities
  • Picard to process sequence files
  • Repeatmasker to annotate TEs

Required input files

The input file for this walkthrough can be downloaded here: https://sourceforge.net/projects/manna/files/pirnaclustercomparison/walkthrough/

  • genome assembly in fasta format with the suffix '.fasta'
  • piRNA cluster annotations in bed format. For more information, please visit https://sourceforge.net/projects/cuscoquality/ .
  • piRNA cluster names in a 1-column file one per line that want to be compared (e.g. with a unix join function on the 4th column of the bed files)

Further required resource files can downloaded here: https://sourceforge.net/projects/manna/files/pirnaclustercomparison/resources/

  • TE_dmel.fasta ( D. melanogaster TE consensus sequences)
  • TE_dmel_3-names (alternative TE names and abbreviations)
  • new_dmel_132cons_hier_4letcod (for information on the order (TIR, LTR, non-LTR) of TE families)

Pipeline

Preparatory work

We need to set a few directories and move files so scripts can work.

We make a several new directories in your working directory in this way:

mkdir fasta
mkdir cluster_bed
mkdir cluster-fasta
mkdir cluster-rm
mkdir cluster-fasta/polyN-added
mkdir adding_lengths
mkdir blast
mkdir genbank
mkdir easyfig
mkdir easyfig/merged
mkdir resources

Then, we move the input files in the desired directory:

mv Iso1.fasta A4.fasta Canton-S.fasta fasta
mv Iso1_cluster.bed A4_cluster.bed Canton-S_cluster.bed cluster_bed
mv shared_clusterID.txt .
mv TE_dmel.fasta TE_dmel_3-names new_dmel_132cons_hier_4letcod resources

cluster_parser.sh

This script extracts the sequences of piRNA clusters using samtools. Note that the 6th column of bed files is required to account for reverse complement assembled sequences.

cd fasta
for i in *fasta;do n=${i%.fasta};bash pirnaclustercomparison/cluster_parser.sh ../shared_clusterID.txt ../cluster_bed/${n}_cluster.bed $i ../cluster-fasta;done

Repeatmasker

We need to annotate TEs in each piRNA cluster sequence and process the annotations

cd ../cluster-fasta
for i in *fasta;do RepeatMasker -frag 10000000 -pa 20 -no_is -s -nolow -dir ../cluster-rm -lib ../resources/TE_dmel.fasta $i;done
cd ../cluster-rm
for i in *.fasta.out;do n=${i%.fasta.out};cat $i|awk 'NR>3'|awk '{print $5,$6,$7,$10,$1,$9,$2}'|awk '{gsub("C","-",$6)} OFS=" "'|sort -k 4|join -a 1 -1 4 -2 1 - ../resources/TE_dmel_3-names|awk '{print $2,$3,$4,$8,$5,$6,$7}' > ${n}.rm;done

cluster_blaster.sh

This script uses Blast to obtain sequence similarities between clusters.

cd ../cluster-fasta
for i in *fasta;do makeblastdb -in $i -dbtype nucl;done
bash pirnaclustercomparison/cluster_blaster.sh ../blast ../shared_clusterID.txt Iso1 Canton-S A4

add_polyN.sh

This script adds polyN to the end of the sequence to account for difference in cluster length between assemblies which is important for equal scaling during visualization.

obtaining the longest variants

First, we need to obtain a two column with the longest variant of each cluster (separated by a space: cluster name; longest length in base pairs)

cd ../cluster_bed
for i in *bed; do n=${i%_cluster.bed}; cat $i|awk '{print $4, $3-$2}'|sort -k 1 > ../adding_lengths/${n}.lengths.sort;done
cd ../adding_lengths
join -1 1 -2 1  Iso1.lengths.sort Canton-S.lengths.sort|awk '{if($2>=$3) print $1,$2;else print $1,$3}'|join -1 1 -2 1 - A4.lengths.sort|awk '{if($2>=$3) print $1,$2;else print $1,$3}' > longest.cluster

adding polyNs

Next, we can run add_polyN.sh which uses Picard for formatting

cd ../cluster-fasta
bash pirnaclustercomparison/add_polyN.sh ../adding_lengths/longest.cluster polyN-added

genbanker.sh

This script uses genbankmaker.py to store the cluster sequence and corresponding TE annotations in genbank format required for easyfig.

cd polyN-added
bash pirnaclustercomparison/genbanker.sh pirnaclustercomparison/genbankmaker.py ../../shared_clusterID.txt Iso1 Canton-S A4

easyfigurer_pairwise.sh

This script uses easyfig_Mod.py to draw pairwise comparisons of assemblies.

cd ../..
bash pirnaclustercomparison/easyfigurer_pairwise.sh pirnaclustercomparison/Easyfig_Mod.py easyfig shared_clusterID.txt Iso1 Canton-S A4

merger_easyfig.sh

This script merges the pairwise comparisons into a combined figure

cd easyfig
bash pirnaclustercomparison/merger_easyfig.sh pirnaclustercomparison/svg_merger.py ../shared_clusterID.txt Iso1 Canton-S A4

Results

The resulting figure can be found in easyfig/merged.
It shows the comparison of cluster 38C between Iso-1, Canton-S and A4.
We can see that most of the cluster is similar between the three strains.


Related

Wiki: Home

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.