Using the coverage and the number softclipped reads from the alignment of long reads to the assembly, we can assess the reliablility of the assembled content of piRNA clusters.
In addition to the CUSCO-quality scripts you need the following
We use minimap2 to align long reads the an assembly. For this walkthrough, we provide an example file of 100X Oxford Nanopore reads that align to 3 piRNA clusters and 3 BUSCO annotations ( https://sourceforge.net/projects/cuscoquality/files/Walkthrough/piRNAclusterAlignments/reads.fastq.gz ) and a genome assembly of the D. melanogaster strain Canton-S ( https://sourceforge.net/projects/cuscoquality/files/Walkthrough/piRNAclusterAlignments/genome.fasta.gz ).
minimap2 -ax map-ont -t 20 genome.fasta.gz reads.fastq.gz > mapped_reads.sam
The mapping output will be sorted and converted into bam format.
samtools sort mapped_reads.sam -o mapped_reads.sort.bam
After aligning reads to the genome and converting the alignment from a sam to a sorted bam file, we can obtain coverages for piRNA clusters and reference genes using a script from our CUSCOquality package. In addition to the required input data of the alignment and both annotations files, we employ the parameter for a minimum mapping quality of 15 (--min-mq) to remove most multimappers and a minimum read length of 1000 (--min-len) to remove shorter reads.
samtools view mapped_reads.sort.bam|python cuscoquality/cluster-coverage-median.py --sam - --cluster 3cluster.bed --reference 3busco.bed --min-mq 15 --min-len 1000 --output-cluster coverage_ml1k_mq15.cluster --output-reference coverage_ml1k_mq15.busco > coverage_ml1k_mq15.summary
After running this script, we obtain 3 outputs.
The summary output provides the coverage quality which is calculated by a division between the median standard deviations of coverages in reference genes and standard deviation of coverage in piRNA clusters. Hence, a high value indicates stable coverages of piRNA clusters compared to reference genes. Furthermore, it outputs the standard deviations of coverages for piRNA clusters and reference genes, the median standard deviation of coverage in reference genes, and the average coverage.
Below, the output contains tab-delimited overviews of standard deviations (column 3) and averages (column 4) for coverages for individual piRNA clusters and reference genes with the corresponding feature names in column 2. For piRNA clusters, a 5. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps, and a 6. column of coverage quality which is calculated by a division between the median standard deviation of reference genes and the standard deviation of a piRNA cluster.
Coverage quality: 0.2162388909609717
Note that CQ = median. var.ref / var.cluster (1=good, 0=bad, >1=exceptional)
Standard deviation of the coverage in piRNA clusters 27.23007883066921
Standard deviation of the coverage in reference regions 11.695878245182904
Median standard deviation of coverage in reference regions 5.888202047123744
Average coverage cluster 91.1847304128054
Average coverage reference 100.39635784020722
Cluster overview
c clu.ID std.dev.cov av.cov. c.score CQ
c 12 29.620228724025186 88.75250532471752 0 0.1987898912592724
c 1 13.213894228892284 98.95333414543266 1000 0.4456068699452084
c 2 10.906112510619122 105.32487650343643 1000 0.5398992575393374
Reference overview
r ref.ID std.dev.cov av.cov.
r EOG0915000C 10.094996158497661 99.0874915209863
r EOG09150009 5.888202047123744 91.99781319651794
r EOG09150006 5.4546381902835375 117.92787998867817
The 2 other outputs are tab-delimited files of coverages in every position of the annotations of piRNA clusters and reference genes.
First 5 lines of such an output for piRNA clusters is shown below.
1 2R_RaGOO 7069609 88
1 2R_RaGOO 7069610 88
1 2R_RaGOO 7069611 88
1 2R_RaGOO 7069612 88
1 2R_RaGOO 7069613 88
Columns:
1. feature ID
2. chromosome/contig
3. position
4. coverage
Using the same input data as for the coverage, we can obtain the number of softclipped bases in piRNA clusters and reference genes using another script from our CUSCOquality package. As for the coverage, we apply a minium mapping quality of 15 and a minimum read length of 1000.
samtools view mapped_reads.sort.bam|python cuscoquality/cluster-softclipcoverage-median.py --sam - --cluster 3cluster.bed --reference 3busco.bed --min-mq 15 --min-len 1000 --output-cluster softclips_ml1k_mq15.cluster --output-reference softclips_ml1k_mq15.busco > softclips_ml1k_mq15.summary
The summary output provides soft-clip quality, which is calculated by a division between the median of softclipped bases in reference genes and the average of softclipped bases in piRNA clusters. Hence, a high value indicates a low number of softclipped bases in piRNA clusters compared to reference genes. Furthermore, this output contains the average softclipped bases for piRNA clusters and reference genes, and the median of softclipped bases in reference genes. Below, overviews shows the average softclipped bases in column 3 of the tab-delimited format for individual piRNA clusters or reference genes (column 2). For piRNA clusters, a 4. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps, and a 5. column of soft-clip quality which is calculated by a division between the median of softclipped bases in reference genes and average of softclipped bases in a piRNA cluster.
Soft-clip quality = 0.930445592423241
Note that ScQ = median.cov.ref / av.cov.cluster (1=good, 0=bad, >1=exceptional)
Average soft-clip coverage in piRNA clusters 5.132141252457175
Average soft-clip coverage in reference regions 4.78370990237099
Median soft-clip coverage of reference regions 4.77517820804227
Cluster overview
c clu.ID av.cov. c.score ScQ
c 2 6.534095790378007 1000 0.7308093363237974
c 1 5.377679927770931 1000 0.8879625176988918
c 12 5.02316252120862 0 0.9506318355977296
Reference overview
r ref.ID av.cov
r EOG09150009 4.854072921485344
r EOG0915000C 4.77517820804227
r EOG09150006 4.7295782621001985
In the same format as for the coverages, the two further outputs provide information on the number of softclipped bases for each position in piRNA clusters and reference genes. The first 5 lines of the piRNA cluster output is shown below.
1 2R_RaGOO 7069609 3
1 2R_RaGOO 7069610 3
1 2R_RaGOO 7069611 3
1 2R_RaGOO 7069612 3
1 2R_RaGOO 7069613 3
Columns:
1. feature ID
2. chromosome/contig
3. position
4. coverage of softclipped bases
We use the the coverage and softclip outputs for reference genes (from argument '--output-reference') as inputs to calculate the quantiles with a script from our CUSCOquality package. We need to run this python script separate for coverage and softclips.
python cuscoquality/quantiles.py --coverage coverage_ml1k_mq15.busco > coverage.qt
python cuscoquality/quantiles.py --coverage softclips_ml1k_mq15.busco --sc-coverage > softclips.qt
This obtains the follow outputs, which will be used to visualize the quantiles.
coverage.qt
c 0.1% 128
c 1% 126
c 5% 121
c 95% 84
c 99% 79
c 99.9% 78
softclip.qt
sc 0.1% 14
sc 1% 12
sc 5% 9
sc 95% 1
sc 99% 0
sc 99.9% 0
To plot the coverages and softclipped bases in piRNA clusters or reference genes we can employ an R-script from our CUSCOquality package.
The script requires the following 6 arguments:
1. lower limit for the quantile range in percent: You can choose between 0.1, 1 or 5, which will correspond to 0.1-99.9%, 1-99% or 5-95% quantile range, respectively.
2. output from the quantiles.py for coverages
3. output from the quantiles.py for softclips
4. output from find-polyN.py
5. output from cluster-coverage.py for piRNA clusters or reference genes
6. output from cluster-softclipcoverage.py for piRNA clusters or reference genes
7. name of feature
R --vanilla --args 1 coverage.qt softclips.qt genome.polyN coverage_ml1k_mq15.cluster softclips_ml1k_mq15.cluster 1 < cuscoquality/visualize.R
This will make a plot for coverage and softclips for a particular piRNA cluster or reference gene in PostScript format.
Cluster 1, also known as 42AB. Here, assembled without gaps.

An example of a Busco annotation.

Cluster 12 as an example for a piRNA cluster with assembly gaps.
