CUSCOquality Wiki

Brought to you by: filwierz, fschwarz, rokofler

piRNAclusterAlign

Authors:

Attachments

EOG09150006.png (89608 bytes)

cl1.png (101577 bytes)

cl12.png (108472 bytes)

Introduction

Using the coverage and the number softclipped reads from the alignment of long reads to the assembly, we can assess the reliablility of the assembled content of piRNA clusters.

Requirements

In addition to the CUSCO-quality scripts you need the following

python3
Long-read alignment software (e.g. minimap2) to align long reads that were used to assemble the genome.
Samtools to sort and convert the alignment file from sam to bam.
Annotations of piRNA clusters and reference genes (in bed format): You can obtain the piRNA cluster annotations using CUSCO. Annotations of reference genome can be obtained by running BUSCO on the genome assembly or from alignments. Here, we provide example annotation files ( https://sourceforge.net/projects/cuscoquality/files/Walkthrough/piRNAclusterAlignments/3cluster.bed & https://sourceforge.net/projects/cuscoquality/files/Walkthrough/piRNAclusterAlignments/3busco.bed).
Information of poly-N tracts in genome assemblies: You need to obtain the gap information by running the script 'find-polyN.py' form our CUSCOquality package using the genome assembly. Here, we provide the gap information for the genome assembly that will be used in this walkthrough ( https://sourceforge.net/projects/cuscoquality/files/Walkthrough/piRNAclusterAlignments/genome.polyN ).
R with ggplot2

Preparatory work

Mapping of long-reads

We use minimap2 to align long reads the an assembly. For this walkthrough, we provide an example file of 100X Oxford Nanopore reads that align to 3 piRNA clusters and 3 BUSCO annotations ( https://sourceforge.net/projects/cuscoquality/files/Walkthrough/piRNAclusterAlignments/reads.fastq.gz ) and a genome assembly of the D. melanogaster strain Canton-S ( https://sourceforge.net/projects/cuscoquality/files/Walkthrough/piRNAclusterAlignments/genome.fasta.gz ).

minimap2 -ax map-ont -t 20 genome.fasta.gz reads.fastq.gz > mapped_reads.sam

sorting

The mapping output will be sorted and converted into bam format.

samtools sort mapped_reads.sam -o mapped_reads.sort.bam

Coverage

After aligning reads to the genome and converting the alignment from a sam to a sorted bam file, we can obtain coverages for piRNA clusters and reference genes using a script from our CUSCOquality package. In addition to the required input data of the alignment and both annotations files, we employ the parameter for a minimum mapping quality of 15 (--min-mq) to remove most multimappers and a minimum read length of 1000 (--min-len) to remove shorter reads.

samtools view mapped_reads.sort.bam|python cuscoquality/cluster-coverage-median.py --sam - --cluster 3cluster.bed --reference 3busco.bed --min-mq 15 --min-len 1000 --output-cluster coverage_ml1k_mq15.cluster --output-reference coverage_ml1k_mq15.busco > coverage_ml1k_mq15.summary

After running this script, we obtain 3 outputs.

The summary output provides the coverage quality which is calculated by a division between the median standard deviations of coverages in reference genes and standard deviation of coverage in piRNA clusters. Hence, a high value indicates stable coverages of piRNA clusters compared to reference genes. Furthermore, it outputs the standard deviations of coverages for piRNA clusters and reference genes, the median standard deviation of coverage in reference genes, and the average coverage.
Below, the output contains tab-delimited overviews of standard deviations (column 3) and averages (column 4) for coverages for individual piRNA clusters and reference genes with the corresponding feature names in column 2. For piRNA clusters, a 5. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps, and a 6. column of coverage quality which is calculated by a division between the median standard deviation of reference genes and the standard deviation of a piRNA cluster.

Coverage quality: 0.2162388909609717
Note that CQ = median. var.ref / var.cluster (1=good, 0=bad, >1=exceptional)
Standard deviation of the coverage in piRNA clusters 27.23007883066921
Standard deviation of the coverage in reference regions 11.695878245182904
Median standard deviation of coverage in reference regions 5.888202047123744
Average coverage cluster 91.1847304128054
Average coverage reference 100.39635784020722

 Cluster overview
c   clu.ID  std.dev.cov av.cov. c.score CQ
c   12  29.620228724025186  88.75250532471752   0   0.1987898912592724
c   1   13.213894228892284  98.95333414543266   1000    0.4456068699452084
c   2   10.906112510619122  105.32487650343643  1000    0.5398992575393374

 Reference overview
r   ref.ID  std.dev.cov av.cov.
r   EOG0915000C 10.094996158497661  99.0874915209863
r   EOG09150009 5.888202047123744   91.99781319651794
r   EOG09150006 5.4546381902835375  117.92787998867817

The 2 other outputs are tab-delimited files of coverages in every position of the annotations of piRNA clusters and reference genes.
First 5 lines of such an output for piRNA clusters is shown below.

1       2R_RaGOO        7069609 88
1       2R_RaGOO        7069610 88
1       2R_RaGOO        7069611 88
1       2R_RaGOO        7069612 88
1       2R_RaGOO        7069613 88

Columns:
1. feature ID
2. chromosome/contig
3. position
4. coverage

Softclipped reads

Using the same input data as for the coverage, we can obtain the number of softclipped bases in piRNA clusters and reference genes using another script from our CUSCOquality package. As for the coverage, we apply a minium mapping quality of 15 and a minimum read length of 1000.

samtools view mapped_reads.sort.bam|python cuscoquality/cluster-softclipcoverage-median.py --sam - --cluster 3cluster.bed --reference 3busco.bed --min-mq 15 --min-len 1000 --output-cluster softclips_ml1k_mq15.cluster --output-reference softclips_ml1k_mq15.busco > softclips_ml1k_mq15.summary

The summary output provides soft-clip quality, which is calculated by a division between the median of softclipped bases in reference genes and the average of softclipped bases in piRNA clusters. Hence, a high value indicates a low number of softclipped bases in piRNA clusters compared to reference genes. Furthermore, this output contains the average softclipped bases for piRNA clusters and reference genes, and the median of softclipped bases in reference genes. Below, overviews shows the average softclipped bases in column 3 of the tab-delimited format for individual piRNA clusters or reference genes (column 2). For piRNA clusters, a 4. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps, and a 5. column of soft-clip quality which is calculated by a division between the median of softclipped bases in reference genes and average of softclipped bases in a piRNA cluster.

Soft-clip quality = 0.930445592423241
Note that ScQ = median.cov.ref / av.cov.cluster (1=good, 0=bad, >1=exceptional)
Average soft-clip coverage in piRNA clusters 5.132141252457175
Average soft-clip coverage in reference regions 4.78370990237099
Median soft-clip coverage of reference regions 4.77517820804227

 Cluster overview
c   clu.ID  av.cov. c.score ScQ
c   2   6.534095790378007   1000    0.7308093363237974
c   1   5.377679927770931   1000    0.8879625176988918
c   12  5.02316252120862    0   0.9506318355977296

 Reference overview
r   ref.ID  av.cov
r   EOG09150009 4.854072921485344
r   EOG0915000C 4.77517820804227
r   EOG09150006 4.7295782621001985

In the same format as for the coverages, the two further outputs provide information on the number of softclipped bases for each position in piRNA clusters and reference genes. The first 5 lines of the piRNA cluster output is shown below.

1       2R_RaGOO        7069609 3
1       2R_RaGOO        7069610 3
1       2R_RaGOO        7069611 3
1       2R_RaGOO        7069612 3
1       2R_RaGOO        7069613 3

Columns:
1. feature ID
2. chromosome/contig
3. position
4. coverage of softclipped bases

Quantiles

We use the the coverage and softclip outputs for reference genes (from argument '--output-reference') as inputs to calculate the quantiles with a script from our CUSCOquality package. We need to run this python script separate for coverage and softclips.

python cuscoquality/quantiles.py --coverage coverage_ml1k_mq15.busco > coverage.qt
python cuscoquality/quantiles.py --coverage softclips_ml1k_mq15.busco --sc-coverage > softclips.qt

This obtains the follow outputs, which will be used to visualize the quantiles.
coverage.qt

c       0.1%    128
c       1%      126
c       5%      121
c       95%     84
c       99%     79
c       99.9%   78

softclip.qt

sc      0.1%    14
sc      1%      12
sc      5%      9
sc      95%     1
sc      99%     0
sc      99.9%   0

Visualize

To plot the coverages and softclipped bases in piRNA clusters or reference genes we can employ an R-script from our CUSCOquality package.
The script requires the following 6 arguments:
1. lower limit for the quantile range in percent: You can choose between 0.1, 1 or 5, which will correspond to 0.1-99.9%, 1-99% or 5-95% quantile range, respectively.
2. output from the quantiles.py for coverages
3. output from the quantiles.py for softclips
4. output from find-polyN.py
5. output from cluster-coverage.py for piRNA clusters or reference genes
6. output from cluster-softclipcoverage.py for piRNA clusters or reference genes
7. name of feature

R --vanilla --args 1 coverage.qt softclips.qt genome.polyN coverage_ml1k_mq15.cluster softclips_ml1k_mq15.cluster 1 < cuscoquality/visualize.R

This will make a plot for coverage and softclips for a particular piRNA cluster or reference gene in PostScript format.
Cluster 1, also known as 42AB. Here, assembled without gaps.

An example of a Busco annotation.

Cluster 12 as an example for a piRNA cluster with assembly gaps.