Recent changes to piRNAclusterAlign

piRNAclusterAlign modified by Filip Wierzbicki

Filip Wierzbicki — Thu, 10 Dec 2020 09:34:07 -0000

--- v27
+++ v28
@@ -117,16 +117,16 @@
 4. coverage of softclipped bases

-#Confidence intervals
-We use the the coverage and softclip outputs for reference genes  (from argument '--output-reference') as inputs to calculate the confidence intervals with a script from our CUSCOquality package. We need to run this python script separate for coverage and softclips.
+#Quantiles
+We use the the coverage and softclip outputs for reference genes  (from argument '--output-reference') as inputs to calculate the quantiles with a script from our CUSCOquality package. We need to run this python script separate for coverage and softclips.

 ~~~
-python cuscoquality/confidence-interval.py --coverage coverage_ml1k_mq15.busco > coverage.CI
-python cuscoquality/confidence-interval.py --coverage softclips_ml1k_mq15.busco --sc-coverage > softclips.CI
+python cuscoquality/quantiles.py --coverage coverage_ml1k_mq15.busco > coverage.qt
+python cuscoquality/quantiles.py --coverage softclips_ml1k_mq15.busco --sc-coverage > softclips.qt
 ~~~

-This obtains the follow outputs, which will be used to visualize the confidence intervals.
-coverage.CI:
+This obtains the follow outputs, which will be used to visualize the quantiles.
+coverage.qt
 ~~~
 c       0.1%    128
 c       1%      126
@@ -135,7 +135,7 @@
 c       99%     79
 c       99.9%   78
 ~~~
-softclip.CI
+softclip.qt
 ~~~
 sc      0.1%    14
 sc      1%      12
@@ -148,15 +148,15 @@
 #Visualize
 To plot the coverages and softclipped bases in piRNA clusters or reference genes we can employ an R-script from our CUSCOquality package.
 The script requires the following 6 arguments:
-1. lower limit for the confidence interval in percent: You can choose between 0.1, 1 or 5, which will correspond to 0.1-99.9%, 1-99% or 5-95% confidence intervals, respectively.
-2. output from the confindence-interval.py for coverages
-3. output from the confindence-interval.py for softclips
+1. lower limit for the quantile range in percent: You can choose between 0.1, 1 or 5, which will correspond to 0.1-99.9%, 1-99% or 5-95% quantile range, respectively.
+2. output from the quantiles.py for coverages
+3. output from the quantiles.py for softclips
 4. output from find-polyN.py 
 5. output from cluster-coverage.py for piRNA clusters or reference genes
 6. output from cluster-softclipcoverage.py  for piRNA clusters or reference genes
 7. name of feature 
 ~~~
-R --vanilla --args 1 coverage.CI softclips.CI genome.polyN coverage_ml1k_mq15.cluster softclips_ml1k_mq15.cluster 1 < cuscoquality/visualize.R
+R --vanilla --args 1 coverage.qt softclips.qt genome.polyN coverage_ml1k_mq15.cluster softclips_ml1k_mq15.cluster 1 < cuscoquality/visualize.R
 ~~~

 This will make a plot for coverage and softclips for a particular piRNA cluster or reference gene in PostScript format.

piRNAclusterAlign modified by Filip Wierzbicki

Filip Wierzbicki — Thu, 27 Aug 2020 15:50:56 -0000

--- v26
+++ v27
@@ -77,7 +77,7 @@
 #Softclipped reads
 Using the same input data as for the coverage, we can obtain the number of softclipped bases in piRNA clusters and reference genes using another script from our CUSCOquality package. As for the coverage, we apply a minium mapping quality of 15 and a minimum read length of 1000.
 ~~~
-samtools view mapped_reads.sort.bam|python cuscoquality/cluster-softclipcoverage.py --sam - --cluster 3cluster.bed --reference 3busco.bed --min-mq 15 --min-len 1000 --output-cluster softclips_ml1k_mq15.cluster --output-reference softclips_ml1k_mq15.busco > softclips_ml1k_mq15.summary
+samtools view mapped_reads.sort.bam|python cuscoquality/cluster-softclipcoverage-median.py --sam - --cluster 3cluster.bed --reference 3busco.bed --min-mq 15 --min-len 1000 --output-cluster softclips_ml1k_mq15.cluster --output-reference softclips_ml1k_mq15.busco > softclips_ml1k_mq15.summary
 ~~~

 The summary output provides soft-clip quality, which is calculated by a division between the median of softclipped bases in reference genes and the average of softclipped bases in piRNA clusters. Hence, a high value indicates a low number of softclipped bases in piRNA clusters compared to reference genes. Furthermore, this output contains the average softclipped bases for piRNA clusters and reference genes, and the median of softclipped bases in reference genes. Below, overviews shows the average softclipped bases in column 3 of the tab-delimited format for individual piRNA clusters or reference genes (column 2).  For piRNA clusters, a 4. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps, and a 5. column of soft-clip quality which is calculated by a division between the median of softclipped bases in reference genes and average of softclipped bases in a piRNA cluster.

piRNAclusterAlign modified by Filip Wierzbicki

Filip Wierzbicki — Thu, 27 Aug 2020 15:41:05 -0000

--- v25
+++ v26
@@ -29,32 +29,33 @@
 #Coverage
 After aligning reads to the genome and converting the alignment from a sam to a sorted bam file, we can obtain coverages for piRNA clusters and reference genes using a script from our CUSCOquality package. In addition to the required input data of the alignment and both annotations files, we employ the parameter for a minimum mapping quality of 15 (--min-mq) to remove most multimappers and a minimum read length of 1000 (--min-len) to remove shorter reads.
 ~~~
-samtools view mapped_reads.sort.bam|python cuscoquality/cluster-coverage.py --sam - --cluster 3cluster.bed --reference 3busco.bed --min-mq 15 --min-len 1000 --output-cluster coverage_ml1k_mq15.cluster --output-reference coverage_ml1k_mq15.busco > coverage_ml1k_mq15.summary
+samtools view mapped_reads.sort.bam|python cuscoquality/cluster-coverage-median.py --sam - --cluster 3cluster.bed --reference 3busco.bed --min-mq 15 --min-len 1000 --output-cluster coverage_ml1k_mq15.cluster --output-reference coverage_ml1k_mq15.busco > coverage_ml1k_mq15.summary
 ~~~
 After running this script, we obtain 3 outputs.

-The summary output provides the coverage quality which is calculated by a division between the average variances of coverages in reference genes and piRNA clusters. Hence, a high value indicates stable coverages of piRNA clusters compared to reference genes. Furthermore, it outputs the average coverage and the  average standard deviations of coverages for piRNA clusters and reference genes. 
-Below, the output contains tab-delimited overviews of standard deviations (column 3) and averages (column 4) for coverages for individual piRNA clusters and reference genes with the corresponding feature names in column 2. For piRNA clusters, a 5. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps, and a 6. column of coverage quality which is calculated by a division between the variance of reference genes and the variance of a piRNA cluster.
+The summary output provides the coverage quality which is calculated by a division between the median standard deviations of coverages in reference genes and  standard deviation of coverage in piRNA clusters. Hence, a high value indicates stable coverages of piRNA clusters compared to reference genes. Furthermore, it outputs the standard deviations of coverages for piRNA clusters and reference genes, the median standard deviation of coverage in reference genes, and the average coverage. 
+Below, the output contains tab-delimited overviews of standard deviations (column 3) and averages (column 4) for coverages for individual piRNA clusters and reference genes with the corresponding feature names in column 2. For piRNA clusters, a 5. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps, and a 6. column of coverage quality which is calculated by a division between the median standard deviation of reference genes and the standard deviation of a piRNA cluster.

 ~~~
-Coverage quality: 0.1844878968558935
-Note that CQ = var.ref / var.cluster (1=good, 0=bad, >1=exceptional)
+Coverage quality: 0.2162388909609717
+Note that CQ = median. var.ref / var.cluster (1=good, 0=bad, >1=exceptional)
 Standard deviation of the coverage in piRNA clusters 27.23007883066921
 Standard deviation of the coverage in reference regions 11.695878245182904
+Median standard deviation of coverage in reference regions 5.888202047123744
 Average coverage cluster 91.1847304128054
 Average coverage reference 100.39635784020722

  Cluster overview
-c       clu.ID  std.dev.cov     av.cov. c.score CQ
-c       12      29.620228724025186      88.75250532471752       0       0.15591534558796424
-c       1       13.213894228892284      98.95333414543266       1000    0.7834369033528322
-c       2       10.906112510619122      105.32487650343643      1000    1.1500738529908696
+c  clu.ID  std.dev.cov av.cov. c.score CQ
+c  12  29.620228724025186  88.75250532471752   0   0.1987898912592724
+c  1   13.213894228892284  98.95333414543266   1000    0.4456068699452084
+c  2   10.906112510619122  105.32487650343643  1000    0.5398992575393374

  Reference overview
-r       ref.ID  std.dev.cov     av.cov.
-r       EOG0915000C     10.094996158497661      99.0874915209863
-r       EOG09150009     5.888202047123744       91.99781319651794
-r       EOG09150006     5.4546381902835375      117.92787998867817
+r  ref.ID  std.dev.cov av.cov.
+r  EOG0915000C 10.094996158497661  99.0874915209863
+r  EOG09150009 5.888202047123744   91.99781319651794
+r  EOG09150006 5.4546381902835375  117.92787998867817
 ~~~

 The 2 other outputs are tab-delimited files of coverages in every position of the annotations of piRNA clusters and reference genes. 
@@ -79,25 +80,26 @@
 samtools view mapped_reads.sort.bam|python cuscoquality/cluster-softclipcoverage.py --sam - --cluster 3cluster.bed --reference 3busco.bed --min-mq 15 --min-len 1000 --output-cluster softclips_ml1k_mq15.cluster --output-reference softclips_ml1k_mq15.busco > softclips_ml1k_mq15.summary
 ~~~

-The summary output provides soft-clip quality, which is calculated by a division between the average softclipped bases in reference genes and piRNA clusters. Hence, a low value indicates a low number of softclipped bases in piRNA clusters compared to reference genes. Furthermore, this output contains the average softclipped bases for piRNA clusters and reference genes. Below, overviews shows the average softclipped bases in column 3 of the tab-delimited format for individual piRNA clusters or reference genes (column 2).  For piRNA clusters, a 4. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps, and a 5. column of soft-clip quality which is calculated by a division between the average softclipped bases in reference genes and a piRNA cluster. 
+The summary output provides soft-clip quality, which is calculated by a division between the median of softclipped bases in reference genes and the average of softclipped bases in piRNA clusters. Hence, a high value indicates a low number of softclipped bases in piRNA clusters compared to reference genes. Furthermore, this output contains the average softclipped bases for piRNA clusters and reference genes, and the median of softclipped bases in reference genes. Below, overviews shows the average softclipped bases in column 3 of the tab-delimited format for individual piRNA clusters or reference genes (column 2).  For piRNA clusters, a 4. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps, and a 5. column of soft-clip quality which is calculated by a division between the median of softclipped bases in reference genes and average of softclipped bases in a piRNA cluster. 

 ~~~
-Soft-clip quality = 0.9321079968484184
-Note that ScQ = av.cov.ref / av.cov.cluster (1=good, 0=bad, >1=exceptional)
+Soft-clip quality = 0.930445592423241
+Note that ScQ = median.cov.ref / av.cov.cluster (1=good, 0=bad, >1=exceptional)
 Average soft-clip coverage in piRNA clusters 5.132141252457175
 Average soft-clip coverage in reference regions 4.78370990237099
+Median soft-clip coverage of reference regions 4.77517820804227

  Cluster overview
-c       clu.ID  av.cov. c.score ScQ
-c       2       6.534095790378007       1000    0.7321150555238869
-c       1       5.377679927770931       1000    0.889549018651591
-c       12      5.02316252120862        0       0.9523303062907836
+c  clu.ID  av.cov. c.score ScQ
+c  2   6.534095790378007   1000    0.7308093363237974
+c  1   5.377679927770931   1000    0.8879625176988918
+c  12  5.02316252120862    0   0.9506318355977296

  Reference overview
-r       ref.ID  av.cov
-r       EOG09150009     4.854072921485344
-r       EOG0915000C     4.77517820804227
-r       EOG09150006     4.7295782621001985
+r  ref.ID  av.cov
+r  EOG09150009 4.854072921485344
+r  EOG0915000C 4.77517820804227
+r  EOG09150006 4.7295782621001985
 ~~~

 In the same format as for the coverages, the two further outputs provide information on the number of softclipped bases for each position in piRNA clusters and reference genes. The first 5 lines of the piRNA cluster output is shown below.

piRNAclusterAlign modified by Filip Wierzbicki

Filip Wierzbicki — Tue, 04 Aug 2020 11:48:20 -0000

--- v24
+++ v25
@@ -33,7 +33,7 @@
 ~~~
 After running this script, we obtain 3 outputs.

-The summary output provides the coverage quality which is calculated by a division between the average variances of coverages in reference and piRNA clusters. Hence, a high value indicates stable coverages of piRNA clusters compared to reference genes. Furthermore, it outputs the average coverage and the  average standard deviations of coverages for piRNA clusters and reference genes. 
+The summary output provides the coverage quality which is calculated by a division between the average variances of coverages in reference genes and piRNA clusters. Hence, a high value indicates stable coverages of piRNA clusters compared to reference genes. Furthermore, it outputs the average coverage and the  average standard deviations of coverages for piRNA clusters and reference genes. 
 Below, the output contains tab-delimited overviews of standard deviations (column 3) and averages (column 4) for coverages for individual piRNA clusters and reference genes with the corresponding feature names in column 2. For piRNA clusters, a 5. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps, and a 6. column of coverage quality which is calculated by a division between the variance of reference genes and the variance of a piRNA cluster.

 ~~~
@@ -79,7 +79,7 @@
 samtools view mapped_reads.sort.bam|python cuscoquality/cluster-softclipcoverage.py --sam - --cluster 3cluster.bed --reference 3busco.bed --min-mq 15 --min-len 1000 --output-cluster softclips_ml1k_mq15.cluster --output-reference softclips_ml1k_mq15.busco > softclips_ml1k_mq15.summary
 ~~~

-The summary output provides soft-clip quality, which is calculated by a division between the average softclipped bases in reference genes and piRNA clusters. Hence, a low value indicates a low number of softclipped bases in piRNA clusters compared to reference genes. Furthermore, this output contains the average softclipped bases for piRNA clusters and reference genes. Below, overviews shows the average softclipped bases in column 3 of the tab-delimited format for individual piRNA clusters or reference genes (column 2).  For piRNA clusters, a 4. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps, and a 6. column of soft-clip quality which is calculated by a division between the average softclipped bases in reference genes and a piRNA cluster. 
+The summary output provides soft-clip quality, which is calculated by a division between the average softclipped bases in reference genes and piRNA clusters. Hence, a low value indicates a low number of softclipped bases in piRNA clusters compared to reference genes. Furthermore, this output contains the average softclipped bases for piRNA clusters and reference genes. Below, overviews shows the average softclipped bases in column 3 of the tab-delimited format for individual piRNA clusters or reference genes (column 2).  For piRNA clusters, a 4. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps, and a 5. column of soft-clip quality which is calculated by a division between the average softclipped bases in reference genes and a piRNA cluster. 

 ~~~
 Soft-clip quality = 0.9321079968484184

piRNAclusterAlign modified by Filip Wierzbicki

Filip Wierzbicki — Tue, 04 Aug 2020 11:39:36 -0000

--- v23
+++ v24
@@ -33,22 +33,25 @@
 ~~~
 After running this script, we obtain 3 outputs.

-The summary output provides the coverage quality which is calculated by a division between the average variances of coverages in piRNA clusters and reference genes. Furthermore, it outputs the average coverage and the  average standard deviations of coverages for piRNA clusters and reference genes. 
-Below, the output contains tab-delimited overviews of standard deviations (column 3) and averages (column 4) for colverages for individual piRNA clusters and reference genes with the corresponding feature names in column 2. For piRNA clusters, a 5. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps.
+The summary output provides the coverage quality which is calculated by a division between the average variances of coverages in reference and piRNA clusters. Hence, a high value indicates stable coverages of piRNA clusters compared to reference genes. Furthermore, it outputs the average coverage and the  average standard deviations of coverages for piRNA clusters and reference genes. 
+Below, the output contains tab-delimited overviews of standard deviations (column 3) and averages (column 4) for coverages for individual piRNA clusters and reference genes with the corresponding feature names in column 2. For piRNA clusters, a 5. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps, and a 6. column of coverage quality which is calculated by a division between the variance of reference genes and the variance of a piRNA cluster.

 ~~~
-Coverage quality: 5.420409777781337
+Coverage quality: 0.1844878968558935
+Note that CQ = var.ref / var.cluster (1=good, 0=bad, >1=exceptional)
 Standard deviation of the coverage in piRNA clusters 27.23007883066921
 Standard deviation of the coverage in reference regions 11.695878245182904
 Average coverage cluster 91.1847304128054
 Average coverage reference 100.39635784020722

  Cluster overview
-c       12      29.620228724025186      88.75250532471752       0
-c       1       13.213894228892284      98.95333414543266       1000
-c       2       10.906112510619122      105.32487650343643      1000
+c       clu.ID  std.dev.cov     av.cov. c.score CQ
+c       12      29.620228724025186      88.75250532471752       0       0.15591534558796424
+c       1       13.213894228892284      98.95333414543266       1000    0.7834369033528322
+c       2       10.906112510619122      105.32487650343643      1000    1.1500738529908696

  Reference overview
+r       ref.ID  std.dev.cov     av.cov.
 r       EOG0915000C     10.094996158497661      99.0874915209863
 r       EOG09150009     5.888202047123744       91.99781319651794
 r       EOG09150006     5.4546381902835375      117.92787998867817
@@ -76,19 +79,22 @@
 samtools view mapped_reads.sort.bam|python cuscoquality/cluster-softclipcoverage.py --sam - --cluster 3cluster.bed --reference 3busco.bed --min-mq 15 --min-len 1000 --output-cluster softclips_ml1k_mq15.cluster --output-reference softclips_ml1k_mq15.busco > softclips_ml1k_mq15.summary
 ~~~

-The summary output provides soft-clip quality, which is calculated by a division between the average softclipped bases in piRNA clusters and reference genes. Furthermore, this output contains the average softclipped bases for piRNA clusters and reference genes. Below, overviews shows the average softclipped bases in column 3 of the tab-delimited format for individual piRNA clusters or reference genes (column 2).  For piRNA clusters, a 4. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps.
+The summary output provides soft-clip quality, which is calculated by a division between the average softclipped bases in reference genes and piRNA clusters. Hence, a low value indicates a low number of softclipped bases in piRNA clusters compared to reference genes. Furthermore, this output contains the average softclipped bases for piRNA clusters and reference genes. Below, overviews shows the average softclipped bases in column 3 of the tab-delimited format for individual piRNA clusters or reference genes (column 2).  For piRNA clusters, a 4. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps, and a 6. column of soft-clip quality which is calculated by a division between the average softclipped bases in reference genes and a piRNA cluster. 

 ~~~
-Soft-clip quality: 1.0728370568444145
+Soft-clip quality = 0.9321079968484184
+Note that ScQ = av.cov.ref / av.cov.cluster (1=good, 0=bad, >1=exceptional)
 Average soft-clip coverage in piRNA clusters 5.132141252457175
 Average soft-clip coverage in reference regions 4.78370990237099

  Cluster overview
-c       2       6.534095790378007       1000
-c       1       5.377679927770931       1000
-c       12      5.02316252120862        0
+c       clu.ID  av.cov. c.score ScQ
+c       2       6.534095790378007       1000    0.7321150555238869
+c       1       5.377679927770931       1000    0.889549018651591
+c       12      5.02316252120862        0       0.9523303062907836

  Reference overview
+r       ref.ID  av.cov
 r       EOG09150009     4.854072921485344
 r       EOG0915000C     4.77517820804227
 r       EOG09150006     4.7295782621001985

piRNAclusterAlign modified by Filip Wierzbicki

Filip Wierzbicki — Mon, 08 Jun 2020 12:16:17 -0000

--- v22
+++ v23
@@ -3,16 +3,19 @@
 Using the coverage and the number softclipped reads from the alignment of long reads to the assembly, we can assess the reliablility of the assembled content of piRNA clusters.

 ##Requirements
-* R
+
+In addition to the CUSCO-quality scripts you need the following
+
 * python3
 * Long-read alignment software (e.g. minimap2) to align long reads that were used to assemble the genome.
 *  Samtools to sort and convert the alignment file from sam to bam. 
-* Annotations of piRNA clusters and reference genes (in bed format): You can obtain the piRNA cluster annotations using CUSCO. Annotations of reference genome can be obtained by running Busco on the genome assembly or from alignments. Here, we provide example annotation files ( https://sourceforge.net/projects/cuscoquality/files/Walkthrough/piRNAclusterAlignments/3cluster.bed & https://sourceforge.net/projects/cuscoquality/files/Walkthrough/piRNAclusterAlignments/3busco.bed).
+* Annotations of piRNA clusters and reference genes (in bed format): You can obtain the piRNA cluster annotations using CUSCO. Annotations of reference genome can be obtained by running BUSCO on the genome assembly or from alignments. Here, we provide example annotation files ( https://sourceforge.net/projects/cuscoquality/files/Walkthrough/piRNAclusterAlignments/3cluster.bed & https://sourceforge.net/projects/cuscoquality/files/Walkthrough/piRNAclusterAlignments/3busco.bed).
 * Information of poly-N tracts in genome assemblies: You need to obtain the gap information by running the script 'find-polyN.py' form our CUSCOquality package using the genome assembly. Here, we provide the gap information for the genome assembly that will be used in this walkthrough ( https://sourceforge.net/projects/cuscoquality/files/Walkthrough/piRNAclusterAlignments/genome.polyN ).
+* R with ggplot2

 #Preparatory work
 ##Mapping of long-reads
-We use minimap2 to align long reads the an assembly.  For this walkthrough, we provide an example file of 100X Oxford Nanopore reads that align to 3 piRNA clusters and 3 Busco annotations ( https://sourceforge.net/projects/cuscoquality/files/Walkthrough/piRNAclusterAlignments/reads.fastq.gz ) and a genome assembly of the *D. melanogaster* strain Canton-S ( https://sourceforge.net/projects/cuscoquality/files/Walkthrough/piRNAclusterAlignments/genome.fasta.gz ).
+We use minimap2 to align long reads the an assembly.  For this walkthrough, we provide an example file of 100X Oxford Nanopore reads that align to 3 piRNA clusters and 3 BUSCO annotations ( https://sourceforge.net/projects/cuscoquality/files/Walkthrough/piRNAclusterAlignments/reads.fastq.gz ) and a genome assembly of the *D. melanogaster* strain Canton-S ( https://sourceforge.net/projects/cuscoquality/files/Walkthrough/piRNAclusterAlignments/genome.fasta.gz ).
 ~~~
 minimap2 -ax map-ont -t 20 genome.fasta.gz reads.fastq.gz > mapped_reads.sam
 ~~~

piRNAclusterAlign modified by Filip Wierzbicki

Filip Wierzbicki — Mon, 08 Jun 2020 11:11:06 -0000

--- v21
+++ v22
@@ -1,10 +1,8 @@
-**page in progress**
-
-**Introduction**
+#Introduction

 Using the coverage and the number softclipped reads from the alignment of long reads to the assembly, we can assess the reliablility of the assembled content of piRNA clusters.

-**Requirements**
+##Requirements
 * R
 * python3
 * Long-read alignment software (e.g. minimap2) to align long reads that were used to assemble the genome.
@@ -12,20 +10,20 @@
 * Annotations of piRNA clusters and reference genes (in bed format): You can obtain the piRNA cluster annotations using CUSCO. Annotations of reference genome can be obtained by running Busco on the genome assembly or from alignments. Here, we provide example annotation files ( https://sourceforge.net/projects/cuscoquality/files/Walkthrough/piRNAclusterAlignments/3cluster.bed & https://sourceforge.net/projects/cuscoquality/files/Walkthrough/piRNAclusterAlignments/3busco.bed).
 * Information of poly-N tracts in genome assemblies: You need to obtain the gap information by running the script 'find-polyN.py' form our CUSCOquality package using the genome assembly. Here, we provide the gap information for the genome assembly that will be used in this walkthrough ( https://sourceforge.net/projects/cuscoquality/files/Walkthrough/piRNAclusterAlignments/genome.polyN ).

-**Preparatory work**
-**Mapping of long-reads**
+#Preparatory work
+##Mapping of long-reads
 We use minimap2 to align long reads the an assembly.  For this walkthrough, we provide an example file of 100X Oxford Nanopore reads that align to 3 piRNA clusters and 3 Busco annotations ( https://sourceforge.net/projects/cuscoquality/files/Walkthrough/piRNAclusterAlignments/reads.fastq.gz ) and a genome assembly of the *D. melanogaster* strain Canton-S ( https://sourceforge.net/projects/cuscoquality/files/Walkthrough/piRNAclusterAlignments/genome.fasta.gz ).
 ~~~
 minimap2 -ax map-ont -t 20 genome.fasta.gz reads.fastq.gz > mapped_reads.sam
 ~~~

-**sorting**
+##sorting
 The mapping output will be sorted and converted into bam format.
 ~~~
 samtools sort mapped_reads.sam -o mapped_reads.sort.bam
 ~~~

-**Coverage**
+#Coverage
 After aligning reads to the genome and converting the alignment from a sam to a sorted bam file, we can obtain coverages for piRNA clusters and reference genes using a script from our CUSCOquality package. In addition to the required input data of the alignment and both annotations files, we employ the parameter for a minimum mapping quality of 15 (--min-mq) to remove most multimappers and a minimum read length of 1000 (--min-len) to remove shorter reads.
 ~~~
 samtools view mapped_reads.sort.bam|python cuscoquality/cluster-coverage.py --sam - --cluster 3cluster.bed --reference 3busco.bed --min-mq 15 --min-len 1000 --output-cluster coverage_ml1k_mq15.cluster --output-reference coverage_ml1k_mq15.busco > coverage_ml1k_mq15.summary
@@ -69,10 +67,10 @@
 3. position
 4. coverage

-**Softclipped reads**
+#Softclipped reads
 Using the same input data as for the coverage, we can obtain the number of softclipped bases in piRNA clusters and reference genes using another script from our CUSCOquality package. As for the coverage, we apply a minium mapping quality of 15 and a minimum read length of 1000.
 ~~~
-samtools view mapped_reads.sort.bam|python /Volumes/Temp3/filip/programs/roberts_scripts/cuscoquality-code/cluster-softclipcoverage.py --sam - --cluster 3cluster.bed --reference 3busco.bed --min-mq 15 --min-len 1000 --output-cluster softclips_ml1k_mq15.cluster --output-reference softclips_ml1k_mq15.busco > softclips_ml1k_mq15.summary
+samtools view mapped_reads.sort.bam|python cuscoquality/cluster-softclipcoverage.py --sam - --cluster 3cluster.bed --reference 3busco.bed --min-mq 15 --min-len 1000 --output-cluster softclips_ml1k_mq15.cluster --output-reference softclips_ml1k_mq15.busco > softclips_ml1k_mq15.summary
 ~~~

 The summary output provides soft-clip quality, which is calculated by a division between the average softclipped bases in piRNA clusters and reference genes. Furthermore, this output contains the average softclipped bases for piRNA clusters and reference genes. Below, overviews shows the average softclipped bases in column 3 of the tab-delimited format for individual piRNA clusters or reference genes (column 2).  For piRNA clusters, a 4. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps.
@@ -108,12 +106,12 @@
 4. coverage of softclipped bases


-**Confidence intervals**
+#Confidence intervals
 We use the the coverage and softclip outputs for reference genes  (from argument '--output-reference') as inputs to calculate the confidence intervals with a script from our CUSCOquality package. We need to run this python script separate for coverage and softclips.

 ~~~
-python /Volumes/Temp3/filip/programs/roberts_scripts/cuscoquality-code/confidence-interval.py --coverage coverage_ml1k_mq15.busco > coverage.CI
-python /Volumes/Temp3/filip/programs/roberts_scripts/cuscoquality-code/confidence-interval.py --coverage softclips_ml1k_mq15.busco --sc-coverage > softclips.CI
+python cuscoquality/confidence-interval.py --coverage coverage_ml1k_mq15.busco > coverage.CI
+python cuscoquality/confidence-interval.py --coverage softclips_ml1k_mq15.busco --sc-coverage > softclips.CI
 ~~~

 This obtains the follow outputs, which will be used to visualize the confidence intervals.
@@ -136,7 +134,7 @@
 sc      99.9%   0
 ~~~

-**Visualize**
+#Visualize
 To plot the coverages and softclipped bases in piRNA clusters or reference genes we can employ an R-script from our CUSCOquality package.
 The script requires the following 6 arguments:
 1. lower limit for the confidence interval in percent: You can choose between 0.1, 1 or 5, which will correspond to 0.1-99.9%, 1-99% or 5-95% confidence intervals, respectively.

piRNAclusterAlign modified by Filip Wierzbicki

Filip Wierzbicki — Fri, 05 Jun 2020 14:59:36 -0000

piRNAclusterAlign modified by Filip Wierzbicki

Filip Wierzbicki — Fri, 05 Jun 2020 14:55:00 -0000

--- v19
+++ v20
@@ -152,8 +152,8 @@

 This will make a plot for coverage and softclips for a particular piRNA cluster or reference gene in PostScript format.
 Cluster 1, also known as 42AB. Here, assembled without gaps.
-[[img src=coverage.png]]
+[[img src=cl1.png]]
 An example of a Busco annotation.
-[[img src=busco_coverage.png]]
+[[img src=EOG09150006.png]]
 Cluster 12 as an example for a piRNA cluster with assembly gaps.
-[[img src=coverage_gap.png]]
+[[img src=cl12.png]]

piRNAclusterAlign modified by Filip Wierzbicki

Filip Wierzbicki — Fri, 05 Jun 2020 14:08:44 -0000

--- v18
+++ v19
@@ -33,7 +33,7 @@
 After running this script, we obtain 3 outputs.

 The summary output provides the coverage quality which is calculated by a division between the average variances of coverages in piRNA clusters and reference genes. Furthermore, it outputs the average coverage and the  average standard deviations of coverages for piRNA clusters and reference genes. 
-Below, the output contains tab-delimited overviews of standard deviations (column 3) and averages (column 4) for colverages for individual piRNA clusters and reference genes with the corresponding feature names in column 2.
+Below, the output contains tab-delimited overviews of standard deviations (column 3) and averages (column 4) for colverages for individual piRNA clusters and reference genes with the corresponding feature names in column 2. For piRNA clusters, a 5. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps.

 ~~~
 Coverage quality: 5.420409777781337
@@ -43,9 +43,9 @@
 Average coverage reference 100.39635784020722

  Cluster overview
-c       12      29.620228724025186      88.75250532471752
-c       1       13.213894228892284      98.95333414543266
-c       2       10.906112510619122      105.32487650343643
+c       12      29.620228724025186      88.75250532471752       0
+c       1       13.213894228892284      98.95333414543266       1000
+c       2       10.906112510619122      105.32487650343643      1000

  Reference overview
 r       EOG0915000C     10.094996158497661      99.0874915209863
@@ -56,18 +56,18 @@
 The 2 other outputs are tab-delimited files of coverages in every position of the annotations of piRNA clusters and reference genes. 
 First 5 lines of such an output for piRNA clusters is shown below.
 ~~~
-1       1000    2R_RaGOO        7069609 88
-1       1000    2R_RaGOO        7069610 88
-1       1000    2R_RaGOO        7069611 88
-1       1000    2R_RaGOO        7069612 88
-1       1000    2R_RaGOO        7069613 88
+1       2R_RaGOO        7069609 88
+1       2R_RaGOO        7069610 88
+1       2R_RaGOO        7069611 88
+1       2R_RaGOO        7069612 88
+1       2R_RaGOO        7069613 88
+
 ~~~
 Columns:
 1.  feature ID
-2.  gap score 
-3. chromosome/contig
-4. position
-5. coverage
+2. chromosome/contig
+3. position
+4. coverage

 **Softclipped reads**
 Using the same input data as for the coverage, we can obtain the number of softclipped bases in piRNA clusters and reference genes using another script from our CUSCOquality package. As for the coverage, we apply a minium mapping quality of 15 and a minimum read length of 1000.
@@ -75,7 +75,7 @@
 samtools view mapped_reads.sort.bam|python /Volumes/Temp3/filip/programs/roberts_scripts/cuscoquality-code/cluster-softclipcoverage.py --sam - --cluster 3cluster.bed --reference 3busco.bed --min-mq 15 --min-len 1000 --output-cluster softclips_ml1k_mq15.cluster --output-reference softclips_ml1k_mq15.busco > softclips_ml1k_mq15.summary
 ~~~

-The summary output provides soft-clip quality, which is calculated by a division between the average softclipped bases in piRNA clusters and reference genes. Furthermore, this output contains the average softclipped bases for piRNA clusters and reference genes. Below, overviews shows the average softclipped bases in column 3 of the tab-delimited format for individual piRNA clusters or reference genes (column 2). 
+The summary output provides soft-clip quality, which is calculated by a division between the average softclipped bases in piRNA clusters and reference genes. Furthermore, this output contains the average softclipped bases for piRNA clusters and reference genes. Below, overviews shows the average softclipped bases in column 3 of the tab-delimited format for individual piRNA clusters or reference genes (column 2).  For piRNA clusters, a 4. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps.

 ~~~
 Soft-clip quality: 1.0728370568444145
@@ -83,9 +83,9 @@
 Average soft-clip coverage in reference regions 4.78370990237099

  Cluster overview
-c       2       6.534095790378007
-c       1       5.377679927770931
-c       12      5.02316252120862
+c       2       6.534095790378007       1000
+c       1       5.377679927770931       1000
+c       12      5.02316252120862        0

  Reference overview
 r       EOG09150009     4.854072921485344
@@ -95,18 +95,18 @@

 In the same format as for the coverages, the two further outputs provide information on the number of softclipped bases for each position in piRNA clusters and reference genes. The first 5 lines of the piRNA cluster output is shown below.  
 ~~~
-1       1000    2R_RaGOO        7069609 3
-1       1000    2R_RaGOO        7069610 3
-1       1000    2R_RaGOO        7069611 3
-1       1000    2R_RaGOO        7069612 3
-1       1000    2R_RaGOO        7069613 3
+1       2R_RaGOO        7069609 3
+1       2R_RaGOO        7069610 3
+1       2R_RaGOO        7069611 3
+1       2R_RaGOO        7069612 3
+1       2R_RaGOO        7069613 3
 ~~~
 Columns:
 1.  feature ID
-2.  gap score 
-3. chromosome/contig
-4. position
-5. coverage
+2. chromosome/contig
+3. position
+4. coverage of softclipped bases
+

 **Confidence intervals**
 We use the the coverage and softclip outputs for reference genes  (from argument '--output-reference') as inputs to calculate the confidence intervals with a script from our CUSCOquality package. We need to run this python script separate for coverage and softclips.