Recent changes to Manual

Manual modified by Filip Wierzbicki

Filip Wierzbicki — Tue, 20 Apr 2021 12:51:14 -0000

--- v48
+++ v49
@@ -163,11 +163,13 @@
 ~~~

 **parameters**
+
 * $1: bed file of regions of interest
 * $2: output directory

 **output**
 The script generates 4 output files in the output directory with the following names:
+
 * left_flanks.bed: annotations of left flanks
 * right_flanks.bed: annotations of right flanks
 * flanks.bed (left and right annotations concatenated) 
@@ -176,12 +178,13 @@
 ##flankparser.sh
 This script obtains the flanking sequences in fasta format from the reference genome assembly using samtools.

-**example call*
+**example call**
 ~~~
 bash flankparser.sh IDs flanks.bed genome.fasta flank-fasta
 ~~~

 **parameters**
+
 * $1: a file containing the flank IDs generated by the script 'flankbeder.sh'
 * $2: annotations of flanks generated by the script 'flankbeder.sh'
 * $3: referenge genome assembly
@@ -200,6 +203,7 @@
 ~~~

 **parameters**
+
 * --bed: required bed file of the annotations of regions of interest in the reference genome assembly
 * --modsam: required alignment  (see [FlankDesign])
 * --inner: integer for (left-most) position-tolerance of corresponding flank alignment inside the region of interest

Manual modified by Filip Wierzbicki

Filip Wierzbicki — Tue, 20 Apr 2021 12:48:17 -0000

--- v47
+++ v48
@@ -181,7 +181,7 @@
 bash flankparser.sh IDs flanks.bed genome.fasta flank-fasta
 ~~~

-**parameters*
+**parameters**
 * $1: a file containing the flank IDs generated by the script 'flankbeder.sh'
 * $2: annotations of flanks generated by the script 'flankbeder.sh'
 * $3: referenge genome assembly

Manual modified by Filip Wierzbicki

Filip Wierzbicki — Tue, 20 Apr 2021 12:44:33 -0000

--- v46
+++ v47
@@ -151,6 +151,63 @@

 Finally we distinguish between a contig-CUSCO (poly-N tracts, i.e gaps,  between the two flanking sequences are not tolerated) and a scaffold-CUSCO (poly-N tracts, i.e. gaps,  are tolerated between the two flanking sequences).

+
+# Flank design
+
+## flankbeder.sh
+This script is the first step during designing flanking sequences for your regions of interest. It uses the annotations of your regions in bed format to generate a annotation file of the corresponding flanks. 
+
+**example call**
+~~~
+bash flankbeder.sh regions-of-interest.bed .
+~~~
+
+**parameters**
+* $1: bed file of regions of interest
+* $2: output directory
+
+**output**
+The script generates 4 output files in the output directory with the following names:
+* left_flanks.bed: annotations of left flanks
+* right_flanks.bed: annotations of right flanks
+* flanks.bed (left and right annotations concatenated) 
+* IDs: a file containing the flank IDs 
+
+##flankparser.sh
+This script obtains the flanking sequences in fasta format from the reference genome assembly using samtools.
+
+**example call*
+~~~
+bash flankparser.sh IDs flanks.bed genome.fasta flank-fasta
+~~~
+
+**parameters*
+* $1: a file containing the flank IDs generated by the script 'flankbeder.sh'
+* $2: annotations of flanks generated by the script 'flankbeder.sh'
+* $3: referenge genome assembly
+* $4: output directory of flanking sequences
+
+**output**
+The script writes flanking sequences separately into fasta files into the provided output directory.
+In the end it generates a directory named 'resources' were all flanking sequences are written into a single file named 'flanks.fasta'.
+
+##flank_validation.py
+This script uses a modfied sam file of flanks aligned back to the reference genome assembly. For more details and required commands (see [FlankDesign]).
+
+**example call**
+~~~
+python flank_validation.py --bed regions-of-interest.bed --modsam resources/sam.mod --inner 100 --outer 5000 > resources/validated.tmp
+~~~
+
+**parameters**
+* --bed: required bed file of the annotations of regions of interest in the reference genome assembly
+* --modsam: required alignment  (see [FlankDesign])
+* --inner: integer for (left-most) position-tolerance of corresponding flank alignment inside the region of interest
+* --outer: integer for (left-most) position-tolerance of corresponding flank alignment outside of the region of interest
+
+**output**
+The output file is a temporary file of validated regions. It requires further processing to obtain the final cluster definition file (see [FlankDesign]).
+
 # piRNA cluster alignments

 ## Obtaining a bed file from BUSCO
@@ -434,4 +491,3 @@
 * col6: observed abundance of SNPs for the TE family
 * col7: observed abundance of internal deletions (ID) for the TE family

-

Manual modified by Filip Wierzbicki

Filip Wierzbicki — Thu, 10 Dec 2020 09:39:05 -0000

--- v45
+++ v46
@@ -329,7 +329,7 @@

 *Positions of arguments need to be in the following order:*

-* integer of the lower limit of the confidence interval: 0.1, 1 or 5
+* integer of the lower limit of the quantile range: 0.1, 1 or 5
 * string of the path to the quantiles file for the coverage
 * string of the path to the quantiles file for the softclips
 * string of the path to the polyN file

Manual modified by Filip Wierzbicki

Filip Wierzbicki — Thu, 10 Dec 2020 09:38:18 -0000

--- v44
+++ v45
@@ -283,13 +283,13 @@
 ##cluster-softclipcoverage-median.py
 This script is very similar to the script cluster-softclipcoverage.py above. It uses the same parameters as cluster-softclipcoverage.py. The only difference is that it uses the median value of reference genes to calculate the quality metric.

-##confidence-interval.py
-This script obtains the confidence intervals using the output for the reference annotations from the two scripts 'cluster-coverage.py' and 'cluster-softclipcoverage.py'.
+##quantiles.py
+This script obtains the quantiles using the output for the reference annotations from the two scripts 'cluster-coverage.py' and 'cluster-softclipcoverage.py'.

 **example calls**
 ~~~
-python cuscoquality/confidence-interval.py --coverage coverage_ml1k_mq15.busco > coverage.CI
-python cuscoquality/confidence-interval.py --coverage softclips_ml1k_mq15.busco --sc-coverage > softclips.CI
+python cuscoquality/quantiles.py --coverage coverage_ml1k_mq15.busco > coverage.qt
+python cuscoquality/quantiles.py --coverage softclips_ml1k_mq15.busco --sc-coverage > softclips.qt
 ~~~

 **parameters**
@@ -322,7 +322,7 @@

 **example call*
 ~~~
-R --vanilla --args 1 coverage.CI softclips.CI genome.polyN coverage_ml1k_mq15.cluster softclips_ml1k_mq15.cluster 1 < cuscoquality/visualize.R
+R --vanilla --args 1 coverage.qt softclips.qt genome.polyN coverage_ml1k_mq15.cluster softclips_ml1k_mq15.cluster 1 < cuscoquality/visualize.R
 ~~~

 **parameters**
@@ -330,8 +330,8 @@
 *Positions of arguments need to be in the following order:*

 * integer of the lower limit of the confidence interval: 0.1, 1 or 5
-* string of the path to the confidence interval file for the coverage
-* string of the path to the confidence interval file for the softclips
+* string of the path to the quantiles file for the coverage
+* string of the path to the quantiles file for the softclips
 * string of the path to the polyN file 
 * string of the path to the coverage of piRNA clusters or reference
 * string of the path to the softclips of piRNA clusters or reference

Manual modified by Filip Wierzbicki

Filip Wierzbicki — Tue, 10 Nov 2020 17:34:27 -0000

--- v43
+++ v44
@@ -132,9 +132,9 @@

 example output bed (--output-cb):
 ~~~
-2R_RaGOO        7069608 7348719 1       1000
-X_RaGOO 21887321        21924568        2       1000
-3L_RaGOO        27875492        28983531        12      0
+2R_RaGOO        7069608 7348719 1       1000   False
+X_RaGOO 21887321        21924568        2       1000   False
+3L_RaGOO        27875492        28983531        12      0  False
 ~~~
 Columns are separated by tabs:

@@ -143,6 +143,7 @@
 * col3: end position
 * col4: piRNA cluster ID
 * col5: gap status(1000=gapless; 0=gapped)
+* col6: if reverse complement assembled compared to original annotation (True/False)

 The first section provides an overview of the piRNA clusters used for the analyis. Most important is the number of clusters with an error.  An error would,  for example, occur if the fasta-ID of a sequence flanking a piRNA cluster does not match the ID provided in the cluster definition file.

Manual modified by Filip Wierzbicki

Filip Wierzbicki — Thu, 27 Aug 2020 15:47:51 -0000

--- v42
+++ v43
@@ -219,6 +219,9 @@
 * col3: position
 * col4: coverage

+##cluster-coverage-median.py
+This script is very similar to the script cluster-coverage.py above. It uses the same parameters as cluster-coverage.py. The only difference is that it uses the median value of standard deviations in reference genes to calculate the quality metric. 
+
 ##cluster-softclipcoverage.py
 This script obtains the number of softclipped bases  from mapping results with a bed file of piRNA clusters (cusco output) and a minimum bed file with a mandatory 4. column with names of reference annotations.

@@ -275,6 +278,9 @@
 * col2: chromosome/contig
 * col3: position
 * col4: coverage
+
+##cluster-softclipcoverage-median.py
+This script is very similar to the script cluster-softclipcoverage.py above. It uses the same parameters as cluster-softclipcoverage.py. The only difference is that it uses the median value of reference genes to calculate the quality metric. 

 ##confidence-interval.py
 This script obtains the confidence intervals using the output for the reference annotations from the two scripts 'cluster-coverage.py' and 'cluster-softclipcoverage.py'.

Manual modified by Filip Wierzbicki

Filip Wierzbicki — Tue, 04 Aug 2020 11:47:48 -0000

--- v41
+++ v42
@@ -181,24 +181,27 @@

 example output of the summary on coverage:
 ~~~
-Coverage quality: 5.420409777781337
+Coverage quality: 0.1844878968558935
+Note that CQ = var.ref / var.cluster (1=good, 0=bad, >1=exceptional)
 Standard deviation of the coverage in piRNA clusters 27.23007883066921
 Standard deviation of the coverage in reference regions 11.695878245182904
 Average coverage cluster 91.1847304128054
 Average coverage reference 100.39635784020722

  Cluster overview
-c       12      29.620228724025186      88.75250532471752       0
-c       1       13.213894228892284      98.95333414543266       1000
-c       2       10.906112510619122      105.32487650343643      1000
+c       clu.ID  std.dev.cov     av.cov. c.score CQ
+c       12      29.620228724025186      88.75250532471752       0       0.15591534558796424
+c       1       13.213894228892284      98.95333414543266       1000    0.7834369033528322
+c       2       10.906112510619122      105.32487650343643      1000    1.1500738529908696

  Reference overview
+r       ref.ID  std.dev.cov     av.cov.
 r       EOG0915000C     10.094996158497661      99.0874915209863
 r       EOG09150009     5.888202047123744       91.99781319651794
 r       EOG09150006     5.4546381902835375      117.92787998867817
 ~~~
-The summary output provides the coverage quality which is calculated by a division between the average variances of coverages in piRNA clusters and reference genes. Furthermore, it outputs the average coverage and the average standard deviations of coverages for piRNA clusters and reference genes.
-Below, the output contains tab-delimited overviews of standard deviations (column 3) and averages (column 4) for colverages for individual piRNA clusters and reference genes with the corresponding feature names in column 2. For piRNA clusters, a 5. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps.
+The summary output provides the coverage quality which is calculated by a division between the average variances of coverages in reference genes and piRNA clusters. Furthermore, it outputs the average coverage and the average standard deviations of coverages for piRNA clusters and reference genes.
+Below, the output contains tab-delimited overviews of standard deviations (column 3) and averages (column 4) for colverages for individual piRNA clusters and reference genes with the corresponding feature names in column 2. For piRNA clusters, a 5. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps, and a 6. column of coverage quality which is calculated by a division between the variance of reference genes and the variance of a piRNA cluster.

 example output for coverage:
 ~~~
@@ -238,21 +241,24 @@

 example output of the summary on softclips:
 ~~~
-Soft-clip quality: 1.0728370568444145
+Soft-clip quality = 0.9321079968484184
+Note that ScQ = av.cov.ref / av.cov.cluster (1=good, 0=bad, >1=exceptional)
 Average soft-clip coverage in piRNA clusters 5.132141252457175
 Average soft-clip coverage in reference regions 4.78370990237099

  Cluster overview
-c       2       6.534095790378007       1000
-c       1       5.377679927770931       1000
-c       12      5.02316252120862        0
+c       clu.ID  av.cov. c.score ScQ
+c       2       6.534095790378007       1000    0.7321150555238869
+c       1       5.377679927770931       1000    0.889549018651591
+c       12      5.02316252120862        0       0.9523303062907836

  Reference overview
+r       ref.ID  av.cov
 r       EOG09150009     4.854072921485344
 r       EOG0915000C     4.77517820804227
 r       EOG09150006     4.7295782621001985
 ~~~
-The summary output provides soft-clip quality, which is calculated by a division between the average softclipped bases in piRNA clusters and reference genes. Furthermore, this output contains the average softclipped bases for piRNA clusters and reference genes. Below, overviews shows the average softclipped bases in column 3 of the tab-delimited format for individual piRNA clusters or reference genes (column 2). For piRNA clusters, a 4. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps.
+The summary output provides soft-clip quality, which is calculated by a division between the average softclipped bases in reference genes and piRNA clusters. Furthermore, this output contains the average softclipped bases for piRNA clusters and reference genes. Below, overviews shows the average softclipped bases in column 3 of the tab-delimited format for individual piRNA clusters or reference genes (column 2). For piRNA clusters, a 4. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps, and a 5. column of soft-clip quality which is calculated by a division between the average softclipped bases in reference genes and a piRNA cluster.

 example output of softclips:
 ~~~

Manual modified by Filip Wierzbicki

Filip Wierzbicki — Mon, 27 Jul 2020 09:49:33 -0000

--- v40
+++ v41
@@ -151,6 +151,14 @@
 Finally we distinguish between a contig-CUSCO (poly-N tracts, i.e gaps,  between the two flanking sequences are not tolerated) and a scaffold-CUSCO (poly-N tracts, i.e. gaps,  are tolerated between the two flanking sequences).

 # piRNA cluster alignments
+
+## Obtaining a bed file from BUSCO
+While the cusco.py can output a bed file of piRNA cluster annotations, we need to make another bed file of reference annotations (e.g. genes). We can use the tsv file from the BUSCO output to make the bed file with following unix command.
+
+**making bed from BUSCO tsv**
+~~~
+cat full_table_output.tsv|awk 'NR>5'|awk '$2=="Complete"'|awk '{print $3"\t"$4-1"\t"$5-1"\t"$1}' > completeBuscos.bed
+~~~
 ##cluster-coverage.py
 This script obtains the coverage from mapping results with a bed file of piRNA clusters (cusco output) and a minimum bed file with a mandatory 4. column with names of reference annotations.

Manual modified by Filip Wierzbicki

Filip Wierzbicki — Mon, 08 Jun 2020 12:46:49 -0000

--- v39
+++ v40
@@ -74,7 +74,7 @@

 ~~~~~~
 :::bash
-python cuscoquality/cusco.py --pic pirnaclusters.txt --polyn polyn.bed --sam align-longreads.sam
+python cuscoquality/cusco.py --pic pirnaclusters.txt --polyn polyn.bed --sam align-longreads.sam --output-cb cluster.bed
 ~~~~~~