Recent changes to CUSCO

CUSCO modified by Filip Wierzbicki

Filip Wierzbicki — Mon, 08 Jun 2020 12:50:54 -0000

--- v26
+++ v27
@@ -150,9 +150,6 @@
 Interestingly about 15% of the clusters were assembled due to a scaffolding step (Hi-C or reference based) as can be seen from the difference between the contig-CUSCO and the scaffold-CUSCO. These clusters assembled by scaffolding contain gaps of unknown size and are thus solely of limited use for a genomic analysis of piRNA clusters.

-
-
-
 Next we compute the CUSCO of the short read based assembly

 ~~~~~~
@@ -219,3 +216,30 @@

 **Note** that filtering for gaps of unknown size (100 'N' characters) had no effect on the CUSCO of the long read based assembly but it affected the CUSCO of the short read based assembly. In particular it elevated the contig-CUSCO from 4.71 to 5.88. Hence all N characters in the ABySS assembly are either sequencing errors or gaps of known size 

+
+##Obtaining piRNA cluster annotations in a bed file
+
+Using the optional parameter '--output-cb' for the output path, we can write a bed file of assembled piRNA clusters.
+~~~
+python cuscoquality/cusco.py --pic pic_flanks_mainChr --polyn polyn-longreads.bed --sam align-longreads.sam --output-cb cluster-longreads.bed
+~~~
+
+**output**
+
+example output:
+~~~
+2R_RaGOO        7069608 7348719 1       1000
+X_RaGOO 21889324        21926571        2       1000
+2L_RaGOO        20346270        20397257        5       1000
+3L_RaGOO        23862896        23921609        6       1000
+X_RaGOO 21996274        23500472        8_9     0
+~~~
+
+Columns are separated by tabs:
+
+* col1: chromosome/scaffold/contig
+* col2: start position
+* col3: end position
+* col4: piRNA cluster ID
+* col5: gap status(1000=gapless; 0=gapped)
+

CUSCO modified by Robert Kofler

Robert Kofler — Mon, 23 Mar 2020 15:25:40 -0000

CUSCO modified by Robert Kofler

Robert Kofler — Mon, 23 Mar 2020 15:25:06 -0000

--- v24
+++ v25
@@ -7,15 +7,15 @@

 * bwa (alignment alogrithm)
 * Python3
-* CUSCOquality scripts (either download and unpack the provided zipped file or use zip subversion, which is the recommended approach)
+* CUSCOquality scripts (use subversion to download the script; go to the Code tab, and follow the instructions)

 # Preparatoy work

 ## Identify pairs of sequences flanking piRNA clusters
 Identify pairs of sequences flanking piRNA cluster and store the sequences in a fasta file.
-Ideally these sequences should align unambiguously in a genome.
+Ideally these sequences should align unambiguously to the a genome.
 We provide an example file for *D.melanogaster* https://sourceforge.net/projects/cuscoquality/files/CUSCO-data/Dmelanogaster/neighboring_flanks.fasta
-This file for example contains the entries "Pld" and "jing" which flank a piRNA cluster on chromosome 2R:
+For example, this file contains the entries "Pld" and "jing" which flank a piRNA cluster on chromosome 2R:

 ~~~~~~
 >Pld
@@ -40,7 +40,7 @@

 We also need a piRNA cluster definition file, which contains for each cluster the position in the reference, the length and the names of the two flanking sequences.
 We provide an example file: https://sourceforge.net/projects/cuscoquality/files/CUSCO-data/Dmelanogaster/pic_flanks_mainChr
-**Note** that the names in this piRNA cluster definition file need to exactly match the names in the fasta shown above.
+**Note** that the IDs of the flanking sequences in this piRNA cluster definition file need to exactly match the IDs in the fasta-file shown above.

 ~~~~~
 1 arm_2R 2144349 2386719 Pld 2142350 2142451 + jing 2390544 2391193 +
@@ -56,11 +56,11 @@
 * col2: reference chromosome of piRNA cluster
 * col3: start position of piRNA cluster with respect to the reference chromosome
 * col4: end position of piRNA cluster with respect to the reference chromosome
-* col5: name of the flanking sequence to the left of the piRNA cluster 
+* col5: ID of the flanking sequence to the left of the piRNA cluster 
 * col6: start position of the left flank
 * col7: end position of the left flank
 * col8: strand of the left flank
-* col9: name of the flanking sequence to the right of the piRNA cluster
+* col9: ID of the flanking sequence to the right of the piRNA cluster
 * col10: start position of the right flank
 * col11: end position of the right flank 
 * col12: strand of the right flank
@@ -80,7 +80,7 @@
 * sequences flanking piRNA clusters: https://sourceforge.net/projects/cuscoquality/files/CUSCO-data/Dmelanogaster/neighboring_flanks.fasta
 * piRNA cluster definition file: https://sourceforge.net/projects/cuscoquality/files/CUSCO-data/Dmelanogaster/pic_flanks_mainChr

-Download these four files into a separate folder
+Download these four files into a  folder.

 ## Align the flanking sequences to your assembly

@@ -101,7 +101,8 @@

 ## Identify poly-N tracts in assembly

-We need to find the position of poly-N tracts in the assemblies. In the CUSCOquality package we included a script for this task.
+We need to find the position of poly-N tracts in the assemblies. 
+In the CUSCOquality package we included a script for this task.

 ~~~~~
 python cuscoquality/find-polyN.py --fasta CantonS_abyss_kmer96-contigs.fasta > polyn-shortreads.bed
@@ -146,7 +147,7 @@
 scaffold-CUSCO 95.29  (81/85)
 ~~~~~
 **Note** that the long read based assembly has very high CUSCO values, hence a large fraction of the piRNA clusters is contiguously assembled.
-Interestingly about 10% of the clusters were assembled due to a scaffolding step (Hi-C or reference based) as can be seen from the difference between the contig-CUSCO and the scaffold-CUSCO. These clusters assembled by scaffolding contain gaps of unknown size and are thus solely of limited use for a genomic analysis of piRNA clusters. 
+Interestingly about 15% of the clusters were assembled due to a scaffolding step (Hi-C or reference based) as can be seen from the difference between the contig-CUSCO and the scaffold-CUSCO. These clusters assembled by scaffolding contain gaps of unknown size and are thus solely of limited use for a genomic analysis of piRNA clusters. 



@@ -174,15 +175,15 @@
 ~~~~~

 **Note** that the CUSCO values of short-read based assemblies are very small.
-**Note** Interestingly some piRNA clusters of the short read based assembly contain poly-N sequences and thus the contig- and the scaffold-CUSCO are not identical. This may be due to very short N-tracts, possibly even of size 1nt (hence a single N).  As suche small N-tracts may represent sequencing errors rather than gaps of unknown size the poly-N file may be filtered for poly-N sequences of a certain size before computing the CUSCO (see next section).
-
-# Filtering the poly-N file for gaps of unknown size 
+**Note** Interestingly some piRNA clusters of the short read based assembly contain poly-N sequences and thus the contig- and the scaffold-CUSCO are not identical. This may be due to very short N-tracts, possibly even of size 1nt (hence a single N).  As suche small N-tracts may represent sequencing errors rather than gaps of unknown size the poly-N file may be filtered for tracts having a minimum size  or a size of exactly 100 before computing the CUSCO (see next section).
+
+## Filtering the poly-N file for gaps of unknown size 
 Some users may want to consider solely gaps of size 100bp, i.e 100 'N' characters, because 100 Ns are commonly used to denote gaps of unknown size.
-This can be easily achieved with following command
-
-~~~~~
-cat polyn-longreads.bed|awk '$5==100' > polyn-100-lr.bed
-cat polyn-shortreads.bed|awk '$5==100' > polyn-100-sr.bed
+This can be easily achieved with the following command
+
+~~~~~
+cat polyn-longreads.bed | awk '$5==100' > polyn-100-lr.bed
+cat polyn-shortreads.bed | awk '$5==100' > polyn-100-sr.bed
 ~~~~~

CUSCO modified by Florian Schwarz

Florian Schwarz — Thu, 19 Mar 2020 15:20:21 -0000

--- v23
+++ v24
@@ -125,7 +125,7 @@



-## compute the CUSCO values
+## Compute the CUSCO values

 We first compute the CUSCO for the long read based assembly

CUSCO modified by Robert Kofler

Robert Kofler — Thu, 19 Mar 2020 15:18:33 -0000

--- v22
+++ v23
@@ -108,10 +108,19 @@
 python cuscoquality/find-polyN.py --fasta subsample_CantonS_readlength_100x_r2_p2_salsa_curated_100Ns_ragoo.fasta > polyn-longreads.bed
 ~~~~~

-an example output
-~~~~~~
-
-~~~~~~
+an example output (.bed)
+~~~~~~
+Chr0_RaGOO 118273  118373  poly-n  100
+Chr0_RaGOO 210349  210449  poly-n  100
+Chr0_RaGOO 275577  275677  poly-n  100
+...
+~~~~~~
+
+* col1: the reference chromosome
+* col2: start position of the poly-N tract
+* col3: end position of the poly-N tract
+* col4: name of the reported feature 
+* col5: length of the poly-N tract (usually 100 is used for gaps of unknown size)

CUSCO modified by Florian Schwarz

Florian Schwarz — Thu, 19 Mar 2020 15:12:32 -0000

--- v21
+++ v22
@@ -137,7 +137,7 @@
 scaffold-CUSCO 95.29  (81/85)
 ~~~~~
 **Note** that the long read based assembly has very high CUSCO values, hence a large fraction of the piRNA clusters is contiguously assembled.
-Interestingly about 10% of the clusters were assembled due to a scaffolding step (Hi-C or reference based) as can be seen from the differnce between the contig-CUSCO and the scaffold-CUSCO. These clusters assembled by scaffolding contain gaps of unknown size and are thus solely of limited use for a genomic analyis of piRNA clusters. 
+Interestingly about 10% of the clusters were assembled due to a scaffolding step (Hi-C or reference based) as can be seen from the difference between the contig-CUSCO and the scaffold-CUSCO. These clusters assembled by scaffolding contain gaps of unknown size and are thus solely of limited use for a genomic analysis of piRNA clusters.

CUSCO modified by Florian Schwarz

Florian Schwarz — Thu, 19 Mar 2020 15:06:53 -0000

--- v20
+++ v21
@@ -40,7 +40,7 @@

 We also need a piRNA cluster definition file, which contains for each cluster the position in the reference, the length and the names of the two flanking sequences.
 We provide an example file: https://sourceforge.net/projects/cuscoquality/files/CUSCO-data/Dmelanogaster/pic_flanks_mainChr
-**Note** that the names in this piRNA cluster defintion file need to exactly match the names in the fasta shown above.
+**Note** that the names in this piRNA cluster definition file need to exactly match the names in the fasta shown above.

 ~~~~~
 1 arm_2R 2144349 2386719 Pld 2142350 2142451 + jing 2390544 2391193 +

CUSCO modified by Florian Schwarz

Florian Schwarz — Thu, 19 Mar 2020 15:03:26 -0000

--- v19
+++ v20
@@ -1,6 +1,6 @@
 [TOC]
 # Introduction
-CUSCO estimates the fraction of contigously assembled piRNA clusters. For estimating the CUSCO we need to first identify sequences flanking piRNA clusters.
+CUSCO estimates the fraction of contiguously assembled piRNA clusters. For estimating the CUSCO we need to first identify sequences flanking piRNA clusters.
 Hence  we need an annotation of piRNA clusters and a reference genome for estimating the CUSCO. 
 ## Requirements
 For this walkthrough you need to install the following

CUSCO modified by Robert Kofler

Robert Kofler — Thu, 19 Mar 2020 14:19:34 -0000

--- v18
+++ v19
@@ -13,7 +13,7 @@

 ## Identify pairs of sequences flanking piRNA clusters
 Identify pairs of sequences flanking piRNA cluster and store the sequences in a fasta file.
-Ideally these sequences should align unambiguously in a genome and have a  length between 100-1000bp.
+Ideally these sequences should align unambiguously in a genome.
 We provide an example file for *D.melanogaster* https://sourceforge.net/projects/cuscoquality/files/CUSCO-data/Dmelanogaster/neighboring_flanks.fasta
 This file for example contains the entries "Pld" and "jing" which flank a piRNA cluster on chromosome 2R:

@@ -47,6 +47,7 @@
 2 arm_X 21392175 21431907 Cyp6t1 21388576 21390165 - 2_right 21432585 21433046 +
 5 arm_2L 20148259 20227581 5_left 20147735 20148026 + 5_right 20227609 20228228 +
 6 arm_3L 23273964 23314199 nAChRalpha4 23271316 23271562 + alpha-Cat 23318742 23318923 -
+...
 ~~~~~

 Columns are separated by tabs:
@@ -90,12 +91,13 @@
 bwa index subsample_CantonS_readlength_100x_r2_p2_salsa_curated_100Ns_ragoo.fasta
 ~~~~~~

-Next we align the flanking sequences to each assembly
+Next we align the flanking sequences to each assembly using for example 'bwa bwasw' or 'bwa mem'

 ~~~~~~
 bwa bwasw CantonS_abyss_kmer96-contigs.fasta neighboring_flanks.fasta > align-shortreads.sam
 bwa bwasw subsample_CantonS_readlength_100x_r2_p2_salsa_curated_100Ns_ragoo.fasta  neighboring_flanks.fasta > align-longreads.sam
 ~~~~~~
+

 ## Identify poly-N tracts in assembly

CUSCO modified by Robert Kofler

Robert Kofler — Thu, 19 Mar 2020 13:26:30 -0000

--- v17
+++ v18
@@ -205,3 +205,5 @@
 ~~~~~

+**Note** that filtering for gaps of unknown size (100 'N' characters) had no effect on the CUSCO of the long read based assembly but it affected the CUSCO of the short read based assembly. In particular it elevated the contig-CUSCO from 4.71 to 5.88. Hence all N characters in the ABySS assembly are either sequencing errors or gaps of known size 
+