CUSCO estimates the fraction of contiguously assembled piRNA clusters. For estimating the CUSCO we need to first identify sequences flanking piRNA clusters.
Hence we need an annotation of piRNA clusters and a reference genome for estimating the CUSCO.
For this walkthrough you need to install the following
Identify pairs of sequences flanking piRNA cluster and store the sequences in a fasta file.
Ideally these sequences should align unambiguously to the a genome.
We provide an example file for D.melanogaster https://sourceforge.net/projects/cuscoquality/files/CUSCO-data/Dmelanogaster/neighboring_flanks.fasta
For example, this file contains the entries "Pld" and "jing" which flank a piRNA cluster on chromosome 2R:
>Pld
GGTTACTTGGTCGACCTGCCATTGGAATTTCTGAACAAGGAGGTTCTCACGCCGCCTGGA
ACTAGTAAGGAGGGCCTAATCCCTACCTCTGTATGGACATAG
>jing
ATGTCAGTCACTCAGCCAAAGGATACGGCATTAAAGACCAAGGAGTCGGCAGCTGAAGTA
GCAGCCCCTCTGGCCCCACTCTCTGTCAAAACAGCAGGAGCAACGGGACGGAAAACTTTA
ACCTCCAGCGCGGCTTTGTCACTGTTTGATCAGCTGAAAAATAGCGTCAATACAAACAGT
TTGACAATTGGCGCCGGCGTCGGCAACAACAGTAGCCCAGAAGCCACACCACCTATAACA
GCACCAGCCAGTACAACCACCGCCTCCCCGATACTTACTCCCAAAAGCCCACCACCCACA
CCACCCATCAACAAAAGCCCGTCTCTGTCCTCCAATATTGAACTGAAGCCCCCGGCCAAA
CCCGCTCGGCCCTTTACCTCGCCCAGCACTATTGGAATCGTGCAGGGTACCAAACGGGGG
GTGGGCGGAGTTTTTGGCGGGTTTGGAGCCCAAGCCGCCAAATTGGATATAAACGCATTG
CGATCTCAGCTCTACCAGGGCGCCAGGAAGACTGCAACAACATCTCGAGTGGGAGTCGGC
AAAACAACAACGGTGGCAACAGCCCCAAGAACAACTGGAGGAGAACGTGGCACAAAGGGG
TCCAAGAAGAGATGTCTAGATCGGTACGACTCGTCGGAGTCATCAGACAG
We also need a piRNA cluster definition file, which contains for each cluster the position in the reference, the length and the names of the two flanking sequences.
We provide an example file: https://sourceforge.net/projects/cuscoquality/files/CUSCO-data/Dmelanogaster/pic_flanks_mainChr
Note that the IDs of the flanking sequences in this piRNA cluster definition file need to exactly match the IDs in the fasta-file shown above.
1 arm_2R 2144349 2386719 Pld 2142350 2142451 + jing 2390544 2391193 +
2 arm_X 21392175 21431907 Cyp6t1 21388576 21390165 - 2_right 21432585 21433046 +
5 arm_2L 20148259 20227581 5_left 20147735 20148026 + 5_right 20227609 20228228 +
6 arm_3L 23273964 23314199 nAChRalpha4 23271316 23271562 + alpha-Cat 23318742 23318923 -
...
Columns are separated by tabs:
Note we provide the flanking sequences and the piRNA cluster definition file for D. melanogaster. We invite the community to submit these two files for other organsims. We will host them on our webpage under: https://sourceforge.net/projects/cuscoquality/files/CUSCO-data/
For this walkthrough we will estimate the quality of two assemblies:
Furthermore we need the sequences flanking piRNA clusters and the piRNA cluster definition file mentioned above:
Download these four files into a folder.
For aligning reads with bwa we first need to generate an index for the two assemblies
bwa index CantonS_abyss_kmer96-contigs.fasta
bwa index subsample_CantonS_readlength_100x_r2_p2_salsa_curated_100Ns_ragoo.fasta
Next we align the flanking sequences to each assembly using for example 'bwa bwasw' or 'bwa mem'
bwa bwasw CantonS_abyss_kmer96-contigs.fasta neighboring_flanks.fasta > align-shortreads.sam
bwa bwasw subsample_CantonS_readlength_100x_r2_p2_salsa_curated_100Ns_ragoo.fasta neighboring_flanks.fasta > align-longreads.sam
We need to find the position of poly-N tracts in the assemblies.
In the CUSCOquality package we included a script for this task.
python cuscoquality/find-polyN.py --fasta CantonS_abyss_kmer96-contigs.fasta > polyn-shortreads.bed
python cuscoquality/find-polyN.py --fasta subsample_CantonS_readlength_100x_r2_p2_salsa_curated_100Ns_ragoo.fasta > polyn-longreads.bed
an example output (.bed)
Chr0_RaGOO 118273 118373 poly-n 100
Chr0_RaGOO 210349 210449 poly-n 100
Chr0_RaGOO 275577 275677 poly-n 100
...
We first compute the CUSCO for the long read based assembly
:::bash
python cuscoquality/cusco.py --pic pic_flanks_mainChr --polyn polyn-longreads.bed --sam align-longreads.sam
and obtain
:::bash
General info:
Tested cluster 85
Cluster with error 0 (i.e. a flanking sequence was not mapped)
Valid cluster 85
CUSCO
contig-CUSCO 78.82 (67/85)
scaffold-CUSCO 95.29 (81/85)
Note that the long read based assembly has very high CUSCO values, hence a large fraction of the piRNA clusters is contiguously assembled.
Interestingly about 15% of the clusters were assembled due to a scaffolding step (Hi-C or reference based) as can be seen from the difference between the contig-CUSCO and the scaffold-CUSCO. These clusters assembled by scaffolding contain gaps of unknown size and are thus solely of limited use for a genomic analysis of piRNA clusters.
Next we compute the CUSCO of the short read based assembly
:::bash
python cuscoquality/cusco.py --pic pic_flanks_mainChr --polyn polyn-shortreads.bed --sam align-shortreads.sam
and obtain
:::bash
General info:
Tested cluster 85
Cluster with error 0 (i.e. a flanking sequence was not mapped)
Valid cluster 85
CUSCO
contig-CUSCO 4.71 (4/85)
scaffold-CUSCO 5.88 (5/85)
Note that the CUSCO values of short-read based assemblies are very small.
Note Interestingly some piRNA clusters of the short read based assembly contain poly-N sequences and thus the contig- and the scaffold-CUSCO are not identical. This may be due to very short N-tracts, possibly even of size 1nt (hence a single N). As suche small N-tracts may represent sequencing errors rather than gaps of unknown size the poly-N file may be filtered for tracts having a minimum size or a size of exactly 100 before computing the CUSCO (see next section).
Some users may want to consider solely gaps of size 100bp, i.e 100 'N' characters, because 100 Ns are commonly used to denote gaps of unknown size.
This can be easily achieved with the following command
cat polyn-longreads.bed | awk '$5==100' > polyn-100-lr.bed
cat polyn-shortreads.bed | awk '$5==100' > polyn-100-sr.bed
Now we may recompute the CUSCO for the long read based assembly
python cuscoquality/cusco.py --pic pic_flanks_mainChr --polyn polyn-100-lr.bed --sam align-longreads.sam
General info:
Tested cluster 85
Cluster with error 0 (i.e. a flanking sequence was not mapped)
Valid cluster 85
CUSCO
contig-CUSCO 78.82 (67/85)
scaffold-CUSCO 95.29 (81/85)
and the short read based assembly
python cuscoquality/cusco.py --pic pic_flanks_mainChr --polyn polyn-100-sr.bed --sam align-shortreads.sam
General info:
Tested cluster 85
Cluster with error 0 (i.e. a flanking sequence was not mapped)
Valid cluster 85
CUSCO
contig-CUSCO 5.88 (5/85)
scaffold-CUSCO 5.88 (5/85)
Note that filtering for gaps of unknown size (100 'N' characters) had no effect on the CUSCO of the long read based assembly but it affected the CUSCO of the short read based assembly. In particular it elevated the contig-CUSCO from 4.71 to 5.88. Hence all N characters in the ABySS assembly are either sequencing errors or gaps of known size
Using the optional parameter '--output-cb' for the output path, we can write a bed file of assembled piRNA clusters.
python cuscoquality/cusco.py --pic pic_flanks_mainChr --polyn polyn-longreads.bed --sam align-longreads.sam --output-cb cluster-longreads.bed
output
example output:
2R_RaGOO 7069608 7348719 1 1000
X_RaGOO 21889324 21926571 2 1000
2L_RaGOO 20346270 20397257 5 1000
3L_RaGOO 23862896 23921609 6 1000
X_RaGOO 21996274 23500472 8_9 0
Columns are separated by tabs: