CUSCOquality Wiki

Brought to you by: filwierz, fschwarz, rokofler

CUSCO

Introduction
- Requirements
Preparatoy work
- Identify pairs of sequences flanking piRNA clusters
- Create a piRNA cluster definition file
Estimating the CUSCO

Introduction

CUSCO estimates the fraction of contiguously assembled piRNA clusters. For estimating the CUSCO we need to first identify sequences flanking piRNA clusters.
Hence we need an annotation of piRNA clusters and a reference genome for estimating the CUSCO.

Requirements

For this walkthrough you need to install the following

bwa (alignment alogrithm)
Python3
CUSCOquality scripts (use subversion to download the script; go to the Code tab, and follow the instructions)

Preparatoy work

Identify pairs of sequences flanking piRNA clusters

Identify pairs of sequences flanking piRNA cluster and store the sequences in a fasta file.
Ideally these sequences should align unambiguously to the a genome.
We provide an example file for D.melanogaster https://sourceforge.net/projects/cuscoquality/files/CUSCO-data/Dmelanogaster/neighboring_flanks.fasta
For example, this file contains the entries "Pld" and "jing" which flank a piRNA cluster on chromosome 2R:

>Pld
GGTTACTTGGTCGACCTGCCATTGGAATTTCTGAACAAGGAGGTTCTCACGCCGCCTGGA
ACTAGTAAGGAGGGCCTAATCCCTACCTCTGTATGGACATAG
>jing
ATGTCAGTCACTCAGCCAAAGGATACGGCATTAAAGACCAAGGAGTCGGCAGCTGAAGTA
GCAGCCCCTCTGGCCCCACTCTCTGTCAAAACAGCAGGAGCAACGGGACGGAAAACTTTA
ACCTCCAGCGCGGCTTTGTCACTGTTTGATCAGCTGAAAAATAGCGTCAATACAAACAGT
TTGACAATTGGCGCCGGCGTCGGCAACAACAGTAGCCCAGAAGCCACACCACCTATAACA
GCACCAGCCAGTACAACCACCGCCTCCCCGATACTTACTCCCAAAAGCCCACCACCCACA
CCACCCATCAACAAAAGCCCGTCTCTGTCCTCCAATATTGAACTGAAGCCCCCGGCCAAA
CCCGCTCGGCCCTTTACCTCGCCCAGCACTATTGGAATCGTGCAGGGTACCAAACGGGGG
GTGGGCGGAGTTTTTGGCGGGTTTGGAGCCCAAGCCGCCAAATTGGATATAAACGCATTG
CGATCTCAGCTCTACCAGGGCGCCAGGAAGACTGCAACAACATCTCGAGTGGGAGTCGGC
AAAACAACAACGGTGGCAACAGCCCCAAGAACAACTGGAGGAGAACGTGGCACAAAGGGG
TCCAAGAAGAGATGTCTAGATCGGTACGACTCGTCGGAGTCATCAGACAG

Create a piRNA cluster definition file

We also need a piRNA cluster definition file, which contains for each cluster the position in the reference, the length and the names of the two flanking sequences.
We provide an example file: https://sourceforge.net/projects/cuscoquality/files/CUSCO-data/Dmelanogaster/pic_flanks_mainChr
Note that the IDs of the flanking sequences in this piRNA cluster definition file need to exactly match the IDs in the fasta-file shown above.

1 arm_2R 2144349 2386719 Pld 2142350 2142451 + jing 2390544 2391193 +
2 arm_X 21392175 21431907 Cyp6t1 21388576 21390165 - 2_right 21432585 21433046 +
5 arm_2L 20148259 20227581 5_left 20147735 20148026 + 5_right 20227609 20228228 +
6 arm_3L 23273964 23314199 nAChRalpha4 23271316 23271562 + alpha-Cat 23318742 23318923 -
...

Columns are separated by tabs:

col1: number of piRNA cluster
col2: reference chromosome of piRNA cluster
col3: start position of piRNA cluster with respect to the reference chromosome
col4: end position of piRNA cluster with respect to the reference chromosome
col5: ID of the flanking sequence to the left of the piRNA cluster
col6: start position of the left flank
col7: end position of the left flank
col8: strand of the left flank
col9: ID of the flanking sequence to the right of the piRNA cluster
col10: start position of the right flank
col11: end position of the right flank
col12: strand of the right flank

Note we provide the flanking sequences and the piRNA cluster definition file for D. melanogaster. We invite the community to submit these two files for other organsims. We will host them on our webpage under: https://sourceforge.net/projects/cuscoquality/files/CUSCO-data/

Estimating the CUSCO

Obtain the necessary files

For this walkthrough we will estimate the quality of two assemblies:
Furthermore we need the sequences flanking piRNA clusters and the piRNA cluster definition file mentioned above:

an assembly based on short reads: https://sourceforge.net/projects/cuscoquality/files/Walkthrough/Cusco/CantonS_abyss_kmer96-contigs.fasta
an assembly based on long-reads and scaffolding with Hi-C: https://sourceforge.net/projects/cuscoquality/files/Walkthrough/Cusco/subsample_CantonS_readlength_100x_r2_p2_salsa_curated_100Ns_ragoo.fasta
sequences flanking piRNA clusters: https://sourceforge.net/projects/cuscoquality/files/CUSCO-data/Dmelanogaster/neighboring_flanks.fasta
piRNA cluster definition file: https://sourceforge.net/projects/cuscoquality/files/CUSCO-data/Dmelanogaster/pic_flanks_mainChr

Download these four files into a folder.

Align the flanking sequences to your assembly

For aligning reads with bwa we first need to generate an index for the two assemblies

bwa index CantonS_abyss_kmer96-contigs.fasta 
bwa index subsample_CantonS_readlength_100x_r2_p2_salsa_curated_100Ns_ragoo.fasta

Next we align the flanking sequences to each assembly using for example 'bwa bwasw' or 'bwa mem'

bwa bwasw CantonS_abyss_kmer96-contigs.fasta neighboring_flanks.fasta > align-shortreads.sam
bwa bwasw subsample_CantonS_readlength_100x_r2_p2_salsa_curated_100Ns_ragoo.fasta  neighboring_flanks.fasta > align-longreads.sam

Identify poly-N tracts in assembly

We need to find the position of poly-N tracts in the assemblies.
In the CUSCOquality package we included a script for this task.

python cuscoquality/find-polyN.py --fasta CantonS_abyss_kmer96-contigs.fasta > polyn-shortreads.bed
python cuscoquality/find-polyN.py --fasta subsample_CantonS_readlength_100x_r2_p2_salsa_curated_100Ns_ragoo.fasta > polyn-longreads.bed

an example output (.bed)

Chr0_RaGOO  118273  118373  poly-n  100
Chr0_RaGOO  210349  210449  poly-n  100
Chr0_RaGOO  275577  275677  poly-n  100
...

col1: the reference chromosome
col2: start position of the poly-N tract
col3: end position of the poly-N tract
col4: name of the reported feature
col5: length of the poly-N tract (usually 100 is used for gaps of unknown size)

Compute the CUSCO values

We first compute the CUSCO for the long read based assembly

:::bash
python cuscoquality/cusco.py --pic pic_flanks_mainChr --polyn polyn-longreads.bed --sam align-longreads.sam

and obtain

:::bash
General info:
Tested cluster 85
Cluster with error 0 (i.e. a flanking sequence was not mapped)
Valid cluster 85

CUSCO
contig-CUSCO 78.82 (67/85)
scaffold-CUSCO 95.29  (81/85)

Note that the long read based assembly has very high CUSCO values, hence a large fraction of the piRNA clusters is contiguously assembled.
Interestingly about 15% of the clusters were assembled due to a scaffolding step (Hi-C or reference based) as can be seen from the difference between the contig-CUSCO and the scaffold-CUSCO. These clusters assembled by scaffolding contain gaps of unknown size and are thus solely of limited use for a genomic analysis of piRNA clusters.

Next we compute the CUSCO of the short read based assembly

:::bash
python cuscoquality/cusco.py --pic pic_flanks_mainChr --polyn polyn-shortreads.bed --sam align-shortreads.sam

and obtain

:::bash
General info:
Tested cluster 85
Cluster with error 0 (i.e. a flanking sequence was not mapped)
Valid cluster 85

CUSCO
contig-CUSCO 4.71 (4/85)
scaffold-CUSCO 5.88  (5/85)

Note that the CUSCO values of short-read based assemblies are very small.
Note Interestingly some piRNA clusters of the short read based assembly contain poly-N sequences and thus the contig- and the scaffold-CUSCO are not identical. This may be due to very short N-tracts, possibly even of size 1nt (hence a single N). As suche small N-tracts may represent sequencing errors rather than gaps of unknown size the poly-N file may be filtered for tracts having a minimum size or a size of exactly 100 before computing the CUSCO (see next section).

Filtering the poly-N file for gaps of unknown size

Some users may want to consider solely gaps of size 100bp, i.e 100 'N' characters, because 100 Ns are commonly used to denote gaps of unknown size.
This can be easily achieved with the following command

cat polyn-longreads.bed | awk '$5==100' > polyn-100-lr.bed
cat polyn-shortreads.bed | awk '$5==100' > polyn-100-sr.bed

Now we may recompute the CUSCO for the long read based assembly

python cuscoquality/cusco.py --pic pic_flanks_mainChr --polyn polyn-100-lr.bed --sam align-longreads.sam  
General info:
Tested cluster 85
Cluster with error 0 (i.e. a flanking sequence was not mapped)
Valid cluster 85

CUSCO
contig-CUSCO 78.82 (67/85)
scaffold-CUSCO 95.29  (81/85)

and the short read based assembly

python cuscoquality/cusco.py --pic pic_flanks_mainChr --polyn polyn-100-sr.bed --sam align-shortreads.sam
General info:
Tested cluster 85
Cluster with error 0 (i.e. a flanking sequence was not mapped)
Valid cluster 85

CUSCO
contig-CUSCO 5.88 (5/85)
scaffold-CUSCO 5.88  (5/85)

Note that filtering for gaps of unknown size (100 'N' characters) had no effect on the CUSCO of the long read based assembly but it affected the CUSCO of the short read based assembly. In particular it elevated the contig-CUSCO from 4.71 to 5.88. Hence all N characters in the ABySS assembly are either sequencing errors or gaps of known size

Obtaining piRNA cluster annotations in a bed file

Using the optional parameter '--output-cb' for the output path, we can write a bed file of assembled piRNA clusters.

python cuscoquality/cusco.py --pic pic_flanks_mainChr --polyn polyn-longreads.bed --sam align-longreads.sam --output-cb cluster-longreads.bed

output

example output:

2R_RaGOO        7069608 7348719 1       1000
X_RaGOO 21889324        21926571        2       1000
2L_RaGOO        20346270        20397257        5       1000
3L_RaGOO        23862896        23921609        6       1000
X_RaGOO 21996274        23500472        8_9     0