CUSCOquality Wiki

Brought to you by: filwierz, fschwarz, rokofler

Manual

Introduction

In this section we provide an overview of all scripts necessary for computing the TE quality metrics and the parameters of these scripts.
For examples on how to use these scripts please check out the walkthrough https://sourceforge.net/p/cuscoquality/wiki/Home/#walkthrough

Installation

Installation is recommended by using subversion.

Go to a folder where you would like to install the tool and type the command provided at the code-tab https://sourceforge.net/p/cuscoquality/code

For example:

svn checkout https://svn.code.sf.net/p/cuscoquality/code/ cuscoquality

Requirements

In addition to the CUSCO-quality scripts you need the following

bwa (or any other alignment algorithm http://bio-bwa.sourceforge.net/)
Python3
DeviaTE (easy installation via conda https://github.com/W-L/deviaTE)
numpy and scipy (best installed via anaconda https://www.anaconda.com/distribution/)
minimap2
samtools
R with ggplot2
reference gene annotations (e.g. from BUSCO)

CUSCO

find-polyN.py

This script scans a fasta file for poly-N sequences.
The output is a bed format. The locations of all poly-N sequences are provided (eg. even for single N-characters). These poly-N sequences may than be filtered for a certain size or minium size (see walkthrough [CUSCO]).
Poly-N sequences are required to distinguish between contig-CUSCO and scaffold-CUSCO.

Example call

:::bash
python cuscoquality/find-polyN.py --fasta myassembly.fasta > polyN.bed

Parameters

--fasta: required string; path to a fasta file which that may contain one or several fasta entries (e.g.,one entry for each contig/scaffold). Poly-N stretches will be identified in these sequences.

Output

An example output:

:::bash
Chr0_RaGOO  118273  118373  poly-n  100
Chr0_RaGOO  210349  210449  poly-n  100
Chr0_RaGOO  275577  275677  poly-n  100
...

col1: the chromosome
col2: start position of the poly-N sequence
col3: end position of the poly-N sequence
col4: id of the reported feature, i.e. "poly-n"
col5: length of the poly-N tract

cusco.py

This script computes the CUSCO values, one of our four novel quality metrics. It computes both the contig-CUSCO and the scaffold-CUSCO. If positions of poly-N tracts are not provided it assumes that no 'N' characters occur in an assembly, and hence the contig-CUSCO and the scaffold-CUSCO will be identical.

** example call**

:::bash
python cuscoquality/cusco.py --pic pirnaclusters.txt --polyn polyn.bed --sam align-longreads.sam --output-cb cluster.bed

** parameters **

--sam: required string; path to a sam file containing the mapping results of the sequences flanking the piRNA clusters
--pic: required string: path to a piRNA cluster definition file, which contains the positions of known piRNA clusters and the IDs and positions of sequences flanking the clusters
--poly-n: optional string: path to a file containing the positions of poly-N sequences in the assembly; important for distinguishing between contig- and scaffold-CUSCO; Note if this file is not provided the algorithm assumes that no poly-N sequences are present in the assembly.
--lendev: optional float; maximum allowed length deviation between the observed and the expected piRNA clusters; e.g., --lendev 0.2 means that piRNA clusters only account as correctly assembled if the length of the assembled cluster is within 20% of the length of the reference cluster.
--output-cb: optional string; path to write a bed file of piRNA clusters

input --pic

example input:

:::bash
1 arm_2R 2144349 2386719 Pld 2142350 2142451 + jing 2390544 2391193 +
2 arm_X 21392175 21431907 Cyp6t1 21388576 21390165 - 2_right 21432585 21433046 +
5 arm_2L 20148259 20227581 5_left 20147735 20148026 + 5_right 20227609 20228228 +
6 arm_3L 23273964 23314199 nAChRalpha4 23271316 23271562 + alpha-Cat 23318742 23318923 -
...

Columns are separated by tabs:

col1: number of piRNA cluster
col2: reference chromosome of piRNA cluster
col3: start position of piRNA cluster with respect to the reference chromosome
col4: end position of piRNA cluster with respect to the reference chromosome
col5: ID of the flanking sequence to the left of the piRNA cluster
col6: start position of the left flank
col7: end position of the left flank
col8: strand of the left flank
col9: ID of the flanking sequence to the right of the piRNA cluster
col10: start position of the right flank
col11: end position of the right flank
col12: strand of the right flank

output

example output:

:::bash
General info:
Tested cluster 85
Cluster with error 0 (i.e. a flanking sequence was not mapped)
Valid cluster 85

CUSCO
contig-CUSCO 4.71 (4/85)
scaffold-CUSCO 5.88  (5/85)

example output bed (--output-cb):

2R_RaGOO        7069608 7348719 1       1000    False
X_RaGOO 21887321        21924568        2       1000    False
3L_RaGOO        27875492        28983531        12      0   False

Columns are separated by tabs:

col1: chromosome/scaffold/contig
col2: start position
col3: end position
col4: piRNA cluster ID
col5: gap status(1000=gapless; 0=gapped)
col6: if reverse complement assembled compared to original annotation (True/False)

The first section provides an overview of the piRNA clusters used for the analyis. Most important is the number of clusters with an error. An error would, for example, occur if the fasta-ID of a sequence flanking a piRNA cluster does not match the ID provided in the cluster definition file.

The final CUSCO section reports the contig-CUSCO and the scaffold-CUSCO (see manuscript). Briefly CUSCO estimates the fraction of contiguously assembled piRNA clusters. A cluster is assumed to be correctly assembled if both sequences flanking a piRNA cluster map to the same contig/scaffold. If the --lendev parameter the size of the cluster is also considered, i.e. the size of the assembled cluster must match the size of the reference cluster to some extent (e.g. --lendev 0.2 allows for 20% size difference).

Finally we distinguish between a contig-CUSCO (poly-N tracts, i.e gaps, between the two flanking sequences are not tolerated) and a scaffold-CUSCO (poly-N tracts, i.e. gaps, are tolerated between the two flanking sequences).

Flank design

flankbeder.sh

This script is the first step during designing flanking sequences for your regions of interest. It uses the annotations of your regions in bed format to generate a annotation file of the corresponding flanks.

example call

bash flankbeder.sh regions-of-interest.bed .

parameters

$1: bed file of regions of interest
$2: output directory

output
The script generates 4 output files in the output directory with the following names:

left_flanks.bed: annotations of left flanks
right_flanks.bed: annotations of right flanks
flanks.bed (left and right annotations concatenated)
IDs: a file containing the flank IDs

flankparser.sh

This script obtains the flanking sequences in fasta format from the reference genome assembly using samtools.

example call

bash flankparser.sh IDs flanks.bed genome.fasta flank-fasta

parameters

$1: a file containing the flank IDs generated by the script 'flankbeder.sh'
$2: annotations of flanks generated by the script 'flankbeder.sh'
$3: referenge genome assembly
$4: output directory of flanking sequences

output
The script writes flanking sequences separately into fasta files into the provided output directory.
In the end it generates a directory named 'resources' were all flanking sequences are written into a single file named 'flanks.fasta'.

flank_validation.py

This script uses a modfied sam file of flanks aligned back to the reference genome assembly. For more details and required commands (see [FlankDesign]).

example call

python flank_validation.py --bed regions-of-interest.bed --modsam resources/sam.mod --inner 100 --outer 5000 > resources/validated.tmp

parameters

--bed: required bed file of the annotations of regions of interest in the reference genome assembly
--modsam: required alignment (see [FlankDesign])
--inner: integer for (left-most) position-tolerance of corresponding flank alignment inside the region of interest
--outer: integer for (left-most) position-tolerance of corresponding flank alignment outside of the region of interest

output
The output file is a temporary file of validated regions. It requires further processing to obtain the final cluster definition file (see [FlankDesign]).

piRNA cluster alignments

Obtaining a bed file from BUSCO

While the cusco.py can output a bed file of piRNA cluster annotations, we need to make another bed file of reference annotations (e.g. genes). We can use the tsv file from the BUSCO output to make the bed file with following unix command.

making bed from BUSCO tsv

cat full_table_output.tsv|awk 'NR>5'|awk '$2=="Complete"'|awk '{print $3"\t"$4-1"\t"$5-1"\t"$1}' > completeBuscos.bed

cluster-coverage.py

This script obtains the coverage from mapping results with a bed file of piRNA clusters (cusco output) and a minimum bed file with a mandatory 4. column with names of reference annotations.

example call

samtools view mapped_reads.sort.bam|python cuscoquality/cluster-coverage.py --sam - --cluster 3cluster.bed --reference 3busco.bed --min-mq 15 --min-len 1000 --output-cluster coverage_ml1k_mq15.cluster --output-reference coverage_ml1k_mq15.busco > coverage_ml1k_mq15.summary

parameters

--sam: required string; path to a headerless sam file containing the mapping results
--cluster: required string; path to a bed of containing piRNA cluster annotations from CUSCO
--reference: required string; path to a bed of containing reference annotations with a 4. column with reference names
--min-mq: optional integer; a minimum mapping quality of reads
--min-len: optional integer; a minimum length of reads
--output-cluster: optional string; a path to write the output for piRNA clusters
--output-reference: optional string; a path to write the output for reference

output

example output of the summary on coverage:

Coverage quality: 0.1844878968558935
Note that CQ = var.ref / var.cluster (1=good, 0=bad, >1=exceptional)
Standard deviation of the coverage in piRNA clusters 27.23007883066921
Standard deviation of the coverage in reference regions 11.695878245182904
Average coverage cluster 91.1847304128054
Average coverage reference 100.39635784020722

 Cluster overview
c       clu.ID  std.dev.cov     av.cov. c.score CQ
c       12      29.620228724025186      88.75250532471752       0       0.15591534558796424
c       1       13.213894228892284      98.95333414543266       1000    0.7834369033528322
c       2       10.906112510619122      105.32487650343643      1000    1.1500738529908696

 Reference overview
r       ref.ID  std.dev.cov     av.cov.
r       EOG0915000C     10.094996158497661      99.0874915209863
r       EOG09150009     5.888202047123744       91.99781319651794
r       EOG09150006     5.4546381902835375      117.92787998867817

The summary output provides the coverage quality which is calculated by a division between the average variances of coverages in reference genes and piRNA clusters. Furthermore, it outputs the average coverage and the average standard deviations of coverages for piRNA clusters and reference genes.
Below, the output contains tab-delimited overviews of standard deviations (column 3) and averages (column 4) for colverages for individual piRNA clusters and reference genes with the corresponding feature names in column 2. For piRNA clusters, a 5. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps, and a 6. column of coverage quality which is calculated by a division between the variance of reference genes and the variance of a piRNA cluster.

example output for coverage:

1       2R_RaGOO        7069609 88
1       2R_RaGOO        7069610 88
1       2R_RaGOO        7069611 88
1       2R_RaGOO        7069612 88
1       2R_RaGOO        7069613 88
...

Columns are separated by tabs:

col1: feature ID
col2: chromosome/contig
col3: position
col4: coverage

cluster-coverage-median.py

This script is very similar to the script cluster-coverage.py above. It uses the same parameters as cluster-coverage.py. The only difference is that it uses the median value of standard deviations in reference genes to calculate the quality metric.

cluster-softclipcoverage.py

This script obtains the number of softclipped bases from mapping results with a bed file of piRNA clusters (cusco output) and a minimum bed file with a mandatory 4. column with names of reference annotations.

example call

samtools view mapped_reads.sort.bam|python cuscoquality/cluster-softclipcoverage.py --sam - --cluster 3cluster.bed --reference 3busco.bed --min-mq 15 --min-len 1000 --output-cluster softclips_ml1k_mq15.cluster --output-reference softclips_ml1k_mq15.busco > softclips_ml1k_mq15.summary

parameters

--sam: required string; path to a headerless sam file containing the mapping results
--cluster: required string; path to a bed of containing piRNA cluster annotations from cusco
--reference: required string; path to a bed of containing reference annotations with a 4. column with reference names
--min-mq: optional integer; a minimum mapping quality of reads
--min-len: optional integer; a minimum length of reads
--output-cluster: optional string; a path to write the output for piRNA clusters
--output-reference: optional string; a path to write the output for reference

output

example output of the summary on softclips:

Soft-clip quality = 0.9321079968484184
Note that ScQ = av.cov.ref / av.cov.cluster (1=good, 0=bad, >1=exceptional)
Average soft-clip coverage in piRNA clusters 5.132141252457175
Average soft-clip coverage in reference regions 4.78370990237099

 Cluster overview
c       clu.ID  av.cov. c.score ScQ
c       2       6.534095790378007       1000    0.7321150555238869
c       1       5.377679927770931       1000    0.889549018651591
c       12      5.02316252120862        0       0.9523303062907836

 Reference overview
r       ref.ID  av.cov
r       EOG09150009     4.854072921485344
r       EOG0915000C     4.77517820804227
r       EOG09150006     4.7295782621001985

The summary output provides soft-clip quality, which is calculated by a division between the average softclipped bases in reference genes and piRNA clusters. Furthermore, this output contains the average softclipped bases for piRNA clusters and reference genes. Below, overviews shows the average softclipped bases in column 3 of the tab-delimited format for individual piRNA clusters or reference genes (column 2). For piRNA clusters, a 4. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps, and a 5. column of soft-clip quality which is calculated by a division between the average softclipped bases in reference genes and a piRNA cluster.

example output of softclips:

1       2R_RaGOO        7069609 3
1       2R_RaGOO        7069610 3
1       2R_RaGOO        7069611 3
1       2R_RaGOO        7069612 3
1       2R_RaGOO        7069613 3
...

Columns are separated by tabs:

col1: feature ID
col2: chromosome/contig
col3: position
col4: coverage

cluster-softclipcoverage-median.py

This script is very similar to the script cluster-softclipcoverage.py above. It uses the same parameters as cluster-softclipcoverage.py. The only difference is that it uses the median value of reference genes to calculate the quality metric.

quantiles.py

This script obtains the quantiles using the output for the reference annotations from the two scripts 'cluster-coverage.py' and 'cluster-softclipcoverage.py'.

example calls

python cuscoquality/quantiles.py --coverage coverage_ml1k_mq15.busco > coverage.qt
python cuscoquality/quantiles.py --coverage softclips_ml1k_mq15.busco --sc-coverage > softclips.qt

parameters

--coverage: required string; path to the coverage or softclips of the reference
--sc-coverage: optional argument; only required if softclips are used

output

example output:

c       0.1%    128
c       1%      126
c       5%      121
c       95%     84
c       99%     79
c       99.9%   78

Columns are separated by tabs:

col1: identifier of underlying data (c=coverage; sc=softclips)
col2: interval limit
col3: value

visualize.R

This script obtains a final plot for a piRNA cluster or reference annotation of interst.

*example call

R --vanilla --args 1 coverage.qt softclips.qt genome.polyN coverage_ml1k_mq15.cluster softclips_ml1k_mq15.cluster 1 < cuscoquality/visualize.R

parameters

Positions of arguments need to be in the following order:

integer of the lower limit of the quantile range: 0.1, 1 or 5
string of the path to the quantiles file for the coverage
string of the path to the quantiles file for the softclips
string of the path to the polyN file
string of the path to the coverage of piRNA clusters or reference
string of the path to the softclips of piRNA clusters or reference
name of piRNA cluster or reference annotation of interest

output

The script writes the output in PostScript format using the path string to the softclips and appending '.ps'.

TE landscape metrics

artificial-reads-for-assembly.py

This script generates artificial reads for an assembly.
For each genomic position one read of a given size will be generated, this ensures an uniform coverage along the chromosome (coverage = read length). These artificial reads are then used to estimate the observed TE landscape for an assembly.

** example call**

:::bash
python cuscoquality/artificial-reads-for-assembly.py --assembly myassembly.fasta --read-length 125 > reads-for-assembly.fastq

** parameters**

--assembly: required string; path to a fasta file of the assembly for which artificial reads should be generated
--read-length: required integer; the length of the reads; single-end reads will be generated

output

A fastq file - for example:

:::bash
@1
ATTTGCCAAATATGGCCAAAAAATTTAATTTCCATTTTTGAACACAGTTTGATTGGAAATTTTATTACGAGCTCAGTGAGATATGACATTACATATTGAGACAATTATTTTTTTATGTTGTGGCA
+1
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@2
TTTGCCAAATATGGCCAAAAAATTTAATTTCCATTTTTGAACACAGTTTGATTGGAAATTTTATTACGAGCTCAGTGAGATATGACATTACATATTGAGACAATTATTTTTTTATGTTGTGGCAA
+2
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

te-landscape-metrics.py

This script computes the TE landscape metrics, three of our novel quality metrics.
It estimates whether the TE landscape is accurately captured by an assembly, i.e. i) whether the TE abundance is accurately represented, ii) whether all SNPs within TEs are present in an assembly, and iii) whether all internal deletions of a TE are present in an assembly.

example call

:::bash
python cuscoquality/te-landscape-metrics.py --observed observed.raw --expected expected.raw --min-count 2 --min-freq 0.02 --min-cov 0.05

** parameters**

--observed: required string; path to a DeviaTE file containing for each TE family the abundance, SNPs and internal deletions; represents the observed TE landscape, ie. the abundance and diversity of TEs found in an assembly
--expected: required string; path to a DeviaTE file containing for each TE family the abundance, SNPs and internal deletions; represents the expected TE landscape, derived from Illumina raw reads
--min-count: optional integer; default = 2; the minimum allele counts for calling SNPs or IDs; necessary to avoid sequencing errors
--min-freq: optional float; default = 0.02; the minimum allele frequency for calling SNPs or IDs; necessary to avoid sequencing errors
--min-cov: optional float; default = 0.05; the coverage fluctuates along TE sequences, especially for old and degraded TE families. Sometimes a SNP is solely supported by a few reads in regions with a low coverage. To avoid using SNPs/IDs from such unreliable regions, this parameter allows to restrict SNP and ID calls to regions with a suitable coverage. E.g. --min-cov 0.05 means that only regions with a coverage of 5% of the average coverage of a particular TE family are considered. Coverage differences between expected and observed values can cause biases if --min-cov is 0.
--print-raw-data: optional flag; switches to optional output, i.e. the raw data used for computing the correlation between expected and observed values are reported

output landscape metrics

Default output is our three TE landscape metrics, i.e. the slope of the regression between the expected and the observed TE landscape. We provide the slope for the TE abundance, the number of SNPs within a TE family and the number of IDs within a TE family.
For example, a slope of 1 means that the assembly perfectly captures the TE abundance, a slope <1 means that TEs are underrepresented in an assembly and a slope >1 means that the TE abundance is overestimated in an assembly.

Example output:

:::bash
TE landscape metrics

TE abundance - slope 1.0063
SNP count  - slope 1.0297
ID count   - slope 1.0672

output raw data

this output is obtained the optional flag --print-raw-data was provided

:::bash
DMRER2DM    8349.978057024851   340 8   8771.11808385376    372 5
RT1C    2716.1252999699536  2079    30  2788.888785939409   2130    41
DIVER2  5113.954950839485   1125    33  5641.890622162796   1174    33
DMPOGOR11   3071.2622164669892  23  13  2533.667445009262   37  14
STALKER3    442.37612212273575  56  1   587.144605053864    65  0
...

col1: the TE family
col2: expected abundance of the TE family
col3: expected abundance of SNPs for the TE family
col4: expected abundance of internal deletions (ID) for the TE family
col5: observed abundance of the TE family
col6: observed abundance of SNPs for the TE family
col7: observed abundance of internal deletions (ID) for the TE family

Wiki: CUSCO
Wiki: FlankDesign
Wiki: Home

CUSCOquality Wiki

Manual

Introduction

Installation

Requirements

CUSCO

find-polyN.py

cusco.py

Flank design

flankbeder.sh

flankparser.sh

flank_validation.py

piRNA cluster alignments

Obtaining a bed file from BUSCO

cluster-coverage.py

cluster-coverage-median.py

cluster-softclipcoverage.py

cluster-softclipcoverage-median.py

quantiles.py

visualize.R

TE landscape metrics

artificial-reads-for-assembly.py

te-landscape-metrics.py

Related