In this section we provide an overview of all scripts necessary for computing the TE quality metrics and the parameters of these scripts.
For examples on how to use these scripts please check out the walkthrough https://sourceforge.net/p/cuscoquality/wiki/Home/#walkthrough
Installation is recommended by using subversion.
Go to a folder where you would like to install the tool and type the command provided at the code-tab https://sourceforge.net/p/cuscoquality/code
For example:
svn checkout https://svn.code.sf.net/p/cuscoquality/code/ cuscoquality
In addition to the CUSCO-quality scripts you need the following
This script scans a fasta file for poly-N sequences.
The output is a bed format. The locations of all poly-N sequences are provided (eg. even for single N-characters). These poly-N sequences may than be filtered for a certain size or minium size (see walkthrough [CUSCO]).
Poly-N sequences are required to distinguish between contig-CUSCO and scaffold-CUSCO.
Example call
:::bash
python cuscoquality/find-polyN.py --fasta myassembly.fasta > polyN.bed
Parameters
Output
An example output:
:::bash
Chr0_RaGOO 118273 118373 poly-n 100
Chr0_RaGOO 210349 210449 poly-n 100
Chr0_RaGOO 275577 275677 poly-n 100
...
This script computes the CUSCO values, one of our four novel quality metrics. It computes both the contig-CUSCO and the scaffold-CUSCO. If positions of poly-N tracts are not provided it assumes that no 'N' characters occur in an assembly, and hence the contig-CUSCO and the scaffold-CUSCO will be identical.
** example call**
:::bash
python cuscoquality/cusco.py --pic pirnaclusters.txt --polyn polyn.bed --sam align-longreads.sam --output-cb cluster.bed
** parameters **
input --pic
example input:
:::bash
1 arm_2R 2144349 2386719 Pld 2142350 2142451 + jing 2390544 2391193 +
2 arm_X 21392175 21431907 Cyp6t1 21388576 21390165 - 2_right 21432585 21433046 +
5 arm_2L 20148259 20227581 5_left 20147735 20148026 + 5_right 20227609 20228228 +
6 arm_3L 23273964 23314199 nAChRalpha4 23271316 23271562 + alpha-Cat 23318742 23318923 -
...
Columns are separated by tabs:
output
example output:
:::bash
General info:
Tested cluster 85
Cluster with error 0 (i.e. a flanking sequence was not mapped)
Valid cluster 85
CUSCO
contig-CUSCO 4.71 (4/85)
scaffold-CUSCO 5.88 (5/85)
example output bed (--output-cb):
2R_RaGOO 7069608 7348719 1 1000 False
X_RaGOO 21887321 21924568 2 1000 False
3L_RaGOO 27875492 28983531 12 0 False
Columns are separated by tabs:
The first section provides an overview of the piRNA clusters used for the analyis. Most important is the number of clusters with an error. An error would, for example, occur if the fasta-ID of a sequence flanking a piRNA cluster does not match the ID provided in the cluster definition file.
The final CUSCO section reports the contig-CUSCO and the scaffold-CUSCO (see manuscript). Briefly CUSCO estimates the fraction of contiguously assembled piRNA clusters. A cluster is assumed to be correctly assembled if both sequences flanking a piRNA cluster map to the same contig/scaffold. If the --lendev parameter the size of the cluster is also considered, i.e. the size of the assembled cluster must match the size of the reference cluster to some extent (e.g. --lendev 0.2 allows for 20% size difference).
Finally we distinguish between a contig-CUSCO (poly-N tracts, i.e gaps, between the two flanking sequences are not tolerated) and a scaffold-CUSCO (poly-N tracts, i.e. gaps, are tolerated between the two flanking sequences).
This script is the first step during designing flanking sequences for your regions of interest. It uses the annotations of your regions in bed format to generate a annotation file of the corresponding flanks.
example call
bash flankbeder.sh regions-of-interest.bed .
parameters
output
The script generates 4 output files in the output directory with the following names:
This script obtains the flanking sequences in fasta format from the reference genome assembly using samtools.
example call
bash flankparser.sh IDs flanks.bed genome.fasta flank-fasta
parameters
output
The script writes flanking sequences separately into fasta files into the provided output directory.
In the end it generates a directory named 'resources' were all flanking sequences are written into a single file named 'flanks.fasta'.
This script uses a modfied sam file of flanks aligned back to the reference genome assembly. For more details and required commands (see [FlankDesign]).
example call
python flank_validation.py --bed regions-of-interest.bed --modsam resources/sam.mod --inner 100 --outer 5000 > resources/validated.tmp
parameters
output
The output file is a temporary file of validated regions. It requires further processing to obtain the final cluster definition file (see [FlankDesign]).
While the cusco.py can output a bed file of piRNA cluster annotations, we need to make another bed file of reference annotations (e.g. genes). We can use the tsv file from the BUSCO output to make the bed file with following unix command.
making bed from BUSCO tsv
cat full_table_output.tsv|awk 'NR>5'|awk '$2=="Complete"'|awk '{print $3"\t"$4-1"\t"$5-1"\t"$1}' > completeBuscos.bed
This script obtains the coverage from mapping results with a bed file of piRNA clusters (cusco output) and a minimum bed file with a mandatory 4. column with names of reference annotations.
example call
samtools view mapped_reads.sort.bam|python cuscoquality/cluster-coverage.py --sam - --cluster 3cluster.bed --reference 3busco.bed --min-mq 15 --min-len 1000 --output-cluster coverage_ml1k_mq15.cluster --output-reference coverage_ml1k_mq15.busco > coverage_ml1k_mq15.summary
parameters
output
example output of the summary on coverage:
Coverage quality: 0.1844878968558935
Note that CQ = var.ref / var.cluster (1=good, 0=bad, >1=exceptional)
Standard deviation of the coverage in piRNA clusters 27.23007883066921
Standard deviation of the coverage in reference regions 11.695878245182904
Average coverage cluster 91.1847304128054
Average coverage reference 100.39635784020722
Cluster overview
c clu.ID std.dev.cov av.cov. c.score CQ
c 12 29.620228724025186 88.75250532471752 0 0.15591534558796424
c 1 13.213894228892284 98.95333414543266 1000 0.7834369033528322
c 2 10.906112510619122 105.32487650343643 1000 1.1500738529908696
Reference overview
r ref.ID std.dev.cov av.cov.
r EOG0915000C 10.094996158497661 99.0874915209863
r EOG09150009 5.888202047123744 91.99781319651794
r EOG09150006 5.4546381902835375 117.92787998867817
The summary output provides the coverage quality which is calculated by a division between the average variances of coverages in reference genes and piRNA clusters. Furthermore, it outputs the average coverage and the average standard deviations of coverages for piRNA clusters and reference genes.
Below, the output contains tab-delimited overviews of standard deviations (column 3) and averages (column 4) for colverages for individual piRNA clusters and reference genes with the corresponding feature names in column 2. For piRNA clusters, a 5. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps, and a 6. column of coverage quality which is calculated by a division between the variance of reference genes and the variance of a piRNA cluster.
example output for coverage:
1 2R_RaGOO 7069609 88
1 2R_RaGOO 7069610 88
1 2R_RaGOO 7069611 88
1 2R_RaGOO 7069612 88
1 2R_RaGOO 7069613 88
...
Columns are separated by tabs:
This script is very similar to the script cluster-coverage.py above. It uses the same parameters as cluster-coverage.py. The only difference is that it uses the median value of standard deviations in reference genes to calculate the quality metric.
This script obtains the number of softclipped bases from mapping results with a bed file of piRNA clusters (cusco output) and a minimum bed file with a mandatory 4. column with names of reference annotations.
example call
samtools view mapped_reads.sort.bam|python cuscoquality/cluster-softclipcoverage.py --sam - --cluster 3cluster.bed --reference 3busco.bed --min-mq 15 --min-len 1000 --output-cluster softclips_ml1k_mq15.cluster --output-reference softclips_ml1k_mq15.busco > softclips_ml1k_mq15.summary
parameters
output
example output of the summary on softclips:
Soft-clip quality = 0.9321079968484184
Note that ScQ = av.cov.ref / av.cov.cluster (1=good, 0=bad, >1=exceptional)
Average soft-clip coverage in piRNA clusters 5.132141252457175
Average soft-clip coverage in reference regions 4.78370990237099
Cluster overview
c clu.ID av.cov. c.score ScQ
c 2 6.534095790378007 1000 0.7321150555238869
c 1 5.377679927770931 1000 0.889549018651591
c 12 5.02316252120862 0 0.9523303062907836
Reference overview
r ref.ID av.cov
r EOG09150009 4.854072921485344
r EOG0915000C 4.77517820804227
r EOG09150006 4.7295782621001985
The summary output provides soft-clip quality, which is calculated by a division between the average softclipped bases in reference genes and piRNA clusters. Furthermore, this output contains the average softclipped bases for piRNA clusters and reference genes. Below, overviews shows the average softclipped bases in column 3 of the tab-delimited format for individual piRNA clusters or reference genes (column 2). For piRNA clusters, a 4. column contains the gap status of a piRNA cluster, where 0 indicates gaps while 1000 indicates no gaps, and a 5. column of soft-clip quality which is calculated by a division between the average softclipped bases in reference genes and a piRNA cluster.
example output of softclips:
1 2R_RaGOO 7069609 3
1 2R_RaGOO 7069610 3
1 2R_RaGOO 7069611 3
1 2R_RaGOO 7069612 3
1 2R_RaGOO 7069613 3
...
Columns are separated by tabs:
This script is very similar to the script cluster-softclipcoverage.py above. It uses the same parameters as cluster-softclipcoverage.py. The only difference is that it uses the median value of reference genes to calculate the quality metric.
This script obtains the quantiles using the output for the reference annotations from the two scripts 'cluster-coverage.py' and 'cluster-softclipcoverage.py'.
example calls
python cuscoquality/quantiles.py --coverage coverage_ml1k_mq15.busco > coverage.qt
python cuscoquality/quantiles.py --coverage softclips_ml1k_mq15.busco --sc-coverage > softclips.qt
parameters
output
example output:
c 0.1% 128
c 1% 126
c 5% 121
c 95% 84
c 99% 79
c 99.9% 78
Columns are separated by tabs:
This script obtains a final plot for a piRNA cluster or reference annotation of interst.
*example call
R --vanilla --args 1 coverage.qt softclips.qt genome.polyN coverage_ml1k_mq15.cluster softclips_ml1k_mq15.cluster 1 < cuscoquality/visualize.R
parameters
Positions of arguments need to be in the following order:
output
The script writes the output in PostScript format using the path string to the softclips and appending '.ps'.
This script generates artificial reads for an assembly.
For each genomic position one read of a given size will be generated, this ensures an uniform coverage along the chromosome (coverage = read length). These artificial reads are then used to estimate the observed TE landscape for an assembly.
** example call**
:::bash
python cuscoquality/artificial-reads-for-assembly.py --assembly myassembly.fasta --read-length 125 > reads-for-assembly.fastq
** parameters**
output
A fastq file - for example:
:::bash
@1
ATTTGCCAAATATGGCCAAAAAATTTAATTTCCATTTTTGAACACAGTTTGATTGGAAATTTTATTACGAGCTCAGTGAGATATGACATTACATATTGAGACAATTATTTTTTTATGTTGTGGCA
+1
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@2
TTTGCCAAATATGGCCAAAAAATTTAATTTCCATTTTTGAACACAGTTTGATTGGAAATTTTATTACGAGCTCAGTGAGATATGACATTACATATTGAGACAATTATTTTTTTATGTTGTGGCAA
+2
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
This script computes the TE landscape metrics, three of our novel quality metrics.
It estimates whether the TE landscape is accurately captured by an assembly, i.e. i) whether the TE abundance is accurately represented, ii) whether all SNPs within TEs are present in an assembly, and iii) whether all internal deletions of a TE are present in an assembly.
example call
:::bash
python cuscoquality/te-landscape-metrics.py --observed observed.raw --expected expected.raw --min-count 2 --min-freq 0.02 --min-cov 0.05
** parameters**
output landscape metrics
Default output is our three TE landscape metrics, i.e. the slope of the regression between the expected and the observed TE landscape. We provide the slope for the TE abundance, the number of SNPs within a TE family and the number of IDs within a TE family.
For example, a slope of 1 means that the assembly perfectly captures the TE abundance, a slope <1 means that TEs are underrepresented in an assembly and a slope >1 means that the TE abundance is overestimated in an assembly.
Example output:
:::bash
TE landscape metrics
TE abundance - slope 1.0063
SNP count - slope 1.0297
ID count - slope 1.0672
output raw data
this output is obtained the optional flag --print-raw-data was provided
:::bash
DMRER2DM 8349.978057024851 340 8 8771.11808385376 372 5
RT1C 2716.1252999699536 2079 30 2788.888785939409 2130 41
DIVER2 5113.954950839485 1125 33 5641.890622162796 1174 33
DMPOGOR11 3071.2622164669892 23 13 2533.667445009262 37 14
STALKER3 442.37612212273575 56 1 587.144605053864 65 0
...