I'd like to troubleshoot false negative events. That is, I'm trying to figure out why a certain known fusion in my sequences was not supported. Is there a way for defuse to spit out raw support for all potential fusions? For example, can I see - for a particular gene/genes:
the amount of discordant reads support for all sites
of chimeric reads by site
general alignment coverage by position
etc.?
Or are there other ways to diagnose false negatives?
Thanks!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Given a gene pair, the easiest way to identify the number and alignment location of discordant reads between the genes is to search the clusters file. Unfortunately the file contains alignments to cdna and the genome, so you can't do something as simple as a grep. Ive attached a perl script to help, the usage should be reasonably intuitive, for example:
will print all clusters with alignments between genes ENSG00000141376 and ENSG00000108510 (plus or minus 1000nt) where Homo_sapiens.GRCh37.69.gtf is the gtf file in the defuse dataset.
The format is tab delimited with fields:
cluster id
cluster end (0 or 1)
read id
read end (0 or 1)
reference name (cDNA or chromosome)
strand
alignment start
alignment end
If you suspect the fusion might produce a very low number of discordant reads (because of low expression or breakpoint near the start or end of the fusion gene) you might want to rerun the clustering with a lower threshold on number of reads. If you dont want to rerun defuse, just rerun the clustering:
grep clustermatepairs log/defuse.log
will produce a command for each chromosome pair. Rerun clustermatepairs with -m 1 for all clusters of size 1 or greater and the results will appear in a clusters*.tmp file.
An additional note, there is substantial heuristic filtering at the alignment stage in alignjob.pl. For instance, read pairs for which both ends align within the same ensembl gene are filtered. This has caused problems previously when erroneous genes in the ensembl database that are annotated to originate from large regions of a chromosome have resulted in filtering real gene fusion evidence, such as TMPRSS2-ERG getting filtered in ensembl version 67.
If you are interested in trying defuse without this filtering, unfortunately the only option is quite manual, you will have to edit alignjob.pl such that filtered read id filenames are not pushed to @concordant_readids, and rerun defuse from scratch.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'd like to troubleshoot false negative events. That is, I'm trying to figure out why a certain known fusion in my sequences was not supported. Is there a way for defuse to spit out raw support for all potential fusions? For example, can I see - for a particular gene/genes:
of chimeric reads by site
Or are there other ways to diagnose false negatives?
Thanks!
Given a gene pair, the easiest way to identify the number and alignment location of discordant reads between the genes is to search the clusters file. Unfortunately the file contains alignments to cdna and the genome, so you can't do something as simple as a grep. Ive attached a perl script to help, the usage should be reasonably intuitive, for example:
cat clusters | perl select_genepair_clusters.pl Homo_sapiens.GRCh37.69.gtf ENSG00000141376 ENSG00000108510 1000
will print all clusters with alignments between genes ENSG00000141376 and ENSG00000108510 (plus or minus 1000nt) where Homo_sapiens.GRCh37.69.gtf is the gtf file in the defuse dataset.
The format is tab delimited with fields:
cluster id
cluster end (0 or 1)
read id
read end (0 or 1)
reference name (cDNA or chromosome)
strand
alignment start
alignment end
If you suspect the fusion might produce a very low number of discordant reads (because of low expression or breakpoint near the start or end of the fusion gene) you might want to rerun the clustering with a lower threshold on number of reads. If you dont want to rerun defuse, just rerun the clustering:
grep clustermatepairs log/defuse.log
will produce a command for each chromosome pair. Rerun clustermatepairs with -m 1 for all clusters of size 1 or greater and the results will appear in a clusters*.tmp file.
An additional note, there is substantial heuristic filtering at the alignment stage in alignjob.pl. For instance, read pairs for which both ends align within the same ensembl gene are filtered. This has caused problems previously when erroneous genes in the ensembl database that are annotated to originate from large regions of a chromosome have resulted in filtering real gene fusion evidence, such as TMPRSS2-ERG getting filtered in ensembl version 67.
If you are interested in trying defuse without this filtering, unfortunately the only option is quite manual, you will have to edit alignjob.pl such that filtered read id filenames are not pushed to @concordant_readids, and rerun defuse from scratch.