EvidentialGene Blog

Evidence Directed Gene Construction for Eukaryotes

Brought to you by: dongilbert

EvidentialGene / Blog: Recent posts

Do DNA and cytometric measures agree on genome sizes?

Find a detailed answer to this question in Gnodes#4 draft document, now at biorxiv.
See also http://eugenes.org/EvidentialGene/other/gnodes/gnodesdoc/
Your comments are welcome. If you know of anyone interested in new experiments to measure genomes with cytometry and DNA, please let them know of this work, and/or let me know. It is a subject worth pursuing to improve genome assemblies, and their contents like duplicated genes, etc. -- Don Gilbert

Posted by 2025-05-19

Gnodes#2 long and short of it, early draft

Find here an early draft of a long, and still messy, document on measuring genomes with DNA reads using Gnodes. If any of you read this, or parts thereof, I welcome any of your quick comments, criticisms, what is puzzling or unreadable in your view.

Gnodes#2 document pdf, 2024 June-Sep.
Measuring DNA contents of animal and plant genomes with Gnodes, the long and short of it. ... read more

Posted by 2024-08-05

Gnodes/Genome Depth Estimator 3rd paper

Dear genome/gene biologists and informaticians,

A Gnodes#3 draft document is here, not quite finished, but main contents are in. I value any comments you may have on this, which I hope will be understandable and useful to a range of genome biologists and informaticians.
http://arthropods.eugenes.org/EvidentialGene/other/gnodes/gnodesdoc/
gnodes_doc3d.pdf
"Measure of major contents in animal and plant genomes, using Gnodes, finds under-assemblies of model plant, Daphnia, fire ant and others."... read more

Posted by 2023-10-24

Notes on evigene 2023 july update

The current release of EvidentialGene tr2aacds.pl pipeline for RNA assembly now auto-detects stranded RNA (including long-read RNA). The version "tr2aacds4_22a.pl" is replaced by tr2aacds4.pl, and tr2aacds.pl is a symbolic link to this.

Find updated script code at
http://arthropods.eugenes.org/EvidentialGene/other/evigene_old/
or https://sourceforge.net/projects/evidentialgene/files/
as evigene23jul15.tar... read more

Posted by 2023-07-20

EvidentialGene update 2023-july-15

evigene23jul15.tar update is now available here

EvidentialGene update 2023-jul-15 notes

Mostly updates to Gnodes portion (DNA measurements of genomes), including corrections with validation of long-read DNA measures.
See Gnodes documents (2nd, 3rd in progress) at
http://eugenes.org/EvidentialGene/other/gnodes/gnodesdoc/
http://eugenes.org/EvidentialGene/other/gnodes/gnodesdoc/gnodes23daphnmodels_doc/ ... read more

Posted by 2023-07-15

evigene22may07 update

evigene22may07.tar update is now available here .
This update to mostly the Gnodes portion of EvidentialGene, with various small corrections. Gnodes is a measuring tool for both chromosome assemblies and the genes there-in, and RNA-assembled genes. One of its most useful aspects is a measure of gene-duplications that is fully evidence-based, resting upon gDNA coverage depth analyses, for unique conserved, 1-copy genes, and then all the other genes, and remainder of genome (transposons, repeats, etc). ... read more

Posted by 2022-05-09

Evigene 22jan14 update

evigene22jan14.tar update is now available here
This public release of EvidentialGene software includes what I believe is a stable public release for Gnodes with a paper in draft form soon to be available. Gnodes is a measuring tool for both chromosome assemblies and the genes there-in, and RNA-assembled genes. One of its most useful aspects is a measure of gene-duplications that is fully evidence-based, resting upon gDNA coverage depth analyses, for unique conserved, 1-copy genes, and then all the other genes, and remainder of genome (transposons, repeats, etc). ... read more

Posted by 2022-01-15

Gnodes/Genome Depth Estimator

Gnodes is a Genome Depth Estimator for animal and plant genomes, and calculates genome sizes based on DNA coverage of assemblies, using as its standard depth for unique conserved genes. It is now a component of EvidentialGene package. This tool appears to work right, that is it recovers observed flow cytometry measures of genome size fairly well in tests with plants and insects. ... read more

Posted by 2021-02-24

EvidentialGene software major update v4

There is a major EvidentialGene software update available now, in evigene20mar15.tar
at sourceforge.net/projects/evidentialgene/files/ and http://arthropods.eugenes.org/EvidentialGene/other/evigene_old/
It includes updates described in this paper, and newer ones.
A brief summary of updates is listed in evigene20mar15_updates.txt. The major new script versions are
** tr2aacds4.pl and evgpipe_sra2genes4v.pl, with updates and additions of associated pipeline components.
tr2ncrna.pl ** : new pipeline to select non-coding RNA subset of input transcripts... read more

Posted by 2020-03-18

Evigene transcript class summary, update v4

Evigene's tr2aacds pipeline summarizes results in a table, name.trclass.sum.txt. The major classes are main, alt, noclass, partitioned as to okay (keep) and drop sets. The okay set is the useful, non-redundant one.

Transcript classification table key for Evigene tr2aacds,
main = primary transcript w/ alternates;
alt = alternate transcript, with main identified;
noclass = primary with no alternates;
Class modifiers:
"hi" = high identity (>= 98% CDS identity);
"hi1" (very hi identity) and "a2" (protein identity) are subclasses
"mid" and "mfrag" are lower identity, more often paralogs than alternates.
"nc" = has non-coding quality (not absolute)
"part" = fragment alternates
"perfdupl" = coding sequence is perfect duplicate of another
"perffrag" = coding sequence is perfect fragment of another
"smallorf" = coding sequence smaller than cut-off (default 100aa).
"okay" and "drop" are partitions of the classes to keep and ignore.... read more

Posted by 2020-03-18

Reference protein annotation with namegenes

EvidentialGene uses reference protein homologies of your transcript assembly both for annotations and for selecting a best gene set. The basic method for this is in SRA2Genes omni-pipeline, as STEP8_refblastgenes. This following script is the template of STEP8_refblastgenes, and can be used to generate your transcript_assembly names table for further use with Evigene.

There are 3 output files:
$refnam-$qname.blastp (blastp output),
$refnam-$qname.btall (condensed blastp table), and
$qname-$refname.names (a table of names and blast-scores, one row per your transcript IDs)
This last names table has enough information for the further uses by Evigene.... read more

Posted by 2020-03-05

Longest protein, longest transcript or most expression? paper

See this paper for comparison of Evigene and related methods:
Gilbert, DG. (2019). Longest protein, longest transcript or most expression, for accurate gene reconstruction of transcriptomes?
bioRxiv 829184; doi: https://doi.org/10.1101/829184

Abstract
Methods of transcript assembly and reduction filters are compared for recovery
of reference gene sets of human, pig and plant, including longest
coding-sequence with EvidentialGene, longest transcript with CD-HIT, and most
RNA-seq with TransRate. EvidentialGene methods are the most accurate in
recovering reference genes, and maintain accuracy for alternate transcripts and
paralogs. In comparison, filtering large over-assemblies by longest RNA
measures, and most RNA-seq expression measures, discards a large portion of
accurate models, especially alternates and paralogs. Accuracy of protein
calculations is compared, with errors found in popular methods, as is accuracy
of transcript assemblers. Gene reconstruction accuracy depends upon the
underlying measurements, where protein criteria, including homology among
species, have the strength of evolutionary biology that other criteria lack.
EvidentialGene provides a gene reconstruction algorithm that is consistent with
genome biology.... read more

Posted by 2019-11-03

Preserve best homology transcripts in tr2aacds

What do you think of adding back BUSCO-scored mRNA missing from EviGene assembly?

A:
The most useful way to preserve transcripts with good homology, is to add a homology score table, as input to a new run of tr2aacds, using same full transcript input set, with -ablast table of transcript IDs and score of homology.

Create homology score table from BUSCO output "full_table_XXX.tsv"
You should create a large table with all the transcript IDs you tested with BUSCO (i.e. evigene set + extras). Likewise BLAST and other scores, in same format can be used. Highest scored transcript per reference ID will be preserved. Table format is rows of
"transcriptID <tab> referenceID <tab> score"... read more</tab></tab>

Posted by 2019-06-22

SRA2Genes Test Drive

Here is an update in-progress, a replacement for the 'tr2aacds' pipeline component of EvidentialGene. SRA2Genes does more than tr2aacds, which is included as part of a full gene set reconstruction pipeline:
http://eugenes.org/EvidentialGene/other/sra2genes_testdrive/

If you have interests, I would like to hear from you how this SRA2Genes Test Drive works for you. It contains gene data sets, meant to see if you can run it properly, and may take about an hour of your time, not including program run times. It will run for up to a few hours on a minimal computer, e.g. on my Mac laptop, in a Linux virtual machine, using 2 virtual CPU.... read more

Posted by 2019-05-13

Proteins from transcripts, ORF methods compared

There are several ways to compute proteins from mRNA transcripts, with differing results. Two classes are (a) basic codon lookup scan of RNA sequence for longest open reading frames (ORFs), and (b) prediction from statistical models of coding sequences, including coding/non-coding metrics, intron/exon models, sequence signals. One should be aware of what a protein translator is doing, in particular predictors miss, or mis-model from 20% to 35% of standard reference proteins, while codon lookup calculations recover nearly all reference proteins. ... read more

Posted by 2019-01-29

evigene19jan01 software release updates

evigene19jan01 software release updates, for EvidentialGene, "Evidence Directed Gene Construction for Eukaryotes".
Find evigene19jan01.tar at
http://arthropods.eugenes.org/EvidentialGene/other/evigene_old/
https://sourceforge.net/projects/evidentialgene/files/

new:
genes/evgpubsetsum.pm : component in SRA2Genes
genes/pubset2submit.pl : component in SRA2Genes
genes/trclass2pubset.pl : component in SRA2Genes
genes/ncbigff2evg.pl : misc tool
prot/tr2aacds_aaconsensus_item.txt : READ This, for tr2aacds2c.pl
prot/tr2aacds2c.pl
rnaseq/asmrna_dupfilter3c.pl... read more

Posted by 2019-01-01

Doc on Genes of the Pig reconstructed with EvidentialGene

See this pre-review doc:
Genes of the Pig, Sus scrofa, reconstructed with EvidentialGene. BioRxiv doi: 10.1101/412130
Your comments and critiques are welcome.

Posted by 2018-09-10

How To Install/Use Evigene doc

There is a new README for EvidentialGene, to help get folks started: evigene/docs/EvidentialGene_howto.txt (arthropods.eugenes.org)

Howto: EvidentialGene Gene Set Reconstruction Software
ABOUT
About Evigene-R : genes assembled from RNA pieces
About evigene/scripts/evgpipe_sra2genes.pl
About evigene/scripts/prot/tr2aacds.pl
About Evigene-G : traditional genes modeled-on-genome
About Evigene-N : non-coding gene reconstruction
About Evigene-H : gene reconstruction with hybrid of methods
HOW TO GET SOFTWARE
WHO USES IT?
IS IT ANY GOOD?
HOW TO INSTALL
TEST DRIVE
Please first try this test case with small input data (TAIR10 mRNA) and tr2aacds outputs,
BASIC USAGE of Evigene-R
STEP 1. get RNA-Seq data
STEP 4. run assemblers of RNA-seq, with kmer size options, other opts
STEP 5. trformat.pl, post process assembly sets
STEP 7. tr2aacds, reduce over-assembly to draft gene set
STEP 10. evgmrna2tsa, produce public gene sequences ... read more

Posted by 2018-09-10

Pig genes reconstructed with SRA2Genes

A gene set reconstruction for the pig (a model organism in RefSeq) is done with SRA2Genes pipeline of EvidentialGene. It is as good as or a bit better than the NCBI RefSeq gene set for pig, and a bit better than Ensembl's pig genes. A PacBio long-RNA gene set for pig is not nearly as complete as the Illumina RNA-seq used for Evigene. NCBI and Ensembl use chromosome gene models, along with the very extensive set of expressed pig gene pieces, to build RefSeq for pig. Evigene uses only genes assembled from RNA-seq pieces here. ... read more

Posted by 2018-08-16

Zebrafish gene set built with SRA2Genes pipeline

Here is a gene set built with new sra2genes pipeline of EvidentialGene, for the zebrafish model organism, Danio rerio,
http://arthropods.eugenes.org/EvidentialGene/vertebrates/zebrafish/zebrafish17evigene/

De-novo reconstruction of zebrafish genes from 3 RNA-seq sources, without use of chromosomes or other species genes, is accurate. Comparison to other zebrafish gene sets, of NCBI and Ensembl, indicate the Evigene methods are more accurate.... read more

Posted by 2018-04-30

gene-transcript ID table from evgmrna2tsa

Q: How do I create a gene-locus x transcript ID linking table?
A: Use this script evgmrna2tsa2.pl, which is my recommended follow-on to tr2aacds.
$evigene/scripts/evgmrna2tsa2.pl -onlypubset -idprefix Aspecies1EVm -class aspecies.trclass

One of its results is publicset/aspecies.pubids, which has gene locus primary and alternate classification table, with uniform new IDs, derived from classification info in aspecies.trclass.

   Public_mRNA_ID   originalID   PublicGeneID  AltNum ...
   Gene1Tr1         tr0001       Gene1          1
   Gene1Tr2         tr0002       Gene1          2
   Gene2Tr1         tr0003       Gene2          1
   Gene3Tr1         tr0004       Gene3          1
   Gene3Tr2         tr0005       Gene3          2

evgmrna2tsa has several more options for publishing gene sets to NCBI-TSA archive but for your and many others needs, this '-onlypubset' choice is enough. You may want to test first with the test drive, small data set arath_TAIR10, as here:... read more

Posted by 2018-03-01

EvidentialGene sra2genes DRAFT version 2017.12

evgpipe_sra2genes.pl is a new omnibus pipeline for processing SRA RNA-seq data into annotated public gene sets, in an automated way. It steps through the methods for fetching RNA from NCBI-SRA, preprocessing it, running several assemblers with several kmer steps. From this over-assembly set, a non-redundant gene set is produced with tr2aacds. Following this, genes are annotated with protein orthology, screened for vectors, and processed into public release gene set, including names and annotations. A file set suited to submission to NCBI-TSA is produced.... read more

Posted by 2017-12-15

Evigene versus Transdecoder for proteins from transcripts

EvidentialGene computes ORFs (proteins and coding sequences of those), of transcripts to classify them accurately. It appears to do a better job of this than TransDecoder, the now popular package of Brian Haas' that evolved from PASA and Trinity components. Evigene's method are drawn in part from these same ORF computations.

ORF computation is fairly straight-forward, differences among methods should be primarily at the edges for complicated, unusual cases. I've recently looked at results from Transdecoder versus Evigene, and I don't think Transdecoder is giving you improvements, it may well be reducing the number of best orthology proteins using its Predict variant. The initial TransDecoder.LongOrfs gives way to many results to be useful without the sort of filtering that Evigene does.... read more

Posted by 2017-11-17

Error of longest-transcript-filters like cd-hit-est

Error of using cd-hit-est longest transcript filter for gene assembly reduction versus using CDS quality filter as by EvidentialGene. Examples from Arabidopsis, Mouse, Mosquito, PacBio RNA, Illumina with Velvet, Trinity.
Don Gilbert, 2017 June

A common mistake among gene/transcript assembly projects is mis-use of longest-transcript-is-best approaches to data reduction. Over-assemblies are often reduced by methods that choose the longest transcripts. Those longest transcripts tend to have more assembly mistakes, thus making them longer than true gene transcripts. See full details at Evigene docs. ... read more

Posted by 2017-10-27

Why more proteins than transcripts, i.e. UTR-ORFs

why are there more protein and CDS sequences than transcripts in output of tr2aacds?

A:
Evigene methods handle a common RNA assembly problem: joined genes.
RNA assemblers can and do join 2 or more genes in one transcript (also called fusion, which has biological
counterpart, or chimera which can be more of a mashup of genes). Evigene methods look for more than one protein (ORF) in transcripts that have long UTR left over from longest ORF. So you end up with more proteins and CDS than you have input transcripts, due to these extra "utrorf" things. The tr2aacds output sequences, cds and aa, have "utrorf" appended to ID for these cases, but your input transcript is not changed. ... read more

Posted by 2017-10-06

<< Older Entries