EvidentialGene Blog

Evidence Directed Gene Construction for Eukaryotes

Brought to you by: dongilbert

trformat usage

.. how is trformat.pl used ?

A:
evigene/scripts/rnaseq/trformat.pl
is used BEFORE tr2aacds to regularize IDs in fasta of Velvet,Soap,Trinity, ensure unique IDs, add prefixes for parameter sets. It is is mainly to ensure unique IDs for input transcripts, where assemblers run with varying options (kmer,..) will produce subsets with same ID.

trformat.pl guesses from file names options like kmer and assembly type, then ensures combined transcript set has unique IDs. It does ID reformatting to make those of various assemblers look similar, from these assembler types: velvet, soaptr, trinity, idba, and pacbio (as of 2017 version).

When I run multi-kmer assemblers, each kmer output has subfolder "vel_k32" "vel_k61" .. then IDs of combined assembly look like "velk32g001t1", "velk61g001t1", ...

Here is a script to run trformat.pl on set of velvet, soap multi-kmer assemblies:
http://arthropods.eugenes.org/EvidentialGene/arthropods/mosquito/evg_scripts/trformgroup.sh
where calls are like this

  nap=MySppIDPrefix
  subd=assemblies_folder
  $evigene/scripts/rnaseq/trformat.pl -pre $nap -out $subd.tr -log -in  $subd/vel*/transcripts.fa
  $evigene/scripts/rnaseq/trformat.pl -pre $nap -out $subd.tr -log -in  $subd/sod*/so*.scafSeq.gz

trformat.pl is not essential to using tr2aacds, which checks for duplicate IDs, but you should somehow ensure your input transcript set has unique transcript IDs.

Wiki: Home

Posted by 2017-10-06

EvidentialGene Blog

Evidence Directed Gene Construction for Eukaryotes

trformat usage

Related