I am going on with the analysis of my Triticum turgidum spp. durum landrace (a tetraploid species). In the past days I ran VCFAnnotate on a population.vcf file generated within NGSEP with SingleSampleVariantDetector (please notice that the input name is misleading as I am actually analyzing a unique landrace genome , not a population, against the reference 'Svevo' cv. in databank).
So I am using a .gff file that I downloaded from EBI-ENSEMBL: http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-51/plants/gff3/triticum_turgidum as reference.
When issuing the following command line:
a huge (437.8 MB) output file named "population_2_annotated.vcf" is generated and on the terminal a very rapid and long series of messages are issued ending with these lines: [...]
WARNING: No genes found in transcriptome file. Check that gene features have "gene" as the exact feature type
Oct 20, 2021 12:10:56 AM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadMap
WARNING: No transcripts found in transcriptome file. Check that transcript features have "mRNA" as the exact feature type
Oct 20, 2021 12:28:23 AM ngsep.vcf.VCFFunctionalAnnotator run
INFO: Process finished
Surprisingly, when looking at the generated file (see the attached sample file with a few hundred lines), I cannot find any reference to SNP, INDEL or whichever variation affecting a CDS, mRNA or gene , while every variation is classified as "TA=intergenic_variant".
That really puzzles me..
The issue seems to be related to the gff or maybe to the correspondance between the fasta and the gff. We have a command called "TranscriptomeAnalyzer", which is useful to determine if a gff3 is well formed according to the specification and if it is consistent with a reference genome. If everything goes fine, this command actually gives you the catalog of CDNA, CDS and protein sequences, and some useful statistics. Please run this command and send me the log file to see why NGSEP is not accepting the gff file because at first sight it looks fine to me.
Best regards
Jorge
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2021-10-21
Hello Jorge,
I ran the TranscriptomeAnalyzer command in the following format:
which generated the attached file
During the execution it issued a long series of warning messages on the terminal of which I attach the last few lines:
[...]
WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 mRNA 497609535 497609735 . + . ID=transcript:TRITD0Uv1G147000.1;Parent=gene:TRITD0Uv1G147000;biotype=protein_coding;transcript_id=TRITD0Uv1G147000.1. Unrecognized sequence name. Un
Oct 21, 2021 10:20:20 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 exon 497609535 497609735 . + . Parent=transcript:TRITD0Uv1G147000.1;Name=TRITD0Uv1G147000.1-E1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=TRITD0Uv1G147000.1-E1;rank=1. Unrecognized sequence name. Un
Oct 21, 2021 10:20:20 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 CDS 497609535 497609735 . + 0 ID=CDS:TRITD0Uv1G147000.1;Parent=transcript:TRITD0Uv1G147000.1;protein_id=TRITD0Uv1G147000.1. Unrecognized sequence name. Un
Oct 21, 2021 10:20:20 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadMap
WARNING: No genes found in transcriptome file. Check that gene features have "gene" as the exact feature type
Oct 21, 2021 10:20:20 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadMap
WARNING: No transcripts found in transcriptome file. Check that transcript features have "mRNA" as the exact feature type
Oct 21, 2021 10:20:20 PM ngsep.transcriptome.TranscriptomeAnalyzer run
INFO: Process finished
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2021-10-21
Sorry Jorge,
the message left without the attached output file of the TranscriptomeAnalyzer command that is attached here. The file is full of zeroes as if no element had been recognized. I suspect a possible problem with the coding of chromosome names in the .fasta file but I am not sure.
In case it were such a problem, can I fix it by editing (in which way?) the corresponding column in the input .vcf file?
Based on the last lines of the log, it looks like the issue is that the chromosome names are not consistent between the reference genome and the gff. To double check this, on one side run samtools faidx on the fasta. The first two columns of the resulting fai file should have the chromosome names and lengths of the sequences in the fasta file. You can contrast this with the chromosome names and lengths of the header of the gff file. If headers are not present, you can check the chromosome names of the "gene", "mRNA" and "CDS" features.
Since your VCF is probably consistent with the fasta, you need to either modify the gff to make it consistent or look for a different annotation gff with consistent chromosome names. In the first option you need to double check that the coordinates of the features are actually consistent. Otherwise, you will have bad annotations and protein predictions.
Let me know how things go
Jorge
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2021-10-25
Hello Jorge,
I ran again the VCFAnnotate command using a .fasta reference file with chromosome filenames consistent with the .gff3 file which I edited as well entirely as for the chromosome codes.
Here is the command issued:
Nevertheless, I got no output file this time but a long series of terminal messages ending with the following lines:
[...]
WARNING: Can not load genomic feature at line: Un CNR_Italy gene 497393756 497394025 . + . ID=gene:TRITD0Uv1G146970;biotype=protein_coding;gene_id=TRITD0Uv1G146970;logic_name=gff3_hc_genes. Unrecognized sequence name. Un
Oct 25, 2021 6:06:51 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 mRNA 497393756 497394025 . + . ID=transcript:TRITD0Uv1G146970.1;Parent=gene:TRITD0Uv1G146970;biotype=protein_coding;transcript_id=TRITD0Uv1G146970.1. Unrecognized sequence name. Un
Oct 25, 2021 6:06:51 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 exon 497393756 497394025 . + . Parent=transcript:TRITD0Uv1G146970.1;Name=TRITD0Uv1G146970.1-E1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=TRITD0Uv1G146970.1-E1;rank=1. Unrecognized sequence name. Un
Oct 25, 2021 6:06:51 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 CDS 497393756 497394025 . + 0 ID=CDS:TRITD0Uv1G146970.1;Parent=transcript:TRITD0Uv1G146970.1;protein_id=TRITD0Uv1G146970.1. Unrecognized sequence name. Un
Oct 25, 2021 6:06:51 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
WARNING: Can not load genomic feature at line: Un CNR_Italy gene 497609535 497609735 . + . ID=gene:TRITD0Uv1G147000;biotype=protein_coding;gene_id=TRITD0Uv1G147000;logic_name=gff3_hc_genes. Unrecognized sequence name. Un
Oct 25, 2021 6:06:51 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 mRNA 497609535 497609735 . + . ID=transcript:TRITD0Uv1G147000.1;Parent=gene:TRITD0Uv1G147000;biotype=protein_coding;transcript_id=TRITD0Uv1G147000.1. Unrecognized sequence name. Un
Oct 25, 2021 6:06:51 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 exon 497609535 497609735 . + . Parent=transcript:TRITD0Uv1G147000.1;Name=TRITD0Uv1G147000.1-E1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=TRITD0Uv1G147000.1-E1;rank=1. Unrecognized sequence name. Un
Oct 25, 2021 6:06:51 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 CDS 497609535 497609735 . + 0 ID=CDS:TRITD0Uv1G147000.1;Parent=transcript:TRITD0Uv1G147000.1;protein_id=TRITD0Uv1G147000.1. Unrecognized sequence name. Un
Exception in thread "main" java.lang.reflect.InvocationTargetException
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:567)
at ngsep.NGSEPcore.main(NGSEPcore.java:66)
Caused by: java.lang.IllegalArgumentException: Empty input segments: [] for transcript: transcript:TRITD4Bv1G073250.6
at ngsep.transcriptome.Transcript.setTranscriptSegments(Transcript.java:82)
at ngsep.transcriptome.io.GFF3TranscriptomeHandler.loadMap(GFF3TranscriptomeHandler.java:219)
at ngsep.transcriptome.io.GFF3TranscriptomeHandler.loadMap(GFF3TranscriptomeHandler.java:116)
at ngsep.vcf.VCFFunctionalAnnotator.loadMap(VCFFunctionalAnnotator.java:258)
at ngsep.vcf.VCFFunctionalAnnotator.run(VCFFunctionalAnnotator.java:216)
at ngsep.vcf.VCFFunctionalAnnotator.main(VCFFunctionalAnnotator.java:210)
... 5 more
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Please share with me the fai file. From the part of the log that you send me, I would like to know if you actually have a chromosome called "Un". Please also use the symbol ">&" followed by a new file name to save the output to a log and share with me the full log. The Exception happens because the gff has a mRNA feature without children (CDS entries). To double check that please run "grep TRITD4Bv1G073250" on the gff file and share the answer.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2021-10-28
Hi Jorge,
I looked at the chromosome headers in the .fasta file and the output is as follows:
ENALT934111LT934111.1 Triticum turgidum subsp. durum genome assembly, chromosome: 1A
ENALT934112LT934112.1 Triticum turgidum subsp. durum genome assembly, chromosome: 1B
ENALT934113LT934113.1 Triticum turgidum subsp. durum genome assembly, chromosome: 2A
ENALT934114LT934114.1 Triticum turgidum subsp. durum genome assembly, chromosome: 2B
ENALT934115LT934115.1 Triticum turgidum subsp. durum genome assembly, chromosome: 3A
ENALT934116LT934116.1 Triticum turgidum subsp. durum genome assembly, chromosome: 3B
ENALT934117LT934117.1 Triticum turgidum subsp. durum genome assembly, chromosome: 4A
ENALT934118LT934118.1 Triticum turgidum subsp. durum genome assembly, chromosome: 4B
ENALT934119LT934119.1 Triticum turgidum subsp. durum genome assembly, chromosome: 5A
ENALT934120LT934120.1 Triticum turgidum subsp. durum genome assembly, chromosome: 5B
ENALT934121LT934121.1 Triticum turgidum subsp. durum genome assembly, chromosome: 6A
ENALT934122LT934122.1 Triticum turgidum subsp. durum genome assembly, chromosome: 6B
ENALT934123LT934123.1 Triticum turgidum subsp. durum genome assembly, chromosome: 7A
ENALT934124LT934124.1 Triticum turgidum subsp. durum genome assembly, chromosome: 7B
which confirms that there is no chromosome called "Un".
In addition I ran again VCFAnnotate and outputing the results onto the attached .log file.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2021-10-28
Hi Jorge,
sorry, the output of the .fai file is as follows (I had edited the .fasta reference file to change the long chromosome codes into the short ones to make them compatible with the .gff file):
About the fasta, now it is clear that the issue was that the chromosome names were not coordinated between the fasta and the gff3. Looking back at the header of the gff, the chromosome lengths coincide, which is a good sign. Now the problem is that you have to adjust or redo the VCF file because the chromosome names in the VCF are probably inconsistent with those in the fasta file. Before that, please make sure that you get good statistics from the Transcriptome Analyzer regardless of the lost annotation in the "Un" chromosome.
About the command, because the gff is compressed with gzip, please use zcat:
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2021-10-29
Hello Jorge,
I had already edited the VCF file to make the chromosome codes consistent with those in the Fasta and GFF files. Nevertheless, now no annotated VCF file is created by VCFAnnotate. Very strangely to me, VCFAnnotate had generated a huge output file when the chromosome codes were not consistent, in my first attempt (at the beginning of this forum thread).
That is very puzzling..
I'll run the Transcriptome Analyzer anyway and let you know.
Thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I checked the output of the command and the issue is that NGSEP can not process the last 7 lines, corresponding to the transcript "TRITD4Bv1G073250.6" because no CDS are annotated for that transcript. The description says that the transcript is a "nontranslating_CDS". I honestly have not seen this before. We will need to modify the software to handle properly such types of entries. In the mean time, please try generating a version of the gff without these type of transcripts. You can uncompress the file with unzip and then edit the file with a text editor to remove these lines and other lines with similar trnscript annotations. You can grep "nontranslating_CDS" to identify more instances. Please make sure first that the transcriptome analyzer runs fine on the edited gff before trying to annotate the VCF.
Let me know how things go.
Jorge
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2021-11-03
Hi Jorge,
I'll take care of that change but I need to stress that the main problem with VCFAnnotate is that it does not generate any output although apparently the chromosome codes are consistent between the various files (FASTA, VCF and GFF3). The problem with a missing transcript is really the least important one.
Filippo
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I understand that. The issue is that the current version of the software finishes the process with an error trying to load the gff and then it does not even start processing the VCF file. While we fix this and release a new version, I need you to please make sure first that the software can read the gff using the transcriptome analyzer command. Before this happens it is not worth to even try to annotate the vcf because the command will keep failing before even reading the VCF file and the process will not generate any output. Once you get statistics from the transcriptome analyzer, you can try the annotation of the VCF.
Let me know how things go.
Jorge
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Dear Jorge,
I am going on with the analysis of my Triticum turgidum spp. durum landrace (a tetraploid species). In the past days I ran VCFAnnotate on a population.vcf file generated within NGSEP with SingleSampleVariantDetector (please notice that the input name is misleading as I am actually analyzing a unique landrace genome , not a population, against the reference 'Svevo' cv. in databank).
So I am using a .gff file that I downloaded from EBI-ENSEMBL: http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-51/plants/gff3/triticum_turgidum as reference.
When issuing the following command line:
java -jar NGSEPcore.jar VCFAnnotate \
-i /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/variants/population.vcf \
-t /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/Triticum_turgidum.Svevo.v1.51.gff3.gz \
-r /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/GCA_900231445.1.fasta \
-o /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/variants/population_annotated.vcf
a huge (437.8 MB) output file named "population_2_annotated.vcf" is generated and on the terminal a very rapid and long series of messages are issued ending with these lines:
[...]
WARNING: No genes found in transcriptome file. Check that gene features have "gene" as the exact feature type
Oct 20, 2021 12:10:56 AM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadMap
WARNING: No transcripts found in transcriptome file. Check that transcript features have "mRNA" as the exact feature type
Oct 20, 2021 12:28:23 AM ngsep.vcf.VCFFunctionalAnnotator run
INFO: Process finished
Surprisingly, when looking at the generated file (see the attached sample file with a few hundred lines), I cannot find any reference to SNP, INDEL or whichever variation affecting a CDS, mRNA or gene , while every variation is classified as "TA=intergenic_variant".
That really puzzles me..
Thank you again for your concern.
Filippo Geuna
Hi Fillipo
The issue seems to be related to the gff or maybe to the correspondance between the fasta and the gff. We have a command called "TranscriptomeAnalyzer", which is useful to determine if a gff3 is well formed according to the specification and if it is consistent with a reference genome. If everything goes fine, this command actually gives you the catalog of CDNA, CDS and protein sequences, and some useful statistics. Please run this command and send me the log file to see why NGSEP is not accepting the gff file because at first sight it looks fine to me.
Best regards
Jorge
Hello Jorge,
I ran the TranscriptomeAnalyzer command in the following format:
java -jar NGSEPcore.jar TranscriptomeAnalyzer \
-i /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/Triticum_turgidum.Svevo.v1.51.gff3.gz \
-o /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/GCA_900231445.1_vs_gff3.TranscriptomeAnalyzer \
-r /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/GCA_900231445.1.fasta
which generated the attached file
During the execution it issued a long series of warning messages on the terminal of which I attach the last few lines:
[...]
WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 mRNA 497609535 497609735 . + . ID=transcript:TRITD0Uv1G147000.1;Parent=gene:TRITD0Uv1G147000;biotype=protein_coding;transcript_id=TRITD0Uv1G147000.1. Unrecognized sequence name. Un
Oct 21, 2021 10:20:20 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 exon 497609535 497609735 . + . Parent=transcript:TRITD0Uv1G147000.1;Name=TRITD0Uv1G147000.1-E1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=TRITD0Uv1G147000.1-E1;rank=1. Unrecognized sequence name. Un
Oct 21, 2021 10:20:20 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 CDS 497609535 497609735 . + 0 ID=CDS:TRITD0Uv1G147000.1;Parent=transcript:TRITD0Uv1G147000.1;protein_id=TRITD0Uv1G147000.1. Unrecognized sequence name. Un
Oct 21, 2021 10:20:20 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadMap
WARNING: No genes found in transcriptome file. Check that gene features have "gene" as the exact feature type
Oct 21, 2021 10:20:20 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadMap
WARNING: No transcripts found in transcriptome file. Check that transcript features have "mRNA" as the exact feature type
Oct 21, 2021 10:20:20 PM ngsep.transcriptome.TranscriptomeAnalyzer run
INFO: Process finished
Sorry Jorge,
the message left without the attached output file of the TranscriptomeAnalyzer command that is attached here. The file is full of zeroes as if no element had been recognized. I suspect a possible problem with the coding of chromosome names in the .fasta file but I am not sure.
In case it were such a problem, can I fix it by editing (in which way?) the corresponding column in the input .vcf file?
Thanks
Filippo
Hi Filippo
Based on the last lines of the log, it looks like the issue is that the chromosome names are not consistent between the reference genome and the gff. To double check this, on one side run samtools faidx on the fasta. The first two columns of the resulting fai file should have the chromosome names and lengths of the sequences in the fasta file. You can contrast this with the chromosome names and lengths of the header of the gff file. If headers are not present, you can check the chromosome names of the "gene", "mRNA" and "CDS" features.
Since your VCF is probably consistent with the fasta, you need to either modify the gff to make it consistent or look for a different annotation gff with consistent chromosome names. In the first option you need to double check that the coordinates of the features are actually consistent. Otherwise, you will have bad annotations and protein predictions.
Let me know how things go
Jorge
Hello Jorge,
I ran again the VCFAnnotate command using a .fasta reference file with chromosome filenames consistent with the .gff3 file which I edited as well entirely as for the chromosome codes.
Here is the command issued:
java -jar NGSEPcore.jar VCFAnnotate \
-i /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/variants/population_2_chromosome_codes.vcf \
-t /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/Triticum_turgidum.Svevo.v1.51.gff3.gz \
-r /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/GCA_900231445.1_TR_VI.fasta \
-o /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/variants/population_2_chromosome_codes_annotated.vcf
Nevertheless, I got no output file this time but a long series of terminal messages ending with the following lines:
[...]
WARNING: Can not load genomic feature at line: Un CNR_Italy gene 497393756 497394025 . + . ID=gene:TRITD0Uv1G146970;biotype=protein_coding;gene_id=TRITD0Uv1G146970;logic_name=gff3_hc_genes. Unrecognized sequence name. Un
Oct 25, 2021 6:06:51 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 mRNA 497393756 497394025 . + . ID=transcript:TRITD0Uv1G146970.1;Parent=gene:TRITD0Uv1G146970;biotype=protein_coding;transcript_id=TRITD0Uv1G146970.1. Unrecognized sequence name. Un
Oct 25, 2021 6:06:51 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 exon 497393756 497394025 . + . Parent=transcript:TRITD0Uv1G146970.1;Name=TRITD0Uv1G146970.1-E1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=TRITD0Uv1G146970.1-E1;rank=1. Unrecognized sequence name. Un
Oct 25, 2021 6:06:51 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 CDS 497393756 497394025 . + 0 ID=CDS:TRITD0Uv1G146970.1;Parent=transcript:TRITD0Uv1G146970.1;protein_id=TRITD0Uv1G146970.1. Unrecognized sequence name. Un
Oct 25, 2021 6:06:51 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
WARNING: Can not load genomic feature at line: Un CNR_Italy gene 497609535 497609735 . + . ID=gene:TRITD0Uv1G147000;biotype=protein_coding;gene_id=TRITD0Uv1G147000;logic_name=gff3_hc_genes. Unrecognized sequence name. Un
Oct 25, 2021 6:06:51 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 mRNA 497609535 497609735 . + . ID=transcript:TRITD0Uv1G147000.1;Parent=gene:TRITD0Uv1G147000;biotype=protein_coding;transcript_id=TRITD0Uv1G147000.1. Unrecognized sequence name. Un
Oct 25, 2021 6:06:51 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 exon 497609535 497609735 . + . Parent=transcript:TRITD0Uv1G147000.1;Name=TRITD0Uv1G147000.1-E1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=TRITD0Uv1G147000.1-E1;rank=1. Unrecognized sequence name. Un
Oct 25, 2021 6:06:51 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 CDS 497609535 497609735 . + 0 ID=CDS:TRITD0Uv1G147000.1;Parent=transcript:TRITD0Uv1G147000.1;protein_id=TRITD0Uv1G147000.1. Unrecognized sequence name. Un
Exception in thread "main" java.lang.reflect.InvocationTargetException
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:567)
at ngsep.NGSEPcore.main(NGSEPcore.java:66)
Caused by: java.lang.IllegalArgumentException: Empty input segments: [] for transcript: transcript:TRITD4Bv1G073250.6
at ngsep.transcriptome.Transcript.setTranscriptSegments(Transcript.java:82)
at ngsep.transcriptome.io.GFF3TranscriptomeHandler.loadMap(GFF3TranscriptomeHandler.java:219)
at ngsep.transcriptome.io.GFF3TranscriptomeHandler.loadMap(GFF3TranscriptomeHandler.java:116)
at ngsep.vcf.VCFFunctionalAnnotator.loadMap(VCFFunctionalAnnotator.java:258)
at ngsep.vcf.VCFFunctionalAnnotator.run(VCFFunctionalAnnotator.java:216)
at ngsep.vcf.VCFFunctionalAnnotator.main(VCFFunctionalAnnotator.java:210)
... 5 more
Hi Filippo
Please share with me the fai file. From the part of the log that you send me, I would like to know if you actually have a chromosome called "Un". Please also use the symbol ">&" followed by a new file name to save the output to a log and share with me the full log. The Exception happens because the gff has a mRNA feature without children (CDS entries). To double check that please run "grep TRITD4Bv1G073250" on the gff file and share the answer.
Hi Jorge,
I looked at the chromosome headers in the .fasta file and the output is as follows:
which confirms that there is no chromosome called "Un".
java -jar NGSEPcore.jar VCFAnnotate \
-i /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/variants/population_2_chromosome_codes.vcf \
-t /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/Triticum_turgidum.Svevo.v1.51.gff3.gz \
-r /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/GCA_900231445.1_TR_VI.fasta \
-o /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/variants/population_2_chromosome_codes_annotated.vcf \
==> Curiously, no .vcf file has been generated as output!!
grep TRITD4Bv1G073250 /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/Triticum_turgidum.Svevo.v1.51.gff3.gz
did not return any result.
Hi Jorge,
sorry, the output of the .fai file is as follows (I had edited the .fasta reference file to change the long chromosome codes into the short ones to make them compatible with the .gff file):
1A 585266722 67 60 61
1B 681112512 595021302 60 61
2A 775448786 1287485757 60 61
2B 790338525 2075858757 60 61
3A 746673839 2879369658 60 61
3B 836514780 3638488128 60 61
4A 736872137 4488944888 60 61
4B 676292951 5238098295 60 61
5A 669155517 5925662863 60 61
5B 701372996 6605971039 60 61
6A 615672275 7319033652 60 61
6B 698614761 7944967199 60 61
7A 728031845 8655225607 60 61
7B 722970987 9395391384 60 61
Hi Filippo
About the fasta, now it is clear that the issue was that the chromosome names were not coordinated between the fasta and the gff3. Looking back at the header of the gff, the chromosome lengths coincide, which is a good sign. Now the problem is that you have to adjust or redo the VCF file because the chromosome names in the VCF are probably inconsistent with those in the fasta file. Before that, please make sure that you get good statistics from the Transcriptome Analyzer regardless of the lost annotation in the "Un" chromosome.
About the command, because the gff is compressed with gzip, please use zcat:
zcat /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/Triticum_turgidum.Svevo.v1.51.gff3.gz | grep TRITD4Bv1G073250
Let me know how things go
Hello Jorge,
I had already edited the VCF file to make the chromosome codes consistent with those in the Fasta and GFF files. Nevertheless, now no annotated VCF file is created by VCFAnnotate. Very strangely to me, VCFAnnotate had generated a huge output file when the chromosome codes were not consistent, in my first attempt (at the beginning of this forum thread).
That is very puzzling..
I'll run the Transcriptome Analyzer anyway and let you know.
Thanks.
The output of the command:
zcat /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/Triticum_turgidum.Svevo.v1.51.gff3.gz | grep TRITD4Bv1G073250
is included in the attached file.
Filippo
Hi Filippo
I checked the output of the command and the issue is that NGSEP can not process the last 7 lines, corresponding to the transcript "TRITD4Bv1G073250.6" because no CDS are annotated for that transcript. The description says that the transcript is a "nontranslating_CDS". I honestly have not seen this before. We will need to modify the software to handle properly such types of entries. In the mean time, please try generating a version of the gff without these type of transcripts. You can uncompress the file with unzip and then edit the file with a text editor to remove these lines and other lines with similar trnscript annotations. You can grep "nontranslating_CDS" to identify more instances. Please make sure first that the transcriptome analyzer runs fine on the edited gff before trying to annotate the VCF.
Let me know how things go.
Jorge
Hi Jorge,
I'll take care of that change but I need to stress that the main problem with VCFAnnotate is that it does not generate any output although apparently the chromosome codes are consistent between the various files (FASTA, VCF and GFF3). The problem with a missing transcript is really the least important one.
Filippo
Hi Filippo
I understand that. The issue is that the current version of the software finishes the process with an error trying to load the gff and then it does not even start processing the VCF file. While we fix this and release a new version, I need you to please make sure first that the software can read the gff using the transcriptome analyzer command. Before this happens it is not worth to even try to annotate the vcf because the command will keep failing before even reading the VCF file and the process will not generate any output. Once you get statistics from the transcriptome analyzer, you can try the annotation of the VCF.
Let me know how things go.
Jorge