Menu

VCFAnnotate issue

Anonymous
2021-10-20
2021-11-03
  • Anonymous

    Anonymous - 2021-10-20

    Dear Jorge,

    I am going on with the analysis of my Triticum turgidum spp. durum landrace (a tetraploid species). In the past days I ran VCFAnnotate on a population.vcf file generated within NGSEP with SingleSampleVariantDetector (please notice that the input name is misleading as I am actually analyzing a unique landrace genome , not a population, against the reference 'Svevo' cv. in databank).
    So I am using a .gff file that I downloaded from EBI-ENSEMBL: http://ftp.ebi.ac.uk/ensemblgenomes/pub/release-51/plants/gff3/triticum_turgidum as reference.
    When issuing the following command line:

    java -jar NGSEPcore.jar VCFAnnotate \
    -i /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/variants/population.vcf \
    -t /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/Triticum_turgidum.Svevo.v1.51.gff3.gz \
    -r /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/GCA_900231445.1.fasta \
    -o /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/variants/population_annotated.vcf

    a huge (437.8 MB) output file named "population_2_annotated.vcf" is generated and on the terminal a very rapid and long series of messages are issued ending with these lines:
    [...]
    WARNING: No genes found in transcriptome file. Check that gene features have "gene" as the exact feature type
    Oct 20, 2021 12:10:56 AM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadMap
    WARNING: No transcripts found in transcriptome file. Check that transcript features have "mRNA" as the exact feature type
    Oct 20, 2021 12:28:23 AM ngsep.vcf.VCFFunctionalAnnotator run
    INFO: Process finished

    Surprisingly, when looking at the generated file (see the attached sample file with a few hundred lines), I cannot find any reference to SNP, INDEL or whichever variation affecting a CDS, mRNA or gene , while every variation is classified as "TA=intergenic_variant".
    That really puzzles me..

    Thank you again for your concern.

    Filippo Geuna

     
  • Jorge Duitama

    Jorge Duitama - 2021-10-21

    Hi Fillipo

    The issue seems to be related to the gff or maybe to the correspondance between the fasta and the gff. We have a command called "TranscriptomeAnalyzer", which is useful to determine if a gff3 is well formed according to the specification and if it is consistent with a reference genome. If everything goes fine, this command actually gives you the catalog of CDNA, CDS and protein sequences, and some useful statistics. Please run this command and send me the log file to see why NGSEP is not accepting the gff file because at first sight it looks fine to me.

    Best regards

    Jorge

     
  • Anonymous

    Anonymous - 2021-10-21

    Hello Jorge,

    I ran the TranscriptomeAnalyzer command in the following format:

    java -jar NGSEPcore.jar TranscriptomeAnalyzer \
    -i /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/Triticum_turgidum.Svevo.v1.51.gff3.gz \
    -o /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/GCA_900231445.1_vs_gff3.TranscriptomeAnalyzer \
    -r /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/GCA_900231445.1.fasta

    which generated the attached file
    During the execution it issued a long series of warning messages on the terminal of which I attach the last few lines:

    [...]
    WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 mRNA 497609535 497609735 . + . ID=transcript:TRITD0Uv1G147000.1;Parent=gene:TRITD0Uv1G147000;biotype=protein_coding;transcript_id=TRITD0Uv1G147000.1. Unrecognized sequence name. Un
    Oct 21, 2021 10:20:20 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
    WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 exon 497609535 497609735 . + . Parent=transcript:TRITD0Uv1G147000.1;Name=TRITD0Uv1G147000.1-E1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=TRITD0Uv1G147000.1-E1;rank=1. Unrecognized sequence name. Un
    Oct 21, 2021 10:20:20 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
    WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 CDS 497609535 497609735 . + 0 ID=CDS:TRITD0Uv1G147000.1;Parent=transcript:TRITD0Uv1G147000.1;protein_id=TRITD0Uv1G147000.1. Unrecognized sequence name. Un
    Oct 21, 2021 10:20:20 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadMap
    WARNING: No genes found in transcriptome file. Check that gene features have "gene" as the exact feature type
    Oct 21, 2021 10:20:20 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadMap
    WARNING: No transcripts found in transcriptome file. Check that transcript features have "mRNA" as the exact feature type
    Oct 21, 2021 10:20:20 PM ngsep.transcriptome.TranscriptomeAnalyzer run
    INFO: Process finished

     
  • Anonymous

    Anonymous - 2021-10-21

    Sorry Jorge,
    the message left without the attached output file of the TranscriptomeAnalyzer command that is attached here. The file is full of zeroes as if no element had been recognized. I suspect a possible problem with the coding of chromosome names in the .fasta file but I am not sure.
    In case it were such a problem, can I fix it by editing (in which way?) the corresponding column in the input .vcf file?

    Thanks

    Filippo

     
  • Jorge Duitama

    Jorge Duitama - 2021-10-22

    Hi Filippo

    Based on the last lines of the log, it looks like the issue is that the chromosome names are not consistent between the reference genome and the gff. To double check this, on one side run samtools faidx on the fasta. The first two columns of the resulting fai file should have the chromosome names and lengths of the sequences in the fasta file. You can contrast this with the chromosome names and lengths of the header of the gff file. If headers are not present, you can check the chromosome names of the "gene", "mRNA" and "CDS" features.

    Since your VCF is probably consistent with the fasta, you need to either modify the gff to make it consistent or look for a different annotation gff with consistent chromosome names. In the first option you need to double check that the coordinates of the features are actually consistent. Otherwise, you will have bad annotations and protein predictions.

    Let me know how things go

    Jorge

     
    • Anonymous

      Anonymous - 2021-10-25

      Hello Jorge,

      I ran again the VCFAnnotate command using a .fasta reference file with chromosome filenames consistent with the .gff3 file which I edited as well entirely as for the chromosome codes.
      Here is the command issued:

      java -jar NGSEPcore.jar VCFAnnotate \
      -i /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/variants/population_2_chromosome_codes.vcf \
      -t /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/Triticum_turgidum.Svevo.v1.51.gff3.gz \
      -r /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/GCA_900231445.1_TR_VI.fasta \
      -o /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/variants/population_2_chromosome_codes_annotated.vcf

      Nevertheless, I got no output file this time but a long series of terminal messages ending with the following lines:

      [...]
      WARNING: Can not load genomic feature at line: Un CNR_Italy gene 497393756 497394025 . + . ID=gene:TRITD0Uv1G146970;biotype=protein_coding;gene_id=TRITD0Uv1G146970;logic_name=gff3_hc_genes. Unrecognized sequence name. Un
      Oct 25, 2021 6:06:51 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
      WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 mRNA 497393756 497394025 . + . ID=transcript:TRITD0Uv1G146970.1;Parent=gene:TRITD0Uv1G146970;biotype=protein_coding;transcript_id=TRITD0Uv1G146970.1. Unrecognized sequence name. Un
      Oct 25, 2021 6:06:51 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
      WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 exon 497393756 497394025 . + . Parent=transcript:TRITD0Uv1G146970.1;Name=TRITD0Uv1G146970.1-E1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=TRITD0Uv1G146970.1-E1;rank=1. Unrecognized sequence name. Un
      Oct 25, 2021 6:06:51 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
      WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 CDS 497393756 497394025 . + 0 ID=CDS:TRITD0Uv1G146970.1;Parent=transcript:TRITD0Uv1G146970.1;protein_id=TRITD0Uv1G146970.1. Unrecognized sequence name. Un
      Oct 25, 2021 6:06:51 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
      WARNING: Can not load genomic feature at line: Un CNR_Italy gene 497609535 497609735 . + . ID=gene:TRITD0Uv1G147000;biotype=protein_coding;gene_id=TRITD0Uv1G147000;logic_name=gff3_hc_genes. Unrecognized sequence name. Un
      Oct 25, 2021 6:06:51 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
      WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 mRNA 497609535 497609735 . + . ID=transcript:TRITD0Uv1G147000.1;Parent=gene:TRITD0Uv1G147000;biotype=protein_coding;transcript_id=TRITD0Uv1G147000.1. Unrecognized sequence name. Un
      Oct 25, 2021 6:06:51 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
      WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 exon 497609535 497609735 . + . Parent=transcript:TRITD0Uv1G147000.1;Name=TRITD0Uv1G147000.1-E1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=TRITD0Uv1G147000.1-E1;rank=1. Unrecognized sequence name. Un
      Oct 25, 2021 6:06:51 PM ngsep.transcriptome.io.GFF3TranscriptomeHandler loadFeatureLine
      WARNING: Can not load genomic feature at line: Un PGSB_Jan2017 CDS 497609535 497609735 . + 0 ID=CDS:TRITD0Uv1G147000.1;Parent=transcript:TRITD0Uv1G147000.1;protein_id=TRITD0Uv1G147000.1. Unrecognized sequence name. Un
      Exception in thread "main" java.lang.reflect.InvocationTargetException
      at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.base/java.lang.reflect.Method.invoke(Method.java:567)
      at ngsep.NGSEPcore.main(NGSEPcore.java:66)
      Caused by: java.lang.IllegalArgumentException: Empty input segments: [] for transcript: transcript:TRITD4Bv1G073250.6
      at ngsep.transcriptome.Transcript.setTranscriptSegments(Transcript.java:82)
      at ngsep.transcriptome.io.GFF3TranscriptomeHandler.loadMap(GFF3TranscriptomeHandler.java:219)
      at ngsep.transcriptome.io.GFF3TranscriptomeHandler.loadMap(GFF3TranscriptomeHandler.java:116)
      at ngsep.vcf.VCFFunctionalAnnotator.loadMap(VCFFunctionalAnnotator.java:258)
      at ngsep.vcf.VCFFunctionalAnnotator.run(VCFFunctionalAnnotator.java:216)
      at ngsep.vcf.VCFFunctionalAnnotator.main(VCFFunctionalAnnotator.java:210)
      ... 5 more

       
  • Jorge Duitama

    Jorge Duitama - 2021-10-26

    Hi Filippo

    Please share with me the fai file. From the part of the log that you send me, I would like to know if you actually have a chromosome called "Un". Please also use the symbol ">&" followed by a new file name to save the output to a log and share with me the full log. The Exception happens because the gff has a mRNA feature without children (CDS entries). To double check that please run "grep TRITD4Bv1G073250" on the gff file and share the answer.

     
    • Anonymous

      Anonymous - 2021-10-28

      Hi Jorge,

      I looked at the chromosome headers in the .fasta file and the output is as follows:

      ENALT934111LT934111.1 Triticum turgidum subsp. durum genome assembly, chromosome: 1A
      ENALT934112LT934112.1 Triticum turgidum subsp. durum genome assembly, chromosome: 1B
      ENALT934113LT934113.1 Triticum turgidum subsp. durum genome assembly, chromosome: 2A
      ENALT934114LT934114.1 Triticum turgidum subsp. durum genome assembly, chromosome: 2B
      ENALT934115LT934115.1 Triticum turgidum subsp. durum genome assembly, chromosome: 3A
      ENALT934116LT934116.1 Triticum turgidum subsp. durum genome assembly, chromosome: 3B
      ENALT934117LT934117.1 Triticum turgidum subsp. durum genome assembly, chromosome: 4A
      ENALT934118LT934118.1 Triticum turgidum subsp. durum genome assembly, chromosome: 4B
      ENALT934119LT934119.1 Triticum turgidum subsp. durum genome assembly, chromosome: 5A
      ENALT934120LT934120.1 Triticum turgidum subsp. durum genome assembly, chromosome: 5B
      ENALT934121LT934121.1 Triticum turgidum subsp. durum genome assembly, chromosome: 6A
      ENALT934122LT934122.1 Triticum turgidum subsp. durum genome assembly, chromosome: 6B
      ENALT934123LT934123.1 Triticum turgidum subsp. durum genome assembly, chromosome: 7A
      ENALT934124LT934124.1 Triticum turgidum subsp. durum genome assembly, chromosome: 7B


      which confirms that there is no chromosome called "Un".

      • In addition I ran again VCFAnnotate and outputing the results onto the attached .log file.

      java -jar NGSEPcore.jar VCFAnnotate \
      -i /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/variants/population_2_chromosome_codes.vcf \
      -t /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/Triticum_turgidum.Svevo.v1.51.gff3.gz \
      -r /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/GCA_900231445.1_TR_VI.fasta \
      -o /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/variants/population_2_chromosome_codes_annotated.vcf \

      & /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/variants/population_2_chromosome_codes_annotated.log

      ==> Curiously, no .vcf file has been generated as output!!

      • The command:

      grep TRITD4Bv1G073250 /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/Triticum_turgidum.Svevo.v1.51.gff3.gz

      did not return any result.

       
    • Anonymous

      Anonymous - 2021-10-28

      Hi Jorge,

      sorry, the output of the .fai file is as follows (I had edited the .fasta reference file to change the long chromosome codes into the short ones to make them compatible with the .gff file):

      1A 585266722 67 60 61
      1B 681112512 595021302 60 61
      2A 775448786 1287485757 60 61
      2B 790338525 2075858757 60 61
      3A 746673839 2879369658 60 61
      3B 836514780 3638488128 60 61
      4A 736872137 4488944888 60 61
      4B 676292951 5238098295 60 61
      5A 669155517 5925662863 60 61
      5B 701372996 6605971039 60 61
      6A 615672275 7319033652 60 61
      6B 698614761 7944967199 60 61
      7A 728031845 8655225607 60 61
      7B 722970987 9395391384 60 61

       
  • Jorge Duitama

    Jorge Duitama - 2021-10-29

    Hi Filippo

    About the fasta, now it is clear that the issue was that the chromosome names were not coordinated between the fasta and the gff3. Looking back at the header of the gff, the chromosome lengths coincide, which is a good sign. Now the problem is that you have to adjust or redo the VCF file because the chromosome names in the VCF are probably inconsistent with those in the fasta file. Before that, please make sure that you get good statistics from the Transcriptome Analyzer regardless of the lost annotation in the "Un" chromosome.

    About the command, because the gff is compressed with gzip, please use zcat:

    zcat /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/Triticum_turgidum.Svevo.v1.51.gff3.gz | grep TRITD4Bv1G073250

    Let me know how things go

     
    • Anonymous

      Anonymous - 2021-10-29

      Hello Jorge,

      I had already edited the VCF file to make the chromosome codes consistent with those in the Fasta and GFF files. Nevertheless, now no annotated VCF file is created by VCFAnnotate. Very strangely to me, VCFAnnotate had generated a huge output file when the chromosome codes were not consistent, in my first attempt (at the beginning of this forum thread).
      That is very puzzling..
      I'll run the Transcriptome Analyzer anyway and let you know.
      Thanks.

       
    • Anonymous

      Anonymous - 2021-10-29

      The output of the command:

      zcat /home/users/filippo.geuna/genomica/Data_Frumento_duro_Tateo_copy/Triticum_turgidum.Svevo.v1.51.gff3.gz | grep TRITD4Bv1G073250

      is included in the attached file.

      Filippo

       
      • Jorge Duitama

        Jorge Duitama - 2021-11-01

        Hi Filippo

        I checked the output of the command and the issue is that NGSEP can not process the last 7 lines, corresponding to the transcript "TRITD4Bv1G073250.6" because no CDS are annotated for that transcript. The description says that the transcript is a "nontranslating_CDS". I honestly have not seen this before. We will need to modify the software to handle properly such types of entries. In the mean time, please try generating a version of the gff without these type of transcripts. You can uncompress the file with unzip and then edit the file with a text editor to remove these lines and other lines with similar trnscript annotations. You can grep "nontranslating_CDS" to identify more instances. Please make sure first that the transcriptome analyzer runs fine on the edited gff before trying to annotate the VCF.

        Let me know how things go.

        Jorge

         
        • Anonymous

          Anonymous - 2021-11-03

          Hi Jorge,

          I'll take care of that change but I need to stress that the main problem with VCFAnnotate is that it does not generate any output although apparently the chromosome codes are consistent between the various files (FASTA, VCF and GFF3). The problem with a missing transcript is really the least important one.
          Filippo

           
          • Jorge Duitama

            Jorge Duitama - 2021-11-03

            Hi Filippo

            I understand that. The issue is that the current version of the software finishes the process with an error trying to load the gff and then it does not even start processing the VCF file. While we fix this and release a new version, I need you to please make sure first that the software can read the gff using the transcriptome analyzer command. Before this happens it is not worth to even try to annotate the vcf because the command will keep failing before even reading the VCF file and the process will not generate any output. Once you get statistics from the transcriptome analyzer, you can try the annotation of the VCF.

            Let me know how things go.

            Jorge

             

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.