Menu

Annotate

Anonymous
2018-07-12
2020-03-03
  • Anonymous

    Anonymous - 2018-07-12

    Hi,

    I have a question regarding the Annotate command. Can the Annotate command accept a gzipped gff3 file as input?

     
  • Jorge Duitama

    Jorge Duitama - 2018-07-15

    Hi

    Not at this moment but it should not be difficult to add this feature for the next version. By now, we normally use uncompressed gff3 files because they are relatively small, unless they include the complete genomic sequence.

    Let me know if you have further questions about the annotation process or about other functionalities.

    Jorge

     
  • Anonymous

    Anonymous - 2020-02-14

    Hi
    I am using the NGSEP that is currently in github, also following the tutorial.txt instructions and the functions Annotate(), and all the Stats functions are not being recognized by NGSEP. Is there some updates that i need to know about? thanks

     
  • Jorge Duitama

    Jorge Duitama - 2020-02-14

    Yes. We are about to make the first release of the major version 4. One of the major changes for this version is a large standarization of command and option names across all functionalities. With the exception of commands having as input multiple files of the same type (such as MergeVariants) all commands will receive their main input file using the -i option and have the -o option to specify the output file (or prefix or directory) and all other inputs will be received through options. The former command "Annotate" is now called "VCFAnnotate" and the former command "SummaryStats" is now called "VCFSummaryStats". If you build the new jar (NGSEPcore_4.0.0.jar), you can already run the program without commands to see the new names (including new functionalities) and you can run a command without parameters to see the new usage.

    We are currently finishing documentation tasks and, unfortunately, the training materials were not yet up to date with the latest changes. I just pushed to github the new version of the training materials, so people cloning the repository can already see how things will operate from now on. We will run a full testing round to make sure that all functionalities work as expected in the new version. In the mean time, feel free to run the github version and let us know if you find any further issue (that would actually help a lot).

    The good news related to the start of this post is that in the new version both fasta reference genomes and annotation gff files can gz compressed.

    Let me know if you have any further questions or issues running NGSEP.

    Jorge

     
  • Anonymous

    Anonymous - 2020-02-19

    Thanks Jorge! I am running through errors when calling the VCFAnnotate:

    INFO: Loading genome from: /data/ngsep_tutorial/reference/GRCh38_latest_genomic.fna
    Feb 19, 2020 11:32:11 AM ngsep.main.OptionValuesDecoder loadGenome
    INFO: Loaded genome with: 639 sequences. Total length: 3272089205 from file: /data/ngsep_tutorial/reference/GRCh38_latest_genomic.fna
    Feb 19, 2020 11:32:11 AM ngsep.vcf.VCFFunctionalAnnotator logParameters
    INFO: Input file: HLA_Project.vcf
    GFF transcriptome file: GRCh38_latest_genomic.gff
    Loaded reference genome from: /data/ngsep_tutorial/reference/GRCh38_latest_genomic.fna
    Output file: HLA_ann.vcf
    Upstream offset: 1000
    Downstream offset: 300
    Splice donor offset: 2
    Splice acceptor offset: 2
    Splice region intron offset: 10
    Splice region exon offset: 2

    Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at ngsep.NGSEPcore.main(NGSEPcore.java:66)
    Caused by: java.lang.IllegalArgumentException: Empty input segments: [] for transcript: rna-NM_001199281.1-2
    at ngsep.transcriptome.Transcript.setTranscriptSegments(Transcript.java:81)
    at ngsep.transcriptome.io.GFF3TranscriptomeHandler.loadMap(GFF3TranscriptomeHandler.java:219)
    at ngsep.transcriptome.io.GFF3TranscriptomeHandler.loadMap(GFF3TranscriptomeHandler.java:116)
    at ngsep.vcf.VCFFunctionalAnnotator.loadMap(VCFFunctionalAnnotator.java:258)
    at ngsep.vcf.VCFFunctionalAnnotator.run(VCFFunctionalAnnotator.java:216)
    at ngsep.vcf.VCFFunctionalAnnotator.main(VCFFunctionalAnnotator.java:210)
    ... 5 more

    And also with the distanceMatrix function:
    
    Feb 19, 2020 11:36:11 AM ngsep.vcf.VCFDistanceMatrixCalculator logParameters
    

    INFO: Input file: HLA_Project
    Output file: HLA_Project_matrix
    Distance from genotype calls ignoring local copy number (GT format field)
    Writing full matrix format
    Samples ploidy: 2

    Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at ngsep.NGSEPcore.main(NGSEPcore.java:66)
    Caused by: java.io.IOException: VCF file does not have line with sample ids
    at ngsep.vcf.VCFFileHeader.loadSampleIds(VCFFileHeader.java:140)
    at ngsep.vcf.VCFFileReader.init(VCFFileReader.java:139)
    at ngsep.vcf.VCFFileReader.<init>(VCFFileReader.java:74)
    at ngsep.vcf.VCFDistanceMatrixCalculator.generateMatrix(VCFDistanceMatrixCalculator.java:154)
    at ngsep.vcf.VCFDistanceMatrixCalculator.run(VCFDistanceMatrixCalculator.java:118)
    at ngsep.vcf.VCFDistanceMatrixCalculator.main(VCFDistanceMatrixCalculator.java:108)
    ... 5 more
    Feb 19, 2020 11:36:11 AM ngsep.clustering.NeighborJoining run
    INFO: Loading matrix from file HLA_Project.vcf
    Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at ngsep.NGSEPcore.main(NGSEPcore.java:66)
    Caused by: java.io.IOException: Number format error reading number of samples
    at ngsep.clustering.DistanceMatrix.loadFromFile(DistanceMatrix.java:48)
    at ngsep.clustering.DistanceMatrix.<init>(DistanceMatrix.java:27)
    at ngsep.clustering.NeighborJoining.run(NeighborJoining.java:81)
    at ngsep.clustering.NeighborJoining.main(NeighborJoining.java:74)
    ... 5 more
    Caused by: java.lang.NumberFormatException: For input string: "##fileformat=VCFv4.2"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:569)
    at java.lang.Integer.parseInt(Integer.java:615)
    at ngsep.clustering.DistanceMatrix.loadFromFile(DistanceMatrix.java:46)
    ... 8 more</init></init>

    the VCF file was generated by NGSEP after the bowtie2 mapping... Please let me know what i can do to fix these errors. Thanks.

     
  • Jorge Duitama

    Jorge Duitama - 2020-02-19

    Hi

    As promised before, I just made the release of version 4.0.0. We actually fixed a few final bugs in the last few days, so my first suggestion would be to download and try again with the released version.

    Regarding the error with VCFAnnotate, please double check in your gff3 file if the annotation of the gene "rna-NM_001199281.1-2" is consistent with the gff3 format specification. You can use the command TranscriptomeAnalyzer to check the consistency of gff3 files and, if everything goes right, obtain some useful statistics.

    Regarding the distance matrix function, the log seems to indicate that you are also calling the "NeighborJoining" command but you are providing the VCF file and this command should receive the output from the "VCFDistanceMatrixCalculator" command (the distance matrix).

    Let me know how things go

    Jorge

     
  • Anonymous

    Anonymous - 2020-02-20

    Thanks Jorge! I think everything is working great right now. I 'll let you know if i come accross anything else.

     
    👍
    1
  • Anonymous

    Anonymous - 2020-02-28

    Hello Jorge,
    i used VCFConverter to get a structure format in order to run STRUCTURE for further analysis and i am getting this error:
    WARNING! Probable error in the input file.
    Individual 88, locus 52; encountered the following data
    "AC-B4_S18" when expecting an integer

    My sturcture file has the name of my samples in the first column of the file and "AC-B4_S18" is one of them. is there a way to fix that? I am guessing that it is important to have the samples name on there.

     
  • Jorge Duitama

    Jorge Duitama - 2020-03-01

    Hi

    From the error text this looks like an error produced by structure. The problem is probably not on the sample names line but it looks like somehow, the text "AC-B4_S18" is also somewhere in the middle of the file (line 88 or 89 would be my first guess). You can grep the text "AC-B4_S18" and see in which places it appears. If it is mislocated, double check your scripts because it is not likely that this error would be produced by the NGSEP converter.

    Let me know how things go.

    Jorge

     
  • Anonymous

    Anonymous - 2020-03-02

    Jorge,
    This error is being output by STRUCTURE software. I am just wondering if when i covert from vcf to -structure using NGSEP if my file output STRUCTURE looks right. it looks some thing like this:
    AC-G10_S79_L001 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 -9 -9 -9 -9 -9 -9 -
    AC-G11_S80_L001 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 1 -9 -9 -9 -9 -9 -9 -
    AC-G1_S78_L001 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 -9 -9 -9 -9 -9 -9 -9
    with all the samples names on the first column.

    by the way, when i grep "AC-B4_S18" it shows in one place with the full file name "AC-B4_S18_L001")instead of just the first half and it is showing just in one place.

     
  • Jorge Duitama

    Jorge Duitama - 2020-03-03

    Hi

    The file format that you share looks fine. In such case, my next guess would be an issue with character encoding, particularly with line breaks. If you are moving files from linux to windows or viceversa, you need to adjust the character encoding to make sure that line changes are properly handled. From windows to linux, use the dos2unix command. In windows I think you can open the file in notepad and save it to change the character encoding. According to the error, the line with the issue is the line 88, so another experiment you can make is to take the first 80 lines and see if they load fine (of course adjusting the number of individuals).

    Let me know how things go.

    Jorge

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.