Not at this moment but it should not be difficult to add this feature for the next version. By now, we normally use uncompressed gff3 files because they are relatively small, unless they include the complete genomic sequence.
Let me know if you have further questions about the annotation process or about other functionalities.
Jorge
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2020-02-14
Hi
I am using the NGSEP that is currently in github, also following the tutorial.txt instructions and the functions Annotate(), and all the Stats functions are not being recognized by NGSEP. Is there some updates that i need to know about? thanks
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes. We are about to make the first release of the major version 4. One of the major changes for this version is a large standarization of command and option names across all functionalities. With the exception of commands having as input multiple files of the same type (such as MergeVariants) all commands will receive their main input file using the -i option and have the -o option to specify the output file (or prefix or directory) and all other inputs will be received through options. The former command "Annotate" is now called "VCFAnnotate" and the former command "SummaryStats" is now called "VCFSummaryStats". If you build the new jar (NGSEPcore_4.0.0.jar), you can already run the program without commands to see the new names (including new functionalities) and you can run a command without parameters to see the new usage.
We are currently finishing documentation tasks and, unfortunately, the training materials were not yet up to date with the latest changes. I just pushed to github the new version of the training materials, so people cloning the repository can already see how things will operate from now on. We will run a full testing round to make sure that all functionalities work as expected in the new version. In the mean time, feel free to run the github version and let us know if you find any further issue (that would actually help a lot).
The good news related to the start of this post is that in the new version both fasta reference genomes and annotation gff files can gz compressed.
Let me know if you have any further questions or issues running NGSEP.
Jorge
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2020-02-19
Thanks Jorge! I am running through errors when calling the VCFAnnotate:
INFO: Loading genome from: /data/ngsep_tutorial/reference/GRCh38_latest_genomic.fna
Feb 19, 2020 11:32:11 AM ngsep.main.OptionValuesDecoder loadGenome
INFO: Loaded genome with: 639 sequences. Total length: 3272089205 from file: /data/ngsep_tutorial/reference/GRCh38_latest_genomic.fna
Feb 19, 2020 11:32:11 AM ngsep.vcf.VCFFunctionalAnnotator logParameters
INFO: Input file: HLA_Project.vcf
GFF transcriptome file: GRCh38_latest_genomic.gff
Loaded reference genome from: /data/ngsep_tutorial/reference/GRCh38_latest_genomic.fna
Output file: HLA_ann.vcf
Upstream offset: 1000
Downstream offset: 300
Splice donor offset: 2
Splice acceptor offset: 2
Splice region intron offset: 10
Splice region exon offset: 2
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at ngsep.NGSEPcore.main(NGSEPcore.java:66)
Caused by: java.lang.IllegalArgumentException: Empty input segments: [] for transcript: rna-NM_001199281.1-2
at ngsep.transcriptome.Transcript.setTranscriptSegments(Transcript.java:81)
at ngsep.transcriptome.io.GFF3TranscriptomeHandler.loadMap(GFF3TranscriptomeHandler.java:219)
at ngsep.transcriptome.io.GFF3TranscriptomeHandler.loadMap(GFF3TranscriptomeHandler.java:116)
at ngsep.vcf.VCFFunctionalAnnotator.loadMap(VCFFunctionalAnnotator.java:258)
at ngsep.vcf.VCFFunctionalAnnotator.run(VCFFunctionalAnnotator.java:216)
at ngsep.vcf.VCFFunctionalAnnotator.main(VCFFunctionalAnnotator.java:210)
... 5 more
And also with the distanceMatrix function:
Feb 19, 2020 11:36:11 AM ngsep.vcf.VCFDistanceMatrixCalculator logParameters
INFO: Input file: HLA_Project
Output file: HLA_Project_matrix
Distance from genotype calls ignoring local copy number (GT format field)
Writing full matrix format
Samples ploidy: 2
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at ngsep.NGSEPcore.main(NGSEPcore.java:66)
Caused by: java.io.IOException: VCF file does not have line with sample ids
at ngsep.vcf.VCFFileHeader.loadSampleIds(VCFFileHeader.java:140)
at ngsep.vcf.VCFFileReader.init(VCFFileReader.java:139)
at ngsep.vcf.VCFFileReader.<init>(VCFFileReader.java:74)
at ngsep.vcf.VCFDistanceMatrixCalculator.generateMatrix(VCFDistanceMatrixCalculator.java:154)
at ngsep.vcf.VCFDistanceMatrixCalculator.run(VCFDistanceMatrixCalculator.java:118)
at ngsep.vcf.VCFDistanceMatrixCalculator.main(VCFDistanceMatrixCalculator.java:108)
... 5 more
Feb 19, 2020 11:36:11 AM ngsep.clustering.NeighborJoining run
INFO: Loading matrix from file HLA_Project.vcf
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at ngsep.NGSEPcore.main(NGSEPcore.java:66)
Caused by: java.io.IOException: Number format error reading number of samples
at ngsep.clustering.DistanceMatrix.loadFromFile(DistanceMatrix.java:48)
at ngsep.clustering.DistanceMatrix.<init>(DistanceMatrix.java:27)
at ngsep.clustering.NeighborJoining.run(NeighborJoining.java:81)
at ngsep.clustering.NeighborJoining.main(NeighborJoining.java:74)
... 5 more
Caused by: java.lang.NumberFormatException: For input string: "##fileformat=VCFv4.2"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:569)
at java.lang.Integer.parseInt(Integer.java:615)
at ngsep.clustering.DistanceMatrix.loadFromFile(DistanceMatrix.java:46)
... 8 more</init></init>
the VCF file was generated by NGSEP after the bowtie2 mapping... Please let me know what i can do to fix these errors. Thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
As promised before, I just made the release of version 4.0.0. We actually fixed a few final bugs in the last few days, so my first suggestion would be to download and try again with the released version.
Regarding the error with VCFAnnotate, please double check in your gff3 file if the annotation of the gene "rna-NM_001199281.1-2" is consistent with the gff3 format specification. You can use the command TranscriptomeAnalyzer to check the consistency of gff3 files and, if everything goes right, obtain some useful statistics.
Regarding the distance matrix function, the log seems to indicate that you are also calling the "NeighborJoining" command but you are providing the VCF file and this command should receive the output from the "VCFDistanceMatrixCalculator" command (the distance matrix).
Let me know how things go
Jorge
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2020-02-20
Thanks Jorge! I think everything is working great right now. I 'll let you know if i come accross anything else.
👍
1
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2020-02-28
Hello Jorge,
i used VCFConverter to get a structure format in order to run STRUCTURE for further analysis and i am getting this error:
WARNING! Probable error in the input file.
Individual 88, locus 52; encountered the following data
"AC-B4_S18" when expecting an integer
My sturcture file has the name of my samples in the first column of the file and "AC-B4_S18" is one of them. is there a way to fix that? I am guessing that it is important to have the samples name on there.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
From the error text this looks like an error produced by structure. The problem is probably not on the sample names line but it looks like somehow, the text "AC-B4_S18" is also somewhere in the middle of the file (line 88 or 89 would be my first guess). You can grep the text "AC-B4_S18" and see in which places it appears. If it is mislocated, double check your scripts because it is not likely that this error would be produced by the NGSEP converter.
Let me know how things go.
Jorge
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2020-03-02
Jorge,
This error is being output by STRUCTURE software. I am just wondering if when i covert from vcf to -structure using NGSEP if my file output STRUCTURE looks right. it looks some thing like this:
AC-G10_S79_L001 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 -9 -9 -9 -9 -9 -9 -
AC-G11_S80_L001 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 1 -9 -9 -9 -9 -9 -9 -
AC-G1_S78_L001 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 -9 -9 -9 -9 -9 -9 -9
with all the samples names on the first column.
by the way, when i grep "AC-B4_S18" it shows in one place with the full file name "AC-B4_S18_L001")instead of just the first half and it is showing just in one place.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The file format that you share looks fine. In such case, my next guess would be an issue with character encoding, particularly with line breaks. If you are moving files from linux to windows or viceversa, you need to adjust the character encoding to make sure that line changes are properly handled. From windows to linux, use the dos2unix command. In windows I think you can open the file in notepad and save it to change the character encoding. According to the error, the line with the issue is the line 88, so another experiment you can make is to take the first 80 lines and see if they load fine (of course adjusting the number of individuals).
Let me know how things go.
Jorge
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I have a question regarding the Annotate command. Can the Annotate command accept a gzipped gff3 file as input?
Hi
Not at this moment but it should not be difficult to add this feature for the next version. By now, we normally use uncompressed gff3 files because they are relatively small, unless they include the complete genomic sequence.
Let me know if you have further questions about the annotation process or about other functionalities.
Jorge
Hi
I am using the NGSEP that is currently in github, also following the tutorial.txt instructions and the functions Annotate(), and all the Stats functions are not being recognized by NGSEP. Is there some updates that i need to know about? thanks
Yes. We are about to make the first release of the major version 4. One of the major changes for this version is a large standarization of command and option names across all functionalities. With the exception of commands having as input multiple files of the same type (such as MergeVariants) all commands will receive their main input file using the -i option and have the -o option to specify the output file (or prefix or directory) and all other inputs will be received through options. The former command "Annotate" is now called "VCFAnnotate" and the former command "SummaryStats" is now called "VCFSummaryStats". If you build the new jar (NGSEPcore_4.0.0.jar), you can already run the program without commands to see the new names (including new functionalities) and you can run a command without parameters to see the new usage.
We are currently finishing documentation tasks and, unfortunately, the training materials were not yet up to date with the latest changes. I just pushed to github the new version of the training materials, so people cloning the repository can already see how things will operate from now on. We will run a full testing round to make sure that all functionalities work as expected in the new version. In the mean time, feel free to run the github version and let us know if you find any further issue (that would actually help a lot).
The good news related to the start of this post is that in the new version both fasta reference genomes and annotation gff files can gz compressed.
Let me know if you have any further questions or issues running NGSEP.
Jorge
Thanks Jorge! I am running through errors when calling the VCFAnnotate:
INFO: Loading genome from: /data/ngsep_tutorial/reference/GRCh38_latest_genomic.fna
Feb 19, 2020 11:32:11 AM ngsep.main.OptionValuesDecoder loadGenome
INFO: Loaded genome with: 639 sequences. Total length: 3272089205 from file: /data/ngsep_tutorial/reference/GRCh38_latest_genomic.fna
Feb 19, 2020 11:32:11 AM ngsep.vcf.VCFFunctionalAnnotator logParameters
INFO: Input file: HLA_Project.vcf
GFF transcriptome file: GRCh38_latest_genomic.gff
Loaded reference genome from: /data/ngsep_tutorial/reference/GRCh38_latest_genomic.fna
Output file: HLA_ann.vcf
Upstream offset: 1000
Downstream offset: 300
Splice donor offset: 2
Splice acceptor offset: 2
Splice region intron offset: 10
Splice region exon offset: 2
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at ngsep.NGSEPcore.main(NGSEPcore.java:66)
Caused by: java.lang.IllegalArgumentException: Empty input segments: [] for transcript: rna-NM_001199281.1-2
at ngsep.transcriptome.Transcript.setTranscriptSegments(Transcript.java:81)
at ngsep.transcriptome.io.GFF3TranscriptomeHandler.loadMap(GFF3TranscriptomeHandler.java:219)
at ngsep.transcriptome.io.GFF3TranscriptomeHandler.loadMap(GFF3TranscriptomeHandler.java:116)
at ngsep.vcf.VCFFunctionalAnnotator.loadMap(VCFFunctionalAnnotator.java:258)
at ngsep.vcf.VCFFunctionalAnnotator.run(VCFFunctionalAnnotator.java:216)
at ngsep.vcf.VCFFunctionalAnnotator.main(VCFFunctionalAnnotator.java:210)
... 5 more
INFO: Input file: HLA_Project
Output file: HLA_Project_matrix
Distance from genotype calls ignoring local copy number (GT format field)
Writing full matrix format
Samples ploidy: 2
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at ngsep.NGSEPcore.main(NGSEPcore.java:66)
Caused by: java.io.IOException: VCF file does not have line with sample ids
at ngsep.vcf.VCFFileHeader.loadSampleIds(VCFFileHeader.java:140)
at ngsep.vcf.VCFFileReader.init(VCFFileReader.java:139)
at ngsep.vcf.VCFFileReader.<init>(VCFFileReader.java:74)
at ngsep.vcf.VCFDistanceMatrixCalculator.generateMatrix(VCFDistanceMatrixCalculator.java:154)
at ngsep.vcf.VCFDistanceMatrixCalculator.run(VCFDistanceMatrixCalculator.java:118)
at ngsep.vcf.VCFDistanceMatrixCalculator.main(VCFDistanceMatrixCalculator.java:108)
... 5 more
Feb 19, 2020 11:36:11 AM ngsep.clustering.NeighborJoining run
INFO: Loading matrix from file HLA_Project.vcf
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at ngsep.NGSEPcore.main(NGSEPcore.java:66)
Caused by: java.io.IOException: Number format error reading number of samples
at ngsep.clustering.DistanceMatrix.loadFromFile(DistanceMatrix.java:48)
at ngsep.clustering.DistanceMatrix.<init>(DistanceMatrix.java:27)
at ngsep.clustering.NeighborJoining.run(NeighborJoining.java:81)
at ngsep.clustering.NeighborJoining.main(NeighborJoining.java:74)
... 5 more
Caused by: java.lang.NumberFormatException: For input string: "##fileformat=VCFv4.2"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:569)
at java.lang.Integer.parseInt(Integer.java:615)
at ngsep.clustering.DistanceMatrix.loadFromFile(DistanceMatrix.java:46)
... 8 more</init></init>
the VCF file was generated by NGSEP after the bowtie2 mapping... Please let me know what i can do to fix these errors. Thanks.
Hi
As promised before, I just made the release of version 4.0.0. We actually fixed a few final bugs in the last few days, so my first suggestion would be to download and try again with the released version.
Regarding the error with VCFAnnotate, please double check in your gff3 file if the annotation of the gene "rna-NM_001199281.1-2" is consistent with the gff3 format specification. You can use the command TranscriptomeAnalyzer to check the consistency of gff3 files and, if everything goes right, obtain some useful statistics.
Regarding the distance matrix function, the log seems to indicate that you are also calling the "NeighborJoining" command but you are providing the VCF file and this command should receive the output from the "VCFDistanceMatrixCalculator" command (the distance matrix).
Let me know how things go
Jorge
Thanks Jorge! I think everything is working great right now. I 'll let you know if i come accross anything else.
Hello Jorge,
i used VCFConverter to get a structure format in order to run STRUCTURE for further analysis and i am getting this error:
WARNING! Probable error in the input file.
Individual 88, locus 52; encountered the following data
"AC-B4_S18" when expecting an integer
My sturcture file has the name of my samples in the first column of the file and "AC-B4_S18" is one of them. is there a way to fix that? I am guessing that it is important to have the samples name on there.
Hi
From the error text this looks like an error produced by structure. The problem is probably not on the sample names line but it looks like somehow, the text "AC-B4_S18" is also somewhere in the middle of the file (line 88 or 89 would be my first guess). You can grep the text "AC-B4_S18" and see in which places it appears. If it is mislocated, double check your scripts because it is not likely that this error would be produced by the NGSEP converter.
Let me know how things go.
Jorge
Jorge,
This error is being output by STRUCTURE software. I am just wondering if when i covert from vcf to -structure using NGSEP if my file output STRUCTURE looks right. it looks some thing like this:
AC-G10_S79_L001 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 -9 -9 -9 -9 -9 -9 -
AC-G11_S80_L001 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 1 -9 -9 -9 -9 -9 -9 -
AC-G1_S78_L001 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 -9 -9 -9 -9 -9 -9 -9
with all the samples names on the first column.
by the way, when i grep "AC-B4_S18" it shows in one place with the full file name "AC-B4_S18_L001")instead of just the first half and it is showing just in one place.
Hi
The file format that you share looks fine. In such case, my next guess would be an issue with character encoding, particularly with line breaks. If you are moving files from linux to windows or viceversa, you need to adjust the character encoding to make sure that line changes are properly handled. From windows to linux, use the dos2unix command. In windows I think you can open the file in notepad and save it to change the character encoding. According to the error, the line with the issue is the line 88, so another experiment you can make is to take the first 80 lines and see if they load fine (of course adjusting the number of individuals).
Let me know how things go.
Jorge