Hi,
I am trying to run the single sample variant detector script on a specific chromosome which is 22. I got the BAM file for the same chromosome and also the Fasta file is for the same chromosome only. it gets me this error
which is seems to be an issue needed to specify which chromosome I am working with but the thing is that I cannot find I flag to specifiy such an option in the available ones.
that was the code I ran, the variables refers to the files path,
Thanks for your interest in NGSEP. This error occurs when the reference genome used to run the variants detector is different than the file used to map the reads. If you only want variants in chr22, you can filter the bam. However, please keep the same reference file because, depending on how you filter the bam, the header could still have all the chromosomes, which makes the software fail. You can also use the option "-querySeq" of the SingleSampleVariantsDetector to call variants only on chromosome 22. The alternative making a previous filter of the bam file is a bit quicker but in any case, please use the complete reference genome.
Let me know how things go
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
thanks for the reply Jorge. the thing is the "-querySeq" flag takes a string not file this means that it won't be viable I guess to use on the terminal for the whole chromosome and also the thing with using the whole refernce file didn't go through as there was a chromosome missing from the refernce file as the tool giving me so I am not pretty sure what should I do here...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I do not understand this issue. In this case you still need to provide the reference fasta file with the -r option. With the otion -querySeq you tell the software that you only want to process one sequence ("chr22" in your case).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
so now there is another update, I have got another refernce genome that supposedly has all the decoys implented with it. I tried to ran it using this script java -jar $NGSEP SingleSampleVariantsDetector -i /home/ionadmin/bassyouni/source_bam/NA12877_chr22.bam -r /home/ionadmin/bassyouni/GRCh38_full_analysis_set_plus_decoy_hla.fa -o chr22_NGSEP -sampleId NA12877
so obviosuly it's giving me here that the sequence KN707606.1 is not in the ref file but I checked it myself through grepping it with grep "KN707606.1" GRCh38_full_analysis_set_plus_decoy_hla.fa
and it was there
The issue is still the same because the name of the sequence in the fasta file is "chrUn_KN707606v1_decoy" and the name in the bam header is "KN707606.1". Ideally, you need to provide the exact reference sequence that was used to generate the bam file. However, in bam files generated by some human genetics projects, they do not have a good standard on what is the reference and they do not make it available, which is a pain for many people.
If you do not have access to the exact reference genome, an alternative is to use samtools reheader to generate a new bam file having in the header only the chromosomes that you want to process. You may also need samtools view to generate a bam file from a sam file. If you manage to do so, then you can again use the chromosome 22 as reference. Use samtools faidx to have a small file with the names of the sequences in the fasta file and make sure that they correspond exactly with those in the header of the bam file. Double check both names and lengths.
Let me know how things go.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The issue is still the same because the name of the sequence in the fasta
file is "chrUn_KN707606v1_decoy" and the name in the bam header is
"KN707606.1". Ideally, you need to provide the exact reference sequence
that was used to generate the bam file. However, in bam files generated by
some human genetics projects, they do not have a good standard on what is
the reference and they do not make it available, which is a pain for many
people.
If you do not have access to the exact reference genome, an alternative is
to use samtools reheader to generate a new bam file having in the header
only the chromosomes that you want to process. You may also need samtools
view to generate a bam file from a sam file. If you manage to do so, then
you can again use the chromosome 22 as reference. Use samtools faidx to
have a small file with the names of the sequences in the fasta file and
make sure that they correspond exactly with those in the header of the bam
file. Double check both names and lengths.
Hi,
I am trying to run the single sample variant detector script on a specific chromosome which is 22. I got the BAM file for the same chromosome and also the Fasta file is for the same chromosome only. it gets me this error
which is seems to be an issue needed to specify which chromosome I am working with but the thing is that I cannot find I flag to specifiy such an option in the available ones.
that was the code I ran, the variables refers to the files path,
can you help with that please ?
Thanks,
Dear Mahmoud
Thanks for your interest in NGSEP. This error occurs when the reference genome used to run the variants detector is different than the file used to map the reads. If you only want variants in chr22, you can filter the bam. However, please keep the same reference file because, depending on how you filter the bam, the header could still have all the chromosomes, which makes the software fail. You can also use the option "-querySeq" of the SingleSampleVariantsDetector to call variants only on chromosome 22. The alternative making a previous filter of the bam file is a bit quicker but in any case, please use the complete reference genome.
Let me know how things go
thanks for the reply Jorge. the thing is the "-querySeq" flag takes a string not file this means that it won't be viable I guess to use on the terminal for the whole chromosome and also the thing with using the whole refernce file didn't go through as there was a chromosome missing from the refernce file as the tool giving me so I am not pretty sure what should I do here...
Hi Mahmoud
I do not understand this issue. In this case you still need to provide the reference fasta file with the -r option. With the otion -querySeq you tell the software that you only want to process one sequence ("chr22" in your case).
so now there is another update, I have got another refernce genome that supposedly has all the decoys implented with it. I tried to ran it using this script
java -jar $NGSEP SingleSampleVariantsDetector -i /home/ionadmin/bassyouni/source_bam/NA12877_chr22.bam -r /home/ionadmin/bassyouni/GRCh38_full_analysis_set_plus_decoy_hla.fa -o chr22_NGSEP -sampleId NA12877
it went with another erro from the same type
so obviosuly it's giving me here that the sequence KN707606.1 is not in the ref file but I checked it myself through grepping it with
grep "KN707606.1" GRCh38_full_analysis_set_plus_decoy_hla.fa
and it was there
so what do you think might be wrong ?
Hi Mahmoud
The issue is still the same because the name of the sequence in the fasta file is "chrUn_KN707606v1_decoy" and the name in the bam header is "KN707606.1". Ideally, you need to provide the exact reference sequence that was used to generate the bam file. However, in bam files generated by some human genetics projects, they do not have a good standard on what is the reference and they do not make it available, which is a pain for many people.
If you do not have access to the exact reference genome, an alternative is to use samtools reheader to generate a new bam file having in the header only the chromosomes that you want to process. You may also need samtools view to generate a bam file from a sam file. If you manage to do so, then you can again use the chromosome 22 as reference. Use samtools faidx to have a small file with the names of the sequences in the fasta file and make sure that they correspond exactly with those in the header of the bam file. Double check both names and lengths.
Let me know how things go.
Alright will do that and will get back to you if any thing changes
happened, thank you so much for helping! Much appreciated …,
On Mon, 10 Jan 2022 at 11:23 PM Jorge Duitama jduitama@users.sourceforge.net wrote:
Thanks @jduitama, I have found the refernrnce genome that they used and it went through perfect, Thanks again for helping much appreciated!