Dear NGSEP team,
I have just experienced a problem with the MultisampleVariantsDetector module.
The analysis of 95 sorted bam samples with header seemed to run fine and finished with the message "INFO: Multisample Variants Detector Completed". The generated VCF file contained only the header lines, but no data lines with information about the SNP markers.
The command line I used was
java -jar NGSEPcore4.3.2.jar MultisampleVariantsDetector -r Theobromacacaocriollochr.v2.0.fna -maxAlnsPerStartPos 1000 .sorted.bam*.
Also attached is a log file and VCF output.
Any help would be greatly appreciated.
Thanks for your interest in NGSEP. The command has a typo because the asterisk should go before sorted.bam (it should be *sorted.bam), but I guess it is just a typo. Based on the log everything looks fine. Please share some information on how did you generate the BAM files. Make sure that the reference used in the MultisampleVariantsDetector is the exact same reference used to align reads. If possible, please send me the result of the following command
samtools view -h CCKM23-041.sorted.bam | head -n 1000
Best regards
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Dear Jorge,
thank you for your prompt answer.
Yes it was a typo in my message, sorry.
Mapping was done with Bowtie2 (in paired mode) and Bam files were generated with samtools : command line : samtools view -u CCKM23-041.header.sam | samtools sort -o CCKM23-041.sorted.bam
Attached is the result of your command .
Thanks again for your help.
Xavier
The bam file looks fine. The only thing that I see is that the name of the reference file used for mapping is Theobroma_cacao_criollo_chr.v2.0.fna and the name of the file used for variant calling is Theobromacacaocriollochr.v2.0.fna. Please use samtools faidx to verify if the two files have the exact same genome (including chromosome names) as follows:
I checked more closely at the commands used to sort the alignments and I ran a few tests and it seems like the issue happened at the sorting step. It looks like the samtools view command is not preserving the header of bowtie2, and then the samtools sort could be removing the RG tag from the alignments. The net effect is that your alignments in the sorted bam are missing a tag like this:
RG:Z:CCKM23-041
This tag is required by the MultisampleVariantsDetector (and as far as I remember by GATK as well) to know to which read group corresponds each alignment. I ran a small test adding this tag for a few alignments and the SNPs started showing. To reproduce this, you can download the attached modified version of your alignments and run a command like this:
If possible, I think it is more simple to use picard SortSam (https://broadinstitute.github.io/picard/) to sort the alignments. Picard can receive directly the sam files from bowtie2 and it can generate a bam index within the same command. You can see our script runMappingBowtie in the training directory for further details.
Dear Jorge,
thank you for your advice. I sorted again the Bowtie2 sam files with Picard tools and MultisampleVariantsDetector now works!!!
Problem solved!
Thank you,
Xavier
👍
1
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Dear NGSEP team,
I have just experienced a problem with the MultisampleVariantsDetector module.
The analysis of 95 sorted bam samples with header seemed to run fine and finished with the message "INFO: Multisample Variants Detector Completed". The generated VCF file contained only the header lines, but no data lines with information about the SNP markers.
The command line I used was
java -jar NGSEPcore4.3.2.jar MultisampleVariantsDetector -r Theobromacacaocriollochr.v2.0.fna -maxAlnsPerStartPos 1000 .sorted.bam*.
Also attached is a log file and VCF output.
Any help would be greatly appreciated.
Xavier
Last edit: Xavier Argout 2023-09-05
Dear Xavier
Thanks for your interest in NGSEP. The command has a typo because the asterisk should go before sorted.bam (it should be *sorted.bam), but I guess it is just a typo. Based on the log everything looks fine. Please share some information on how did you generate the BAM files. Make sure that the reference used in the MultisampleVariantsDetector is the exact same reference used to align reads. If possible, please send me the result of the following command
samtools view -h CCKM23-041.sorted.bam | head -n 1000
Best regards
Dear Jorge,
thank you for your prompt answer.
Yes it was a typo in my message, sorry.
Mapping was done with Bowtie2 (in paired mode) and Bam files were generated with samtools : command line : samtools view -u CCKM23-041.header.sam | samtools sort -o CCKM23-041.sorted.bam
Attached is the result of your command .
Thanks again for your help.
Xavier
Dear Xavier
The bam file looks fine. The only thing that I see is that the name of the reference file used for mapping is Theobroma_cacao_criollo_chr.v2.0.fna and the name of the file used for variant calling is Theobromacacaocriollochr.v2.0.fna. Please use samtools faidx to verify if the two files have the exact same genome (including chromosome names) as follows:
samtools faidx Theobroma_cacao_criollo_chr.v2.0.fna
samtools faidx Theobromacacaocriollochr.v2.0.fna
If they are the same, please share with me the sequence of chr1 to make a small internal test. You can do that with faidx as well:
samtools faidx Theobromacacaocriollochr.v2.0.fna chr1 > chr1.fa
Dear Jorge,
it is the same reference file I used for mapping and for NGSEP.
Please find attached chr1.fa
Last edit: Xavier Argout 2023-09-05
Dear Xavier
I checked more closely at the commands used to sort the alignments and I ran a few tests and it seems like the issue happened at the sorting step. It looks like the samtools view command is not preserving the header of bowtie2, and then the samtools sort could be removing the RG tag from the alignments. The net effect is that your alignments in the sorted bam are missing a tag like this:
RG:Z:CCKM23-041
This tag is required by the MultisampleVariantsDetector (and as far as I remember by GATK as well) to know to which read group corresponds each alignment. I ran a small test adding this tag for a few alignments and the SNPs started showing. To reproduce this, you can download the attached modified version of your alignments and run a command like this:
java -Xmx4g -jar /path/to/NGSEPcore_4.3.2.jar MultisampleVariantsDetector -r chr1.fa -o testMultiSample.vcf -maxAlnsPerStartPos 1000 firstReads.sam
If possible, I think it is more simple to use picard SortSam (https://broadinstitute.github.io/picard/) to sort the alignments. Picard can receive directly the sam files from bowtie2 and it can generate a bam index within the same command. You can see our script runMappingBowtie in the training directory for further details.
Let me know how things go
Jorge
Dear Jorge,
thank you for your advice. I sorted again the Bowtie2 sam files with Picard tools and MultisampleVariantsDetector now works!!!
Problem solved!
Thank you,
Xavier