Used NGSEP3.3 plugin for years, and converting to NGSEP4windows.
I have sorted, indexed bam files from samtools go to that folder in NGSEP4 right click on folder name choose Multiple variant detector. that opens a window for Select alignment files, but there are no files in the table and the Select All button does nothing. (screenshot attached)
Thanks for your interest in NGSEP. The reason of this behavior is probably that your BAM files do not have RG tags in the header and the multisample variants detector needs those headers to distribute alignments among samples. If possible, please run samtools view -H on any of the bam files to double check if this is the case.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
That is exactly the issue. Please find attached an updated version of the script that we used to include in NGSEP3 to map reads with bowtie2, including the options to register properly the needed RG tags. Although ideally we would like people to start using our reads aligner, we also want to remain interoperable with sorted bam files generated with bowtie2 and bwa. This script is now available in the training folder and will be available in future versions.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2021-10-01
The issue I have with the internal read aligner in NGSEP4 is that it is far slower than the Bowtie2 I am using in Win10/WSL2-Ubuntu. I have a set of 96 reads from different individuals at 10-15x coverage on a 1Gb genome. With Bowtie2 that gets done in about 4 days. I tried aligning one set of reads in NGSEP4 and set threads to 8 (like I do for Bowtie2) and ended up killing it after 24 hours, so I don't know how to get that to speed up but my next set of DNAs will be 96 samples at 20x, so am probably going to go to distributed processing on our server. I will try the script you sent and let you know.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for reporting this issue. It is right that the current implementation is slower than bowtie2. Based on our current benchmark experiments, you can reduce runtime without loosing accuracy by increasing the k-mer length to 21. However, we are working on options to further increase runtime.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I was able to go back and re-run bowtie2 on three of the samples to add the RG data,
bowtie2 -x iGRCg7b -1 fq/A01-420-CGTACGTA_R1.fastq -2 fq/A01-420-CGTACGTA_R2.fastq --rg-id 420 --rg SM:420 -S sam/420.sam --no-unal -p 8 --time
then run samtools to convert to bam, then sort and index the bam.
Then those bam files were available for the multiple variant detector. But I did not see the option to run variant detector separately. I used to run them separately, then merge, then filter, so this seems to detect and merge at the same time. In NGSEP3.3 for 48 samples it would take 18 hours to variant detect the 48 bam files, 18 hours to merge and 18-20 hrs to filter. I ran the Multiple Variant Detector on 3 bam files and it took 20 hours so am concerned as to what amount of time it would take for 48. Guess I will have to go back and try it to find out. Thanks for the guidance, and great software.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for your feedback. This is very valuable for us to keep improving the software. Right now you can run the per sample process sample by sample, right clicking on each bam file. We still do not have an option in the new interface to run in parallel variants discovery sample by sample. For this process also you would need to genotype the samples after merging variants and then merge the VCF files, to obtain homozygus reference genotype calls. About the time issue, I guess that with the 48 samples it will take longer (hopefully less than 3 days) and you may need a bit of extra memory but you will get your vcf file ready for filtering.
Let me know if you have further issues running the software.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I went back and reran the process adding RG-ids for all the samples, Typically we have 2 sets of 48 samples (24 case and 24 control). In the past, I then used NGSEP to convert to .vcf, then merge them and then filter. With NGSEP3.3 the merged vcf for this latest set of 48 took 14 hours to convert to vcf, then about 36 hours to merge and the merged file was 26.22 GB. I have been running the Multiple Variant Detector in NGSEP4 since Oct 11 17:00 so elapsed time is ~4.5 days, the merged vcf is 81 GB and the progress bar has hardly moved. Don't know what the problem is. I had previously run a test on 3 files and that took about 6-8 hours. These are 1 GB chicken genomes with 8-10x coverage, and 16M SNPs.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry for the delay. If possible, please run tail on the VCF file to see in which chromosome it is running right now. Please also share the log to see the progress of the process.
Based on your description it seems like you used to run the per sample pipeline but you could be missing the genotying step. The per sample approach can be executed with the new interface but it is a bit more complicated. You need to right click on each bam file and call the variants detector. Then you need to merge the VCF to produce the catalog of variants. Then you need to go over each bam file to run the genotyping process and finally you need to merge again the VCF files from this second round. This process takes longer total runtime but since the discovery and the genotyping steps can be parallelized, it could take less total time if you have enough processors. The process is described in the command line tutorial and in the manual of the graphical interface.
Let me know how things go
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2021-10-20
The current log file is attached. Looks like it is currently at Chr 4: 28,700,000., so in 9 days of running it is about half way done with the genome, but that is only one set of 48. What I used to do was convert all of the bam to vcf in one batch operation which was one click in NGSEP3.3. There are 96 bam files. So then I would merge in sets of 48, then filter for biallelic SNPs with heterozygosity 0.1-0.5, and then extract the MAF for Case and Control using SNPtest.
Thanks. Looking at the log, the process will take a lot of time, so I think it is better to stop it and try some quicker options. Looking at your parameters, I see that you reduced the variant QS to zero. This explains why the VCF is being too large this time and may explain in part the very large runtime. Please keep 40 in this parameter to ensure that at least one sample calls a variant with such genotype quality score and keep only variants worth to analyze. Please also reduce the maximum number of alignments starting at the same position to 2. This will reduce runtime and may also help if you have any PCR amplification artifact. I also see that you increase the prior heterozygosity to 0.01. Unless you have a reason to believe that your mice are highly heterozygous, it is better to keep the default for this parameter. Finally, you can reduce the maximum base quality score to 30.
You can go quicker run the multisample variants detector in batches and then merge the resulting VCF files but please do not separate the samples as case control because you may miss some interesting variants if the alterantive allele is not present in all the groups. A random partition should do better.
Let me know how things go
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Not mice, these are chickens, and they are the product of a cross and I am only interested in SNPs that have a MAF>0.1, because we are mapping QTLs with major effect, not looking for rare alleles. When I merge I merge 24 case and 24 control.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
My apologies for the confusion. In that case, I think you can run in parallel even four groups of 12 samples with the parameters that I mentioned earlier and then merge the VCF files. You can use the filtering facility to keep SNPs genotyped for at least 40 individuals and MAF above 0.1. That should still take some time but it should be more reasonable.
Let me know how things go
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I ran the Multiple Variant Detector on 12 bam files and it took about 11 hours using the settings you had suggested. I also increased the available RAM from 7 to 64 G. Not certain how memory hungry this can be but I have 128 available. If I start 3 sets of 12 at about the same time will they each use a different core? I have 12 available.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Great, that is good news. I do not think the process should take 64G but it may take more than 7G. If you have 128Gb available, set the maximum to about 120Gb.
Different MultisampleVariantsDetector processes should run in parallel, so that you could have the 4 VCF files in about 12 hours. Then you need to merge the VCF files.
Let me know how things go
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Update: I upped the RAM to Xmx64G, started three sets of 12 samples and in 9 hours two of the sets had finished, and now at 10 hours the last set of 12 is just past 80% of the genome finished. So adding more RAM really helped.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am glad to hear this. The memory consumed by the process is directly related to the read depth of the samples, either global or local. You could make further experiments to adjust the amount of memory needed to process batches of different numbers of samples.
To merge the VCFs obtained from the multisample variants detector you can use the function "Variant Files Merge" over the folder where the VCF files are located. After you select the VCFs, please make sure to use the button "Merge Genotype Calls" to retain the genotype calls for each sample.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Finished the subgroup Variant Detction, then merged the vcf retaining genotype calls, started the Variant Filter step and immediately got a progress bar that almost went to the end and the log file has a bunch of repeating SEVERE errors. There were 48 bam files that were detected in subgroups of 12 (6 ea case and control), then the 4 subgroup files were merged into group1.vcf
Oct 24, 2021 6:37:46 AM ngsep.vcf.VCFFilter logParameters
INFO: Input file: G:\IBV2021\group1\group1.vcf
Output file: G:\IBV2021\group1\group1_filter.vcf
Genotype filters
Minimum genotype quality: 20
Minimum read depth: 4
Variant context filters
Population data filters
Minimum samples genotyped: 20
Keep only biallelic SNVs
Minimum minor allele frequency (MAF): 0.05
Oct 24, 2021 6:37:53 AM ngsep.vcf.VCFFileReader loadVCFRecord
SEVERE: Can not load genomic variant at NC_052534.1:15747365. Number of genotyped samples does not coincide with number of samples in the header
Oct 24, 2021 6:37:53 AM ngsep.vcf.VCFFileReader loadVCFRecord
SEVERE: Can not load genomic variant at NC_052532.1:8180343. Number of genotyped samples does not coincide with number of samples in the header
Oct 24, 2021 6:37:53 AM ngsep.vcf.VCFFileReader loadVCFRecord
SEVERE: Can not load genomic variant at NC_052532.1:8180430. Number of genotyped samples does not coincide with number of samples in the header
Oct 24, 2021 6:37:53 AM ngsep.vcf.VCFFileReader loadVCFRecord
SEVERE: Can not load genomic variant at NC_052532.1:8180473. Number of genotyped samples does not coincide with number of samples in the header
I get the same error in the log file when I run the Summary Stats on this vcf. Neither the Filter or Summary Stats seem to be building anything in their output
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There is genotype data in the subgroup vcf but it is all gone after I merge the four subgroups using the Merge Genotype calls, so I am going to try merging them with just the Merge variants.
The genotype data all seems to go away which ever way I merge the subgroup vcf. I did the merge with variant calls and there is not genotype data in the merged vcf. Seems I have to go back and do the Multiple Variants Detector on all 48 bam files at the same time if I want the MAF data preserved.
Th problem definitely happens during the merging process. If possible, please share with me a snapshot of the Merge variants screen before clicking the button and let me know which button are you clicking. Please also send me the log of the process. In the mean time, it is possible that with the improvement in memory the multisample variants detector will process faster the 48 samples.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I started the Multiple Variants Detector on all 48 bam.sort files and have been monitoring the logs. I am running with 64G RAM, and the settings you had recommended. Estimated completion in 60 hours. I attached a NGSEPScreenShot.jpg for the Merge Variants Files call, and tried both buttons (Merge variants, and Merge genotype calls), In both cases there were no genotypes in the merged vcf based on VCF summary statistics. I attached the logs for Merge variants (group1mergevariants_Merge.log) and Merge genotype calls (group1_Merge.log). I may need to include the logs in separate posts. I tried to attach 3 files and something crashed.
Can't seem to upload the log files. Merge variants (group1mergevariants_Merge.log) the upload choked on the 491Mb file.
Merge genotype calls (group1_Merge.log) is 234 Mb and the upload choked with this message:
This site can’t be reachedsourceforge.net unexpectedly closed the connection.
Try:
Checking the connection
Checking the proxy and the firewall
Running Windows Network Diagnostics
ERR_CONNECTION_CLOSED
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Used NGSEP3.3 plugin for years, and converting to NGSEP4windows.
I have sorted, indexed bam files from samtools go to that folder in NGSEP4 right click on folder name choose Multiple variant detector. that opens a window for Select alignment files, but there are no files in the table and the Select All button does nothing. (screenshot attached)
Hi
Thanks for your interest in NGSEP. The reason of this behavior is probably that your BAM files do not have RG tags in the header and the multisample variants detector needs those headers to distribute alignments among samples. If possible, please run samtools view -H on any of the bam files to double check if this is the case.
I generated sorted indexed bam files using samtools 0.1.19 and the Multiple variant detector failed to recognize any of the 96 files. I then did a samtools view -H and I don't see any RG anywhere:
Output:
@HD VN:1.0 SO:coordinate
@SQ SN:NC_052532.1 LN:196449156
@SQ SN:NW_024095932.1 LN:42181
@SQ SN:NW_024095933.1 LN:76880
@SQ SN:NW_024095934.1 LN:38338
@SQ SN:NW_024095935.1 LN:50781
@SQ SN:NC_052533.1 LN:149539284
@SQ SN:NC_052534.1 LN:110642502
@SQ SN:NC_052535.1 LN:90861225
@SQ SN:NW_024095936.1 LN:24491
@SQ SN:NW_024095937.1 LN:23670
@SQ SN:NC_052536.1 LN:59506338
@SQ SN:NC_052537.1 LN:36220557
@SQ SN:NC_052538.1 LN:36382834
@SQ SN:NC_052539.1 LN:29578256
@SQ SN:NC_052540.1 LN:23733309
@SQ SN:NC_052541.1 LN:20453248
@SQ SN:NC_052542.1 LN:19638187
@SQ SN:NC_052543.1 LN:20119077
@SQ SN:NC_052544.1 LN:17905061
@SQ SN:NC_052545.1 LN:15331188
@SQ SN:NC_052546.1 LN:12703657
@SQ SN:NC_052547.1 LN:2706039
@SQ SN:NW_024095938.1 LN:64959
@SQ SN:NW_024095939.1 LN:236755
@SQ SN:NW_024095940.1 LN:24076
@SQ SN:NW_024095941.1 LN:27394
@SQ SN:NW_024095942.1 LN:76531
@SQ SN:NC_052548.1 LN:11092391
@SQ SN:NC_052549.1 LN:11623896
@SQ SN:NC_052550.1 LN:10455293
@SQ SN:NC_052551.1 LN:14265659
@SQ SN:NC_052552.1 LN:6970754
@SQ SN:NC_052553.1 LN:4686657
@SQ SN:NC_052554.1 LN:6253421
@SQ SN:NC_052555.1 LN:6478339
@SQ SN:NC_052556.1 LN:3067737
@SQ SN:NC_052557.1 LN:5349051
@SQ SN:NC_052558.1 LN:5228753
@SQ SN:NC_052559.1 LN:5437364
@SQ SN:NC_052560.1 LN:726478
@SQ SN:NW_024095943.1 LN:39239
@SQ SN:NW_024095944.1 LN:52620
@SQ SN:NC_052561.1 LN:755666
@SQ SN:NC_052562.1 LN:2457334
@SQ SN:NW_024095945.1 LN:89875
@SQ SN:NW_024095946.1 LN:52746
@SQ SN:NW_024095947.1 LN:105159
@SQ SN:NC_052563.1 LN:125424
@SQ SN:NW_024095948.1 LN:99000
@SQ SN:NW_024095949.1 LN:72078
@SQ SN:NW_024095950.1 LN:54963
@SQ SN:NW_024095951.1 LN:43198
@SQ SN:NW_024095952.1 LN:105149
@SQ SN:NW_024095953.1 LN:70221
@SQ SN:NC_052564.1 LN:3839931
@SQ SN:NW_024095954.1 LN:62499
@SQ SN:NW_024095955.1 LN:52169
@SQ SN:NC_052565.1 LN:3469343
@SQ SN:NC_052566.1 LN:554126
@SQ SN:NW_024095956.1 LN:259507
@SQ SN:NW_024095957.1 LN:52716
@SQ SN:NW_024095958.1 LN:31960
@SQ SN:NW_024095959.1 LN:83545
@SQ SN:NW_024095960.1 LN:44029
@SQ SN:NC_052567.1 LN:358375
@SQ SN:NW_024095961.1 LN:244274
@SQ SN:NW_024095962.1 LN:135931
@SQ SN:NW_024095963.1 LN:39716
@SQ SN:NW_024095964.1 LN:39622
@SQ SN:NW_024095965.1 LN:28486
@SQ SN:NW_024095966.1 LN:90962
@SQ SN:NC_052568.1 LN:157853
@SQ SN:NW_024095967.1 LN:183934
@SQ SN:NW_024095968.1 LN:56170
@SQ SN:NW_024095969.1 LN:147375
@SQ SN:NW_024095970.1 LN:123103
@SQ SN:NW_024095971.1 LN:56291
@SQ SN:NW_024095972.1 LN:61860
@SQ SN:NC_052569.1 LN:667312
@SQ SN:NC_052570.1 LN:177356
@SQ SN:NW_024095973.1 LN:60241
@SQ SN:NW_024095974.1 LN:122554
@SQ SN:NC_052571.1 LN:9109940
@SQ SN:NW_024095975.1 LN:103662
@SQ SN:NW_024095976.1 LN:96352
@SQ SN:NW_024095977.1 LN:179921
@SQ SN:NW_024095978.1 LN:63726
@SQ SN:NW_024095979.1 LN:102867
@SQ SN:NW_024095980.1 LN:75790
@SQ SN:NW_024095981.1 LN:59700
@SQ SN:NW_024095982.1 LN:454552
@SQ SN:NW_024095983.1 LN:184108
@SQ SN:NW_024095984.1 LN:107179
@SQ SN:NW_024095985.1 LN:70535
@SQ SN:NW_024095986.1 LN:83855
@SQ SN:NW_024095987.1 LN:62949
@SQ SN:NW_024095988.1 LN:72665
@SQ SN:NW_024095989.1 LN:68472
@SQ SN:NW_024095990.1 LN:43858
@SQ SN:NW_024095991.1 LN:65185
@SQ SN:NW_024095992.1 LN:121826
@SQ SN:NW_024095993.1 LN:157377
@SQ SN:NC_052572.1 LN:86044486
@SQ SN:NW_024095994.1 LN:139882
@SQ SN:NW_024095995.1 LN:88251
@SQ SN:NW_024095996.1 LN:447004
@SQ SN:NW_024095997.1 LN:279492
@SQ SN:NW_024095998.1 LN:172381
@SQ SN:NW_024095999.1 LN:219221
@SQ SN:NW_024096000.1 LN:48159
@SQ SN:NW_024096001.1 LN:142519
@SQ SN:NW_024096002.1 LN:56807
@SQ SN:NW_024096003.1 LN:56774
@SQ SN:NW_024096004.1 LN:121294
@SQ SN:NW_024096005.1 LN:145889
@SQ SN:NW_024096006.1 LN:46453
@SQ SN:NW_024096007.1 LN:243023
@SQ SN:NW_024096008.1 LN:76213
@SQ SN:NW_024096009.1 LN:102453
@SQ SN:NW_024096010.1 LN:51278
@SQ SN:NW_024096011.1 LN:239848
@SQ SN:NW_024096012.1 LN:132137
@SQ SN:NW_024096013.1 LN:82910
@SQ SN:NW_024096014.1 LN:45339
@SQ SN:NW_024096015.1 LN:37135
@SQ SN:NW_024096016.1 LN:28217
@SQ SN:NW_024096017.1 LN:27687
@SQ SN:NW_024096018.1 LN:92785
@SQ SN:NW_024096019.1 LN:43833
@SQ SN:NW_024096020.1 LN:37288
@SQ SN:NW_024096021.1 LN:36282
@SQ SN:NW_024096022.1 LN:35597
@SQ SN:NW_024096023.1 LN:71264
@SQ SN:NW_024096024.1 LN:35936
@SQ SN:NW_024096025.1 LN:75927
@SQ SN:NW_024096026.1 LN:54073
@SQ SN:NW_024096027.1 LN:77924
@SQ SN:NW_024096028.1 LN:91947
@SQ SN:NW_024096029.1 LN:99652
@SQ SN:NW_024096030.1 LN:128647
@SQ SN:NW_024096031.1 LN:73711
@SQ SN:NW_024096032.1 LN:50063
@SQ SN:NW_024096033.1 LN:95042
@SQ SN:NW_024096034.1 LN:94594
@SQ SN:NW_024096035.1 LN:77453
@SQ SN:NW_024096036.1 LN:73086
@SQ SN:NW_024096037.1 LN:66990
@SQ SN:NW_024096038.1 LN:64784
@SQ SN:NW_024096039.1 LN:61262
@SQ SN:NW_024096040.1 LN:60993
@SQ SN:NW_024096041.1 LN:60148
@SQ SN:NW_024096042.1 LN:58765
@SQ SN:NW_024096043.1 LN:57327
@SQ SN:NW_024096044.1 LN:57283
@SQ SN:NW_024096045.1 LN:55004
@SQ SN:NW_024096046.1 LN:51700
@SQ SN:NW_024096047.1 LN:49878
@SQ SN:NW_024096048.1 LN:49896
@SQ SN:NW_024096049.1 LN:49236
@SQ SN:NW_024096050.1 LN:48223
@SQ SN:NW_024096051.1 LN:48023
@SQ SN:NW_024096052.1 LN:47344
@SQ SN:NW_024096053.1 LN:47336
@SQ SN:NW_024096054.1 LN:45670
@SQ SN:NW_024096055.1 LN:44737
@SQ SN:NW_024096056.1 LN:44416
@SQ SN:NW_024096057.1 LN:44091
@SQ SN:NW_024096058.1 LN:43804
@SQ SN:NW_024096059.1 LN:43447
@SQ SN:NW_024096060.1 LN:41869
@SQ SN:NW_024096061.1 LN:41187
@SQ SN:NW_024096062.1 LN:40905
@SQ SN:NW_024096063.1 LN:40485
@SQ SN:NW_024096064.1 LN:38749
@SQ SN:NW_024096065.1 LN:38267
@SQ SN:NW_024096066.1 LN:38098
@SQ SN:NW_024096067.1 LN:37156
@SQ SN:NW_024096068.1 LN:36049
@SQ SN:NW_024096069.1 LN:36169
@SQ SN:NW_024096070.1 LN:34455
@SQ SN:NW_024096071.1 LN:32761
@SQ SN:NW_024096072.1 LN:32129
@SQ SN:NW_024096073.1 LN:29996
@SQ SN:NW_024096074.1 LN:29845
@SQ SN:NW_024096075.1 LN:29612
@SQ SN:NW_024096076.1 LN:27008
@SQ SN:NW_024096077.1 LN:26690
@SQ SN:NW_024096078.1 LN:26680
@SQ SN:NW_024096079.1 LN:25600
@SQ SN:NW_024096080.1 LN:25356
@SQ SN:NW_024096081.1 LN:24872
@SQ SN:NW_024096082.1 LN:24165
@SQ SN:NW_024096083.1 LN:23651
@SQ SN:NW_024096084.1 LN:22200
@SQ SN:NW_024096085.1 LN:20719
@SQ SN:NW_024096086.1 LN:20625
@SQ SN:NW_024096087.1 LN:20391
@SQ SN:NW_024096088.1 LN:20277
@SQ SN:NW_024096089.1 LN:17176
@SQ SN:NW_024096090.1 LN:16695
@SQ SN:NW_024096091.1 LN:15631
@SQ SN:NW_024096092.1 LN:13796
@SQ SN:NW_024096093.1 LN:9587
@SQ SN:NW_024096094.1 LN:9004
@SQ SN:NW_024096095.1 LN:5394
@SQ SN:NW_024096096.1 LN:4183
@SQ SN:NW_024096097.1 LN:3676
@SQ SN:NW_024096098.1 LN:3291
@SQ SN:NW_024096099.1 LN:2948
@SQ SN:NW_024096100.1 LN:2415
@SQ SN:NW_024096101.1 LN:2044
@SQ SN:NW_024096102.1 LN:1713
@SQ SN:NW_024096103.1 LN:1437
@SQ SN:NC_053523.1 LN:16784
@PG ID:bowtie2 PN:bowtie2 VN:2.4.4 CL:"/home/dougrhoads/miniconda3/bin/bowtie2-align-l --wrapper basic-0 -x iGRCg7b -p 8 --time --passthrough -1 fq/A02-775-AGCTAGCT_R1.fastq -2 fq/A02-775-AGCTAGCT_R2.fastq"
Dear Douglas
That is exactly the issue. Please find attached an updated version of the script that we used to include in NGSEP3 to map reads with bowtie2, including the options to register properly the needed RG tags. Although ideally we would like people to start using our reads aligner, we also want to remain interoperable with sorted bam files generated with bowtie2 and bwa. This script is now available in the training folder and will be available in future versions.
Let me know how things go.
The issue I have with the internal read aligner in NGSEP4 is that it is far slower than the Bowtie2 I am using in Win10/WSL2-Ubuntu. I have a set of 96 reads from different individuals at 10-15x coverage on a 1Gb genome. With Bowtie2 that gets done in about 4 days. I tried aligning one set of reads in NGSEP4 and set threads to 8 (like I do for Bowtie2) and ended up killing it after 24 hours, so I don't know how to get that to speed up but my next set of DNAs will be 96 samples at 20x, so am probably going to go to distributed processing on our server. I will try the script you sent and let you know.
Hi
Thanks for reporting this issue. It is right that the current implementation is slower than bowtie2. Based on our current benchmark experiments, you can reduce runtime without loosing accuracy by increasing the k-mer length to 21. However, we are working on options to further increase runtime.
I was able to go back and re-run bowtie2 on three of the samples to add the RG data,
bowtie2 -x iGRCg7b -1 fq/A01-420-CGTACGTA_R1.fastq -2 fq/A01-420-CGTACGTA_R2.fastq --rg-id 420 --rg SM:420 -S sam/420.sam --no-unal -p 8 --time
then run samtools to convert to bam, then sort and index the bam.
Then those bam files were available for the multiple variant detector. But I did not see the option to run variant detector separately. I used to run them separately, then merge, then filter, so this seems to detect and merge at the same time. In NGSEP3.3 for 48 samples it would take 18 hours to variant detect the 48 bam files, 18 hours to merge and 18-20 hrs to filter. I ran the Multiple Variant Detector on 3 bam files and it took 20 hours so am concerned as to what amount of time it would take for 48. Guess I will have to go back and try it to find out. Thanks for the guidance, and great software.
Hi Douglas
Thanks for your feedback. This is very valuable for us to keep improving the software. Right now you can run the per sample process sample by sample, right clicking on each bam file. We still do not have an option in the new interface to run in parallel variants discovery sample by sample. For this process also you would need to genotype the samples after merging variants and then merge the VCF files, to obtain homozygus reference genotype calls. About the time issue, I guess that with the 48 samples it will take longer (hopefully less than 3 days) and you may need a bit of extra memory but you will get your vcf file ready for filtering.
Let me know if you have further issues running the software.
I went back and reran the process adding RG-ids for all the samples, Typically we have 2 sets of 48 samples (24 case and 24 control). In the past, I then used NGSEP to convert to .vcf, then merge them and then filter. With NGSEP3.3 the merged vcf for this latest set of 48 took 14 hours to convert to vcf, then about 36 hours to merge and the merged file was 26.22 GB. I have been running the Multiple Variant Detector in NGSEP4 since Oct 11 17:00 so elapsed time is ~4.5 days, the merged vcf is 81 GB and the progress bar has hardly moved. Don't know what the problem is. I had previously run a test on 3 files and that took about 6-8 hours. These are 1 GB chicken genomes with 8-10x coverage, and 16M SNPs.
Hi Douglas
Sorry for the delay. If possible, please run tail on the VCF file to see in which chromosome it is running right now. Please also share the log to see the progress of the process.
Based on your description it seems like you used to run the per sample pipeline but you could be missing the genotying step. The per sample approach can be executed with the new interface but it is a bit more complicated. You need to right click on each bam file and call the variants detector. Then you need to merge the VCF to produce the catalog of variants. Then you need to go over each bam file to run the genotyping process and finally you need to merge again the VCF files from this second round. This process takes longer total runtime but since the discovery and the genotyping steps can be parallelized, it could take less total time if you have enough processors. The process is described in the command line tutorial and in the manual of the graphical interface.
Let me know how things go
The current log file is attached. Looks like it is currently at Chr 4: 28,700,000., so in 9 days of running it is about half way done with the genome, but that is only one set of 48. What I used to do was convert all of the bam to vcf in one batch operation which was one click in NGSEP3.3. There are 96 bam files. So then I would merge in sets of 48, then filter for biallelic SNPs with heterozygosity 0.1-0.5, and then extract the MAF for Case and Control using SNPtest.
Hi Douglas
Thanks. Looking at the log, the process will take a lot of time, so I think it is better to stop it and try some quicker options. Looking at your parameters, I see that you reduced the variant QS to zero. This explains why the VCF is being too large this time and may explain in part the very large runtime. Please keep 40 in this parameter to ensure that at least one sample calls a variant with such genotype quality score and keep only variants worth to analyze. Please also reduce the maximum number of alignments starting at the same position to 2. This will reduce runtime and may also help if you have any PCR amplification artifact. I also see that you increase the prior heterozygosity to 0.01. Unless you have a reason to believe that your mice are highly heterozygous, it is better to keep the default for this parameter. Finally, you can reduce the maximum base quality score to 30.
You can go quicker run the multisample variants detector in batches and then merge the resulting VCF files but please do not separate the samples as case control because you may miss some interesting variants if the alterantive allele is not present in all the groups. A random partition should do better.
Let me know how things go
Not mice, these are chickens, and they are the product of a cross and I am only interested in SNPs that have a MAF>0.1, because we are mapping QTLs with major effect, not looking for rare alleles. When I merge I merge 24 case and 24 control.
My apologies for the confusion. In that case, I think you can run in parallel even four groups of 12 samples with the parameters that I mentioned earlier and then merge the VCF files. You can use the filtering facility to keep SNPs genotyped for at least 40 individuals and MAF above 0.1. That should still take some time but it should be more reasonable.
Let me know how things go
I ran the Multiple Variant Detector on 12 bam files and it took about 11 hours using the settings you had suggested. I also increased the available RAM from 7 to 64 G. Not certain how memory hungry this can be but I have 128 available. If I start 3 sets of 12 at about the same time will they each use a different core? I have 12 available.
Hi Douglas
Great, that is good news. I do not think the process should take 64G but it may take more than 7G. If you have 128Gb available, set the maximum to about 120Gb.
Different MultisampleVariantsDetector processes should run in parallel, so that you could have the 4 VCF files in about 12 hours. Then you need to merge the VCF files.
Let me know how things go
Update: I upped the RAM to Xmx64G, started three sets of 12 samples and in 9 hours two of the sets had finished, and now at 10 hours the last set of 12 is just past 80% of the genome finished. So adding more RAM really helped.
Hi Douglas
I am glad to hear this. The memory consumed by the process is directly related to the read depth of the samples, either global or local. You could make further experiments to adjust the amount of memory needed to process batches of different numbers of samples.
To merge the VCFs obtained from the multisample variants detector you can use the function "Variant Files Merge" over the folder where the VCF files are located. After you select the VCFs, please make sure to use the button "Merge Genotype Calls" to retain the genotype calls for each sample.
Finished the subgroup Variant Detction, then merged the vcf retaining genotype calls, started the Variant Filter step and immediately got a progress bar that almost went to the end and the log file has a bunch of repeating SEVERE errors. There were 48 bam files that were detected in subgroups of 12 (6 ea case and control), then the 4 subgroup files were merged into group1.vcf
Oct 24, 2021 6:37:46 AM ngsep.vcf.VCFFilter logParameters
INFO: Input file: G:\IBV2021\group1\group1.vcf
Output file: G:\IBV2021\group1\group1_filter.vcf
Genotype filters
Minimum genotype quality: 20
Minimum read depth: 4
Variant context filters
Population data filters
Minimum samples genotyped: 20
Keep only biallelic SNVs
Minimum minor allele frequency (MAF): 0.05
Oct 24, 2021 6:37:53 AM ngsep.vcf.VCFFileReader loadVCFRecord
SEVERE: Can not load genomic variant at NC_052534.1:15747365. Number of genotyped samples does not coincide with number of samples in the header
Oct 24, 2021 6:37:53 AM ngsep.vcf.VCFFileReader loadVCFRecord
SEVERE: Can not load genomic variant at NC_052532.1:8180343. Number of genotyped samples does not coincide with number of samples in the header
Oct 24, 2021 6:37:53 AM ngsep.vcf.VCFFileReader loadVCFRecord
SEVERE: Can not load genomic variant at NC_052532.1:8180430. Number of genotyped samples does not coincide with number of samples in the header
Oct 24, 2021 6:37:53 AM ngsep.vcf.VCFFileReader loadVCFRecord
SEVERE: Can not load genomic variant at NC_052532.1:8180473. Number of genotyped samples does not coincide with number of samples in the header
I get the same error in the log file when I run the Summary Stats on this vcf. Neither the Filter or Summary Stats seem to be building anything in their output
Seems that there is no genotype data in the merged vcf??
There is genotype data in the subgroup vcf but it is all gone after I merge the four subgroups using the Merge Genotype calls, so I am going to try merging them with just the Merge variants.
The genotype data all seems to go away which ever way I merge the subgroup vcf. I did the merge with variant calls and there is not genotype data in the merged vcf. Seems I have to go back and do the Multiple Variants Detector on all 48 bam files at the same time if I want the MAF data preserved.
Hi Douglas
Th problem definitely happens during the merging process. If possible, please share with me a snapshot of the Merge variants screen before clicking the button and let me know which button are you clicking. Please also send me the log of the process. In the mean time, it is possible that with the improvement in memory the multisample variants detector will process faster the 48 samples.
I started the Multiple Variants Detector on all 48 bam.sort files and have been monitoring the logs. I am running with 64G RAM, and the settings you had recommended. Estimated completion in 60 hours. I attached a NGSEPScreenShot.jpg for the Merge Variants Files call, and tried both buttons (Merge variants, and Merge genotype calls), In both cases there were no genotypes in the merged vcf based on VCF summary statistics. I attached the logs for Merge variants (group1mergevariants_Merge.log) and Merge genotype calls (group1_Merge.log). I may need to include the logs in separate posts. I tried to attach 3 files and something crashed.
Can't seem to upload the log files. Merge variants (group1mergevariants_Merge.log) the upload choked on the 491Mb file.
Merge genotype calls (group1_Merge.log) is 234 Mb and the upload choked with this message:
This site can’t be reachedsourceforge.net unexpectedly closed the connection.
Try:
Checking the connection
Checking the proxy and the firewall
Running Windows Network Diagnostics
ERR_CONNECTION_CLOSED