Menu

How NGSEP works some questions

2017-02-21
2017-02-21
  • Eric González

    Eric González - 2017-02-21

    Dear all,

    I want to use NGSEP and I have the next questions

    1.- I am already have bamfiles processed such as the GATK team recommend in
    their best practices, do you think that is a good idea to incorporate this
    data directly in NGSEP to call SNPs? my alignments was done using bwa-mem.

    2.- Why you choose bowtie2 over Bwa-mem?

    3.- Can I run the SNP calling on two different sets of individual sequenced
    by two different GBS approaches (SE and PE) separately and then merge the
    data?

    4.- Which is the amount of memory and CPUs required to carry out all the
    processes from "Deconvolution" , mapping to SNP calling and filtering for
    around 100 of samples?

    5.- For me is still not clear how the SNPs are called. I understand that
    they do not take in to account the Hardy-Weinberg Equilibrium what is good
    for data such as mapping populations, data that I have, but how the SNPs
    are called? There is an advantage to call the SNPs together such as in GATK
    joint calling? The SNPs are called independently for single individuals?

    Best Wishes,

    --
    Eric

     
  • Jorge Duitama

    Jorge Duitama - 2017-02-23

    Hi Eric

    First of all, many thanks for your interest in NGSEP. Here are the answers to your questions:

    1. You can call NGSEP from BAM files coming from bwa-mem and after running either some or all the stages of the GATK best practices. In the NAR paper we used the BAM files produced directly by bwa to make the comparisons as fair as possible. The only important issue to take into account is that you need to use the option "-ignoreXS" to ignore the "XS" optional information field while processing the alignments. We use this field to process bowtie2 alignments because it help us to clearly separate reads are uniquely mapped and which reads are mapped multiple times.

    2. The main reason why we prefer bowtie2 in our analyses is that we can store as many alignments as we want for each read, whereas bwa forces picking only one random alignment in cases when there are multiple valid alignments. At least that was the case until the last time we checked the bwa options last year. In any case bowtie2 behaves by default almost in the same way as bwa. We discussed in the NAR paper consequences for read depth analysis of different choices on the number of alignments to consider. Again, NGSEP runs fine on alignments produced by either tool or actually on any well formed BAM file.

    3. Yes, you can discover SNPs sample by sample using different parameters of single end and paired end data, run the "MergeVariants" command on all VCFs to get a complete variants catalog, then genotype again sample by sample and finally run the "MergeVCF" command to obtain the complete variants catalog genotyped in all your samples. For example, we often merge samples with GBS and WGS data. The only caveat is that if you guys made the single end GBS and the paired end GBS with different enzimes, you may have a very small intersection between the variants that you can discover from the two datasets. Unfortunately this is independent of the SNP calling pipeline that you choose.

    4. I can not give you exact numbers because that heavily depends on the genome size and on the average read depth per sample. However I can tell you that we have been able to process GBS data from populations of more than 100 samples in 2 to 3 days using only a laptop with 4 cores and 8GB RAM. Of course having more processors and memory, the total processing time can be reduced to a few hours.

    5. The initial SNP calling algorithm implemented in NGSEP is a more or less standard bayesian algorithm explained in this publication (http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-13-S2-S6). We have done several improvements in these years, from which the two most important are our own implementation of indel realignment, which does not produce extra BAM files, and the option to include annotated short tandem repeats, which prevents false SNP calling due to misalignments within these regions. We have some preliminary data showing that our small indels are now very good, so we are doing the formal benchmarks to compare with GATK, samtools, Freebayes and the other variant calling tools. Because we do not assume any particular population model, the SNPs are discovered sample by sample and then we have a two stages merging process to obtain first the population variants and later to assemble the database of genotypes for such variants.

    Please let me know if you have any further questions related to any of these topics. It would be also great for us if you share your own experience using NGSEP.

    Best regards

    Jorge

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.