Re: [Useq-users] Important USeq_8.7.0 update for bisulfite data analysis

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Good point Sue.

See the novoalign docs too.  The -u setting varies by species.  Note, it worthwhile piloting what setting are appropriate, e.g. -b2 or -b4.  The additional compute time is considerable with -b4 and isn't needed for standard Illumina libraries run on the HiSeq or MiSeq.  It's these amplicon/ pcr derived libraries that need special processing.

-cheers, D

On Nov 20, 2013, at 11:33 AM, Sue Hammoud <Sue...@hc...<mailto:Sue...@hc...>>
 wrote:

Hi Everyone,

Also when aligning miseq data make sure to you  use  –u12 in addition to the -b4.

Best

Sue

From: David Nix <Dav...@hc...<mailto:Dav...@hc...>>
Date: Wednesday, November 20, 2013 11:30 AM
To: Sue Hammoud <sue...@hc...<mailto:sue...@hc...>>, Magdalena Potok <Mag...@hc...<mailto:Mag...@hc...>>, Brad Cairns <Bra...@hc...<mailto:Bra...@hc...>>, Bushra Gorsi <bg...@ge...<mailto:bg...@ge...>>, Somaye Dehghanizadeh <Som...@hc...<mailto:Som...@hc...>>
Cc: "bio...@ut...<mailto:bio...@ut...>" <bio...@ut...<mailto:bio...@ut...>>, USeq <use...@li...<mailto:use...@li...>>
Subject: Important USeq_8.7.0 update for bisulfite data analysis

Hello Folks,

I've posted a new useq release that contains a critical patch for bisulfite sequencing analysis derived from amplicon based library preps.  These need to be aligned with the -b 4 setting to maximize the number of aligned reads and then processed through the new NovoalignBisufiteParser 8.7.0 . See http://useq.sourceforge.net/usageBisSeq.html  The old NBP won't correctly parse these new alignment types.  I have also corrected an issue with the merging of paired overlapping alignments.  Some of these were being skipped and double counting of overlapping paired reads occurred.  Note the later fix changes the counts but not the percents of methylation.  See below for an illustration on a Hg19 sperm dataset.

So moving forward, pay very close attention to amplicon -b4 aligned and parsed datasets.  Visually inspect the base fraction methylation and converted and non converted point datasets along side the duplicate filtered bam alignment tracks in IGB.  Note any discrepancies.  This is a novel complex datatype that warrants close scrutiny.

-cheers, David

#### New NovoalignBisulfiteParser Stats
Filtering statistics for 174477814 alignments:
5621028 Failed mapping quality score (13.0)
2192225 Failed alignment score (300.0)
1331036 Aligned to phiX or adapters
0       Failed vendor QC
55078179        Are unmapped

110255346       Passed filters (63.2%)
88083944        Total non-converted Cs sequenced
1504140872      Total converted Cs sequenced
0.055   Fraction non converted C's.

0.998   Fraction bp passing quality (13)

2164389480      BPs overlapping paired sequence
10644938422     BPs paired sequence
0.203   Fraction overlapping bps from paired reads.

#### Old  NovoalignBisulfiteParser Stats
Filtering statistics for 174477814 alignments:
5621028 Failed mapping quality score (13.0)
2192225 Failed alignment score (300.0)
1331036 Aligned to phiX or adapters
0       Failed vendor QC
55078179        Are unmapped

110255346       Passed filters (63.2%)
94996528        Total non-converted Cs sequenced
1640576210      Total converted Cs sequenced
0.055   Fraction non converted C's.

0.998   Fraction bp passing quality (13)

2164389480      BPs overlapping paired sequence
10644938422     BPs paired sequence
0.203   Fraction overlapping bps from paired reads.

#### New NBP run through BisStat
Using Lambda data to set the expected fraction non-converted Cs to 0.00123 (16257/(16257+13221619))
Stats based on aligned genomic contexts that meet a minimum FDR threshold of 20.0. WARNING: datasets must be subsampled to the same bp aligned for these stats to be cross dataset comparable.
        0.978   (12631645/12919931)     mCG/mC
        0.007   (89908/12919931)        mCHG/mC
        0.015   (198378/12919931)       mCHH/mC

Stats based on cumulative sums of all read sequences, no thresholds:
        0.968   (85257586/88067661)     mCG/mC
        0.010   (875639/88067661)       mCHG/mC
        0.022   (1934436/88067661)      mCHH/mC

        0.059   (88067661/1490918281)   mC/C
        3.809   (85257586/22383883)     mCG/CG
        0.002   (875639/459687208)      mCHG/CHG
        0.002   (1934436/1008847190)    mCHH/CHH

        0.056   (88067661/1578985942)   mC/(C+mC)
        0.792   (85257586/107641469)    mCG/(CG+mCG)
        0.002   (875639/460562847)      mCHG/(CHG+mCHG)
        0.002   (1934436/1010781626)    mCHH/(CHH+mCHH)

#### Old NBP run through BisStat
Using Lambda data to set the expected fraction non-converted Cs to 0.00120 (15986/(15986+13272652))
Stats based on aligned genomic contexts that meet a minimum FDR threshold of 20.0. WARNING: datasets must be subsampled to the same bp aligned for these stats to be cross dataset comparable.
        0.966   (14109977/14599772)     mCG/mC
        0.010   (147396/14599772)       mCHG/mC
        0.023   (342399/14599772)       mCHH/mC

Stats based on cumulative sums of all read sequences, no thresholds:
        0.969   (92009703/94980515)     mCG/mC
        0.010   (935216/94980515)       mCHG/mC
        0.021   (2035596/94980515)      mCHH/mC

        0.058   (94980515/1627302558)   mC/C
        3.842   (92009703/23946641)     mCG/CG
        0.002   (935216/497817018)      mCHG/CHG
        0.002   (2035596/1105538899)    mCHH/CHH

        0.055   (94980515/1722283073)   mC/(C+mC)
        0.793   (92009703/115956344)    mCG/(CG+mCG)
        0.002   (935216/498752234)      mCHG/(CHG+mCHG)
        0.002   (2035596/1107574495)    mCHH/(CHH+mCHH)