From: David N. <Dav...@hc...> - 2012-07-23 22:38:43
|
Hello Alex, Good question! No I'm pretty sure it is not a bug. There was a change back in December in how the log2 ratio is calculated to attempt to control for variance outliers. The DESeq method is too severe in this regard so we typically don't use its variance filtered p-values anymore and have attempted a compromise using modifications to how the log2 ratio is calculated. (DESeq will throw out variance outliers no matter where they occur so if your treatment replica counts for a particular gene look like 20, 30,43 and control replicas like 1400, 1300, 8000; this will be tossed.) So how does one estimate the fold difference in expression between two genes? The simple answer is to just correct each replica for total counts, sum them together, add one to avoid dividing by zero, and calculate a log2((sumT+1)/(sumC+1)) ratio. This is a simple straightforward approach and is comparable to your FPKM and the old Log2 approach. The problem is that this exposes you to situations where one of the replicas is quite different that the others in a particular condition. It's extreme value can drive the entire summary statistic. Another approach is to calculate the median (we use a close variant called the pseudo median) of the total count corrected replica counts for each condition, then calculate a Log2((pseT+1)/(pseC+1)) ratio. This works well for datasets with 4 or more replicas in each condition. OK with 3. But what to do with 2 (or 3)? This is tricky. The pseudoMedian method, with few replicas, is like the old sum method, it can be unduly influenced by a variance outlier. So yet another approach is to correct for total counts and calculate all single replica paired log2 ratios and report the smallest log2 ratio. Conservative but not as severe as the DESeq variance filtered p-value approach. You can use the -p flag in ODRSS to force the pseudoMedian approach, see the help menu for ODRSS. So does that solve the problem. No! Yet another option is to make use of DESeq's variance adjusted count values in place of the total count adjusted counts. This, in theory, should be a better estimation of the expression difference between two genes and is the approach taken in the latest USeq app called DefinedRegionDifferentialSeq. (This is a merge of ODRSS and MultipleConditionRNASeq, ~3x faster than ODRSS, with a more sensitive alternative splice estimation. ODRSS and MultipleConditionRNASeq are going to be depreciated.) Remember, you've got the FPKM measurements for all the genes so you can always calculate the straight forward log2((tFPKM+1)/(cFPKM+1)) in Excel, just watch out for variance outliers. Does that help? -cheers, David From: Alexander Williams <ale...@gl...<mailto:ale...@gl...>> Date: Mon, 23 Jul 2012 14:15:19 -0700 To: David Nix <dav...@hc...<mailto:dav...@hc...>> Cc: Alisha Holloway <ali...@gl...<mailto:ali...@gl...>>, Benoit Bruneau <bbr...@gl...<mailto:bbr...@gl...>>, Paul Delgado <pde...@gl...<mailto:pde...@gl...>> Subject: Possible bug in "log2ratio" column in Useq 8.3 -- it is very different from 8.0, and I can't make sense of the values Hi David, 1) In Useq 8.0.x, a column named "Log2((sumT+1)/(sumC+1))" was always very similar to log2(tFPKM / cFPKM), and seemed intuitively obvious. 2) In Use 8.3.x, this column is now "Log2Ratio," which is totally different and doesn't appear to correspond in an obvious fashion to the tFPKM / cFPKM (although it is still correlated with them). 3) The Useq docs that I've been able to find still refer to the old name, and it just says "Log2((sumT+1)/(sumC+1)) - normalized log2 ratio." (http://useq.sourceforge.net/outputFileTypeDescriptions.html) 4) This is the result of a run of OverdispersedRegionScanSeqs. 5) Do you know if this is expected behavior, or if this seems like a bug? I've attached 3 images showing the old USeq out, the new USeq out, and the correlation of the new output to my calculated-in-excel output (=LOG(tFPKM/cFPKM, 2) ). Alex [cid:CEE5C1F4-FBDA-48A5-9BEC-24A43C219536@gs.local] [cid:0F339874-74BC-41FF-8686-CEE9664D8CBA@gs.local] [cid:9CBF128F-2598-4795-A035-83A8A2F18528@gs.local] |