From: Alec W. <al...@br...> - 2013-06-11 17:16:00
|
Hi Thuy, The code is written the way it is to make it faster for the default regexp. If you want to use the default regexp, why pass it on the command line rather than using the default value? Could you provide a concrete example of the problem you are experiencing? I.e. the regexp you want to use, and some examples of the read names that you expect to match the regexp but are not, or vice versa? Thanks, Alec On Jun 10, 2013, at 6:05 PM, Thuy Linh Chu <thu...@ya...> wrote: > > There seems to be differences in how MarkDuplicates handles default readId regex. According to documentation, the default regex is: > > READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* > > However if your readIds do not match this pattern, you will get different results for optical duplicate to be different when passing in default regex vs. no passing. > > I looked at the source code and it seems ifyou don’t specify a regex, it assumes a default and this code does a split of the readIds instead of a match and ends up using values in fixed locations. These values are incorrect values for tile/X/Y and will produce wrong optical duplicates count. And if I pass in a regex the same as the default, it goes through a different code path. This code path performs a pattern match. The result is a no match and produces 0 optical duplicates. > > I'd suggest changing the default (no regex provided) code to perform the same pattern matching to prevent this bug. > > Working with this I found another bug in how MarkDuplicates handle tiles which I will file in another email. > > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Windows: > > Build for Windows Store. > > http://p.sf.net/sfu/windows-dev2dev > _______________________________________________ > Samtools-devel mailing list > Sam...@li... > https://lists.sourceforge.net/lists/listinfo/samtools-devel |