Re: [Samtools-devel] MarkDuplicates default READ_NAME_REGEX bug

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

OK, I added that to the usage message.

Thanks, Alec

On Aug 27, 2013, at 9:28 AM, Olof Karlberg <olo...@me...> wrote:

> Hi Alec,
> 
> Thanks, that should hopefully save at least some frustration and confusion.
> 
> You are right about Matcher.matches. Maybe that should also be clarified in the docs because it is not obvious in the current version that the pattern is required to match the entire region as it just states that it should contain three capture groups.
> 
> Olof
> 
> On 2013-08-27 2:30 PM, Alec Wysoker wrote:
>> Hi Olof,
>> 
>> I made the change so that the usage message explains the optimization used instead of the regex.  
>> 
>> I believe that it is not the case that the actual regex would apply for the new format read name.  Matcher.matches attempts to match the entire region against the pattern.
>> 
>> -Alec
>> 
>> On Aug 26, 2013, at 3:38 PM, Olof Karlberg <olo...@me...> wrote:
>> 
>>> Hi,
>>> 
>>> This bug was supposedly fixed by adding a warning if the default regexp does not match the read ID. However, I believe the current solution needs rethinking as it will fail to catch what I think is a very common situation.
>>> 
>>> At some point (my guess is from CASAVA 1.8) two extra fields were added to the read name from Illumina. These are instrument name/id and run number and are prepended to the old read name.
>>> 
>>> To illustrate the problem with an example:
>>> 
>>> Old format
>>> H0HVEADXX:1:2109:16286:10354
>>> 
>>> New format
>>> HISEQ:106:H0HVEADXX:1:2109:16286:10354
>>> 
>>> Default regexp
>>> [a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*
>>> uses three last matches in parentheses as tile, X, Y
>>> 
>>> Heuristic approach in the code
>>> Split on ':', use index 2-4 for tile, X, Y
>>> 
>>> The default regexp would still work if it really was applied as it will match correctly at the end of the ID and is not anchored to the beginning of the string. The heuristic though, will not work as it will return the wrong fields (H0HVEADXX:1:2109 instead of 2109:16286:10354), meaning that all duplicates present on the same tile are called as optical duplicates. 
>>> 
>>> The code that should trigger a warning will not be executed as we still get > 4 elements from the split with the new format:
>>> 
>>>        if (READ_NAME_REGEX == DEFAULT_READ_NAME_REGEX) {
>>>             final int fields = StringUtil.split(readName, tmpLocationFields, ':');
>>>             if (fields < 5) {
>>>                 if (!warnedAboutRegexNotMatching) {
>>>                     LOG.warn(String.format("Default READ_NAME_REGEX '%s' did not match read name '%s'.  " +
>>>                             "You may need to specify a READ_NAME_REGEX in order to correctly identify optical duplicates.  " +
>>>                             "Note that this message will not be emitted again even if other read names do not match the regex.",
>>>                             READ_NAME_REGEX, readName));
>>>                     warnedAboutRegexNotMatching = true;
>>>                 }
>>>                 return false;
>>>             }
>>> 
>>> 
>>> I have two suggestions on how to fix this. The primary is to add something about this in the docs, like "The default is to not use a regexp, but a hard coded split on ':' and extract elements 3-5 (2-4, zero based) as tile, X and Y respectively." 
>>> 
>>> Secondly, as Thuy suggested before, the first time the actual regexp should be tested against the name to see if it matches, otherwise give a warning. The current test is not sufficient. Compiling the regexp once and matching the first read ID should not cause any major slowdown.
>>> 
>>> Additionally, it would be nice if it was possible to add either an option to change which fields are extracted from the split on ':', or if some auto-detect function could be added to discriminate between 5 and 7 field IDs. This way, the speed up gained by the hard coded approach could be acquired also for reads with the longer read name format.
>>> 
>>> The fact that the default pattern correctly matches the new read ID format, while the code really does something completely different from what is stated in the docs, causes some major confusion here.
>>> 
>>> 
>>> Olof
>>> 
>>>> 
>>>> Hi Thuy,
>>>> 
>>>> I've added limited validation and warning for the default regex.
>>>> 
>>>> -Alec
>>>> 
>>>> On Jun 11, 2013, at 4:34 PM, Thuy Linh Chu <thuylinh_chu@...> wrote:
>>>> 
>>>> > 
>>>> > 
>>>> > Thanks for adding the warning.  However, I would suggest performing the check on default (non-specified) regex as well.  The check needs to be done only once and should not affect overall performance.  This will help us catch the problem of unmatched read ID format earlier/easier.  We would have not caught this problem if we didn't have manually created test sets to performs checks at various levels of aggregation.  This is not the normal case for most actual flow cells sequencing.  Thanks.
>>>> > 
>>>> > Thuy  
>>>> > 
>>>> > 
>>>> > ----- Original Message -----
>>>> > From: Alec Wysoker <alecw@...>
>>>> > To: Thuy Linh Chu <thuylinh_chu@...>
>>>> > Cc: samtools-devel@...
>>>> > Sent: Tuesday, June 11, 2013 12:44 PM
>>>> > Subject: Re: [Samtools-devel] MarkDuplicates default READ_NAME_REGEX bug
>>>> > 
>>>> > Hi Thuy,
>>>> > 
>>>> > I've added code so that if a non-default READ_NAME_REGEX does not match a read name, it will emit a single warning.
>>>> > 
>>>> > -Alec
>>>> > 
>>>> > On Jun 10, 2013, at 6:05 PM, Thuy Linh Chu <thuylinh_chu@...> wrote:
>>>> > 
>>>> >> 
>>>> >> There seems to be differences in how MarkDuplicates handles default readId regex.  According to documentation, the default regex is:
>>>> >> 
>>>> >> READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*
>>>> >> 
>>>> >> However if your readIds do not match this pattern, you will get different results for optical duplicate to be different when passing in default regex vs. no passing.  
>>>> >> 
>>>> >> I looked at the source code and it seems ifyou don’t specify a regex, it assumes a default and this code does a split of the readIds instead of a match and ends up using values in fixed locations.  These values are incorrect values for tile/X/Y and will produce wrong optical duplicates count.  And if I pass in a regex the same as the default, it goes through a different code path.  This code path performs a pattern match.  The result is a no match and produces 0 optical duplicates.
>>>> >> 
>>>> >> I'd suggest changing the default (no regex provided) code to perform the same pattern matching to prevent this bug.
>>>> >> 
>>>> >> Working with this I found another bug in how MarkDuplicates handle tiles which I will file in another email.
>>>> >> 
>>>> >> 
>>>> >> 
>>>> >> ------------------------------------------------------------------------------
>>>> >> This SF.net email is sponsored by Windows:
>>>> >> 
>>>> >> Build for Windows Store.
>>>> >> 
>>>> >> http://p.sf.net/sfu/windows-dev2dev
>>>> >> _______________________________________________
>>>> >> Samtools-devel mailing list
>>>> >> Samtools-devel@...
>>>> >> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>>>> 
>>>> 
>>>> 
>>> 
>>> ------------------------------------------------------------------------------
>>> Introducing Performance Central, a new site from SourceForge and 
>>> AppDynamics. Performance Central is your source for news, insights, 
>>> analysis and resources for efficient Application Performance Management. 
>>> Visit us today!
>>> http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk_______________________________________________
>>> Samtools-devel mailing list
>>> Sam...@li...
>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>> 
>