From: SourceForge.net <no...@so...> - 2013-03-05 20:00:43
|
Bugs item #3606683, was opened at 2013-03-03 09:34 Message generated for change (Comment added) made by tgl You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3606683&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 43. Regexp Group: current: 8.6.0 Status: Open Resolution: None Priority: 5 Private: No Submitted By: Tom Lane (tgl) Assigned to: Don Porter (dgp) Summary: fix for 3604074 does not work Initial Comment: I couldn't find any way to reopen or comment on the already-closed bug 3604074, so here is a new bug report to say that the fix for that one doesn't work: if you build with the asserts in Spencer's code enabled, you'll find that the Tcl regression tests draw an assertion failure, because fixempties() fails to eliminate all the EMPTY arcs. The proximate cause of that is a small oversight in the patch: the first sub-loop in fixempties supposes that it doesn't need to do the "nexta = a->outchain" dance, but that's wrong because unempty is still capable of deleting the arc immediately, if the arc is a "vacuous loop". However, even with that repaired I don't find the fix implemented for 3604074 to be very trustworthy. The patch will cope if expansion of a later out-arc for the current state tries to re-add an empty out-arc *to that state* that was already processed. However, it doesn't detect the case where we re-add an empty out-arc to a previously processed state. Nor is it obvious that a removed empty out-arc can't be regenerated in a later pass (which after all was the failure mode exhibited in 3604074). I couldn't find a test case exhibiting an infinite loop, but I don't have a good feeling about that. I am also suspicious that the new code is outright wrong, because it seems to leave out some arguably-necessary arcs when I compare its final NFA state to the rewritten version attached. I don't have a test case to prove that either, but I'm suspicious that the timing of elimination of empty arcs might not be quite right now. After some thought and experimentation I devised a new algorithm that clearly terminates, because it doesn't have any "repeat until done" flavor to it. As a bonus, it seems to be faster than the old one, and it's clearer that it creates all the required replacement arcs. A few miscellaneous notes: * It was alleged in the comments to 3604074 that Postgres had fixed this bug, but actually the reason it doesn't show up for us is only that in http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=173e29aa5deefd9e71c183583ba37805c8102a72 we had gotten rid of the expansion of "a*" to "a+|", and it's the empty alternative that's tickling this bug. If you try select 'a' ~ '((((((a+|)+|)+|)+|)+|)+|)'; it still fails in released Postgres versions. * The reason I found out that your patch is broken is that as Postgres has things set up, the assert()s in Spencer's code are enabled in standard development builds. This evidently isn't the case in Tcl, but I suggest you might want to rethink that. * While fooling with the fixempties algorithm, I found that it was very easy to make CVE-2007-4772 (the infinite loop in pullback() for "($|^)*") come back, if the output NFA was even slightly nonoptimal. So I fear that that bug is not really fixed either: it seems to be fairly easy to make pullback() create new states as fast as it can process them. pullback looks vaguely similar to what fixempties is doing, so I wonder whether it could be rewritten in a similar style to this patch. I've not tried it though. ---------------------------------------------------------------------- Comment By: Tom Lane (tgl) Date: 2013-03-05 12:00 Message: > it's not clear to me what worst case performance possibilities might be lurking in the nested loops containing the emptyreachable() call FWIW, this version is faster than the old code in my tests (the special cases with "doomed" states were required to make it so, else I'd not have bothered with those). I was mostly testing with regexes like "(((((a+|)+|)+|)+|)+|)", and it's possible that I somehow optimized for these cases at the expense of other ones, but there's nothing in the code that seems like it ought to be very specific to any particular regexp. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2013-03-05 08:36 Message: I'm satisfied that if the new routines have a correctness problem, I'm not able to find it. It's also clear that the new routines do not suffer from the infinite loop hazard found in the former version. There might well be opportunities for improvement remaining, and it's not clear to me what worst case performance possibilities might be lurking in the nested loops containing the emptyreachable() call, but I think such concerns can be set aside until demonstrations of trouble can be found. So, I think accepting this patch is the right move. (possible addition of hasanemptyout() pending). I also found, as tgl reported, that Tcl Bug 1810038 was quite easy to resurrect when making mods to the details of how these graphs are optimized. I think revising the pushfwd() and pullback() routines so that their infinite loop hazards are also removed is well worth doing. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2013-03-05 08:25 Message: The recursion in emptyreachable() gives me brief concern, but it's no worse than the recursion already present in other traversal routines. ---------------------------------------------------------------------- Comment By: Tom Lane (tgl) Date: 2013-03-05 08:02 Message: Yeah, that's a bit handwavy, and I don't fully understand the effects either. But one of the things I tried was removing that logic altogether and just always pulling arcs in the same direction, and that definitely produced inferior results (for one thing, it broke pullback()). So there's some method in Henry's madness there, even if I didn't explain it quite right. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2013-03-05 07:59 Message: It may not matter much, but I do not understand the selection argument made in the comment in the replaceempty() routine regarding larger fans creating greater likelihood of making states deleteable. ---------------------------------------------------------------------- Comment By: Tom Lane (tgl) Date: 2013-03-05 07:43 Message: The flag field is only nonzero for the special "pre" and "post" states that represent the NFA's starting and ending points. It's not valid to remove either one (at least not without creating a replacement), so the tests are there to ensure we don't try to do so. The previous coding had an assert() that seemed to imply that the pre and post states could never have empty arcs attached at all, but I think that's wrong --- at least, if I remember the testing I was doing over the weekend, this code can migrate empties to be attached directly to the pre state, and I don't see a reason why that's not ok. They'll be gone anyway when fixempties is done. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2013-03-05 07:38 Message: I see that the "flag" field of the state struct plays a role in fixempties() that it did not before. Was this just another way that fixempties() was broken? Or is the new preservation of states with nonzero flag values a misstep? Are there test cases to demonstrate the correctness either way? ---------------------------------------------------------------------- Comment By: Tom Lane (tgl) Date: 2013-03-05 07:37 Message: > A hasanemptyout(s) routine that didn't uselessly count them all to report existence would be an improvement. Yeah, I considered that but didn't feel it was worth the trouble. But if you want to add one, no objection here. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2013-03-05 07:31 Message: The (nonemptyouts(s2) > 0) test in fixempties seems mildly wasteful. A hasanemptyout(s) routine that didn't uselessly count them all to report existence would be an improvement. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2013-03-04 08:12 Message: I wrote a long comment, then via a network glitch, lost it. Summarized, it was "Yup." Will review the patch. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3606683&group_id=10894 |