#174 Fragment recognition too rudimentary (parsing)


LXR uses a very simple parser to isolate homogeneous fragment of source file to ease recognition of keywords and variables. It tries to wipe out strings, comments and include constructs, so that what is left is composed only of operators and variables.

In some circumstances, the parser leaves strings before seeing the end of the string, e.g. in "A string with \" inside". The parser thinks "A string with \" is the string, the inside may be a variable and a new string is then opened out of sync. The cause is a simplistic parse based on pattern-matching with opening and closing delimiters for context. There is a need for an other kind of objects which maintain the parser in the same state when detected. What happens is data represented by the regexp associated to these objects is "swallowed" and nothing happens because they cannot be classified as opening or closing delimiters (provided it is possible, i.e. they do not appear also as these delimiters).

For that, 'spec' of generic.conf is modified to have 4 parameters insted of 3:

'spec' => [ context_name, open_pattern, close_pattern, stay_pattern, ... ]

(which, by the way, means a huge rewrite of generic.conf)

In the case of C strings, we can now have (extract only):
'string', '"', '"', '\\\\"',

But it is not enough; we must take care that this stay_pattern is considered BEFORE end_pattern because the latter is a prefix of the former and the expected match leng of stay_pattern is longer than that of end_pattern. Unhappily, there is no way to guarantee that all patterns of a 'spec' can be sorted such that they'll match from the most specific to the less specific.

The proposed patch tries to solve that, but there will always be situations where it will fail. It is a consequence of pattern-matching parsing. Using a finite state automaton only to find candidate identifiers is not really worth it.

Other bug loosely related:
When the line is split according to all patterns of a spec, it must be determined if one element is an opening delimiter. The answer is positive if there is an exact match, that is the pattern extends from the start ^ to the end $. These anchors are added to the ends of the merged regexp containing all the patterns as a sequence of alternatives. It means that only the first delimiter is constrained to start at the beginning of the string and the last delimiter to end at the end of the string. This leads to false detection of short delimiters in the middle of strings.

This is also addressed in the proposed patch.

!-37 110307 (after line 37 in sub init, add:)
!my @stay; # Fragment maintaining current context
!-49 110307 (after line 49, add:)
! @stay = ();
!-58,58 110307 (replace line 58 with:)
! while (@_ = splice(@blksep, 0, 4)) {
!-61 110307 (after line 61, add:)
! push(@stay, $_[3]);
!-74 110307 (after line 74, add:)
! foreach (@stay) {
! next if $_ eq '';
! $split = "$_|" . $split;
! }
------------------- end of patch -------------

in sub init & nextfrag
nextfrag sometimes matches erroneously delimiters because ^ and $ surround
the merged patterns of all alternatives. ^ and $ should be put arround each
alternative to ensure that the delimiter will match as a wholle and not as
a part of the fragment.
!-68,68 110307 init
! $open .= "^($_)\$|";
!-138,138 110307 nextfrag
! if ($frags[0] =~ /$open/) {
!-150,150 110307 nextfrag
! if (defined($frag) && (@_ = $frag =~ /$open/)) {
------------------------ end of patch ---------------------------


  • Andre-Littoz

    • priority: 5 --> 7
    • status: open --> open-works-for-me
  • Andre-Littoz

    • priority: 7 --> 8
  • Andre-Littoz

    Modified genereic.conf to work with the patch

  • Andre-Littoz

    • assigned_to: nobody --> ajlittoz
    • status: open-works-for-me --> closed-fixed
  • Andre-Littoz

    The correction proposed above does not cover all cases.
    The fix in CVS is more state-oriented. The "open" delimiter is used to determine the next state. In this state, the associated "end" delimiter (and only this one) must be met to leave and revert to the initial state. To circumvent any ambiguity arising from the language grammar (e.g. " and \" in C), every state may have a "stay" pattern which acts as a no-op, preventing a change of state. "atom" is modified to mean "stay" pattern for initial state (utterly needed in Perl).
    With this scheme, it becomes possible to sort the 'spec's by order of expected match length to eliminate all risks of mismatch because only the "open" delimiters are used to transition from initial state to q specific one. Then, in this secondary state, only the "end " pattern is used, which can be cleverly designed to match by order of expected length without ambiguity.

    Fix contains modification to generic.conf but NOT ALL languages have been reviewed. The 'spec's may not be correct everywhere. Please check at least php, Java and VB (where I commented out 'atom' because I felt it was no longer needed). I made the right corrections in C, C++ and Perl.
    The changes to generic.conf will be documented in the documentation to come.

    IMHO, the parser is now theoretically sane. However, since the code is larger, there may be a performance penalty. Help is welcome to tune the code and test the new parser on hard cases.

    Due to changes in generic.conf, it is safer to reindex the source trees to avoid compatibility issues.

  • Andre-Littoz

    This has now been fixed in CVS.

    If you can install the new version and check that it solves your
    problem, then it would be very useful.

    Thanks for reporting this defect and helping to make LXR better.