C++ CPD treats escapes as significant in strings
A source code analyzer
Brought to you by:
adangel,
juansotuyo
While looking into bug 1814737 I noticed that the tokenizer for C++ is treating differences in escape sequences as significant in strings even though those differences may not result in differences in the actual text generated.
In particular it would treat this string:
"abc"
as different from this string:
"a\
b\
c"
when they are really identical.
Similarly it would treat these as different strings:
"a\007b" vs. "a\x07b"
It would be more robust if CPD figured out what will actually be generated for the string to use in comparing the tokens.
A similar issue can occur with concatenated string literals as in
"ab" vs. "a" "b"
which generate the same thing.
Logged In: YES
user_id=5159
Originator: NO
Hm, interesting... I'm not sure if I entirely agree, but I see where you're coming from there. I guess I'm used to thinking of a tokenizer as being able to produce exactly what was entered as input...
Yours,
Tom
Logged In: YES
user_id=130378
Originator: YES
I don't think it really matters what the job of a tokenizer is. The point is what is the job of CPD.
The job of CPD is to find bits of code to find snippets of code that do not differ in a significant way. CPD ignores whitespace for between tokens because whitespace is insignificant to the code produced. Similarly the use of line continuation in a string is insignificant as is octal vs. hex escape codes and concatenating adjacent strings.