Thread: [Flex-help] Generated lexer to ignore start conditions in Visual C++

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi everyone,

for a project I'm using the following (still work in progress though) lexer 
definition:

----- CODE -----

%{
        #include <xmlrpc_parser.tab.hh>
%}

%option reentrant
%option bison-bridge
%option bison-locations
%option noyywrap
%option yylineno
%option prefix="xml"
%option never-interactive
%option nounistd
%option stack

%{
        char* MyStrnDup(const char* src, size_t len)
        {
                char* pRet = new char[len + 1]; /* Terminierende 0 */
                memcpy(pRet, src, len);
                pRet[len] = '\0';
                return pRet;
        }
%}

%x STRINGDATA

%%
[ \t\f\n\r]
"<?xml version=\"1.0\"?>"               return XML_HEAD;
"<methodCall>"                  return XML_BEGIN_CALL;
"</methodCall>"                 return XML_END_CALL;
"<methodResponse>"              return XML_BEGIN_RESPONSE;
"</methodResponse>"             return XML_END_RESPONSE;
[0-9]+                          { yylval->itype = strtoul(yytext, 0, 10); 
return NUMBER; }
[0-9]+"."+[0-9]*                return FLOAT;
"<methodName>"                  return XML_BEGIN_METHOD_NAME;
"</methodName>"                 return XML_END_METHOD_NAME;
"<params>"                      return XML_BEGIN_PARAM_LIST;
"</params>"                     return XML_END_PARAM_LIST;
"<param>"                               return XML_BEGIN_PARAMETER;
"</param>"                      return XML_END_PARAMETER;
"<value>"                               return XML_BEGIN_VALUE;
"</value>"                      return XML_END_VALUE;
"<array>"                               return XML_BEGIN_ARRAY;
"</array>"                      return XML_END_ARRAY;
"<data>"                                return XML_BEGIN_ARRAY_DATA;
"</data>"                               return XML_END_ARRAY_DATA;
"<base64>"                      { BEGIN STRINGDATA; return XML_BEGIN_BASE64; }
"/base64>"                      return XML_END_BASE64;
"<boolean>"                     return XML_BEGIN_BOOLEAN;
"</boolean>"                    return XML_END_BOOLEAN;
"<dateTime.iso8601>"            return XML_BEGIN_DATETIME;
"</dateTime.iso8601>"           return XML_END_DATETIME;
"<double>"                      return XML_BEGIN_DOUBLE;
"</double>"                     return XML_END_DOUBLE;
"<int>"|"<i4>"                  return XML_BEGIN_INT;
"</int>"|"</i4>"                        return XML_END_INT;
"<string>"                       { BEGIN STRINGDATA; return 
XML_BEGIN_STRING; }
<STRINGDATA>[<]                 { BEGIN INITIAL; yylval->strtype = 
MyStrnDup(yytext, yyleng - 1); return FREETEXT; }
<STRINGDATA>.                   yymore();
"/string>"              return XML_END_STRING;
"<struct>"                      return XML_BEGIN_STRUCT;
"</struct>"                     return XML_END_STRUCT;
"<nil/>"                                return XML_NIL;
"<member>"                      return XML_BEGIN_MEMBER;
"</member>"                     return XML_END_MEMBER;
"<name>"                                return XML_BEGIN_NAME;
"</name>"                               return XML_END_NAME;
"<fault>"                               return XML_BEGIN_FAULT;
"</fault>"                      return XML_END_FAULT;
"faultCode"                     return XML_FAULT_CODE;
"faultString"                   return XML_FAULT_STRING;
[a-zA-Z]+'.'[a-zA-Z]+                   return SCOPED_NAME;
[a-zA-Z]+                       { yylval->strtype = MyStrnDup(yytext, yyleng); 
return WORD; }

----- CODE -----

Using the generated lex.xml.c on the linux platform yields the expected 
results. Taking the same source to the Visual C++ 2005 compiler however has a 
surprising result.

When parsing the following text (line numbers added for later reference):

1: <?xml version="1.0"?>
2: <methodResponse>
3:  <params>
4:    <param>
5:        <value>
6:               <struct>
7:                 <member>
8:                   <name>foo</name>
9:                   <value><i4>1</i4></value>
10:                 </member>
11:                 <member>
12:                   <name>bar</name>
13:                   <value><i4>2</i4></value>
14:                 </member>
15:                 <member>
16:                   <name>otto</name>
17:                   <value><string>huhu!</string></value>
18:                 </member>
19:               </struct>
20:        </value>
21:    </param>
22:  </params>
23: </methodResponse>

I then studied the output of the debugging mode.

----- DEBUG OUTPUT
Starting parse
Entering state 0
Reading a token: Next token is token XML_HEAD (1.0-1.0: )
Shifting token XML_HEAD (1.0-1.0: )
Entering state 1
Reading a token: Next token is token XML_BEGIN_RESPONSE (1.0-1.0: )
Shifting token XML_BEGIN_RESPONSE (1.0-1.0: )
Entering state 4
Reading a token: Next token is token XML_BEGIN_PARAM_LIST (1.0-1.0: )
Shifting token XML_BEGIN_PARAM_LIST (1.0-1.0: )
Entering state 10
Reading a token: Next token is token XML_BEGIN_PARAMETER (1.0-1.0: )
Shifting token XML_BEGIN_PARAMETER (1.0-1.0: )
Entering state 19
Reading a token: Next token is token XML_BEGIN_VALUE (1.0-1.0: )
Shifting token XML_BEGIN_VALUE (1.0-1.0: )
Entering state 26
Reading a token: Next token is token XML_BEGIN_STRUCT (1.0-1.0: )
Shifting token XML_BEGIN_STRUCT (1.0-1.0: )
Entering state 40
Reducing stack by rule 27 (line 88):
   $1 = token XML_BEGIN_STRUCT (1.0-1.0: )
-> $$ = nterm xml_begin_struct (1.0-1.0: )
Stack now 0 1 4 10 19 26
Entering state 49
Reading a token: Next token is token XML_BEGIN_MEMBER (1.0-1.0: )
Shifting token XML_BEGIN_MEMBER (1.0-1.0: )
Entering state 64
Reading a token: Next token is token XML_BEGIN_NAME (1.0-1.0: )
Shifting token XML_BEGIN_NAME (1.0-1.0: )
Entering state 77
Reading a token: Next token is token FREETEXT (1.0-1.0: )
Parse error: syntax error, unexpected FREETEXT, expecting WORD
----- DEBUG OUTPUT -----

The text is correctly scanned until the lexer reaches line 8. It correctly 
returns XML_BEGIN_NAME for the opening tag. But the following content is 
returned as FREETEXT although WORD was expected by the parser.

This is strange, because FREETEXT is returned only under the STRINGDATA 
condition, which in turn is only begun after encountering a <string> or 
<base64> tag.

Does that mean, that the generated code compiled under Visual C++ ignores 
conditions? And if it does, are there any suggestions for workarounds?

Best regards,
- Christian Loth

Thread: [Flex-help] Generated lexer to ignore start conditions in Visual C++

flex is a tool for generating scanners

flex-help