[Flex-help] yy_scan_buffer: partial token at the end of a buffer (short read)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hello all,

I'm working on a small project that needs to parse a simple scripting 
language, and I use bison and flex. Because of the nature of the 
application, I have the following constraints:

1. I have to do all the reads, and can not let yylex() do them
2. because of short reads, I often will have my input from multiple 
subsequential reads which will split up the input arbitrary ways
3. I have some multiple character long tokens, like ID which is 
[a-zA-Z0-9_.-]+
4. I have multiple concurrent parsers
5. my grammar has explicit end-of-script token, so I don't need to rely 
on detecting end-of-file or end-of-stream

Because of 4., I use reentrant parser/lex and because of 1., I choose to 
do push-parsing: I append two \0s to the end of my read buffer, then 
call yy_scan_buffer(), yylex() and yypush_parse() in a loop. All works 
fine until (because of 2. and 3.) a buffer boundary cuts a multi-char 
token in half. In this case yylex() returns 0 when hitting end of buffer 
but before that it also returns the partial token as a valid token. When 
next buffer is available the same loop starts tokenizing it and the 
second half of the token will be interpreted as a new token. This is 
obviously not desired, as it will randomly cut valid IDs into 2 (most 
probably invalid) IDs depending on buffer sizes and I/O events.

Question: I think the simplest solution would be to explain flex that 
end-of-buffer doesn't mean end-of-token, and the next buffer fed in may 
be required to finish the current token. Of course this also means if 
there is no next buffer I'll end up having a token stuck in the 
state machine because it couldn't decide if it was a partial one - but 
because of 5., this is no problem for me. Unfortunately I couldn't find 
a way to do it, is there any?

If that is not possible, an alternative, much less preferred, solution
would be to modify my parse loop to look ahead one token and always save 
where the last token started. Then if I detect and end-of-buffer, I just 
don't pass on the last looking-valid token to bison but store the string 
instead and I start next loop by inserting this string in the front of 
the new buffer. Would this have any side effect or risk?

Maybe the above alternative could be done in the grammar by allowing any 
multi-char token to be concetanted from multiple tokens separated by 
explicit end-of-buffer tokens; in this case the question is how to get 
flex to emit an explicit end-of-buffer token.

Thank you in advance,

Tibor Palinkas

[Flex-help] yy_scan_buffer: partial token at the end of a buffer (short read)

flex is a tool for generating scanners

[Flex-help] yy_scan_buffer: partial token at the end of a buffer (short read)