[Flex-help] yy_scan_buffer: partial token at the end of a buffer (short read)
flex is a tool for generating scanners
Brought to you by:
wlestes
|
From: <pl...@ig...> - 2012-08-24 07:38:50
|
Hello all, I'm working on a small project that needs to parse a simple scripting language, and I use bison and flex. Because of the nature of the application, I have the following constraints: 1. I have to do all the reads, and can not let yylex() do them 2. because of short reads, I often will have my input from multiple subsequential reads which will split up the input arbitrary ways 3. I have some multiple character long tokens, like ID which is [a-zA-Z0-9_.-]+ 4. I have multiple concurrent parsers 5. my grammar has explicit end-of-script token, so I don't need to rely on detecting end-of-file or end-of-stream Because of 4., I use reentrant parser/lex and because of 1., I choose to do push-parsing: I append two \0s to the end of my read buffer, then call yy_scan_buffer(), yylex() and yypush_parse() in a loop. All works fine until (because of 2. and 3.) a buffer boundary cuts a multi-char token in half. In this case yylex() returns 0 when hitting end of buffer but before that it also returns the partial token as a valid token. When next buffer is available the same loop starts tokenizing it and the second half of the token will be interpreted as a new token. This is obviously not desired, as it will randomly cut valid IDs into 2 (most probably invalid) IDs depending on buffer sizes and I/O events. Question: I think the simplest solution would be to explain flex that end-of-buffer doesn't mean end-of-token, and the next buffer fed in may be required to finish the current token. Of course this also means if there is no next buffer I'll end up having a token stuck in the state machine because it couldn't decide if it was a partial one - but because of 5., this is no problem for me. Unfortunately I couldn't find a way to do it, is there any? If that is not possible, an alternative, much less preferred, solution would be to modify my parse loop to look ahead one token and always save where the last token started. Then if I detect and end-of-buffer, I just don't pass on the last looking-valid token to bison but store the string instead and I start next loop by inserting this string in the front of the new buffer. Would this have any side effect or risk? Maybe the above alternative could be done in the grammar by allowing any multi-char token to be concetanted from multiple tokens separated by explicit end-of-buffer tokens; in this case the question is how to get flex to emit an explicit end-of-buffer token. Thank you in advance, Tibor Palinkas |