Re: [Flex-help] EOF handling in flex canners is mysterious to me
flex is a tool for generating scanners
Brought to you by:
wlestes
From: Will E. <wes...@gm...> - 2014-10-14 12:33:32
|
Yeah that's definitely odd. CC'ing flex-help to get it logged there. First thing, though: Do you notice this with later versions of flex? I expect the answer is "Yes" but let's find out if you've tried that yet. On Tuesday, 14 October 2014, 8:26 am -0400, "Eric S. Raymond" <es...@th...> wrote: > My apologies for sending this directly to you, but my attempt to > subscribe to flex-help seems to have blackholed. > > There seems to be something either buggy or very badly documented (I'm hoping > the latter) about the way flex 2.5.35 generates scanners with the following > options: > > %option reentrant bison-bridge > %option warn nodefault > %option pointer > %option noyywrap noyyget_extra noyyget_leng noyyset_lineno > %option noyyget_out noyyset_out noyyget_lval noyyset_lval > %option noyyget_lloc noyyset_lloc noyyget_debug noyyset_debug > > My situation is this. I maintain cvs-fast-export, which uses a > Bison/Flex parser (using Bison 3.0.2) to digest CVS master files. > High speed is extremely important in this application, as the data > sets (legacy CVs repositories) are often extremely large; the parser > is reentrant so master parsing can be multithreaded for higher > performance. The code is availale at > > https://gitorious.org/cvs-fast-export/cvs-fast-export.git > > and is readily tested with > > cvs-fast-export -v tests/basic.repo/basic/README,v > > from the repo directory. > > I would like to use options like fast, read, batch and > never-interactive, but am blocked from doing so by behaviors I don't > understand. > > The scanner code contains the following hack, inserted in the preamble at > some time in the past and recently modified to use yyget_in(yyscanner) > when I made the scanner re-entrant: > > #define YY_INPUT(buf,result,max_size) { \ > int c = getc(yyget_in(yyscanner)); \ > result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \ > > I would like to remove this and use flex's native input handling, > because this limits the scanner to character-at-a-time input. But when > I do so the parser hangs forever. Running strace reveals that it is > repeatedly reading no input, > > read(3, "", 4096) = 0 > read(3, "", 4096) = 0 > read(3, "", 4096) = 0 > read(3, "", 4096) = 0 > read(3, "", 4096) = 0 > > having apparently failed to recognize EOF. Setting %option batch > never-interactive does not change this. > > My setup code looks like this: > > yyscan_t scanner; > FILE *in; > cvs_file *cvs; > > in = fopen(name, "r"); > > yylex_init(&scanner); > yyset_in(in, scanner); > yyparse(scanner, cvs); > yylex_destroy(scanner); > > fclose(in); > > which if I understand the documentation correctly ought to be sufficient > to set up the default input machinery. My first question is: what is > wrong here? Why is the scanner failing to recognize EOF when using the stock > YY_INPUT? > > I tried replacing the custom YY_INPUT with this logiically equivalent function: > > #define YY_INPUT(buf,result,max_size) result = custom_input(buf, max_size, yyscanner); > > static ssize_t custom_input(char *buf, size_t result_max, yyscan_t yyscanner) > { > int c = getc(yyget_in(yyscanner)); > ssize_t res = (c == EOF) ? YY_NULL : (buf[0] = c, 1); > if (yydebug) > fprintf(stderr, "custom_input(..., %ld, ...) -> %zd\n", result_max, res); > return res; > } > > That works but still limits throughput because I/O is being done a > character at a time. When I replace the custom input function with > this: > > #define YY_INPUT(buf,result,max_size) result = custom_input(buf, max_size, yyscanner); > > static ssize_t custom_input(char *buf, size_t result_max, yyscan_t yyscanner) > { > ssize_t res = fread(buf, 1, result_max, yyget_in(yyscanner)); > > if (feof(yyget_in(yyscanner)) || ferror(yyget_in(yyscanner))) > res = YY_NULL; > if (yydebug) > fprintf(stderr, "custom_input(..., %ld, ...) -> %zd\n", result_max, res); > return res; > } > > the scanner sucks in the entire input file on the first read, raises > end of input immediately, and *doesn't parse tokens out of the input > buffer*. > > esr@snark:~/WWW/cvs-fast-export$ cvs-fast-export -v tests/basic.repo/basic/README,v >/dev/null > Starting parse > Entering state 0 > Reading a token: custom_input(..., 8192, ...) -> 0 > Now at end of input. > Reducing stack by rule 3 (line 105): > -> $$ = nterm headers () > Stack now 0 > Entering state 9 > Reducing stack by rule 25 (line 161): > -> $$ = nterm revisions () > Stack now 0 9 > Entering state 21 > Now at end of input. > parse error syntax error at > > My second question is: why doesn't the generated scanner parse remaining > tokens out of the input buffer when it reaches EOF? > > The documentation is unhelpful on these points and in general vague about > how input is acquired and buffered. I have read it very closely but am > unable to determine whether I am seeing the expected behavior. > -- > <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> > > All governments are more or less combinations against the > people. . .and as rulers have no more virtue than the ruled. . . > the power of government can only be kept within its constituted > bounds by the display of a power equal to itself, the collected > sentiment of the people. > -- Benjamin Franklin Bache, in a Phildelphia Aurora editorial 1794 |