From: Gilles D. <gr...@sc...> - 2004-12-16 22:41:04
|
As a followup to the recent thread between Jon, David and Steve, I just wanted to let you all know that I discovered a bug in the external_parsers handling of htdig (versions 3.1.6 and 3.2.0b6). Jon Sorensen reported verbose htdig output like this: > Content-Type: application/pdf > Header line: > returnStatus = 0 > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 907 from document > Read a total of 361355 bytes > word: Read@0 > word: 8192@4 > word: from@9 > word: document@13 > word: Read@21 > word: 8192@26 > word: from@30 > word: document@35 > word: Read@43 > word: 8192@47 > word: from@52 > word: document@56 I've seen that before in posts to htdig-general, but couldn't make sense of that. Jon also asked: > I posted a question recently about indexing pdfs with doc2html > but I can't figure out what the problem is. I believe that the conifg is correct > but there may be a problem there. when I dig a number of pdfs the files > are read but the words indexed are not correct: > word: Read@0 > word: 8192@4 > word: from@9 > Does anyone know what this indicates? > From looking at the message archives it seems that others have had this problem > but there weren't any solutions posted in the messages It appears that htdig's stdout is being fed back into the parser, which seemed to defy all logic, until I figured out the cause on a new test system, which was also having problems indexing PDFs. When I ran the external converter manually, I got the error: /usr/local/bin/perl: bad interpreter: No such file or directory The problem was that the script began with "#!/usr/local/bin/perl", which worked fine on the older system, but not on the newer one. That explained why PDF indexing didn't work (htdig couldn't "exec" the external_parsers script), but not why htdig was eating its own output. Then I realized what was going on: htdig does a fork() and execv() to call the script, and if the execv() fails the child process exits, as it should. But, the child process exits using the exit() function, rather than _exit(), which is a no-no in a child process. The problem is that the fork() makes a duplicate of everything in the parent process, including all the parent's I/O buffers. If the child process calls exit(), it flushes its copy of the parent's stdout buffer, so a copy of much of the parent's verbose output gets flushed out into the child's pipe, which the parent reads and parses. The fix is to change htdig/ExternalParser.cc like this: --- htdig/ExternalParser.cc.orig 2004-05-28 08:15:14.000000000 -0500 +++ htdig/ExternalParser.cc 2004-12-16 16:37:14.000000000 -0600 @@ -280,7 +280,11 @@ ExternalParser::parse(Retriever &retriev // Call External Parser execv(parsargs[0], parsargs); - exit(EXIT_FAILURE); + perror("execv"); + write(STDERR_FILENO, "External parser error: Can't execute ", 37); + write(STDERR_FILENO, parseargs[0], strlen(parseargs[0])); + write(STDERR_FILENO, "\n", 1); + _exit(EXIT_FAILURE); } // Parent Process Of course, this is only a problem if the external parser/converter script can't be exec'ed by htdig, so if all is working well, this bug won't be an issue. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
From: David A. <D.J...@so...> - 2004-12-17 11:00:55
|
Giles, Congratulations on getting to the bottom of this; it solves a few mysterious reports of difficulties with external parsing reported to the mailing list in the last couple of years. Could I clarify one point with you? You wrote: > The problem was that the script began with "#!/usr/local/bin/perl", > which worked fine on the older system, but not on the newer one. was this simply because the Perl binary was in a different location on your newer system, or because there is a general problem on the newer system with this method of specifying the executable to run the script (which would be serious indeed!)? I think the moral is that users must take great care in correctly configuring their external parser(s), and must check that they work from the command line. David Adams Corporate Information Services Information Systems Services University of Southampton ----- Original Message ----- From: "Gilles Detillieux" <gr...@sc...> To: "ht://Dig mailing list" <htd...@li...> Cc: "Gilles Detillieux" <gr...@sc...>; "Gilbert Detillieux" <ge...@cs...> Sent: Thursday, December 16, 2004 10:40 PM Subject: [htdig] external_parsers bug (was Re: [htdig] pdf indexing problems) > As a followup to the recent thread between Jon, David and Steve, I just > wanted to let you all know that I discovered a bug in the external_parsers > handling of htdig (versions 3.1.6 and 3.2.0b6). > > Jon Sorensen reported verbose htdig output like this: >> Content-Type: application/pdf >> Header line: >> returnStatus = 0 >> Read 8192 from document >> Read 8192 from document >> Read 8192 from document >> Read 8192 from document >> Read 907 from document >> Read a total of 361355 bytes >> word: Read@0 >> word: 8192@4 >> word: from@9 >> word: document@13 >> word: Read@21 >> word: 8192@26 >> word: from@30 >> word: document@35 >> word: Read@43 >> word: 8192@47 >> word: from@52 >> word: document@56 > > I've seen that before in posts to htdig-general, but couldn't make sense > of that. > > Jon also asked: >> I posted a question recently about indexing pdfs with doc2html >> but I can't figure out what the problem is. I believe that the conifg >> is correct >> but there may be a problem there. when I dig a number of pdfs the >> files >> are read but the words indexed are not correct: >> word: Read@0 >> word: 8192@4 >> word: from@9 >> Does anyone know what this indicates? >> From looking at the message archives it seems that others have had >> this problem >> but there weren't any solutions posted in the messages > > It appears that htdig's stdout is being fed back into the parser, which > seemed to defy all logic, until I figured out the cause on a new test > system, which was also having problems indexing PDFs. When I ran the > external converter manually, I got the error: > > /usr/local/bin/perl: bad interpreter: No such file or directory > > The problem was that the script began with "#!/usr/local/bin/perl", > which worked fine on the older system, but not on the newer one. > That explained why PDF indexing didn't work (htdig couldn't "exec" > the external_parsers script), but not why htdig was eating its own output. > > Then I realized what was going on: htdig does a fork() and execv() > to call the script, and if the execv() fails the child process exits, > as it should. But, the child process exits using the exit() function, > rather than _exit(), which is a no-no in a child process. The problem > is that the fork() makes a duplicate of everything in the parent > process, including all the parent's I/O buffers. If the child process > calls exit(), it flushes its copy of the parent's stdout buffer, so a > copy of much of the parent's verbose output gets flushed out into the > child's pipe, which the parent reads and parses. The fix is to change > htdig/ExternalParser.cc like this: > > --- htdig/ExternalParser.cc.orig 2004-05-28 08:15:14.000000000 -0500 > +++ htdig/ExternalParser.cc 2004-12-16 16:37:14.000000000 -0600 > @@ -280,7 +280,11 @@ ExternalParser::parse(Retriever &retriev > // Call External Parser > execv(parsargs[0], parsargs); > > - exit(EXIT_FAILURE); > + perror("execv"); > + write(STDERR_FILENO, "External parser error: Can't execute ", 37); > + write(STDERR_FILENO, parseargs[0], strlen(parseargs[0])); > + write(STDERR_FILENO, "\n", 1); > + _exit(EXIT_FAILURE); > } > > // Parent Process > > Of course, this is only a problem if the external parser/converter script > can't be exec'ed by htdig, so if all is working well, this bug won't be > an issue. > > -- > Gilles R. Detillieux E-mail: <gr...@sc...> > Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ > Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://productguide.itmanagersjournal.com/ > _______________________________________________ > ht://Dig general mailing list: <htd...@li...> > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-general > |
From: Gilles D. <gr...@sc...> - 2004-12-17 15:47:28
|
According to David Adams: > Congratulations on getting to the bottom of this; it solves a few mysterious > reports of difficulties with external parsing reported to the mailing list > in the last couple of years. Yeah, that's what I figured. I'm not on the list anymore because of time constraints, but I remember a few of these reports coming up before, and expect there may be more. Now, those of you still on the list will know what the cause is, and how to respond. I haven't actually tested my patch I posted yesterday, but until I (or someone) does, and applies it to the code base, and a new release goes out, I expect the lack of a meaningful "Can't execute" message will continue to be a problem. I'm sorry I didn't look more carefully at the code when we switched from popen() to the pipe(), fork() and execv() we're using now. > Could I clarify one point with you? You wrote: > > > The problem was that the script began with "#!/usr/local/bin/perl", > > which worked fine on the older system, but not on the newer one. > > was this simply because the Perl binary was in a different location on your > newer system, or because there is a general problem on the newer system with > this method of specifying the executable to run the script (which would be > serious indeed!)? The former, not the latter. Handling of "#!(some path)" in executable scripts is built right into the kernel of Linux and most UNIX systems, and is unlikely to go away or break in any current system. The problem was entirely on our own systems. This was on the Manitoba Unix User Group's web server (http://www.muug.mb.ca/). On the old system, there was a symlink to /usr/bin/perl in /usr/local/bin, so the script worked with the original heading. On the newer system, we've done away with these extra symlinks, but I forgot to check and change the heading of the script. (We actually had several perl scripts that needed changing, but someone else was looking after all of those, and left just the htdig setup to me.) > I think the moral is that users must take great care in correctly > configuring their external parser(s), and must check that they work from the > command line. I couldn't agree more. Meaningful error messages can help a lot, but there's no substitute for great care and manual testing. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |