Menu

#2 "Unknown input at line" on all lines of the xml file

open
nobody
None
5
2014-02-21
2014-02-12
dmitrii347
No

Command "pdftohtml -xml mybook.pdf" typically gives you the following mybook.xml file:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml producer="poppler" version="0.20.4">
...
</pdf2xml>

If you run "pdfreflow mybook.xml" then you've got "Unknown input at line" on all lines of the mybook.xml file because of the following code in parse.c:

...
#define PDF2XML "<pdf2xml>"
...
struct array *parse_pdf2xml(FILE *file) {
    ...
    while ((ret = mygets(buf, BUFSIZE, file))) {
    ...
    } else if (!strncmp(cur, PDF2XML, xmllen)) {
    ...
    } else {
        fprintf(stderr, "Unknown input at line %d: %s", lineno, buf);
    }
    }
    ...
}

If you remove by hand producer="poppler" version="0.20.4" from mybook.xml then all work just fine. I think the programm have to know that pdf2xml tag could have some attributes.

Discussion


Log in to post a comment.

MongoDB Logo MongoDB