"Unknown input at line" on all lines of the xml file
Status: Alpha
Brought to you by:
pranananda
Command "pdftohtml -xml mybook.pdf" typically gives you the following mybook.xml file:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml producer="poppler" version="0.20.4">
...
</pdf2xml>
If you run "pdfreflow mybook.xml" then you've got "Unknown input at line" on all lines of the mybook.xml file because of the following code in parse.c:
...
#define PDF2XML "<pdf2xml>"
...
struct array *parse_pdf2xml(FILE *file) {
...
while ((ret = mygets(buf, BUFSIZE, file))) {
...
} else if (!strncmp(cur, PDF2XML, xmllen)) {
...
} else {
fprintf(stderr, "Unknown input at line %d: %s", lineno, buf);
}
}
...
}
If you remove by hand producer="poppler" version="0.20.4" from mybook.xml then all work just fine. I think the programm have to know that pdf2xml tag could have some attributes.