I've came accross a bug... well not exactly a bug per se, but a non working feature :
When you try to webharvest a webpage starting with a classic <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> (like any Facebook page for instance) doctype header, you get an exception !
According to this page (http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic) the problem seems to be that the parser automatically tries to download the DTD "xhtml1-strict.dtd" from the W3 website, due to the !DOCTYPE directive in the webpage.
I've been unable to find a way to bypass this limitation, but I would be definitely like to see that corrected ! If not, this great tool fall useless for a lot of websites, and that would be a shame !
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've came accross a bug... well not exactly a bug per se, but a non working feature :
When you try to webharvest a webpage starting with a classic <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> (like any Facebook page for instance) doctype header, you get an exception !
According to this page (http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic) the problem seems to be that the parser automatically tries to download the DTD "xhtml1-strict.dtd" from the W3 website, due to the !DOCTYPE directive in the webpage.
I've been unable to find a way to bypass this limitation, but I would be definitely like to see that corrected ! If not, this great tool fall useless for a lot of websites, and that would be a shame !
patrick, After 5 years I am facing the same issue. Did you figure out a way to workaround this?
Tnx,
Julio.
See here
http://stackoverflow.com/questions/998280/dtd-download-error-while-parsing-xhtml-document-in-xom