From: jarpis <ja...@gm...> - 2010-12-27 10:02:28
|
Hi all. I'm having a problem using httpclient with XQuery for requesting HTML documents outside eXist. If document triggers httpclient's XHTML mode, its encoding will brake. XHTML doctype or mere HTML entity will change the mode. If httpclient deems document as mere HTML, encoding is handled as should be. Hopefully my examples survive intact, but the problem is easy to reproduce. Example document probably needs to be outside eXist (storing documents with HTML entities in eXist is another problem for me, but likely entities will be converted to Unicode anyway). My examples are from the most recent eXist, 1.4.0 rev 10440. Given a following example XQuery: import module namespace http='http://exist-db.org/xquery/httpclient'; http:get(xs:anyURI('http://localhost:8000/testdocument'), true(), <headers/>) and a following HTML document: <html><head></head><body> ääkkösiä</body></html> eXist will respond: <httpclient:response statusCode="200"> <httpclient:headers> <httpclient:header name="Date" value="Mon, 27 Dec 2010 09:24:30 GMT"/> <httpclient:header name="Server" value="WSGIServer/0.1 Python/2.6.5"/> <httpclient:header name="Content-Type" value="text/html; charset=utf-8"/> </httpclient:headers> <httpclient:body mimetype="text/html; charset=utf-8" type="xhtml"> <html> <head/> <body> ääkkösiä</body> </html> </httpclient:body> </httpclient:response> However, with a document with explicit meta tag for encoding: <html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body> ääkkösiä</body></html> eXist responds correctly: <httpclient:response statusCode="200"> <httpclient:headers> <httpclient:header name="Date" value="Mon, 27 Dec 2010 09:32:12 GMT"/> <httpclient:header name="Server" value="WSGIServer/0.1 Python/2.6.5"/> <httpclient:header name="Content-Type" value="text/html; charset=utf-8"/> </httpclient:headers> <httpclient:body mimetype="text/html; charset=utf-8" type="xhtml"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> </head> <body> ääkkösiä</body> </html> </httpclient:body> </httpclient:response> A document without HTML entity and META tag, eXist also responds correctly: <httpclient:response statusCode="200"> <httpclient:headers> <httpclient:header name="Date" value="Mon, 27 Dec 2010 09:49:20 GMT"/> <httpclient:header name="Server" value="WSGIServer/0.1 Python/2.6.5"/> <httpclient:header name="Content-Type" value="text/html; charset=utf-8"/> </httpclient:headers> <httpclient:body mimetype="text/html; charset=utf-8" type="xml"> <html> <head/> <body>ääkkösiä</body> </html> </httpclient:body> </httpclient:response> Now the document is seen as XML. I already tested with util:html-parse and it does not have this issue, but first the document needs to be stored in the eXist database as text which is not feasible for my project. I'd like to know if you can think of a workaround for this or if this is actually a bug and can be fixed. Thank you. Jussi Arpalahti |