[Httplib2-discuss] Best way of retrieving a page as a unicode string?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Is there a supported way of getting hold of a page as a Python unicode
string with httplib2? As far as I can tell I need to do this:

import httplib2
h = httplib2.Http()
headers, content = h.request('http://simonwillison.net/', 'GET')
content_type = headers.get('content-type', '')
if 'charset' in content_type:
    junk, charset = content_type.split('charset=', 2)
else:
    charset = 'iso-8859-1'
unicode_content = content.decode(charset)

Even the above doesn't look like it would properly solve the problem
(I'm not sure if that's the best assumption for a default encoding,
and I should probably be catching any unicode decoding exceptions and
falling back on something else if they occur). Shouldn't this be
handled by the library in some way?

Cheers,

Simon Willison