Re: [Httplib2-discuss] Best way of retrieving a page as a unicode string?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Dec 8, 2007 4:54 PM, Simon Willison <si...@si...> wrote:
> Is there a supported way of getting hold of a page as a Python unicode
> string with httplib2? As far as I can tell I need to do this:
>
> import httplib2
> h = httplib2.Http()
> headers, content = h.request('http://simonwillison.net/', 'GET')
> content_type = headers.get('content-type', '')
> if 'charset' in content_type:
>     junk, charset = content_type.split('charset=', 2)
> else:
>     charset = 'iso-8859-1'
> unicode_content = content.decode(charset)

Oh, if it were only that simple :) For example, look at the
charset sniffing rules for JSON (RFC 4627) and XML 1.0
<http://www.w3.org/TR/REC-xml/#sec-guessing>. You're
best bet will probably be to use:

   <http://chardet.feedparser.org/>

At the very least I should have a link to chardet in the
httplib2 documentation. Not sure if httplib2 should do
more than that.

   -joe

>
> Even the above doesn't look like it would properly solve the problem
> (I'm not sure if that's the best assumption for a default encoding,
> and I should probably be catching any unicode decoding exceptions and
> falling back on something else if they occur). Shouldn't this be
> handled by the library in some way?
>
> Cheers,
>
> Simon Willison
>
> -------------------------------------------------------------------------
> SF.Net email is sponsored by:
> Check out the new SourceForge.net Marketplace.
> It's the best place to buy or sell services for
> just about anything Open Source.
> http://sourceforge.net/services/buy/index.php
> _______________________________________________
> Httplib2-discuss mailing list
> Htt...@li...
> https://lists.sourceforge.net/lists/listinfo/httplib2-discuss
>

-- 
Joe Gregorio        http://bitworking.org