Re: [Httplib2-discuss] Best way of retrieving a page as a unicode string?
Status: Beta
Brought to you by:
jcgregorio
From: Joe G. <jo...@bi...> - 2007-12-08 22:10:33
|
On Dec 8, 2007 4:54 PM, Simon Willison <si...@si...> wrote: > Is there a supported way of getting hold of a page as a Python unicode > string with httplib2? As far as I can tell I need to do this: > > import httplib2 > h = httplib2.Http() > headers, content = h.request('http://simonwillison.net/', 'GET') > content_type = headers.get('content-type', '') > if 'charset' in content_type: > junk, charset = content_type.split('charset=', 2) > else: > charset = 'iso-8859-1' > unicode_content = content.decode(charset) Oh, if it were only that simple :) For example, look at the charset sniffing rules for JSON (RFC 4627) and XML 1.0 <http://www.w3.org/TR/REC-xml/#sec-guessing>. You're best bet will probably be to use: <http://chardet.feedparser.org/> At the very least I should have a link to chardet in the httplib2 documentation. Not sure if httplib2 should do more than that. -joe > > Even the above doesn't look like it would properly solve the problem > (I'm not sure if that's the best assumption for a default encoding, > and I should probably be catching any unicode decoding exceptions and > falling back on something else if they occur). Shouldn't this be > handled by the library in some way? > > Cheers, > > Simon Willison > > ------------------------------------------------------------------------- > SF.Net email is sponsored by: > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > http://sourceforge.net/services/buy/index.php > _______________________________________________ > Httplib2-discuss mailing list > Htt...@li... > https://lists.sourceforge.net/lists/listinfo/httplib2-discuss > -- Joe Gregorio http://bitworking.org |