encoding. of course it is encoding...
đşđ Convert 'epub' Files to Text
Brought to you by:
brudis
Originally created by: RMHogervorst
Originally owned by: hrbrmstr
It would be very nice if the text parsing would default to utf-8, because I have something that doesn't seem to be right. 1001 nights
Generous Dealing of Yahya Son of KhĂ\u0081Lid with A Man Who Forged A Letter in His Name.
should be
Generous Dealing of Yahya Son of KhĂLid with A Man Who Forged A Letter in His Name.
Originally posted by: RMHogervorst
should be
Originally posted by: RMHogervorst
This is the file that doesn't work (had to zip it, because github doesn't accept epub)
arab.zip (github.com)
Originally posted by: RMHogervorst
I extracted a few parts and the html files within are encoded correctly that is, there is a charset tag in the
So I guess it could read that tag, or default to utf-8
In https://github.com/hrbrmstr/pubcrawl/blob/master/R/clean-text.R#L5:
read_html might need the encoding argument (defaults to "")
If I read the html file in directly with
rvest::html_text(xml2::read_html("file.html"))it already defaults to utf-8 . So perhaps there is implicit recoding when xslt::xml_xslt is applied to the data?Originally posted by: RMHogervorst
nope thats not it (xml2::read_html(doc) would also always default to utf-8).
Originally posted by: hrbrmstr
So, the default was UTF-8 but I added a pass-through
encodingparameter wherever I could and it still looks as though you're going to have to post-process to handle Latin1 or cp1252 (etc) encodings. Vis a vis:In theory, it should have dealt with ^^ properly since it (honest!) passed it in all the way through and I even do a final
iconv()toencodingon the column.But, if you do (this text is Latin1 btw):
it works.
I'll keep this open since it'd like to provide robust support in the long run but at least the
iconv()should work ex-post-facto for the edge cases.Originally posted by: hrbrmstr
(just saw your extended comments)
aye, i even pass
encodingalong to it and ensure it's a raw vector when processing and still no-go.something (IMO) "weird" is happening either as a result of
read_html()OR in tibble-land causing some issues buticonv()will work ex post facto.