Menu ▾ ▴

#4 encoding. of course it is encoding...

open
nobody
monitoring (1)
2018-04-13
2018-04-12
Anonymous
No

Originally created by: RMHogervorst
Originally owned by: hrbrmstr

It would be very nice if the text parsing would default to utf-8, because I have something that doesn't seem to be right. 1001 nights

Generous Dealing of Yahya Son of KhÃ\u0081Lid with A Man Who Forged A Letter in His Name.

should be

Generous Dealing of Yahya Son of KhÁLid with A Man Who Forged A Letter in His Name.

Discussion

  • Anonymous

    Anonymous - 2018-04-12

    Originally posted by: RMHogervorst

      The Kingâ\u0080\u0099s Daughter and the Ape
    

    should be

    The King’s Daughter and the Ape.
    
     
  • Anonymous

    Anonymous - 2018-04-12

    Originally posted by: RMHogervorst

    This is the file that doesn't work (had to zip it, because github doesn't accept epub)

    arab.zip (github.com)

     
  • Anonymous

    Anonymous - 2018-04-12

    Originally posted by: RMHogervorst

    I extracted a few parts and the html files within are encoded correctly that is, there is a charset tag in the

    <meta charset="utf-8" />
    

    So I guess it could read that tag, or default to utf-8
    In https://github.com/hrbrmstr/pubcrawl/blob/master/R/clean-text.R#L5:

    if (!inherits(doc, "html_document")) doc <- xml2::read_html(doc)
    

    read_html might need the encoding argument (defaults to "")
    If I read the html file in directly with rvest::html_text(xml2::read_html("file.html")) it already defaults to utf-8 . So perhaps there is implicit recoding when xslt::xml_xslt is applied to the data?

     
  • Anonymous

    Anonymous - 2018-04-12

    Originally posted by: RMHogervorst

    nope thats not it (xml2::read_html(doc) would also always default to utf-8).

     
  • Anonymous

    Anonymous - 2018-04-12

    Originally posted by: hrbrmstr

    So, the default was UTF-8 but I added a pass-through encoding parameter wherever I could and it still looks as though you're going to have to post-process to handle Latin1 or cp1252 (etc) encodings. Vis a vis:

    x <- epub_to_text("~/Downloads/b97b.epub", "Latin1")
    
    z <- x$content[1] # just to make it easier to debug in my session
    
    substr(z, 1, 1000) # I added the hard line breaks
    
    [1] "The Book of The Thousand Nights and a Night: a plain and literal translation of the Arabian Nights Entertainments. Translated and annotated by Richard F. Burton; illustrated by Albert Letchford\n    Contents\n      Top\n\tEditorâ\u0080\u0099s Note to this Web 
    Edition\n\tDedications to the Original Ten Volumes\n\tThe Translatorâ\u0080\u0099s Foreword.\n\tThe Book of The Thousand Nights and a 
    Night\n\tTale of the Trader and the Jinni.\n\tThe First Shaykhâ\u0080\u0099s Story.\n\tThe Second Shaykhâ\u0080\u0099s Story.\n\tThe 
    Third Shaykhâ\u0080\u0099s Story.\n\tThe Fisherman and the Jinni.\n\tThe Tale of the Wazir and the Sage Duban.\n\tKing Sindibad and 
    his Falcon.\n\tThe Tale of the Husband and the Parrot.\n\tThe Tale of the Prince and the Ogress.\n\tThe Tale of the Ensorcelled 
    Prince.\n\tThe Porter and the Three Ladies of Baghdad.\n\tThe First Kalandarâ\u0080\u0099s Tale.\n\tThe Second Kalandarâ\u0080\u0099s 
    Tale.\n\tThe Tale of the Envier and the Envied.\n\tThe Third Kalandarâ\u0080\u0099s Tale.\n\tThe Eldest Ladyâ\u0080\u0099s 
    Tale.\n\tTale of the Portress.\n\tThe Tale of the Three Apples\n\tTale of Nur Al-Din and his S"
    

    In theory, it should have dealt with ^^ properly since it (honest!) passed it in all the way through and I even do a final iconv() to encoding on the column.

    But, if you do (this text is Latin1 btw):

    substr(iconv(z, "", to="Latin1"), 1, 1000)
    
    [1] "The Book of The Thousand Nights and a Night: a plain and literal translation of the Arabian Nights Entertainments. Translated 
    and annotated by Richard F. Burton; illustrated by Albert Letchford\n    Contents\n      Top\n\tEditor’s Note to this Web 
    Edition\n\tDedications to the Original Ten Volumes\n\tThe Translator’s Foreword.\n\tThe Book of The Thousand Nights and a 
    Night\n\tTale of the Trader and the Jinni.\n\tThe First Shaykh’s Story.\n\tThe Second Shaykh’s Story.\n\tThe Third Shaykh’s 
    Story.\n\tThe Fisherman and the Jinni.\n\tThe Tale of the Wazir and the Sage Duban.\n\tKing Sindibad and his Falcon.\n\tThe Tale of 
    the Husband and the Parrot.\n\tThe Tale of the Prince and the Ogress.\n\tThe Tale of the Ensorcelled Prince.\n\tThe Porter and the 
    Three Ladies of Baghdad.\n\tThe First Kalandar’s Tale.\n\tThe Second Kalandar’s Tale.\n\tThe Tale of the Envier and the 
    Envied.\n\tThe Third Kalandar’s Tale.\n\tThe Eldest Lady’s Tale.\n\tTale of the Portress.\n\tThe Tale of the Three Apples\n\tTale of 
    Nur Al-Din and his Son.\n\tThe Hunchback"
    

    it works.

    I'll keep this open since it'd like to provide robust support in the long run but at least the iconv() should work ex-post-facto for the edge cases.

     
  • Anonymous

    Anonymous - 2018-04-12

    Originally posted by: hrbrmstr

    (just saw your extended comments)

    aye, i even pass encoding along to it and ensure it's a raw vector when processing and still no-go.

    something (IMO) "weird" is happening either as a result of read_html() OR in tibble-land causing some issues but iconv() will work ex post facto.

     
  • Anonymous

    Anonymous - 2018-04-13
     

Log in to post a comment.

MongoDB Logo MongoDB