Menu

#60 parsing stops after first multibyte character

Unassigned
closed
None
2022-04-09
2011-11-11
Anonymous
No

As of 1.5, simple html dom parses html byte by byte and is not multi byte aware. as a quick fix, i added one line to the load function to convert the character encoding from UTF-8 to 'HTML-ENTITIES' which effectively converts the doc to a single byte encoding and allows the parsing to continue. but, assumes your input is UTF-8. i've attached a patch file to show what i did. however, i believe the correct solution is to modify the parser to be multi byte aware. i recommend converting all input to UTF-8 and modifying the parser to handle UTF-8 according the wiki article at http://en.wikipedia.org/wiki/UTF-8#Design.

Cheers,
Keith

Discussion

  • LogMANOriginal

    LogMANOriginal - 2019-04-18
    • Labels: --> charset
     
  • LogMANOriginal

    LogMANOriginal - 2019-04-19

    Ticket moved from /p/simplehtmldom/bugs/89/

     
  • LogMANOriginal

    LogMANOriginal - 2022-04-09
    • labels: charset -->
    • status: open --> closed
    • assigned_to: LogMANOriginal
    • Group: --> Unassigned
     

Log in to post a comment.

MongoDB Logo MongoDB