PHP Simple HTML DOM Parser / Feature Requests / #60 parsing stops after first multibyte character

#60 parsing stops after first multibyte character

Milestone: Unassigned

Status: closed

Owner: LogMANOriginal

Labels: None

Updated: 2022-04-09

Created: 2011-11-11

Creator: Anonymous

Private: No

As of 1.5, simple html dom parses html byte by byte and is not multi byte aware. as a quick fix, i added one line to the load function to convert the character encoding from UTF-8 to 'HTML-ENTITIES' which effectively converts the doc to a single byte encoding and allows the parsing to continue. but, assumes your input is UTF-8. i've attached a patch file to show what i did. however, i believe the correct solution is to modify the parser to be multi byte aware. i recommend converting all input to UTF-8 and modifying the parser to handle UTF-8 according the wiki article at http://en.wikipedia.org/wiki/UTF-8#Design.

Cheers,
Keith

Discussion

Comment has been marked as spam.
Undo

View and moderate all "feature-requests Discussion" comments posted by this user

Mark all as spam, and block user from posting to "Feature Requests"

Anonymous - 2011-11-11

Patch to allow multibyte input

Patch to allow multibyte input

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

simple_html_dom.php.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LogMANOriginal - 2019-04-18

Labels: --> charset
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LogMANOriginal - 2019-04-19

Ticket moved from /p/simplehtmldom/bugs/89/

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LogMANOriginal - 2022-04-09

labels: charset -->

status: open --> closed

assigned_to: LogMANOriginal

Group: --> Unassigned
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

parsing stops after first multibyte character

A php based DOM parser.

Group

Searches

Help

#60 parsing stops after first multibyte character

Discussion