PHP Simple HTML DOM Parser / Feature Requests / #52 Always decode content values from the DOM tree

Anonymous - 2019-04-24

$html = " foo "; $html = str_get_html($html); $html->find("span", 0)->remove(); printf("(%s)\n", $html->find("p", 0)->plaintext); $html = str_get_html(' <meta name="description" content=" bar ">'); printf("(%s)\n", $html->find("meta[name=description]", 0)->content);

Last edit: Anonymous 2019-04-24
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2019-04-26
 
 Any news on trimming?
 
 If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- LogMANOriginal - 2019-04-27
 
 I see what you mean. There are actually two parts to it. Your first example is fixed with [4d68ba]. Whitespace is now removed when using ->plaintext but can optionally be retained using ->text(false). Let me know if you find other examples where it doesn't fully work.
 
 Your second example with the attribute is not affected by this change. If you take a look at the document in your browser, you'll notice that it also doesn't remove whitespace because attributes are meant to be handled literally. The only exception is the class attribute, which is already trimmed by the parser.
 
 In these cases I think it's best to simply trim manually.
 
 $html = str_get_html(' <meta name="description" content=" bar ">'); printf("(%s)\n", trim($html->find("meta[name=description]", 0)->content));
 
 Related
 
 Commit: [4d68ba]
 
 If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
 - Anonymous - 2019-04-27
 
 Thanks for the fix!
 
 Does the W3C HTML specification say that whitespace inside quotes are in fact part of the actual value? If so, I sort of concede, but not very happily, I must admit.
 
 Though whitespace inside quotes is no doubt due to sloppiness on the page author's part, in the real world you always want trimmed values to avoid messing up database fields, plain-text terminal output, markup, and other sources that would carry the whitespace with them.
 
 I appreciate you're trying to follow standards, but would you be willing to consider trimming the values anyway, out of sheer pragmatism if nothing else?
 
 If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
 - LogMANOriginal - 2019-04-27
 
 Unfortunately the HTML specification is not that easy to comprehend. For a full overview you need to look at each attribute individually, which is too much to ask for (unless you do this professionally). Here is the specification if you are interested: https://www.w3.org/TR/html/
 
 That said, there are lots of resources on the web which go into more details on these topics. Here is one I found on whitespace in attribute values: https://www.impressivewebs.com/leading-trailing-spaces-html-attribute-values/
 
 I appreciate you're trying to follow standards, but would you be willing to consider trimming the values anyway, out of sheer pragmatism if nothing else?
 
 It's less about following the standard, but more about what I see in the browser vs. what the parser returns. If we remove whitespace on attribute values, it will effect CSS selectors (which wouldn't work anymore). This can result in confusion if people simply copy selectors from the css files or from the browser debug tools.
 
 Anyway, if you want to try with full trimming on all attributes, simply remove the condition at line 2180. Let me know if this is what you expect.
 
 Oh yeah, before I forget, there is a new commit available with ensures that all block and inline level elements are taken into account when returning plaintext. It also got rid of the $trim option I added before. [ced5ab]
 
 This example now also works. You should compare results from the previous commit to this one.
 
 <?php require_once 'simple_html_dom.php'; $str = 'I am saying  <a href=""> Hello World </a>  to you.'; $html = str_get_html($str); echo $html->find("p", 0)->plaintext . "\n";
 
 (from [bugs:#163])
 
 The previous version produced incorrect results (whitespace around "Hello World").
 
 Related
 
 Bugs: ~~#163~~
 Commit: [ced5ab]
 
 If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
 - Anonymous - 2019-04-27
 
 Thanks for a good follow-up. I'll try the latest commit. I actually know one of the main W3C guys, so if necessary I could ask for his opinion on the matter.
 
 Malplaced whitespace typically ends up in HTML because of sloppiness by non-tech people, eg. a journalist copy-pasting an article title (in a proportional font) into a CMS, not noticing the minuscle leading/trailing whitespace. So you could say that this is where the trim function really belongs: At the origin of the fault.
 
 I doubt any developer with half a brain would ever willingly use spaces in attributes, like the examples in your link. In such cases, the browser should just crash for the sake of humanity.
 
 Last edit: Anonymous 2019-04-27
 
 If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
 
 LogMANOriginal - 2019-04-27
 
 I agree, whitespace in attributes is definately not good practice. However, there are instances where we need to work with it one way or another.
 
 If you know someone with more insight, I'd love to hear his opinion. In particular if this was done for the sake of compatibility (because the internet simply grew that way), or because it is actually used in legit cases we are not aware of. I couldn't come up with any information about that during my research.
 
 ~~I still can't wrap my head around why browsers maintain whitespace if it has no meaning. Maybe it's also done for the sake of compatibility.~~
 
 Forget about that. While writing this response I found a resource that is quite reliable: https://www.w3.org/TR/xml/#AVNormalize
 
 This line in particular is important:
 
 If the attribute type is not CDATA, then the XML processor MUST further process the normalized attribute value by discarding any leading and trailing space (#x20) characters, and by replacing sequences of space (#x20) characters by a single space (#x20) character.
 
 There are a few pointers in the HTML specification to the specification above, but all of them are related to XHTML. The part about attribute values is a bit vague: https://www.w3.org/TR/html/syntax.html#attribute-values
 
 But I think it's close enough to use as argument for trimming attribute values.
 
 That said, if we trim arguments in the DOM we should do the same with selectors. That way copying CSS with whitespace will work as if the whitespace never existed in the first place, thus making it compatible to the parser. I'll look into this next week if I find the time.
 
 I doubt any developer with half a brain would ever willingly use spaces in attributes, like the examples in your link.
 
 A few years ago I had to work with someone who actually did this (and many other stupid things) on purpose. His reasons were simple: "I like it better this way" and "This is just the way I do it". There are simply people who like to think "outside the box" - as they say - and make it harder for everyone, just to feel special.
 
 In such cases, the browser should just crash for the sake of humanity.
 
 Amen.
 
 If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LogMANOriginal - 2019-04-29

Attribute normalization added via [83c897].
Let me know if this works for you.

Related

Commit: [83c897]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2019-04-29

Looks like it works! Let me know when you merge things, so I don't have to choose between decoding and trimming :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LogMANOriginal - 2019-04-29

Sure, it's just a matter of days (fingers crossed). It entirely depends on the outcome of [bugs:#169].

In the meantime you can do this:

git checkout master git checkout -b __your_branch_name__ git merge TrimWhitespace git merge EntityDecoding

The last merge will throw a warning because of conflicts. Open 'simple_html_dom.php' and fix the problem between the '<<<<<<<' and '>>>>>>>' marks. For some reason git cannot figure out to put the first line at the last line (protected function parse() should be protected function parse($trim = false)).

Once you fixed this, make sure to remove any unnecessary stuff (like the '<<<<<<', '=======' and '>>>>>>>').

Then continue with

git add simple_html_dom.php git commit

Everything should work fine now.
If it doesn't work, use the attached file :)

Related

Bugs: ~~#169~~

simple_html_dom.php
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2019-04-30

Just post a notice when you're all done merging everything to master.

I'm so happy this project is active. I really thought it was abandonware when I first started using it, as bug reports were years old with no response, etc. For whatever reason things have gotten back on track, I'm grateful, and happy to help.

Actually, parsing HTML is so fundamental, and this library is so user-friendly, that I think it should be a integral part of PHP. (PHP doesn't have anything out of the box for this.)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2019-05-16

When will everything be merged?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LogMANOriginal - 2019-05-27

Next up is 1.9 with the things that are currently in master. Since 1.10 was used for an earlier release (a few years ago), 2.0 is the next logical step. It will include these changes and more.

I plan to release 1.9 at the end of this week, after which the branches get merged into master.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LogMANOriginal - 2019-05-30

status: pending --> closed

assigned_to: LogMANOriginal
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LogMANOriginal - 2019-05-30

Done. Both branches are now merged into master.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2019-06-06

I'm so happy with these changes. The package is now like a dream to use, because you instantly get the content you want without struggling with manual trimming and decoding every single time.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2019-06-14

$html = file_get_html("https://www.rogerebert.com/reviews/dark-phoenix-2019"); foreach($html->find("div[itemprop=reviewBody] > p") as $p) printf("%s\n\n", wordwrap($p->plaintext));

I found an incident where whitespace is not removed (marked with underscore). Can you fix this?

...and “_X-Men: Apocalypse_,” Simon Kinberg_’s directorial debut...

...Jean Grey, Professor X, Raven (_Jennifer Lawrence_)...

...named Vuk (who takes the body of Jessica Chastain_) is encouraging...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- LogMANOriginal - 2019-06-22
  
  This should be fixed in master as well (works for me now). Please consider opening a new issue if this is still a problem.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2025-11-06

Post awaiting moderation.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Always decode content values from the DOM tree

A php based DOM parser.

Group

Searches

Help

#52 Always decode content values from the DOM tree

Discussion

Related

Related

Related

Related