Would you consider adding functionality that automatically does html_entity_decode($string, ENT_QUOTES | ENT_HTML5) for all content returned from the library, such as$e->plaintext, $e->getAttribute(), $e->href, etc.?
Today one has to do this manually for every single access which gets repetitive and tedious.
I don't think I've ever needed not to decode returned data, to put it that way. Actually, I would consider it proper practice to always decode, and then encode when explicitly needed for your own purpose.
Could decoding be the default, with a possible option to disable it if this is necessary for somebody?
Last edit: Anonymous 2019-04-24
Any news on trimming?
I see what you mean. There are actually two parts to it. Your first example is fixed with [4d68ba]. Whitespace is now removed when using
->plaintextbut can optionally be retained using->text(false). Let me know if you find other examples where it doesn't fully work.Your second example with the attribute is not affected by this change. If you take a look at the document in your browser, you'll notice that it also doesn't remove whitespace because attributes are meant to be handled literally. The only exception is the
classattribute, which is already trimmed by the parser.In these cases I think it's best to simply trim manually.
Related
Commit: [4d68ba]
Thanks for the fix!
Does the W3C HTML specification say that whitespace inside quotes are in fact part of the actual value? If so, I sort of concede, but not very happily, I must admit.
Though whitespace inside quotes is no doubt due to sloppiness on the page author's part, in the real world you always want trimmed values to avoid messing up database fields, plain-text terminal output, markup, and other sources that would carry the whitespace with them.
I appreciate you're trying to follow standards, but would you be willing to consider trimming the values anyway, out of sheer pragmatism if nothing else?
Unfortunately the HTML specification is not that easy to comprehend. For a full overview you need to look at each attribute individually, which is too much to ask for (unless you do this professionally). Here is the specification if you are interested: https://www.w3.org/TR/html/
That said, there are lots of resources on the web which go into more details on these topics. Here is one I found on whitespace in attribute values: https://www.impressivewebs.com/leading-trailing-spaces-html-attribute-values/
It's less about following the standard, but more about what I see in the browser vs. what the parser returns. If we remove whitespace on attribute values, it will effect CSS selectors (which wouldn't work anymore). This can result in confusion if people simply copy selectors from the css files or from the browser debug tools.
Anyway, if you want to try with full trimming on all attributes, simply remove the condition at line 2180. Let me know if this is what you expect.
Oh yeah, before I forget, there is a new commit available with ensures that all block and inline level elements are taken into account when returning plaintext. It also got rid of the
$trimoption I added before. [ced5ab]This example now also works. You should compare results from the previous commit to this one.
(from [bugs:#163])
The previous version produced incorrect results (whitespace around "Hello World").
Related
Bugs:
#163Commit: [ced5ab]
Thanks for a good follow-up. I'll try the latest commit. I actually know one of the main W3C guys, so if necessary I could ask for his opinion on the matter.
Malplaced whitespace typically ends up in HTML because of sloppiness by non-tech people, eg. a journalist copy-pasting an article title (in a proportional font) into a CMS, not noticing the minuscle leading/trailing whitespace. So you could say that this is where the trim function really belongs: At the origin of the fault.
I doubt any developer with half a brain would ever willingly use spaces in attributes, like the examples in your link. In such cases, the browser should just crash for the sake of humanity.
Last edit: Anonymous 2019-04-27
I agree, whitespace in attributes is definately not good practice. However, there are instances where we need to work with it one way or another.
If you know someone with more insight, I'd love to hear his opinion. In particular if this was done for the sake of compatibility (because the internet simply grew that way), or because it is actually used in legit cases we are not aware of. I couldn't come up with any information about that during my research.I still can't wrap my head around why browsers maintain whitespace if it has no meaning. Maybe it's also done for the sake of compatibility.Forget about that. While writing this response I found a resource that is quite reliable: https://www.w3.org/TR/xml/#AVNormalize
This line in particular is important:
There are a few pointers in the HTML specification to the specification above, but all of them are related to XHTML. The part about attribute values is a bit vague: https://www.w3.org/TR/html/syntax.html#attribute-values
But I think it's close enough to use as argument for trimming attribute values.
That said, if we trim arguments in the DOM we should do the same with selectors. That way copying CSS with whitespace will work as if the whitespace never existed in the first place, thus making it compatible to the parser. I'll look into this next week if I find the time.
A few years ago I had to work with someone who actually did this (and many other stupid things) on purpose. His reasons were simple: "I like it better this way" and "This is just the way I do it". There are simply people who like to think "outside the box" - as they say - and make it harder for everyone, just to feel special.
Amen.
Attribute normalization added via [83c897].
Let me know if this works for you.
Related
Commit: [83c897]
Looks like it works! Let me know when you merge things, so I don't have to choose between decoding and trimming :)
Sure, it's just a matter of days (fingers crossed). It entirely depends on the outcome of [bugs:#169].
In the meantime you can do this:
The last merge will throw a warning because of conflicts. Open 'simple_html_dom.php' and fix the problem between the '<<<<<<<' and '>>>>>>>' marks. For some reason git cannot figure out to put the first line at the last line (
protected function parse()should beprotected function parse($trim = false)).Once you fixed this, make sure to remove any unnecessary stuff (like the '<<<<<<', '=======' and '>>>>>>>').
Then continue with
Everything should work fine now.
If it doesn't work, use the attached file :)
Related
Bugs:
#169Just post a notice when you're all done merging everything to master.
I'm so happy this project is active. I really thought it was abandonware when I first started using it, as bug reports were years old with no response, etc. For whatever reason things have gotten back on track, I'm grateful, and happy to help.
Actually, parsing HTML is so fundamental, and this library is so user-friendly, that I think it should be a integral part of PHP. (PHP doesn't have anything out of the box for this.)
When will everything be merged?
Next up is 1.9 with the things that are currently in master. Since 1.10 was used for an earlier release (a few years ago), 2.0 is the next logical step. It will include these changes and more.
I plan to release 1.9 at the end of this week, after which the branches get merged into master.
Done. Both branches are now merged into master.
I'm so happy with these changes. The package is now like a dream to use, because you instantly get the content you want without struggling with manual trimming and decoding every single time.
I found an incident where whitespace is not removed (marked with underscore). Can you fix this?
...and “_X-Men: Apocalypse_,” Simon Kinberg_’s directorial debut......Jean Grey, Professor X, Raven (_Jennifer Lawrence_)......named Vuk (who takes the body of Jessica Chastain_) is encouraging...This should be fixed in master as well (works for me now). Please consider opening a new issue if this is still a problem.