Would you consider adding functionality that automatically does html_entity_decode($string, ENT_QUOTES | ENT_HTML5) for all content returned from the library, such as$e->plaintext, $e->getAttribute(), $e->href, etc.?
Today one has to do this manually for every single access which gets repetitive and tedious.
I don't think I've ever needed not to decode returned data, to put it that way. Actually, I would consider it proper practice to always decode, and then encode when explicitly needed for your own purpose.
Could decoding be the default, with a possible option to disable it if this is necessary for somebody?
Actually,
trim()would also be a desired default. I have never needed not to remove surrounding whitespace from an accessed value (unless when assuming/hoping there's never going to be any).I also can't think of any reason where decoding is not desired. That said, it adds some overhead, especially when decoding isn't necessary (i.e. already decoded).
Decoding could be made optional (i.e. with a define) or done while parsing (this is probably the most efficient as it is done only once).
Trim on the other hand should already be done for plaintext (not sure about attributes). These things are expecially delicate as small changes can have big impact and might break existing code. Nevertheless, trim can be applied to some elements while parsing, similar to decoding.
I doubt a change in performance would have any significant real world impact. I think the correct thing is to always decode and trim. Not doing is relying too much on the HTML, which will break in other ways if the HTML changes.
As for breaking code, decoding a string that is already decoded will practically always return the same string. So I think you could actually get away with just changing the code. Unless a lot of people really expect a lot of non-trimmed, non-decoded strings. I would shout pretty loud about it in the change log, though :)
You could make a separate branch for the changes, and I (and others) could test them out. Keep me posted.
Cross reference: https://sourceforge.net/p/simplehtmldom/bugs/29/
I implemented a version for this in a separate branch at [1cefc6]. Please let me know if this works for you.
This feature needs to be scheduled for 2.0 as it breaks existing code. Find an explanation in the commit message for the commit above. Essentially, decoding contents twice is problematic in situations where decoding an entity reveals another (i.e.
&).Regarding trim, it needs to be applied before decoding because the content might contain encoded whitespace which would be stripped if we trim after decoding.
Related
Commit: [1cefc6]
Last edit: LogMANOriginal 2019-04-21
Very exciting! Will try this. I wouldn't worry about edge cases like
&for normal use. The only relevant case seems to be markup that is verbatimely referring to entities.Last edit: Anonymous 2019-04-22
PS! Which Git commands do I use to get this branch/commit? A normal pull gets just the master branch, and there is no such commit there.
git checkout EntityDecodingshould work just fine.I had to do
git checkout -b EntityDecodingand thengit pull origin EntityDecoding. Let me know if there's an easier way to pull a branch.I see. You probably had a clone of the repository before the branch was added, right?
In that case
git fetch --allwill update the list of branches which you can view withgit branch -aand checkout withgit checkout <branch-name>.I still had to do
git pull origin EntityDecoding. Maybe this has something to do with.gitconfigand definition of remotes.Your branch currently doesn't track the upstream branch, because you branched it manually. This answer on StackOverflow should fix it: https://stackoverflow.com/a/2286030
This should work for you:
git branch --set-upstream-to=origin/EntityDecoding EntityDecodingThis is just not my day:
error: the requested upstream branch 'origin/EntityDecoding' does not exist. Starting over seemed to work better:On a side note, it would be good if the basename of the URL was
simple_html_domorsimplehtmldom(whichever is the official), rather thanrepository.Last edit: Anonymous 2019-04-22
I'd love to change the basename, but the same name is used for the tab ("Repository"), URL and everything. This is one of the things SF is really bad at.
That said, you can change your folder name to whatever you like when cloning the repository:
git clone git://git.code.sf.net/p/simplehtmldom/repository simplehtmldomFortunately you can also rename the folder after cloning.
Ever thought about migrating the whole thing to Github? Sourceforge feels kind of outdated, though there may be aspects of this I don't know about...
That was my original intention as well. Then John (john_schlick) and I spoke about that and decided to stay on SF because everything is already setup and the project has been on this platform for years. SF also provides much more insight than GitHub on statistics and stuff like that.
That said, I'm still investigating the option to use GitHub as mirror (or vice versa), which was made possible by SF a few month ago. Here is a blog with some details: https://sourceforge.net/blog/introducing-the-new-sourceforge/
I'll look into it once I've finished cleaning up the technical dept.
Trimming doesn't seem to take place. You wrote " it [trimming] needs to be applied before decoding". Does this mean it's not implemented yet?
Last edit: Anonymous 2019-04-22
That is corrct, trimming is not yet implemented.
Found some things that are not decoded as expected:
Results in:
Thanks for the feedback. Looks like attribute values are currently not captured.
Fixed via [a89d71]. Let me know if you find more issues.
Related
Commit: [a89d71]
Looks fine now. Post a notice when you have trimming in place.
I added [48879f] for trimming. This is a separate branch from EntitiyDecoding to not mix up changes (yet). The results are very promising.
Example
Note: The two branches currently don't merge due to conflicts (changing the same code in both branches). Still, if you want to try them both, put them together manually.
Let me know if you find any bugs.
Related
Commit: [48879f]