PHP Simple HTML DOM Parser / Feature Requests / #52 Always decode content values from the DOM tree

Anonymous - 2019-03-21

Actually, trim() would also be a desired default. I have never needed not to remove surrounding whitespace from an accessed value (unless when assuming/hoping there's never going to be any).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LogMANOriginal - 2019-03-23

I also can't think of any reason where decoding is not desired. That said, it adds some overhead, especially when decoding isn't necessary (i.e. already decoded).

Decoding could be made optional (i.e. with a define) or done while parsing (this is probably the most efficient as it is done only once).

Trim on the other hand should already be done for plaintext (not sure about attributes). These things are expecially delicate as small changes can have big impact and might break existing code. Nevertheless, trim can be applied to some elements while parsing, similar to decoding.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2019-03-23

I doubt a change in performance would have any significant real world impact. I think the correct thing is to always decode and trim. Not doing is relying too much on the HTML, which will break in other ways if the HTML changes.

As for breaking code, decoding a string that is already decoded will practically always return the same string. So I think you could actually get away with just changing the code. Unless a lot of people really expect a lot of non-trimmed, non-decoded strings. I would shout pretty loud about it in the change log, though :)

You could make a separate branch for the changes, and I (and others) could test them out. Keep me posted.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LogMANOriginal - 2019-04-18

Group: Unassigned --> 1.9
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LogMANOriginal - 2019-04-19

Cross reference: https://sourceforge.net/p/simplehtmldom/bugs/29/

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LogMANOriginal - 2019-04-21

status: open --> pending

Group: 1.9 --> 2.0
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LogMANOriginal - 2019-04-21

I implemented a version for this in a separate branch at [1cefc6]. Please let me know if this works for you.

This feature needs to be scheduled for 2.0 as it breaks existing code. Find an explanation in the commit message for the commit above. Essentially, decoding contents twice is problematic in situations where decoding an entity reveals another (i.e. &amp;).

Regarding trim, it needs to be applied before decoding because the content might contain encoded whitespace which would be stripped if we trim after decoding.

Related

Commit: [1cefc6]

Last edit: LogMANOriginal 2019-04-21

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2019-04-21

Very exciting! Will try this. I wouldn't worry about edge cases like &amp; for normal use. The only relevant case seems to be markup that is verbatimely referring to entities.

Last edit: Anonymous 2019-04-22

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2019-04-21

PS! Which Git commands do I use to get this branch/commit? A normal pull gets just the master branch, and there is no such commit there.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LogMANOriginal - 2019-04-21

git checkout EntityDecoding should work just fine.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2019-04-22

I had to do git checkout -b EntityDecoding and then git pull origin EntityDecoding. Let me know if there's an easier way to pull a branch.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- LogMANOriginal - 2019-04-22
  
  I see. You probably had a clone of the repository before the branch was added, right?
  
  In that case git fetch --all will update the list of branches which you can view with git branch -a and checkout with git checkout <branch-name>.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anonymous - 2019-04-22
    
    I still had to do git pull origin EntityDecoding. Maybe this has something to do with .gitconfig and definition of remotes.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - LogMANOriginal - 2019-04-22
      
      Your branch currently doesn't track the upstream branch, because you branched it manually. This answer on StackOverflow should fix it: https://stackoverflow.com/a/2286030
      
      This should work for you:
      git branch --set-upstream-to=origin/EntityDecoding EntityDecoding
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Anonymous - 2019-04-22
        
        This is just not my day: error: the requested upstream branch 'origin/EntityDecoding' does not exist. Starting over seemed to work better:
        
        $ git clone git://git.code.sf.net/p/simplehtmldom/repository $ cd repository $ git checkout EntityDecoding $ git fetch --all
        
        On a side note, it would be good if the basename of the URL was simple_html_dom or simplehtmldom (whichever is the official), rather than repository.
        
        Last edit: Anonymous 2019-04-22
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        LogMANOriginal - 2019-04-22
        
        I'd love to change the basename, but the same name is used for the tab ("Repository"), URL and everything. This is one of the things SF is really bad at.
        
        That said, you can change your folder name to whatever you like when cloning the repository:
        
        git clone git://git.code.sf.net/p/simplehtmldom/repository simplehtmldom
        
        Fortunately you can also rename the folder after cloning.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Anonymous - 2019-04-22
        
        Ever thought about migrating the whole thing to Github? Sourceforge feels kind of outdated, though there may be aspects of this I don't know about...
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        LogMANOriginal - 2019-04-22
        
        That was my original intention as well. Then John (john_schlick) and I spoke about that and decided to stay on SF because everything is already setup and the project has been on this platform for years. SF also provides much more insight than GitHub on statistics and stuff like that.
        
        That said, I'm still investigating the option to use GitHub as mirror (or vice versa), which was made possible by SF a few month ago. Here is a blog with some details: https://sourceforge.net/blog/introducing-the-new-sourceforge/
        
        I'll look into it once I've finished cleaning up the technical dept.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2019-04-22

Trimming doesn't seem to take place. You wrote " it [trimming] needs to be applied before decoding". Does this mean it's not implemented yet?

Last edit: Anonymous 2019-04-22

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- LogMANOriginal - 2019-04-22
  
  That is corrct, trimming is not yet implemented.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2019-04-22

Found some things that are not decoded as expected:

$html = str_get_html('<meta name="description" content="Häagen-Dazs">'); echo $html->find("meta[name=description]", 0)->content . "\n"; echo $html->find("meta[name=description]", 0)->getAttribute("content") . "\n";

Results in:

Häagen-Dazs Häagen-Dazs
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- LogMANOriginal - 2019-04-22
  
  Thanks for the feedback. Looks like attribute values are currently not captured.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- LogMANOriginal - 2019-04-22
  
  Fixed via [a89d71]. Let me know if you find more issues.
  
  Related
  
  Commit: [a89d71]
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anonymous - 2019-04-22
    
    Looks fine now. Post a notice when you have trimming in place.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

I added [48879f] for trimming. This is a separate branch from EntitiyDecoding to not mix up changes (yet). The results are very promising.

Example

<?php

require_once 'simple_html_dom.php';

$html = str_get_html(
<<<EOD
<html>
<head>
    <meta charset="UTF-8">
    <meta name ="description" content= "simplehtmldom">
    <meta name = "keywords" content   =   "simple,html,dom">
    <meta      name      =        "author"
        content          =        "John Doe">
    <meta charset="UTF-8">
    < meta name="description" content="simplehtmldom">
    <meta name="keywords" content="simple,html,dom" >
    < meta name="author" content="John Doe" >
    <       meta name="viewport" content="width=device-width, initial-scale=1.0"
    >
<body>
    <div class="article" />
    < div class="article" />
    <div class="article" / >
    < div class="article" / >
    <
    div class="article" /
    >
    <     div class="article" /     >
    <div class="article">
    < div class="level1">
    <div class="level2" >
    < div class="level3" >
    <
    div class="level4"
    >
    <     div class="level5"     >
    <     /div     >
    <
    /div
    >
    < /div >
    </div >
    < /div>
    </ div>
</body>
</html>
EOD
);

echo $html . PHP_EOL;

/**

 * Output
 * 
 * <html><head><meta charset="UTF-8"><meta name="description" content="simplehtmldom"><meta name="keywords" content="simple,html,dom"><meta name="author" content="John Doe"><meta charset="UTF-8"><meta name="description" content="simplehtmldom"><meta name="keywords" content="simple,html,dom"><meta name="author" content="John Doe"><meta name="viewport" content="width=device-width, initial-scale=1.0"><body><div class="article"/><div class="article"/><div class="article"/><div class="article"/><div class="article"/><div class="article"/><div class="article"><div class="level1"><div class="level2"><div class="level3"><div class="level4"><div class="level5"></div></div></div></div></div></div></body></html>
 */

Note: The two branches currently don't merge due to conflicts (changing the same code in both branches). Still, if you want to try them both, put them together manually.

Let me know if you find any bugs.

Commit: [48879f]

Always decode content values from the DOM tree

A php based DOM parser.

Group

Searches

Help

#52 Always decode content values from the DOM tree

Discussion

Related

Related

Related