Menu

#52 Always decode content values from the DOM tree

2.0
closed
None
2025-11-06
2019-03-21
No

Would you consider adding functionality that automatically does html_entity_decode($string, ENT_QUOTES | ENT_HTML5) for all content returned from the library, such as$e->plaintext, $e->getAttribute(), $e->href, etc.?

Today one has to do this manually for every single access which gets repetitive and tedious.

I don't think I've ever needed not to decode returned data, to put it that way. Actually, I would consider it proper practice to always decode, and then encode when explicitly needed for your own purpose.

Could decoding be the default, with a possible option to disable it if this is necessary for somebody?

Discussion

1 2 > >> (Page 1 of 2)
  • Anonymous

    Anonymous - 2019-03-21

    Actually, trim() would also be a desired default. I have never needed not to remove surrounding whitespace from an accessed value (unless when assuming/hoping there's never going to be any).

     
  • LogMANOriginal

    LogMANOriginal - 2019-03-23

    I also can't think of any reason where decoding is not desired. That said, it adds some overhead, especially when decoding isn't necessary (i.e. already decoded).

    Decoding could be made optional (i.e. with a define) or done while parsing (this is probably the most efficient as it is done only once).

    Trim on the other hand should already be done for plaintext (not sure about attributes). These things are expecially delicate as small changes can have big impact and might break existing code. Nevertheless, trim can be applied to some elements while parsing, similar to decoding.

     
  • Anonymous

    Anonymous - 2019-03-23

    I doubt a change in performance would have any significant real world impact. I think the correct thing is to always decode and trim. Not doing is relying too much on the HTML, which will break in other ways if the HTML changes.

    As for breaking code, decoding a string that is already decoded will practically always return the same string. So I think you could actually get away with just changing the code. Unless a lot of people really expect a lot of non-trimmed, non-decoded strings. I would shout pretty loud about it in the change log, though :)

    You could make a separate branch for the changes, and I (and others) could test them out. Keep me posted.

     
  • LogMANOriginal

    LogMANOriginal - 2019-04-18
    • Group: Unassigned --> 1.9
     
  • LogMANOriginal

    LogMANOriginal - 2019-04-21
    • status: open --> pending
    • Group: 1.9 --> 2.0
     
  • LogMANOriginal

    LogMANOriginal - 2019-04-21

    I implemented a version for this in a separate branch at [1cefc6]. Please let me know if this works for you.

    This feature needs to be scheduled for 2.0 as it breaks existing code. Find an explanation in the commit message for the commit above. Essentially, decoding contents twice is problematic in situations where decoding an entity reveals another (i.e. &).

    Regarding trim, it needs to be applied before decoding because the content might contain encoded whitespace which would be stripped if we trim after decoding.

     

    Related

    Commit: [1cefc6]


    Last edit: LogMANOriginal 2019-04-21
  • Anonymous

    Anonymous - 2019-04-21

    Very exciting! Will try this. I wouldn't worry about edge cases like & for normal use. The only relevant case seems to be markup that is verbatimely referring to entities.

     

    Last edit: Anonymous 2019-04-22
  • Anonymous

    Anonymous - 2019-04-21

    PS! Which Git commands do I use to get this branch/commit? A normal pull gets just the master branch, and there is no such commit there.

     
  • LogMANOriginal

    LogMANOriginal - 2019-04-21

    git checkout EntityDecoding should work just fine.

     
  • Anonymous

    Anonymous - 2019-04-22

    I had to do git checkout -b EntityDecoding and then git pull origin EntityDecoding. Let me know if there's an easier way to pull a branch.

     
    • LogMANOriginal

      LogMANOriginal - 2019-04-22

      I see. You probably had a clone of the repository before the branch was added, right?

      In that case git fetch --all will update the list of branches which you can view with git branch -a and checkout with git checkout <branch-name>.

       
      • Anonymous

        Anonymous - 2019-04-22

        I still had to do git pull origin EntityDecoding. Maybe this has something to do with .gitconfig and definition of remotes.

         
        • LogMANOriginal

          LogMANOriginal - 2019-04-22

          Your branch currently doesn't track the upstream branch, because you branched it manually. This answer on StackOverflow should fix it: https://stackoverflow.com/a/2286030

          This should work for you:
          git branch --set-upstream-to=origin/EntityDecoding EntityDecoding

           
          • Anonymous

            Anonymous - 2019-04-22

            This is just not my day: error: the requested upstream branch 'origin/EntityDecoding' does not exist. Starting over seemed to work better:

            $ git clone git://git.code.sf.net/p/simplehtmldom/repository
            $ cd repository
            $ git checkout EntityDecoding
            $ git fetch --all
            

            On a side note, it would be good if the basename of the URL was simple_html_dom or simplehtmldom (whichever is the official), rather than repository.

             

            Last edit: Anonymous 2019-04-22
            • LogMANOriginal

              LogMANOriginal - 2019-04-22

              I'd love to change the basename, but the same name is used for the tab ("Repository"), URL and everything. This is one of the things SF is really bad at.

              That said, you can change your folder name to whatever you like when cloning the repository:

              git clone git://git.code.sf.net/p/simplehtmldom/repository simplehtmldom

              Fortunately you can also rename the folder after cloning.

               
              • Anonymous

                Anonymous - 2019-04-22

                Ever thought about migrating the whole thing to Github? Sourceforge feels kind of outdated, though there may be aspects of this I don't know about...

                 
                • LogMANOriginal

                  LogMANOriginal - 2019-04-22

                  That was my original intention as well. Then John (john_schlick) and I spoke about that and decided to stay on SF because everything is already setup and the project has been on this platform for years. SF also provides much more insight than GitHub on statistics and stuff like that.

                  That said, I'm still investigating the option to use GitHub as mirror (or vice versa), which was made possible by SF a few month ago. Here is a blog with some details: https://sourceforge.net/blog/introducing-the-new-sourceforge/

                  I'll look into it once I've finished cleaning up the technical dept.

                   
  • Anonymous

    Anonymous - 2019-04-22

    Trimming doesn't seem to take place. You wrote " it [trimming] needs to be applied before decoding". Does this mean it's not implemented yet?

     

    Last edit: Anonymous 2019-04-22
    • LogMANOriginal

      LogMANOriginal - 2019-04-22

      That is corrct, trimming is not yet implemented.

       
  • Anonymous

    Anonymous - 2019-04-22

    Found some things that are not decoded as expected:

    $html = str_get_html('<meta name="description" content="H&auml;agen-Dazs">');
    echo $html->find("meta[name=description]", 0)->content . "\n";
    echo $html->find("meta[name=description]", 0)->getAttribute("content") . "\n";
    

    Results in:

    H&auml;agen-Dazs
    H&auml;agen-Dazs
    
     
    • LogMANOriginal

      LogMANOriginal - 2019-04-22

      Thanks for the feedback. Looks like attribute values are currently not captured.

       
    • LogMANOriginal

      LogMANOriginal - 2019-04-22

      Fixed via [a89d71]. Let me know if you find more issues.

       

      Related

      Commit: [a89d71]

      • Anonymous

        Anonymous - 2019-04-22

        Looks fine now. Post a notice when you have trimming in place.

         
  • LogMANOriginal

    LogMANOriginal - 2019-04-23

    I added [48879f] for trimming. This is a separate branch from EntitiyDecoding to not mix up changes (yet). The results are very promising.

    Example

    <?php
    
    require_once 'simple_html_dom.php';
    
    $html = str_get_html(
    <<<EOD
    <html>
    <head>
        <meta charset="UTF-8">
        <meta name ="description" content= "simplehtmldom">
        <meta name = "keywords" content   =   "simple,html,dom">
        <meta      name      =        "author"
            content          =        "John Doe">
        <meta charset="UTF-8">
        < meta name="description" content="simplehtmldom">
        <meta name="keywords" content="simple,html,dom" >
        < meta name="author" content="John Doe" >
        <       meta name="viewport" content="width=device-width, initial-scale=1.0"
        >
    <body>
        <div class="article" />
        < div class="article" />
        <div class="article" / >
        < div class="article" / >
        <
        div class="article" /
        >
        <     div class="article" /     >
        <div class="article">
        < div class="level1">
        <div class="level2" >
        < div class="level3" >
        <
        div class="level4"
        >
        <     div class="level5"     >
        <     /div     >
        <
        /div
        >
        < /div >
        </div >
        < /div>
        </ div>
    </body>
    </html>
    EOD
    );
    
    echo $html . PHP_EOL;
    
    /**
    
     * Output
     * 
     * <html><head><meta charset="UTF-8"><meta name="description" content="simplehtmldom"><meta name="keywords" content="simple,html,dom"><meta name="author" content="John Doe"><meta charset="UTF-8"><meta name="description" content="simplehtmldom"><meta name="keywords" content="simple,html,dom"><meta name="author" content="John Doe"><meta name="viewport" content="width=device-width, initial-scale=1.0"><body><div class="article"/><div class="article"/><div class="article"/><div class="article"/><div class="article"/><div class="article"/><div class="article"><div class="level1"><div class="level2"><div class="level3"><div class="level4"><div class="level5"></div></div></div></div></div></div></body></html>
     */
    

    Note: The two branches currently don't merge due to conflicts (changing the same code in both branches). Still, if you want to try them both, put them together manually.

    Let me know if you find any bugs.

     

    Related

    Commit: [48879f]

1 2 > >> (Page 1 of 2)

Log in to post a comment.

MongoDB Logo MongoDB