Menu

#52 Always decode content values from the DOM tree

2.0
closed
None
2025-11-06
2019-03-21
No

Would you consider adding functionality that automatically does html_entity_decode($string, ENT_QUOTES | ENT_HTML5) for all content returned from the library, such as$e->plaintext, $e->getAttribute(), $e->href, etc.?

Today one has to do this manually for every single access which gets repetitive and tedious.

I don't think I've ever needed not to decode returned data, to put it that way. Actually, I would consider it proper practice to always decode, and then encode when explicitly needed for your own purpose.

Could decoding be the default, with a possible option to disable it if this is necessary for somebody?

Discussion

<< < 1 2 (Page 2 of 2)
  • Anonymous

    Anonymous - 2019-04-24
    $html = "  <p>  <span>  </span>  foo  </p>  ";
    $html = str_get_html($html);
    $html->find("span", 0)->remove();
    printf("(%s)\n", $html->find("p", 0)->plaintext);
    
    $html = str_get_html('  <meta name="description" content="  bar  ">');
    printf("(%s)\n", $html->find("meta[name=description]", 0)->content);
    
     

    Last edit: Anonymous 2019-04-24
    • Anonymous

      Anonymous - 2019-04-26

      Any news on trimming?

       
    • LogMANOriginal

      LogMANOriginal - 2019-04-27

      I see what you mean. There are actually two parts to it. Your first example is fixed with [4d68ba]. Whitespace is now removed when using ->plaintext but can optionally be retained using ->text(false). Let me know if you find other examples where it doesn't fully work.

      Your second example with the attribute is not affected by this change. If you take a look at the document in your browser, you'll notice that it also doesn't remove whitespace because attributes are meant to be handled literally. The only exception is the class attribute, which is already trimmed by the parser.

      In these cases I think it's best to simply trim manually.

      $html = str_get_html('  <meta name="description" content="  bar  ">');
      printf("(%s)\n", trim($html->find("meta[name=description]", 0)->content));
      
       

      Related

      Commit: [4d68ba]

      • Anonymous

        Anonymous - 2019-04-27

        Thanks for the fix!

        Does the W3C HTML specification say that whitespace inside quotes are in fact part of the actual value? If so, I sort of concede, but not very happily, I must admit.

        Though whitespace inside quotes is no doubt due to sloppiness on the page author's part, in the real world you always want trimmed values to avoid messing up database fields, plain-text terminal output, markup, and other sources that would carry the whitespace with them.

        I appreciate you're trying to follow standards, but would you be willing to consider trimming the values anyway, out of sheer pragmatism if nothing else?

         
        • LogMANOriginal

          LogMANOriginal - 2019-04-27

          Unfortunately the HTML specification is not that easy to comprehend. For a full overview you need to look at each attribute individually, which is too much to ask for (unless you do this professionally). Here is the specification if you are interested: https://www.w3.org/TR/html/

          That said, there are lots of resources on the web which go into more details on these topics. Here is one I found on whitespace in attribute values: https://www.impressivewebs.com/leading-trailing-spaces-html-attribute-values/

          I appreciate you're trying to follow standards, but would you be willing to consider trimming the values anyway, out of sheer pragmatism if nothing else?

          It's less about following the standard, but more about what I see in the browser vs. what the parser returns. If we remove whitespace on attribute values, it will effect CSS selectors (which wouldn't work anymore). This can result in confusion if people simply copy selectors from the css files or from the browser debug tools.

          Anyway, if you want to try with full trimming on all attributes, simply remove the condition at line 2180. Let me know if this is what you expect.

          Oh yeah, before I forget, there is a new commit available with ensures that all block and inline level elements are taken into account when returning plaintext. It also got rid of the $trim option I added before. [ced5ab]

          This example now also works. You should compare results from the previous commit to this one.

          <?php
          
          require_once 'simple_html_dom.php';
          
          $str = '<p>I am saying
          <!-- -->
          <span>
          <a href="">
          Hello World
          </a>
          </span>
          <!-- -->
          to you.</p>';
          
          $html = str_get_html($str);
          
          echo $html->find("p", 0)->plaintext . "\n";
          

          (from [bugs:#163])

          The previous version produced incorrect results (whitespace around "Hello World").

           

          Related

          Bugs: #163
          Commit: [ced5ab]

          • Anonymous

            Anonymous - 2019-04-27

            Thanks for a good follow-up. I'll try the latest commit. I actually know one of the main W3C guys, so if necessary I could ask for his opinion on the matter.

            Malplaced whitespace typically ends up in HTML because of sloppiness by non-tech people, eg. a journalist copy-pasting an article title (in a proportional font) into a CMS, not noticing the minuscle leading/trailing whitespace. So you could say that this is where the trim function really belongs: At the origin of the fault.

            I doubt any developer with half a brain would ever willingly use spaces in attributes, like the examples in your link. In such cases, the browser should just crash for the sake of humanity.

             

            Last edit: Anonymous 2019-04-27
            • LogMANOriginal

              LogMANOriginal - 2019-04-27

              I agree, whitespace in attributes is definately not good practice. However, there are instances where we need to work with it one way or another.

              If you know someone with more insight, I'd love to hear his opinion. In particular if this was done for the sake of compatibility (because the internet simply grew that way), or because it is actually used in legit cases we are not aware of. I couldn't come up with any information about that during my research.

              I still can't wrap my head around why browsers maintain whitespace if it has no meaning. Maybe it's also done for the sake of compatibility.

              Forget about that. While writing this response I found a resource that is quite reliable: https://www.w3.org/TR/xml/#AVNormalize

              This line in particular is important:

              If the attribute type is not CDATA, then the XML processor MUST further process the normalized attribute value by discarding any leading and trailing space (#x20) characters, and by replacing sequences of space (#x20) characters by a single space (#x20) character.

              There are a few pointers in the HTML specification to the specification above, but all of them are related to XHTML. The part about attribute values is a bit vague: https://www.w3.org/TR/html/syntax.html#attribute-values

              But I think it's close enough to use as argument for trimming attribute values.

              That said, if we trim arguments in the DOM we should do the same with selectors. That way copying CSS with whitespace will work as if the whitespace never existed in the first place, thus making it compatible to the parser. I'll look into this next week if I find the time.

              I doubt any developer with half a brain would ever willingly use spaces in attributes, like the examples in your link.

              A few years ago I had to work with someone who actually did this (and many other stupid things) on purpose. His reasons were simple: "I like it better this way" and "This is just the way I do it". There are simply people who like to think "outside the box" - as they say - and make it harder for everyone, just to feel special.

              In such cases, the browser should just crash for the sake of humanity.

              Amen.

               
  • LogMANOriginal

    LogMANOriginal - 2019-04-29

    Attribute normalization added via [83c897].
    Let me know if this works for you.

     

    Related

    Commit: [83c897]

  • Anonymous

    Anonymous - 2019-04-29

    Looks like it works! Let me know when you merge things, so I don't have to choose between decoding and trimming :)

     
  • LogMANOriginal

    LogMANOriginal - 2019-04-29

    Sure, it's just a matter of days (fingers crossed). It entirely depends on the outcome of [bugs:#169].

    In the meantime you can do this:

    git checkout master
    git checkout -b __your_branch_name__
    git merge TrimWhitespace
    git merge EntityDecoding
    

    The last merge will throw a warning because of conflicts. Open 'simple_html_dom.php' and fix the problem between the '<<<<<<<' and '>>>>>>>' marks. For some reason git cannot figure out to put the first line at the last line (protected function parse() should be protected function parse($trim = false)).

    Once you fixed this, make sure to remove any unnecessary stuff (like the '<<<<<<', '=======' and '>>>>>>>').

    Then continue with

    git add simple_html_dom.php
    git commit
    

    Everything should work fine now.
    If it doesn't work, use the attached file :)

     

    Related

    Bugs: #169

  • Anonymous

    Anonymous - 2019-04-30

    Just post a notice when you're all done merging everything to master.

    I'm so happy this project is active. I really thought it was abandonware when I first started using it, as bug reports were years old with no response, etc. For whatever reason things have gotten back on track, I'm grateful, and happy to help.

    Actually, parsing HTML is so fundamental, and this library is so user-friendly, that I think it should be a integral part of PHP. (PHP doesn't have anything out of the box for this.)

     
  • Anonymous

    Anonymous - 2019-05-16

    When will everything be merged?

     
  • LogMANOriginal

    LogMANOriginal - 2019-05-27

    Next up is 1.9 with the things that are currently in master. Since 1.10 was used for an earlier release (a few years ago), 2.0 is the next logical step. It will include these changes and more.

    I plan to release 1.9 at the end of this week, after which the branches get merged into master.

     
  • LogMANOriginal

    LogMANOriginal - 2019-05-30
    • status: pending --> closed
    • assigned_to: LogMANOriginal
     
  • LogMANOriginal

    LogMANOriginal - 2019-05-30

    Done. Both branches are now merged into master.

     
  • Anonymous

    Anonymous - 2019-06-06

    I'm so happy with these changes. The package is now like a dream to use, because you instantly get the content you want without struggling with manual trimming and decoding every single time.

     
  • Anonymous

    Anonymous - 2019-06-14
      $html = file_get_html("https://www.rogerebert.com/reviews/dark-phoenix-2019");
      foreach($html->find("div[itemprop=reviewBody] > p") as $p)
        printf("%s\n\n", wordwrap($p->plaintext));
    

    I found an incident where whitespace is not removed (marked with underscore). Can you fix this?

    ...and “_X-Men: Apocalypse_,” Simon Kinberg_’s directorial debut...

    ...Jean Grey, Professor X, Raven (_Jennifer Lawrence_)...

    ...named Vuk (who takes the body of Jessica Chastain_) is encouraging...

     
    • LogMANOriginal

      LogMANOriginal - 2019-06-22

      This should be fixed in master as well (works for me now). Please consider opening a new issue if this is still a problem.

       
  • Anonymous

    Anonymous - 2025-11-06
    Post awaiting moderation.
<< < 1 2 (Page 2 of 2)

Log in to post a comment.

MongoDB Logo MongoDB