Menu

#169 Bug: 1.8.1 find not properly working

closed
None
2019-05-30
2019-04-11
Luis Franco
No

Version 1.8.1 doesn't work as expected, for example:

$dom->find('div[class=name]', 0);

The above fails if the div has more than 1 class. I'm currently working with version 1.7.1 which doesn't have this issue.

Related

Feature Requests: #52

Discussion

  • LogMANOriginal

    LogMANOriginal - 2019-04-15

    Thanks for reporting this issue.
    It could be related to #166, which was fixed recently. Does it still happen with master?

     
  • LogMANOriginal

    LogMANOriginal - 2019-04-22
    • status: open --> pending
     
  • LogMANOriginal

    LogMANOriginal - 2019-04-22

    Does this still happen with master?

     
    • Luis Franco

      Luis Franco - 2019-04-23

      Hi, sorry, I was out on vacations, I'll test and reply back with the results.

       
      • LogMANOriginal

        LogMANOriginal - 2019-04-29

        Any news on this?

         
  • DB1

    DB1 - 2019-05-24

    I can confirm that this still is an issue.

    I've tried 1.8.1, 1.7, and the latests Master and in all cases I can't get the div when it has more than one class.

     
    • Luis Franco

      Luis Franco - 2019-05-24

      Hello, try 1.7.1, that one worked for me. and yes @LogMANOriginal it's still buggy using master.

       
      • DB1

        DB1 - 2019-05-25

        There is no 1.7.1 that I can find. There's 1.7 and I've tried it; doesn't work unfortunately.

         
      • DB1

        DB1 - 2019-05-25

        It seems that 1.7 does work afterall... I guess in my haste in testing I probably forgot to change the file path.

        Thanks!

         
  • DB1

    DB1 - 2019-05-24

    No longer relevant.

     

    Last edit: DB1 2019-05-25
  • LogMANOriginal

    LogMANOriginal - 2019-05-27

    I think the issue is finally clear to me.
    At first I thought you were searching through HTML like this

    <div class="main" class="article"></div>
    

    but you are talking about multiple classes, not multiple class attributes, so the code probably looks like this (please correct me if I'm wrong)

    <div class="main article"></div>
    

    In this case, the selector 'div[class="main"]' is not what you really want. You want to match classes that contain the specified value and not the ones that match exactly. If it works in 1.7 that was a bug.

    Generally speaking, for classes you should use 'div[class~="main"]' as it takes whitespace into consideration. Please note that it doesn't work for multiple classes (because of the whitespace).

    Find more details here

    Here is some code for testing

    <?php
    
    require_once 'simple_html_dom.php';
    
    $html = str_get_html(<<<EOD
    <body>
    <div class="main header section"></div>
    <div class="mainnot"></div>
    </body>
    EOD
    );
    
    // "=" matches the value **exactly**
    echo 'Match "=":  ';
    echo count($html->find('div[class="main"]'));
    echo PHP_EOL;
    
    // "*=" matches if it **contains** the value
    echo 'Match "*=": ';
    echo count($html->find('div[class*="main"]'));
    echo PHP_EOL;
    // Note that this also matches <div class="mainnot"></div>
    
    // "^=" matches if it **starts** with the value
    echo 'Match "^=": ';
    echo count($html->find('div[class^="main"]'));
    echo PHP_EOL;
    // Note that  this also matches <div class="mainnot"></div>
    
    // "~=" matches if it **contains** the value as whitespace separated list
    echo 'Match "~=": ';
    echo count($html->find('div[class~="main"]'));
    echo PHP_EOL;
    

    Does that work for you?

     
  • Luis Franco

    Luis Franco - 2019-05-27

    gotcha, I'm testing both master and 7.1, But one thing, this used to work on previous versions, in my case I had to upgrade the simplehtmldom library because of a PHP bump to 7.3.X

    Let me get back to you with the results.

     
  • Luis Franco

    Luis Franco - 2019-05-27

    1.8.1 As you mentioned, div[class~="main"] is the one that would resolve the problems I was having (going from 1.5 and prior versions up to master).

    Also, I can confirm echo $html->find('div[class="main"]', 0)->innertext; works on 1.7, if that's a bug now, please confirm it, and then I guess we're good to close this case.

    I appreciate a lot the time you spent to check this out. :)

     
    • LogMANOriginal

      LogMANOriginal - 2019-05-28

      Also, I can confirm echo $html->find('div[class="main"]', 0)->innertext; works on 1.7, if that's a bug now, please confirm it, and then I guess we're good to close this case.

      It is a bug in version 1.7 and prior. The CSS specification is very clear about it.

      [att=val] Represents an element with the att attribute whose value is exactly "val".
      -- https://www.w3.org/TR/selectors/#attribute-selectors

      I suppose the best way to confirm this is to load a HTML document with CSS styles. Here is one example.

      <head>
      <style>
      div[class=main] { color: white; }
      div[class~=main] { background-color: blue; }
      </style>
      </head>
      <body>
      <div class="main header section">PHP Simple HTML DOM Parser</div>
      <div class="mainnot">Hello, World!</div>
      </body>
      

      As you can see, the first selector uses the original solution which worked in 1.7 and prior. It sets the text color to white. The second selector sets the background color to blue. On my machine only the second selector works.

      It looks something like this (only renders on SF)

      PHP Simple HTML DOM Parser
      Hello, World!
       
      • DB1

        DB1 - 2019-05-28

        On this page: https://www.npostart.nl/andere-tijden/VPWON_1247337/episode

        I can do the following in 1.7:

        foreach($html->find('div[id=component-grid-episodes] div[class=npo-grid-asset] .npo-asset-tile-container') as $episode)

        It correctly grabs all episode divs that way.

        But when I change the last part to:
        div[class~=npo-asset-tile-container]
        it only grabs the first div and not all of them.

        What am I doing wrong?

         
        • LogMANOriginal

          LogMANOriginal - 2019-05-28

          That is an actual bug in 1.8!

          Here is some example code which shows versions that work and some that don't

          <?php
          
          include_once 'simple_html_dom.php';
          
          $html = file_get_html('https://www.npostart.nl/andere-tijden/VPWON_1247337/episode');
          
          // Works
          echo count($html->find('div[id="component-grid-episodes"] div[class=npo-grid-asset] .npo-asset-tile-container'));
          
          echo PHP_EOL;
          
          // Works
          echo count($html->find('div[id=component-grid-episodes ] div[class=npo-grid-asset] .npo-asset-tile-container'));
          
          echo PHP_EOL;
          
          // Doesn't work
          echo count($html->find('div[id=component-grid-episodes] div[class=npo-grid-asset] .npo-asset-tile-container'));
          
          echo PHP_EOL;
          

          The reason why the first two examples work and the third one doesn't is because the id ends on an 's', which is incorrectly detected as the case-sensitivity specifier. https://www.w3.org/TR/selectors-4/#attribute-case

          This certainly needs fixing, thanks for reporting it!

           
          • DB1

            DB1 - 2019-05-28

            Wow, I am sure glad it was a bug cause I was slowly getting convinced I was going crazy. :p

            Glad to help! :D

             
            • LogMANOriginal

              LogMANOriginal - 2019-05-30

              This is fixed in master
              [680b45]

               

              Related

              Commit: [680b45]

        • LogMANOriginal

          LogMANOriginal - 2019-05-28

          Changing this line
          https://sourceforge.net/p/simplehtmldom/repository/ci/1.8.1/tree/simple_html_dom.php#l1188

          to

          "/\[@?(!?[\w:-]+)(?:([!*^$|~]?=)[\"']?(.*?)[\"']?)?(?:\s+?([iIsS])?)?\]/is",
          

          should fix it (notice the \s+?([iIsS]) instead of \s*?([iIsS]) at the end)

          I'll push this fix to master later this week.

           
  • Luis Franco

    Luis Franco - 2019-05-28

    Awesome explanation, now the issue is clear, we'll update stuff accordingly.

    🙏Thanks again!!!

     
  • Luis Franco

    Luis Franco - 2019-05-28

    dup text

     

    Last edit: Luis Franco 2019-05-28
  • LogMANOriginal

    LogMANOriginal - 2019-05-30
    • status: pending --> closed
    • assigned_to: LogMANOriginal
     

Log in to post a comment.

MongoDB Logo MongoDB