Menu

#199 Incorrect handling of <br> tags next to line breaks

closed
None
2022-04-25
2022-04-19
olalav
No
  $text = '<p>***** ** ********,*** ****** ********. ******* **** *** ** ** ******* *** *** ****** ***. *** **** *** ****** ****. *.***. ** **** *** ***** **** *****
<br>********* *** ************ ** *** *** *** *** ** ****. *** ** ****, ******** *******
<br>*** ******** ********** * ** ******* ****. ******** ** *** *** **** ***********,
<br>** *** *** *** * *** *** ********* .*** ***** ******* *** ** **** ***.*** ******
<br>*** ** **** *** ******** *** *** *********** *** **** .*** *** **** *** ********
<br>***** ******,********* ** ** *** ** ********* *** *** ** *** **** *** *** ****,****
<br>** *** ****** *** **** ***** * *******. ***** ****** *** *** ******* *** ****. *****
<br>** *** *** *** ***** ** ****** *****. ****** *** *** ***** ***** ** ** ******** ****
<br>** *** ********* *** ***** ******** *** ** ** ***** **** *** *** ******.
<br>*** *** **** ******** ***** ***** *** **** *** ******* **********  **********.
<br>** ** *************** *** ****** **********.*** ******** ** ** *** *******
<br>BR TAGS IN THE MIDDLE LIKE THIS <br> SHOULD ACTUALLY BREAK.
<br> THE SPACE CHARACTER AT THE START OF THIS LINE SHOULD BE REMOVED.</p>';
  // $text = preg_replace("/\n+<br>/", "\n", $text); // un-comment to workaround bug
  $plain = str_get_html($text)->plaintext;
  echo wordwrap($plain, 80) . "\n";

The <br> tag is not properly interpreted/removed, and the parser leaves CR characters in the plaintext, so the resulting text is a mess, eg.

  1. Lines are not wrapped, and become much longer than 80 chars
  2. Space characters at the start of a line are not removed

I know there is a define("DEFAULT_BR_TEXT", "\r\n"), but changing the setting doesn't really/fully solve the problem. Also, this setting seems like a hack as CRLF conventions should follow the platform (and the rest of the code) by default, eg. LF only on Mac.

A simple way to test compliance is to compare the output plaintext with how the HTML is presented/broken in a browser.

The bug would be considered fixed when the output of the code is a solid block of text as expected.

Discussion

  • olalav

    olalav - 2022-04-19

    The text should say "The BR tag is not...". Evidently this editor interprets HTML tags as-is, and initial postings can't be edited :/

     
  • LogMANOriginal

    LogMANOriginal - 2022-04-19
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -17,7 +17,7 @@
       echo wordwrap($plain, 80) . &#34;\n&#34;;
     ~~~
    
    -The &lt;br&gt; tag is not properly interpreted/removed, and the parser leaves CR characters in the plaintext, so the resulting text is a mess, eg.
    +The `&lt;br&gt;` tag is not properly interpreted/removed, and the parser leaves CR characters in the plaintext, so the resulting text is a mess, eg.
    
     1. Lines are not wrapped, and become much longer than 80 chars
     2. Space characters at the start of a line are not removed
    
    • status: open --> accepted
    • assigned_to: LogMANOriginal
     
  • LogMANOriginal

    LogMANOriginal - 2022-04-19

    Thanks for reporting. I fixed your original message.

    You are right, the current implementation of <br> is wrong. I haven't tested this yet but it should give slightly better results if you define DEFAULT_BR_TEXT like this:

    define("DEFAULT_BR_TEXT", PHP_EOL)

    At the very least, this makes it platform independent.
    That said, there is additional work to do in the parser to handle all cases (like the <br> a case).

     
  • LogMANOriginal

    LogMANOriginal - 2022-04-24

    Please try again with current master. From what I can tell, the output looks right:

    ***** ** ********,*** ****** ********. ******* **** *** ** ** ******* *** *** ****** ***. *** **** *** ****** ****. *.***. ** **** *** ***** **** *****
    ********* *** ************ ** *** *** *** *** ** ****. *** ** ****, ******** *******
    *** ******** ********** * ** ******* ****. ******** ** *** *** **** ***********,
    ** *** *** *** * *** *** ********* .*** ***** ******* *** ** **** ***.*** ******
    *** ** **** *** ******** *** *** *********** *** **** .*** *** **** *** ********
    ***** ******,********* ** ** *** ** ********* *** *** ** *** **** *** *** ****,****
    ** *** ****** *** **** ***** * *******. ***** ****** *** *** ******* *** ****. *****
    ** *** *** *** ***** ** ****** *****. ****** *** *** ***** ***** ** ** ******** ****
    ** *** ********* *** ***** ******** *** ** ** ***** **** *** *** ******.
    *** *** **** ******** ***** ***** *** **** *** ******* **********  **********.
    ** ** *************** *** ****** **********.*** ******** ** ** *** *******
    BR TAGS IN THE MIDDLE LIKE THIS
    SHOULD ACTUALLY BREAK.
    THE SPACE CHARACTER AT THE START OF THIS LINE SHOULD BE REMOVED.
    
     
  • olalav

    olalav - 2022-04-24

    Shouldn't plaintext convert newlines to spaces? Did you change this recently? Surely this is a bug/regression, or am I missing something completely?

      $text = "<p>Hello" . "\n" . "World</p>";
      $plain = str_get_html($text)->plaintext;
      echo "PLAINTEXT:\n" . $plain . "\n\n";
      echo "WORDWRAP:\n" . wordwrap($plain, 80) . "\n";
    

    PS! Where on the SourceForge page is the link to the manual (the one with the clickable tabs with examples, etc.)? I hope you didn't remove this, because I use it as a reference all the time.

     
  • LogMANOriginal

    LogMANOriginal - 2022-04-24

    Shouldn't plaintext convert newlines to spaces? Did you change this recently? Surely this is a bug/regression, or am I missing something completely?

    The plaintext implementation is completely rewritten but it passes all tests. Your particular case probably isn't covered by any of the tests right now. I'll check this as well.

    At the very least <br> seems to work right.

    PS! Where on the SourceForge page is the link to the manual (the one with the clickable tabs with examples, etc.)? I hope you didn't remove this, because I use it as a reference all the time.

    The old site is still available here: https://simplehtmldom.sourceforge.io/docs/1.0/

     
  • LogMANOriginal

    LogMANOriginal - 2022-04-24

    [67c0f4e21091a9cc66151610a653724a0acb1b69] fixes the whitespace issue. Let me know if this works for you.

     

    Related

    Commit: [67c0f4]

  • olalav

    olalav - 2022-04-24

    The space thing works now, but the BR tag is still not handled well. Try the code in the original post and compare the output when (un)-commenting the commented line.

    PS! Would be nice if you could link to the manual from the "Support" section, because it was hard to find. https://sourceforge.net/projects/simplehtmldom/support

     
  • LogMANOriginal

    LogMANOriginal - 2022-04-24

    PS! Would be nice if you could link to the manual from the "Support" section, because it was hard to find. https://sourceforge.net/projects/simplehtmldom/support

    Good idea! I'll do that.

    The space thing works now, but the BR tag is still not handled well. Try the code in the original post and compare the output when (un)-commenting the commented line.

    I'm comparing the output of plaintext with what is displayed in the browser and it looks exactly the same. Please note that I have removed wordwrap() to ensure that it aligns with what is displayed in the browser.

    <?php
    
    include_once 'simple_html_dom.php';
    
    $text = '<p>***** ** ********,*** ****** ********. ******* **** *** ** ** ******* *** *** ****** ***. *** **** *** ****** ****. *.***. ** **** *** ***** **** *****
    <br>********* *** ************ ** *** *** *** *** ** ****. *** ** ****, ******** *******
    <br>*** ******** ********** * ** ******* ****. ******** ** *** *** **** ***********,
    <br>** *** *** *** * *** *** ********* .*** ***** ******* *** ** **** ***.*** ******
    <br>*** ** **** *** ******** *** *** *********** *** **** .*** *** **** *** ********
    <br>***** ******,********* ** ** *** ** ********* *** *** ** *** **** *** *** ****,****
    <br>** *** ****** *** **** ***** * *******. ***** ****** *** *** ******* *** ****. *****
    <br>** *** *** *** ***** ** ****** *****. ****** *** *** ***** ***** ** ** ******** ****
    <br>** *** ********* *** ***** ******** *** ** ** ***** **** *** *** ******.
    <br>*** *** **** ******** ***** ***** *** **** *** ******* **********  **********.
    <br>** ** *************** *** ****** **********.*** ******** ** ** *** *******
    <br>BR TAGS IN THE MIDDLE LIKE THIS <br> SHOULD ACTUALLY BREAK.
    <br> THE SPACE CHARACTER AT THE START OF THIS LINE SHOULD BE REMOVED.</p>';
    // $text = preg_replace("/\n+<br>/", "\n", $text); // un-comment to workaround bug
    $plain = str_get_html($text)->plaintext;
    //echo wordwrap($plain, 80) . "\n";
    echo $plain . "\n";
    

    Output

    ***** ** ********,*** ****** ********. ******* **** *** ** ** ******* *** *** ****** ***. *** **** *** ****** ****. *.***. ** **** *** ***** **** *****
    ********* *** ************ ** *** *** *** *** ** ****. *** ** ****, ******** *******
    *** ******** ********** * ** ******* ****. ******** ** *** *** **** ***********,
    ** *** *** *** * *** *** ********* .*** ***** ******* *** ** **** ***.*** ******
    *** ** **** *** ******** *** *** *********** *** **** .*** *** **** *** ********
    ***** ******,********* ** ** *** ** ********* *** *** ** *** **** *** *** ****,****
    ** *** ****** *** **** ***** * *******. ***** ****** *** *** ******* *** ****. *****
    ** *** *** *** ***** ** ****** *****. ****** *** *** ***** ***** ** ** ******** ****
    ** *** ********* *** ***** ******** *** ** ** ***** **** *** *** ******.
    *** *** **** ******** ***** ***** *** **** *** ******* ********** **********.
    ** ** *************** *** ****** **********.*** ******** ** ** *** *******
    BR TAGS IN THE MIDDLE LIKE THIS
    SHOULD ACTUALLY BREAK.
    THE SPACE CHARACTER AT THE START OF THIS LINE SHOULD BE REMOVED.
    

    The following should display the same output rendered by your browser:

    ***** ** ********,*** ****** ********. ******* **** *** ** ** ******* *** *** ****** ***. *** **** *** ****** ****. *.***. ** **** *** ***** **** *****
    ********* *** ************ ** *** *** *** *** ** ****. *** ** ****, ******** *******
    *** ******** ********** * ** ******* ****. ******** ** *** *** **** ***********,
    ** *** *** *** * *** *** ********* .*** ***** ******* *** ** **** ***.*** ******
    *** ** **** *** ******** *** *** *********** *** **** .*** *** **** *** ********
    ***** ******,********* ** ** *** ** ********* *** *** ** *** **** *** *** ****,****
    ** *** ****** *** **** ***** * *******. ***** ****** *** *** ******* *** ****. *****
    ** *** *** *** ***** ** ****** *****. ****** *** *** ***** ***** ** ** ******** ****
    ** *** ********* *** ***** ******** *** ** ** ***** **** *** *** ******.
    *** *** **** ******** ***** ***** *** **** *** ******* ********** **********.
    ** ** *************** *** ****** **********.*** ******** ** ** *** *******
    BR TAGS IN THE MIDDLE LIKE THIS
    SHOULD ACTUALLY BREAK.
    THE SPACE CHARACTER AT THE START OF THIS LINE SHOULD BE REMOVED.

    When I uncomment your workaround, it returns this:

    ***** ** ********,*** ****** ********. ******* **** *** ** ** ******* *** *** ****** ***. *** **** *** ****** ****. *.***. ** **** *** ***** **** ***** ********* *** ************ ** *** *** *** *** ** ****. *** ** ****, ******** ******* *** ******** ********** * ** ******* ****. ******** ** *** *** **** ***********, ** *** *** *** * *** *** ********* .*** ***** ******* *** ** **** ***.*** ****** *** ** **** *** ******** *** *** *********** *** **** .*** *** **** *** ******** ***** ******,********* ** ** *** ** ********* *** *** ** *** **** *** *** ****,**** ** *** ****** *** **** ***** * *******. ***** ****** *** *** ******* *** ****. ***** ** *** *** *** ***** ** ****** *****. ****** *** *** ***** ***** ** ** ******** **** ** *** ********* *** ***** ******** *** ** ** ***** **** *** *** ******. *** *** **** ******** ***** ***** *** **** *** ******* ********** **********. ** ** *************** *** ****** **********.*** ******** ** ** *** ******* BR TAGS IN THE MIDDLE LIKE THIS
    SHOULD ACTUALLY BREAK. THE SPACE CHARACTER AT THE START OF THIS LINE SHOULD BE REMOVED.
    

    With word wrapping it looks like this:

    ***** ** ********,*** ****** ********. ******* **** *** ** ** ******* *** ***
    ****** ***. *** **** *** ****** ****. *.***. ** **** *** ***** **** *****
    ********* *** ************ ** *** *** *** *** ** ****. *** ** ****, ********
    ******* *** ******** ********** * ** ******* ****. ******** ** *** *** ****
    ***********, ** *** *** *** * *** *** ********* .*** ***** ******* *** ** ****
    ***.*** ****** *** ** **** *** ******** *** *** *********** *** **** .*** ***
    **** *** ******** ***** ******,********* ** ** *** ** ********* *** *** ** ***
    **** *** *** ****,**** ** *** ****** *** **** ***** * *******. ***** ****** ***
    *** ******* *** ****. ***** ** *** *** *** ***** ** ****** *****. ****** *** ***
    ***** ***** ** ** ******** **** ** *** ********* *** ***** ******** *** ** **
    ***** **** *** *** ******. *** *** **** ******** ***** ***** *** **** ***
    ******* ********** **********. ** ** *************** *** ****** **********.***
    ******** ** ** *** ******* BR TAGS IN THE MIDDLE LIKE THIS
    SHOULD ACTUALLY BREAK. THE SPACE CHARACTER AT THE START OF THIS LINE SHOULD BE
    REMOVED.
    

    What am I missing?

     
  • LogMANOriginal

    LogMANOriginal - 2022-04-24

    PS! Would be nice if you could link to the manual from the "Support" section, because it was hard to find. https://sourceforge.net/projects/simplehtmldom/support

    Turns out that page is managed by SF. There is no way to change the contents of that page 😔
    I added a "Manual" tab instead.

     
  • olalav

    olalav - 2022-04-24

    Looks good now! However, you must set the Unicode flag, or else preg_replace() may return an invalid Unicode string, which may cause the second preg_replace() to return NULL, and a deprecation error for the third preg_replace().

    diff --git a/HtmlNode.php b/HtmlNode.php
    index 9bc6a1a..9649d37 100644
    --- a/HtmlNode.php
    +++ b/HtmlNode.php
    @@ -351 +351 @@ class HtmlNode
    - $ret = preg_replace('/\s+/', ' ', $ret);
    + $ret = preg_replace('/\s+/u', ' ', $ret);

    And about the manual: I see now that the navigation sidebar is aligned far down upon page load, so that only the documentation for the functions (isset etc.) is immediately visible, not the more useful "Quick Start", etc. on top (you have to scroll up to see that). Very strange. So I see that all the useful tips are preserved here, and then I obviously don't need the old manual :)

     

    Last edit: olalav 2022-04-24
  • LogMANOriginal

    LogMANOriginal - 2022-04-25

    Looks good now! However, you must set the Unicode flag, or else preg_replace() may return an invalid Unicode string, which may cause the second preg_replace() to return NULL, and a deprecation error for the third preg_replace().

    Good catch. Fixed via [b8d048e46b7f1964c28ea041d39ccb1d05f9a0ed].

    And about the manual: I see now that the navigation sidebar is aligned far down upon page load, so that only the documentation for the functions (isset etc.) is immediately visible, not the more useful "Quick Start", etc. on top (you have to scroll up to see that). Very strange. So I see that all the useful tips are preserved here, and then I obviously don't need the old manual :)

    Yes that is a problem with MkDocs which is very annoying. I spend hours trying to figure out a way to make this work. For version 2.0 the site structure is changed, which fixes that issue. A preview is online: https://simplehtmldom.sourceforge.io/docs/2.0/

     

    Related

    Commit: [b8d048]

  • LogMANOriginal

    LogMANOriginal - 2022-04-25
    • status: accepted --> closed
     

Log in to post a comment.