$text = '<p>***** ** ********,*** ****** ********. ******* **** *** ** ** ******* *** *** ****** ***. *** **** *** ****** ****. *.***. ** **** *** ***** **** *****
<br>********* *** ************ ** *** *** *** *** ** ****. *** ** ****, ******** *******
<br>*** ******** ********** * ** ******* ****. ******** ** *** *** **** ***********,
<br>** *** *** *** * *** *** ********* .*** ***** ******* *** ** **** ***.*** ******
<br>*** ** **** *** ******** *** *** *********** *** **** .*** *** **** *** ********
<br>***** ******,********* ** ** *** ** ********* *** *** ** *** **** *** *** ****,****
<br>** *** ****** *** **** ***** * *******. ***** ****** *** *** ******* *** ****. *****
<br>** *** *** *** ***** ** ****** *****. ****** *** *** ***** ***** ** ** ******** ****
<br>** *** ********* *** ***** ******** *** ** ** ***** **** *** *** ******.
<br>*** *** **** ******** ***** ***** *** **** *** ******* ********** **********.
<br>** ** *************** *** ****** **********.*** ******** ** ** *** *******
<br>BR TAGS IN THE MIDDLE LIKE THIS <br> SHOULD ACTUALLY BREAK.
<br> THE SPACE CHARACTER AT THE START OF THIS LINE SHOULD BE REMOVED.</p>';
// $text = preg_replace("/\n+<br>/", "\n", $text); // un-comment to workaround bug
$plain = str_get_html($text)->plaintext;
echo wordwrap($plain, 80) . "\n";
The <br> tag is not properly interpreted/removed, and the parser leaves CR characters in the plaintext, so the resulting text is a mess, eg.
I know there is a define("DEFAULT_BR_TEXT", "\r\n"), but changing the setting doesn't really/fully solve the problem. Also, this setting seems like a hack as CRLF conventions should follow the platform (and the rest of the code) by default, eg. LF only on Mac.
A simple way to test compliance is to compare the output plaintext with how the HTML is presented/broken in a browser.
The bug would be considered fixed when the output of the code is a solid block of text as expected.
The text should say "The BR tag is not...". Evidently this editor interprets HTML tags as-is, and initial postings can't be edited :/
Diff:
Thanks for reporting. I fixed your original message.
You are right, the current implementation of
<br>is wrong. I haven't tested this yet but it should give slightly better results if you defineDEFAULT_BR_TEXTlike this:define("DEFAULT_BR_TEXT", PHP_EOL)At the very least, this makes it platform independent.
That said, there is additional work to do in the parser to handle all cases (like the
<br> acase).Please try again with current master. From what I can tell, the output looks right:
Shouldn't plaintext convert newlines to spaces? Did you change this recently? Surely this is a bug/regression, or am I missing something completely?
PS! Where on the SourceForge page is the link to the manual (the one with the clickable tabs with examples, etc.)? I hope you didn't remove this, because I use it as a reference all the time.
The plaintext implementation is completely rewritten but it passes all tests. Your particular case probably isn't covered by any of the tests right now. I'll check this as well.
At the very least
<br>seems to work right.The old site is still available here: https://simplehtmldom.sourceforge.io/docs/1.0/
[67c0f4e21091a9cc66151610a653724a0acb1b69] fixes the whitespace issue. Let me know if this works for you.
Related
Commit: [67c0f4]
The space thing works now, but the BR tag is still not handled well. Try the code in the original post and compare the output when (un)-commenting the commented line.
PS! Would be nice if you could link to the manual from the "Support" section, because it was hard to find. https://sourceforge.net/projects/simplehtmldom/support
Good idea! I'll do that.
I'm comparing the output of plaintext with what is displayed in the browser and it looks exactly the same. Please note that I have removed
wordwrap()to ensure that it aligns with what is displayed in the browser.Output
The following should display the same output rendered by your browser:
***** ** ********,*** ****** ********. ******* **** *** ** ** ******* *** *** ****** ***. *** **** *** ****** ****. *.***. ** **** *** ***** **** *****
********* *** ************ ** *** *** *** *** ** ****. *** ** ****, ******** *******
*** ******** ********** * ** ******* ****. ******** ** *** *** **** ***********,
** *** *** *** * *** *** ********* .*** ***** ******* *** ** **** ***.*** ******
*** ** **** *** ******** *** *** *********** *** **** .*** *** **** *** ********
***** ******,********* ** ** *** ** ********* *** *** ** *** **** *** *** ****,****
** *** ****** *** **** ***** * *******. ***** ****** *** *** ******* *** ****. *****
** *** *** *** ***** ** ****** *****. ****** *** *** ***** ***** ** ** ******** ****
** *** ********* *** ***** ******** *** ** ** ***** **** *** *** ******.
*** *** **** ******** ***** ***** *** **** *** ******* ********** **********.
** ** *************** *** ****** **********.*** ******** ** ** *** *******
BR TAGS IN THE MIDDLE LIKE THIS
SHOULD ACTUALLY BREAK.
THE SPACE CHARACTER AT THE START OF THIS LINE SHOULD BE REMOVED.
When I uncomment your workaround, it returns this:
With word wrapping it looks like this:
What am I missing?
Turns out that page is managed by SF. There is no way to change the contents of that page 😔
I added a "Manual" tab instead.
Looks good now! However, you must set the Unicode flag, or else preg_replace() may return an invalid Unicode string, which may cause the second preg_replace() to return NULL, and a deprecation error for the third preg_replace().
diff --git a/HtmlNode.php b/HtmlNode.php
index 9bc6a1a..9649d37 100644
--- a/HtmlNode.php
+++ b/HtmlNode.php
@@ -351 +351 @@ class HtmlNode
And about the manual: I see now that the navigation sidebar is aligned far down upon page load, so that only the documentation for the functions (isset etc.) is immediately visible, not the more useful "Quick Start", etc. on top (you have to scroll up to see that). Very strange. So I see that all the useful tips are preserved here, and then I obviously don't need the old manual :)
Last edit: Anonymous 2022-04-24
Good catch. Fixed via [b8d048e46b7f1964c28ea041d39ccb1d05f9a0ed].
Yes that is a problem with MkDocs which is very annoying. I spend hours trying to figure out a way to make this work. For version 2.0 the site structure is changed, which fixes that issue. A preview is online: https://simplehtmldom.sourceforge.io/docs/2.0/
Related
Commit: [b8d048]
DELETE THIS BUG REPORT