The bug was introduced in [5d9b34], where the final trim() in the function text() on line 658 removes whitespace from all elements because it is called recursively on line 648.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2019-02-04
Seems to work! However, the example below creates multiple whitespace where there should be only one. I don't know if this is a different bug, but it's at least somewhat related.
Thanks for the feedback, I'm glad to hear it works for you.
The whitespace is actually part of the HTML string.
There are multiple parts to it, so bear with me:
1) Newlines are replaced by whitespace before parsing. You can turn that off by setting the parameter $stripRN = false when calling str_get_html.
2) Whitespace is added to the end of span elements in order to prevent multiple span elements running into each other (so <p>This <span>is</span><span>a</span>test</p> results in This is a test). You can turn that off by setting the parameter $defaultSpanText = '' when calling str_get_html.
Here is what you get if you turn off the two parameters
<?phprequire_once'simple_html_dom.php';$str='<p>I am saying<!-- --><span><a href="">Hello World</a></span><!-- -->to you.</p>';$html=str_get_html($str,true,true,DEFAULT_TARGET_CHARSET,false,DEFAULT_BR_TEXT,'');echo$html->find("p",0)->plaintext."\n";// I am saying// ////// Hello World// ////// to you.
This looks better than before, but it still contains too many newlines. Note that this output is the exact representation of your original HTML string without the tags. This becomes obvious when comparing the output to ->innertext
The newlines in the output are in fact from the original HTML. Only the tags were removed. Of course, the easiest way to fix that is to do preg_replace('!\s+!', ' ', $input);
<?phprequire_once'simple_html_dom.php';$str='<p>I am saying<!-- --><span><a href="">Hello World</a></span><!-- -->to you.</p>';$html=str_get_html($str,true,true,DEFAULT_TARGET_CHARSET,true,DEFAULT_BR_TEXT,'');echopreg_replace('!\s+!',' ',$html->find("p",0)->plaintext)."\n";// I am saying Hello World to you.
I'm actually not sure if this can be automated by the parser, because newlines may actually be desired depending on the contents.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2019-02-04
I think what I mostly would expect is the plain text to look like the text as displayed in the browser, in other words, a single whitespace no matter how many whitespaces are in the source. Possibly, there may be instances where multiplace whitespace are desired (like you hint at), but I can't think of any at the moment. Of course the replace will work around the problem.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I agree, the output should be as close to the browser as possible. The current implementation, however, is still very limited. I think it's possible to entirely skip elements with no contents, which should get rid of excessive newlines.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
966c5e39493eff7dc1eb77e0004bdc0015037b34 fixes various issues related to spaces and line breaks when generating plain text. For all the examples provided here, it produces the correct output. It also properly collapses superfluous spaces and line breaks, so that the output should be much more readable, especially for awkwardly formatted HTML documents.
Similarly, the following produces "World.Hello". Should produce "World. Hello".
Confirmed. Thanks for reporting this issue!
The bug was introduced in [5d9b34], where the final
trim()in the functiontext()on line 658 removes whitespace from all elements because it is called recursively on line 648.I'm working on a fix...
Related
Commit: [5d9b34]
Should be fixed in [28bc69]. Let me know if it works for you.
Related
Commit: [28bc69]
Seems to work! However, the example below creates multiple whitespace where there should be only one. I don't know if this is a different bug, but it's at least somewhat related.
Thanks for the feedback, I'm glad to hear it works for you.
The whitespace is actually part of the HTML string.
There are multiple parts to it, so bear with me:
1) Newlines are replaced by whitespace before parsing. You can turn that off by setting the parameter
$stripRN = falsewhen callingstr_get_html.2) Whitespace is added to the end of
spanelements in order to prevent multiple span elements running into each other (so<p>This <span>is</span><span>a</span>test</p>results inThis is a test). You can turn that off by setting the parameter$defaultSpanText = ''when callingstr_get_html.Here is what you get if you turn off the two parameters
This looks better than before, but it still contains too many newlines. Note that this output is the exact representation of your original HTML string without the tags. This becomes obvious when comparing the output to
->innertextThe newlines in the output are in fact from the original HTML. Only the tags were removed. Of course, the easiest way to fix that is to do
preg_replace('!\s+!', ' ', $input);I'm actually not sure if this can be automated by the parser, because newlines may actually be desired depending on the contents.
I think what I mostly would expect is the plain text to look like the text as displayed in the browser, in other words, a single whitespace no matter how many whitespaces are in the source. Possibly, there may be instances where multiplace whitespace are desired (like you hint at), but I can't think of any at the moment. Of course the replace will work around the problem.
I agree, the output should be as close to the browser as possible. The current implementation, however, is still very limited. I think it's possible to entirely skip elements with no contents, which should get rid of excessive newlines.
966c5e39493eff7dc1eb77e0004bdc0015037b34 fixes various issues related to spaces and line breaks when generating plain text. For all the examples provided here, it produces the correct output. It also properly collapses superfluous spaces and line breaks, so that the output should be much more readable, especially for awkwardly formatted HTML documents.
Related
Commit: [966c5e]