PHP Simple HTML DOM Parser / Bugs / #163 Missing whitespace in plaintext property

Anonymous - 2019-02-02

Similarly, the following produces "World.Hello". Should produce "World. Hello".

$file = '<a href="">World. </a>Hello';
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LogMANOriginal - 2019-02-03

Confirmed. Thanks for reporting this issue!

The bug was introduced in [5d9b34], where the final trim() in the function text() on line 658 removes whitespace from all elements because it is called recursively on line 648.

I'm working on a fix...

Related

Commit: [5d9b34]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LogMANOriginal - 2019-02-03

Should be fixed in [28bc69]. Let me know if it works for you.

Related

Commit: [28bc69]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LogMANOriginal - 2019-02-03

assigned_to: LogMANOriginal
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2019-02-04

Seems to work! However, the example below creates multiple whitespace where there should be only one. I don't know if this is a different bug, but it's at least somewhat related.

$str = 'I am saying  <a href=""> Hello World </a>  to you.'; $html = str_get_html($str); echo $html->find("p", 0)->plaintext . "\n"; I am saying Hello World to you.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LogMANOriginal - 2019-02-04

Thanks for the feedback, I'm glad to hear it works for you.

The whitespace is actually part of the HTML string.
There are multiple parts to it, so bear with me:

1) Newlines are replaced by whitespace before parsing. You can turn that off by setting the parameter $stripRN = false when calling str_get_html.

2) Whitespace is added to the end of span elements in order to prevent multiple span elements running into each other (so This isatest results in This is a test). You can turn that off by setting the parameter $defaultSpanText = '' when calling str_get_html.

Here is what you get if you turn off the two parameters

<?php require_once 'simple_html_dom.php'; $str = 'I am saying  <a href=""> Hello World </a>  to you.'; $html = str_get_html($str, true, true, DEFAULT_TARGET_CHARSET, false, DEFAULT_BR_TEXT, ''); echo $html->find("p", 0)->plaintext . "\n"; // I am saying // // // // Hello World // // // // to you.

This looks better than before, but it still contains too many newlines. Note that this output is the exact representation of your original HTML string without the tags. This becomes obvious when comparing the output to ->innertext

I am saying  <a href=""> Hello World </a>  to you.

The newlines in the output are in fact from the original HTML. Only the tags were removed. Of course, the easiest way to fix that is to do preg_replace('!\s+!', ' ', $input);

<?php require_once 'simple_html_dom.php'; $str = 'I am saying  <a href=""> Hello World </a>  to you.'; $html = str_get_html($str, true, true, DEFAULT_TARGET_CHARSET, true, DEFAULT_BR_TEXT, ''); echo preg_replace('!\s+!', ' ', $html->find("p", 0)->plaintext) . "\n"; // I am saying Hello World to you.

I'm actually not sure if this can be automated by the parser, because newlines may actually be desired depending on the contents.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2019-02-04

I think what I mostly would expect is the plain text to look like the text as displayed in the browser, in other words, a single whitespace no matter how many whitespaces are in the source. Possibly, there may be instances where multiplace whitespace are desired (like you hint at), but I can't think of any at the moment. Of course the replace will work around the problem.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LogMANOriginal - 2019-02-04

I agree, the output should be as close to the browser as possible. The current implementation, however, is still very limited. I think it's possible to entirely skip elements with no contents, which should get rid of excessive newlines.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LogMANOriginal - 2022-04-09

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LogMANOriginal - 2022-04-09

966c5e39493eff7dc1eb77e0004bdc0015037b34 fixes various issues related to spaces and line breaks when generating plain text. For all the examples provided here, it produces the correct output. It also properly collapses superfluous spaces and line breaks, so that the output should be much more readable, especially for awkwardly formatted HTML documents.

Related

Commit: [966c5e]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2025-11-06

Post awaiting moderation.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Missing whitespace in plaintext property

A php based DOM parser.

Searches

Help

#163 Missing whitespace in plaintext property

Related

Discussion

Related

Related

Related