Menu

#163 Missing whitespace in plaintext property

closed
None
2025-11-06
2019-02-02
No
  $file = 'Hello<a href=""> World';
  $html = str_get_html($file);
  echo "$html->plaintext\n";

Produces "HelloWorld".

Should produce "Hello World".

Related

Feature Requests: #52

Discussion

  • Anonymous

    Anonymous - 2019-02-02

    Similarly, the following produces "World.Hello". Should produce "World. Hello".

    $file = '<a href="">World. </a>Hello';
    
     
  • LogMANOriginal

    LogMANOriginal - 2019-02-03

    Confirmed. Thanks for reporting this issue!

    The bug was introduced in [5d9b34], where the final trim() in the function text() on line 658 removes whitespace from all elements because it is called recursively on line 648.

    I'm working on a fix...

     

    Related

    Commit: [5d9b34]

  • LogMANOriginal

    LogMANOriginal - 2019-02-03

    Should be fixed in [28bc69]. Let me know if it works for you.

     

    Related

    Commit: [28bc69]

  • LogMANOriginal

    LogMANOriginal - 2019-02-03
    • assigned_to: LogMANOriginal
     
  • Anonymous

    Anonymous - 2019-02-04

    Seems to work! However, the example below creates multiple whitespace where there should be only one. I don't know if this is a different bug, but it's at least somewhat related.

    $str = '<p>I am saying
    <!-- -->
    <span>
    <a href="">
    Hello World
    </a>
    </span>
    <!-- -->
    to you.</p>';
    $html = str_get_html($str);
    echo $html->find("p", 0)->plaintext . "\n";
    
    I am saying    Hello World     to you.
    
     
  • LogMANOriginal

    LogMANOriginal - 2019-02-04

    Thanks for the feedback, I'm glad to hear it works for you.

    The whitespace is actually part of the HTML string.
    There are multiple parts to it, so bear with me:

    1) Newlines are replaced by whitespace before parsing. You can turn that off by setting the parameter $stripRN = false when calling str_get_html.

    2) Whitespace is added to the end of span elements in order to prevent multiple span elements running into each other (so <p>This <span>is</span><span>a</span>test</p> results in This is a test). You can turn that off by setting the parameter $defaultSpanText = '' when calling str_get_html.

    Here is what you get if you turn off the two parameters

    <?php
    
    require_once 'simple_html_dom.php';
    
    $str = '<p>I am saying
    <!-- -->
    <span>
    <a href="">
    Hello World
    </a>
    </span>
    <!-- -->
    to you.</p>';
    
    $html = str_get_html($str, true, true, DEFAULT_TARGET_CHARSET, false, DEFAULT_BR_TEXT, '');
    
    echo $html->find("p", 0)->plaintext . "\n";
    
    // I am saying
    // 
    //
    //
    // Hello World
    // 
    //
    //
    // to you.
    

    This looks better than before, but it still contains too many newlines. Note that this output is the exact representation of your original HTML string without the tags. This becomes obvious when comparing the output to ->innertext

    I am saying
    <!-- -->
    <span>
    <a href="">
    Hello World
    </a>
    </span>
    <!-- -->
    to you.
    

    The newlines in the output are in fact from the original HTML. Only the tags were removed. Of course, the easiest way to fix that is to do preg_replace('!\s+!', ' ', $input);

    <?php
    
    require_once 'simple_html_dom.php';
    
    $str = '<p>I am saying
    <!-- -->
    <span>
    <a href="">
    Hello World
    </a>
    </span>
    <!-- -->
    to you.</p>';
    
    $html = str_get_html($str, true, true, DEFAULT_TARGET_CHARSET, true, DEFAULT_BR_TEXT, '');
    
    echo preg_replace('!\s+!', ' ', $html->find("p", 0)->plaintext) . "\n";
    
    // I am saying Hello World to you.
    

    I'm actually not sure if this can be automated by the parser, because newlines may actually be desired depending on the contents.

     
  • Anonymous

    Anonymous - 2019-02-04

    I think what I mostly would expect is the plain text to look like the text as displayed in the browser, in other words, a single whitespace no matter how many whitespaces are in the source. Possibly, there may be instances where multiplace whitespace are desired (like you hint at), but I can't think of any at the moment. Of course the replace will work around the problem.

     
  • LogMANOriginal

    LogMANOriginal - 2019-02-04

    I agree, the output should be as close to the browser as possible. The current implementation, however, is still very limited. I think it's possible to entirely skip elements with no contents, which should get rid of excessive newlines.

     
  • LogMANOriginal

    LogMANOriginal - 2022-04-09
    • status: open --> closed
     
  • LogMANOriginal

    LogMANOriginal - 2022-04-09

    966c5e39493eff7dc1eb77e0004bdc0015037b34 fixes various issues related to spaces and line breaks when generating plain text. For all the examples provided here, it produces the correct output. It also properly collapses superfluous spaces and line breaks, so that the output should be much more readable, especially for awkwardly formatted HTML documents.

     

    Related

    Commit: [966c5e]

  • Anonymous

    Anonymous - 2025-11-06
    Post awaiting moderation.

Log in to post a comment.

MongoDB Logo MongoDB