Menu

#160 Parsing fails with '<-' + '/' symbols combination in string

closed
None
2018-12-06
2017-02-09
No

The problem is in '<-' + '/' symbols combination in string

Example HTML page:

<html><body>
<div>
  <a>
    <span> ---> Lorem ipsum <--- dolor sit amet / at volutpat </span>
    <span> Lorem ipsum dolor sit amet </span>
    <span class="foo_1">Bar 1</span>
  </a>
  <div class="foo_2">Bar 2</div>
</div>
<span class="foo_3">Bar 3</span>
</body></html>

Test script (example page stored in $data):

$dom = str_get_html($data);
$tmp = $dom->find('div/span', 0);
var_dump($tmp->plaintext);

Result:

string(80) " ---> Lorem ipsum       Lorem ipsum dolor sit amet       Bar 1    </a>   Bar 2  "

Tested on:
simplehtmldom v.1.5 rev 196 & 210
PHP 5.6.17

Discussion

  • Alex Kozlovsky

    Alex Kozlovsky - 2017-02-09

    my fast solution is to replace '/' with '|'

    $data = preg_replace('#(<-[^<>]*)\/([^<>]*<\s*\/[^\/]+>)#uis', '$1|$2', $data);
    
     

    Last edit: Alex Kozlovsky 2017-02-09
  • LogMANOriginal

    LogMANOriginal - 2018-12-04

    Thanks for reporting this issue. I've added a test to check for this behavior in future. Please notice that "<" is invalid text according to https://validator.w3.org/#validate_by_input

    It correctly suggests escaping < to &lt;, which solves your issue.

    That being said, the parser now considers tags starting with "<-" invalid (as does the HTML Specification). These tags are now correctly added as text. Let me know if you experience further issues.

    Fixed via [b1bade]

     

    Related

    Commit: [b1bade]

  • LogMANOriginal

    LogMANOriginal - 2018-12-04
    • status: open --> closed
    • assigned_to: LogMANOriginal
     

Log in to post a comment.

MongoDB Logo MongoDB