Tracker: Bugs

5 assertWantedText matches javascript source code - ID: 1671539
Last Update: Comment added ( pp11 )

In the web tester, assertWantedText() will match text which appears in javascript source code. See attached zip for example testcase, which should be unzipped into the root of your web server to run.

I would argue that anything inside <script></script> is not "browser visible text" and therefore should not be returned.

I would propose modifying SimpleHtmlSaxParser::normalise( ) to the following:

function normalise($html) {
$text = preg_replace('|<!--.*?-->|s', '', $html);
$text = preg_replace('|<script.*? >.*?</script>|s', '', $text);
$text = preg_replace('|<img.*?alt\s*=\s*"(.*?)".*?>|s', ' \1 ', $text);
$text = preg_replace('|<img.*?alt\s*=\s*\'(.*?)\'.*?>|s', ' \1 ', $text);
$text = preg_replace('|<img.*?alt\s*=\s*([a-zA-Z_]+).*?>|s', ' \1 ', $text);
$text = preg_replace('|<.*?>|s', '', $text);
$text = SimpleHtmlSaxParser::decodeHtml($text);
$text = preg_replace('|\s+|', ' ', $text);
return trim($text);
}

*NOTE*

as well as stripping the contents of <script> tags I also suggest that it would be good to use the 's' modifer on these preg_replace calls. This causes the '.' in regexes to match newlines as well, and so caters for cases such as:

i) img tags spanning multiple lines
ii) HTML comments spanning multiple lines
iii) stripping out other tags spanning multiple lines

I have simpletest_1.0.1beta.tar.gz (parser.php rev 1.66)

Best

David Heath


David Heath ( dgheath ) - 2007-03-01 01:36:46 PST

5

Closed

Fixed

Perrick Penet

Web tester

None

Public


Comments ( 2 )

Date: 2007-12-23 09:11:04 PST
Sender: pp11Project Admin


Thank you for the bug report : it's now fixed in the SVN tree.

Yours, Perrick



Date: 2007-03-01 02:05:50 PST
Sender: dgheath


Further to that, the regexps in normalise() should also not rely on the
non-greedy-all (.*?) match to find stuff inside tags. Even without the 's'
modifier on the regexs, there are cases where the <img> regexs will strip
out too much, consider for example this html:

<img src="foo.png" /><p>some text</p><img src="bar.png" alt="bar" />

In this example "some text" would get stripped out by the img regexps
because the first img tag lacked an alt attribute.

To avoid this, most of the .*? should be changed to [^>]* which gives:

function normalise($html) {
$text = preg_replace('|<!--.*?-->|s', '', $html);
$text = preg_replace('|<script[^>]*>.*?</script>|s', '',
$text);
$text = preg_replace('|<img[^>]*alt\s*=\s*"([^>]*)"[^>]*>|s', '
\1 ', $text);
$text = preg_replace('|<img[^>]*alt\s*=\s*\'([^>]*)\'[^>]*>|s',
' \1 ', $text);
$text =
preg_replace('|<img[^>]*alt\s*=\s*([a-zA-Z_]+)[^>]*>|s', ' \1 ', $text);
$text = preg_replace('|<[^>]*>|s', '', $text);
$text = SimpleHtmlSaxParser::decodeHtml($text);
$text = preg_replace('|\s+|', ' ', $text);
return trim($text);
}



Attached File ( 1 )

Filename Description Download
spelling_example.zip This test unexpectedly passes on current simpletest version Download

Changes ( 5 )

Field Old Value Date By
status_id Open 2007-12-23 09:11:04 PST pp11
resolution_id None 2007-12-23 09:11:04 PST pp11
assigned_to lastcraft 2007-12-23 09:11:04 PST pp11
close_date - 2007-12-23 09:11:04 PST pp11
File Added 218297: spelling_example.zip 2007-03-01 01:36:46 PST dgheath