If a link is places within an html-comment (<!-- ... -->), the crawler should ignore it and not
place it in the URL-queue.
Also see this forum-post:
You seem to have CSS turned off.
Please don't fill out this field.
Same for script-tags too (<script>...</script>)
is there any fix yet for this problem ? or a small hack to ignor all between <script and="" script=""> ?
Look at this post, a user posted a fix/hack:
i´m using version 0.81 and searched all files for
"if ($stream_to_memory == true)"
is the fix http://sourceforge.net/p/phpcrawl/discussion/307696/thread/5029c505/ just for an older version ?
Yes, you are right, must have been from an older version.
Try to insert
$html_source = preg_replace('/(?s)<!--.*?-->/', '', $html_source)
in file "PHPCrawlerLinkFinder.class.php" at line 135 at the beginning of the method "findLinksInHTMLChunk()"
Dind't test it though, but should work.
But please note the comment in the mentioned forum post (third one), it explains why this really is just a hack that may fail in some cases.
Thx it works now fine for me :)
I did not need the fix for html comments - i need it for <script> tags because in script tags like google analytics or piwik place in the pages the parser finds not valid links, so i now strip the <script> tags.
thx for your quick support :-)
Just to note that adding this regex to a CLI crawling script was triggering occasional PCRE segfaults for me. Just noting it here in case anyone experiences the same problem.
Strange, this is a pretty straightforward regex, anybody else having these PCRE-segfaults here?
(Just asking because a similar regex will be part of the next phpcrawlversion regarding this topic)
Fixed/Added since verion 0.83, please see and use new method excludeLinkSearchDocumentSections() for excluding script- and html-comment-seciotns.
Sign up for the SourceForge newsletter: