Links get found within html-comments and script-tags
Status: Beta
Brought to you by:
huni
If a link is places within an html-comment (<!-- ... -->), the crawler should ignore it and not
place it in the URL-queue.
Also see this forum-post:
https://sourceforge.net/projects/phpcrawl/forums/forum/307696/topic/5506460
Anonymous
Same for script-tags too (<script>...</script>)
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
is there any fix yet for this problem ? or a small hack to ignor all between <script and="" script=""> ?
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
Hi!
Look at this post, a user posted a fix/hack:
http://sourceforge.net/p/phpcrawl/discussion/307696/thread/5029c505/
(second post)
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
i´m using version 0.81 and searched all files for
"if ($stream_to_memory == true)"
is the fix http://sourceforge.net/p/phpcrawl/discussion/307696/thread/5029c505/ just for an older version ?
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
Yes, you are right, must have been from an older version.
Try to insert
in file "PHPCrawlerLinkFinder.class.php" at line 135 at the beginning of the method "findLinksInHTMLChunk()"
Dind't test it though, but should work.
But please note the comment in the mentioned forum post (third one), it explains why this really is just a hack that may fail in some cases.
Last edit: Uwe Hunfeld 2014-01-03
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
Thx it works now fine for me :)
I did not need the fix for html comments - i need it for <script> tags because in script tags like google analytics or piwik place in the pages the parser finds not valid links, so i now strip the <script> tags.
thx for your quick support :-)
Just to note that adding this regex to a CLI crawling script was triggering occasional PCRE segfaults for me. Just noting it here in case anyone experiences the same problem.
Strange, this is a pretty straightforward regex, anybody else having these PCRE-segfaults here?
(Just asking because a similar regex will be part of the next phpcrawlversion regarding this topic)
Fixed/Added since verion 0.83, please see and use new method excludeLinkSearchDocumentSections() for excluding script- and html-comment-seciotns.
http://phpcrawl.cuab.de/classreferences/PHPCrawler/method_detail_tpl_method_excludeLinkSearchDocumentSections.htm
Thanks!