#25 Links get found within html-comments and script-tags

closed-fixed
nobody
None
5
2015-01-27
2012-08-08
Uwe Hunfeld
No

If a link is places within an html-comment (<!-- ... -->), the crawler should ignore it and not
place it in the URL-queue.

Also see this forum-post:
https://sourceforge.net/projects/phpcrawl/forums/forum/307696/topic/5506460

Discussion

  • Uwe Hunfeld

    Uwe Hunfeld - 2012-10-12
    • summary: Links get found within html-comments --> Links get found within html-comments and script-tags
     
  • Uwe Hunfeld

    Uwe Hunfeld - 2012-10-12

    Same for script-tags too (<script>...</script>)

     
  • Comment has been marked as spam. 
    Undo

    You can see all pending comments posted by this user  here

    Anonymous - 2014-01-03

    is there any fix yet for this problem ? or a small hack to ignor all between <script and="" script=""> ?

     
  • Comment has been marked as spam. 
    Undo

    You can see all pending comments posted by this user  here

    Anonymous - 2014-01-03

    Hi!

    Look at this post, a user posted a fix/hack:
    http://sourceforge.net/p/phpcrawl/discussion/307696/thread/5029c505/

    (second post)

     
  • Comment has been marked as spam. 
    Undo

    You can see all pending comments posted by this user  here

    Anonymous - 2014-01-03

    i´m using version 0.81 and searched all files for
    "if ($stream_to_memory == true)"

    is the fix http://sourceforge.net/p/phpcrawl/discussion/307696/thread/5029c505/ just for an older version ?

     
  • Comment has been marked as spam. 
    Undo

    You can see all pending comments posted by this user  here

    Anonymous - 2014-01-03

    Yes, you are right, must have been from an older version.

    Try to insert

        $html_source = preg_replace('/(?s)<!--.*?-->/', '', $html_source)
    

    in file "PHPCrawlerLinkFinder.class.php" at line 135 at the beginning of the method "findLinksInHTMLChunk()"

    Dind't test it though, but should work.

    But please note the comment in the mentioned forum post (third one), it explains why this really is just a hack that may fail in some cases.

     
    Last edit: Uwe Hunfeld 2014-01-03
  • Comment has been marked as spam. 
    Undo

    You can see all pending comments posted by this user  here

    Anonymous - 2014-01-04

    Thx it works now fine for me :)
    I did not need the fix for html comments - i need it for <script> tags because in script tags like google analytics or piwik place in the pages the parser finds not valid links, so i now strip the <script> tags.

    thx for your quick support :-)

     
  • SpiderBro

    SpiderBro - 2014-11-25

    Just to note that adding this regex to a CLI crawling script was triggering occasional PCRE segfaults for me. Just noting it here in case anyone experiences the same problem.

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2014-11-25

    Strange, this is a pretty straightforward regex, anybody else having these PCRE-segfaults here?

    (Just asking because a similar regex will be part of the next phpcrawlversion regarding this topic)

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2015-01-27
    • status: open --> closed-fixed
     


Anonymous

Cancel  Add attachments





Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks