Menu

#25 Links get found within html-comments and script-tags

closed-fixed
nobody
None
5
2015-01-27
2012-08-08
Uwe Hunfeld
No

If a link is places within an html-comment (<!-- ... -->), the crawler should ignore it and not
place it in the URL-queue.

Also see this forum-post:
https://sourceforge.net/projects/phpcrawl/forums/forum/307696/topic/5506460

Discussion

  • Uwe Hunfeld

    Uwe Hunfeld - 2012-10-12
    • summary: Links get found within html-comments --> Links get found within html-comments and script-tags
     
  • Uwe Hunfeld

    Uwe Hunfeld - 2012-10-12

    Same for script-tags too (<script>...</script>)

     
  • Anonymous

    Anonymous - 2014-01-03

    is there any fix yet for this problem ? or a small hack to ignor all between <script and="" script=""> ?

     
  • Anonymous

    Anonymous - 2014-01-03

    i´m using version 0.81 and searched all files for
    "if ($stream_to_memory == true)"

    is the fix http://sourceforge.net/p/phpcrawl/discussion/307696/thread/5029c505/ just for an older version ?

     
  • Anonymous

    Anonymous - 2014-01-03

    Yes, you are right, must have been from an older version.

    Try to insert

        $html_source = preg_replace('/(?s)<!--.*?-->/', '', $html_source)
    

    in file "PHPCrawlerLinkFinder.class.php" at line 135 at the beginning of the method "findLinksInHTMLChunk()"

    Dind't test it though, but should work.

    But please note the comment in the mentioned forum post (third one), it explains why this really is just a hack that may fail in some cases.

     

    Last edit: Uwe Hunfeld 2014-01-03
  • Anonymous

    Anonymous - 2014-01-04

    Thx it works now fine for me :)
    I did not need the fix for html comments - i need it for <script> tags because in script tags like google analytics or piwik place in the pages the parser finds not valid links, so i now strip the <script> tags.

    thx for your quick support :-)

     
  • SpiderBro

    SpiderBro - 2014-11-25

    Just to note that adding this regex to a CLI crawling script was triggering occasional PCRE segfaults for me. Just noting it here in case anyone experiences the same problem.

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2014-11-25

    Strange, this is a pretty straightforward regex, anybody else having these PCRE-segfaults here?

    (Just asking because a similar regex will be part of the next phpcrawlversion regarding this topic)

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2015-01-27
    • status: open --> closed-fixed
     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.