#26 this.options[this.selectedIndex].value added to url

closed-duplicate
nobody
None
5
2012-10-12
2012-09-07
Anonymous
No

version 0.8

1)
phpcrawl adds this string to url: this.options[this.selectedIndex].value
Example:
http://test-site.x.org/tag/reboot/this.options\[this.selectedIndex].value

2) as above but " ( " is added so it generates 404 error
phpcrawl generates get request like this:
GET /page/%28

Discussion

  • Nobody/Anonymous

    this is when aggressive link search i set to 'true'
    $crawler->enableAggressiveLinkSearch(true);

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2012-09-11

    Hi!

    This is because of the way phpcrawl is searching for links. It also searches for links in javascript-code, so something like

    <script type="javascript">
    document.location.href = /lifeatgoogle/' + story['id'] + '.html
    </script>

    .. will lead to an 404-error (http://www.google.pl/about/jobs/lifeatgoogle/' +
    story['id'] + '.html).

    But is this really a problem (for you)?

    The approach of phpcrawl ist to find as many links as possible, even in javascript-code and css-blocks (so that no link get's missed).

     
  • Nobody/Anonymous

    OK, I see.
    Is there a way to exclude <script> tags ?
    I want to avoid generating 404 errors as this is a simple way to be banned by servers.

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2012-10-12
    • status: open --> closed-duplicate
     
  • Uwe Hunfeld

    Uwe Hunfeld - 2012-10-12

    "Merged" this bug with bug 3555300 "Links get found within html-comments and script-tags" (which is the same problem)

     


Anonymous

Cancel  Add attachments





Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks