Menu

Discard query strings

Help
Anonymous
2016-10-04
2016-11-22
  • Anonymous

    Anonymous - 2016-10-04

    Is there a way to discard query strings, so similar URLs are crawled only once? So for example, these can all treated as the same page, and crawled just once:

    somedomain.com/abc?a=b
    somedomain.com/abc
    somedomain.com/abc?x=y

     

    Last edit: Anonymous 2016-10-05
  • James Shaver

    James Shaver - 2016-10-06

    You could use some regex to ignore them:

    //somedomain.com/abc?a=b
    $crawler->addURLFilterRule("^somedomain.com\/abc\?\=([a-z])$");

     
  • Anonymous

    Anonymous - 2016-10-10

    Thanks James, but that doesn't work. I'd need a rule for each address, and I'm not sure a URL filter will do the job - pages are missed if they're always found with query strings. I need query strings discarded, so given two similar addresses with different query strings the address is crawled once. I could do this by modifying URLs as they are found, to strip the query string, then let the crawler carry on as normal, to crawl (or ignore as crawled already) the stripped address. Any ideas?

     

    Last edit: Anonymous 2016-10-10
  • James Shaver

    James Shaver - 2016-10-10

    Using regex covers anything that falls under it. There's a little typo in my example above though, so try this:

    $crawler->addURLFilterRule("^somedomain.com\/abc\?([a-z])\=([a-z])$");

    What this does is look for any URL that contains a query string after 'abc', and would match a - z on both sides of the '=':
    somedomain.com\abc crawled
    somedomain.com\abc?w=y ignored
    somedomain.com\abc?x=y ignored
    somedomain.com\abc?w=z ignored

     

    Last edit: James Shaver 2016-10-10
  • Anonymous

    Anonymous - 2016-10-11

    Sorry James, I'm not making myself clear, if a site has these URLs:

    somedomain.com/abc?a=b
    somedomain.com/abc?a=c
    somedomain.com/xyz
    somedomain.com/qwe?a=b
    somedomain.com/qwe?a=c

    With the regex you suggest the /abc page is ignored completely and the /qwe page is crawled twice:

    somedomain.com/abc?a=b (ignored)
    somedomain.com/abc?a=c (ignored)
    somedomain.com/xyz (crawled)
    somedomain.com/qwe?a=b (crawled)
    somedomain.com/qwe?a=c (crawled)

    But I need the query strings ignored, and the pages crawled:

    somedomain.com/abc
    somedomain.com/xyz
    somedomain.com/qwe

    All help gratefully recieved!

     

    Last edit: Anonymous 2016-10-11
  • James Shaver

    James Shaver - 2016-10-11

    So you would need to specify each page that has an additional query string besides its base URL:

    $crawler->addURLFilterRule("^somedomain.com\/abc\?([a-z])\=([a-z])$");
    $crawler->addURLFilterRule("^somedomain.com\/xyz\?([a-z])\=([a-z])$");
    $crawler->addURLFilterRule("^somedomain.com\/qwe\?([a-z])\=([a-z])$");

    This crawls each page base, but ignores the query strings. There may be an easier way to do it, but that depends on what you have available. Maybe a loop through a list of these URLs?

    $filters = array(
    'abc','qwe','xyz'
    );
    foreach($filters as $filter) {
    $crawler->addURLFilterRule("^somedomain.com\/" . $filter . "\?([a-z])\=([a-z])$");
    }
    
     
  • Anonymous

    Anonymous - 2016-10-13

    Thanks James, that's why a regex filter is not a workable solution for this task. There's no way to know what pages exist prior to a crawl, nor which have query strings. I need a method to simply discard query strings, not a filter to remove pages with query strings. I'll let you know if I find such a method. Thanks.

     

    Last edit: Anonymous 2016-10-13
  • James Shaver

    James Shaver - 2016-10-16

    I think you underestimate regular expressions :)

    $crawler->addURLFilterRule("^somedomain.com\/[a-z]{1,99}\?([a-z]{1,99})\=([a-z]{0,99})$");
    

    This does not match:
    somedomain.com/asdf
    somedomain.com/abc
    somedomain.com/xyz
    somedomain.com/qwe

    It DOES match:
    somedomain.com/asdf?a=c
    somedomain.com/abc?b=a
    somedomain.com/xyz?anything=something
    somedomain.com/qwe?id=var

    Where I added "[a-z]{1,99}", it means it will meet any alphabetic character, 1 - 99 times. Depending on what you're expecting you could make it alphanumeric with [a-z0-9]{1,99}

     

    Last edit: James Shaver 2016-10-16
  • Anonymous

    Anonymous - 2016-10-19

    Thanks James, that's very useful. I will use that to filter out any page with a query string, where that's required, but it still doesn't do exactly what's needed here. It filters out all pages with query strings, so if a page only exists with query strings that page will be completely filtered out. The requirement in this case is for query strings to be ignored, not pages with query strings. The difference is crucial - for example, if during a crawl only the following two URLs are found:

    somedomain.com/abc?a=c
    somedomain.com/abc?b=a

    if pages with query strings are ignored, nothing will be crawled as both URLs have query strings, however, if query strings are ignored, the page itself will be crawled once:

    somedomain.com/abc

    I really appreciate your help on this, but I'm still not convinced the requirements outlined can be satisfied via regex, however powerful it certainly is.The example you give doesn't do so (yet!).

     

    Last edit: Anonymous 2016-10-19
  • James Shaver

    James Shaver - 2016-11-22

    Sorry for the late reply...

    Actually it doesn't ignore pages entirely with query strings. Maybe a better way to explain it:

    It will crawl:
    somedomain.com/asdf
    somedomain.com/abc
    somedomain.com/xyz
    somedomain.com/qwe

    It will not crawl:
    somedomain.com/asdf?a=c
    somedomain.com/abc?b=a
    somedomain.com/xyz?anything=something
    somedomain.com/qwe?id=var

    Notice the root pages are identical, but the pages ignored contain query strings.

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.