Is there a way to discard query strings, so similar URLs are crawled only once? So for example, these can all treated as the same page, and crawled just once:
Thanks James, but that doesn't work. I'd need a rule for each address, and I'm not sure a URL filter will do the job - pages are missed if they're always found with query strings. I need query strings discarded, so given two similar addresses with different query strings the address is crawled once. I could do this by modifying URLs as they are found, to strip the query string, then let the crawler carry on as normal, to crawl (or ignore as crawled already) the stripped address. Any ideas?
Last edit: Anonymous 2016-10-10
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What this does is look for any URL that contains a query string after 'abc', and would match a - z on both sides of the '=':
somedomain.com\abc crawled
somedomain.com\abc?w=y ignored
somedomain.com\abc?x=y ignored
somedomain.com\abc?w=z ignored
Last edit: James Shaver 2016-10-10
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This crawls each page base, but ignores the query strings. There may be an easier way to do it, but that depends on what you have available. Maybe a loop through a list of these URLs?
Thanks James, that's why a regex filter is not a workable solution for this task. There's no way to know what pages exist prior to a crawl, nor which have query strings. I need a method to simply discard query strings, not a filter to remove pages with query strings. I'll let you know if I find such a method. Thanks.
Last edit: Anonymous 2016-10-13
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This does not match:
somedomain.com/asdf
somedomain.com/abc
somedomain.com/xyz
somedomain.com/qwe
It DOES match:
somedomain.com/asdf?a=c
somedomain.com/abc?b=a
somedomain.com/xyz?anything=something
somedomain.com/qwe?id=var
Where I added "[a-z]{1,99}", it means it will meet any alphabetic character, 1 - 99 times. Depending on what you're expecting you could make it alphanumeric with [a-z0-9]{1,99}
Last edit: James Shaver 2016-10-16
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks James, that's very useful. I will use that to filter out any page with a query string, where that's required, but it still doesn't do exactly what's needed here. It filters out all pages with query strings, so if a page only exists with query strings that page will be completely filtered out. The requirement in this case is for query strings to be ignored, not pages with query strings. The difference is crucial - for example, if during a crawl only the following two URLs are found:
somedomain.com/abc?a=c
somedomain.com/abc?b=a
if pages with query strings are ignored, nothing will be crawled as both URLs have query strings, however, if query strings are ignored, the page itself will be crawled once:
somedomain.com/abc
I really appreciate your help on this, but I'm still not convinced the requirements outlined can be satisfied via regex, however powerful it certainly is.The example you give doesn't do so (yet!).
Last edit: Anonymous 2016-10-19
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Is there a way to discard query strings, so similar URLs are crawled only once? So for example, these can all treated as the same page, and crawled just once:
somedomain.com/abc?a=b
somedomain.com/abc
somedomain.com/abc?x=y
Last edit: Anonymous 2016-10-05
You could use some regex to ignore them:
//somedomain.com/abc?a=b
$crawler->addURLFilterRule("^somedomain.com\/abc\?\=([a-z])$");
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Thanks James, but that doesn't work. I'd need a rule for each address, and I'm not sure a URL filter will do the job - pages are missed if they're always found with query strings. I need query strings discarded, so given two similar addresses with different query strings the address is crawled once. I could do this by modifying URLs as they are found, to strip the query string, then let the crawler carry on as normal, to crawl (or ignore as crawled already) the stripped address. Any ideas?
Last edit: Anonymous 2016-10-10
Using regex covers anything that falls under it. There's a little typo in my example above though, so try this:
$crawler->addURLFilterRule("^somedomain.com\/abc\?([a-z])\=([a-z])$");
What this does is look for any URL that contains a query string after 'abc', and would match a - z on both sides of the '=':
somedomain.com\abc crawled
somedomain.com\abc?w=y ignored
somedomain.com\abc?x=y ignored
somedomain.com\abc?w=z ignored
Last edit: James Shaver 2016-10-10
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Sorry James, I'm not making myself clear, if a site has these URLs:
somedomain.com/abc?a=b
somedomain.com/abc?a=c
somedomain.com/xyz
somedomain.com/qwe?a=b
somedomain.com/qwe?a=c
With the regex you suggest the /abc page is ignored completely and the /qwe page is crawled twice:
somedomain.com/abc?a=b (ignored)
somedomain.com/abc?a=c (ignored)
somedomain.com/xyz (crawled)
somedomain.com/qwe?a=b (crawled)
somedomain.com/qwe?a=c (crawled)
But I need the query strings ignored, and the pages crawled:
somedomain.com/abc
somedomain.com/xyz
somedomain.com/qwe
All help gratefully recieved!
Last edit: Anonymous 2016-10-11
So you would need to specify each page that has an additional query string besides its base URL:
$crawler->addURLFilterRule("^somedomain.com\/abc\?([a-z])\=([a-z])$");
$crawler->addURLFilterRule("^somedomain.com\/xyz\?([a-z])\=([a-z])$");
$crawler->addURLFilterRule("^somedomain.com\/qwe\?([a-z])\=([a-z])$");
This crawls each page base, but ignores the query strings. There may be an easier way to do it, but that depends on what you have available. Maybe a loop through a list of these URLs?
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Thanks James, that's why a regex filter is not a workable solution for this task. There's no way to know what pages exist prior to a crawl, nor which have query strings. I need a method to simply discard query strings, not a filter to remove pages with query strings. I'll let you know if I find such a method. Thanks.
Last edit: Anonymous 2016-10-13
I think you underestimate regular expressions :)
This does not match:
somedomain.com/asdf
somedomain.com/abc
somedomain.com/xyz
somedomain.com/qwe
It DOES match:
somedomain.com/asdf?a=c
somedomain.com/abc?b=a
somedomain.com/xyz?anything=something
somedomain.com/qwe?id=var
Where I added "[a-z]{1,99}", it means it will meet any alphabetic character, 1 - 99 times. Depending on what you're expecting you could make it alphanumeric with [a-z0-9]{1,99}
Last edit: James Shaver 2016-10-16
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Thanks James, that's very useful. I will use that to filter out any page with a query string, where that's required, but it still doesn't do exactly what's needed here. It filters out all pages with query strings, so if a page only exists with query strings that page will be completely filtered out. The requirement in this case is for query strings to be ignored, not pages with query strings. The difference is crucial - for example, if during a crawl only the following two URLs are found:
somedomain.com/abc?a=c
somedomain.com/abc?b=a
if pages with query strings are ignored, nothing will be crawled as both URLs have query strings, however, if query strings are ignored, the page itself will be crawled once:
somedomain.com/abc
I really appreciate your help on this, but I'm still not convinced the requirements outlined can be satisfied via regex, however powerful it certainly is.The example you give doesn't do so (yet!).
Last edit: Anonymous 2016-10-19
Sorry for the late reply...
Actually it doesn't ignore pages entirely with query strings. Maybe a better way to explain it:
It will crawl:
somedomain.com/asdf
somedomain.com/abc
somedomain.com/xyz
somedomain.com/qwe
It will not crawl:
somedomain.com/asdf?a=c
somedomain.com/abc?b=a
somedomain.com/xyz?anything=something
somedomain.com/qwe?id=var
Notice the root pages are identical, but the pages ignored contain query strings.