PHPCrawl / Forum / Help: Discard query strings

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2016-10-04

Is there a way to discard query strings, so similar URLs are crawled only once? So for example, these can all treated as the same page, and crawled just once:

somedomain.com/abc?a=b
somedomain.com/abc
somedomain.com/abc?x=y

Last edit: Anonymous 2016-10-05

Is there a way to discard query strings, so similar URLs are crawled only once? So for example, these can all treated as the same page, and crawled just once: somedomain.com/abc?a=b somedomain.com/abc somedomain.com/abc?x=y

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

James Shaver - 2016-10-06

You could use some regex to ignore them:

//somedomain.com/abc?a=b
$crawler->addURLFilterRule("^somedomain.com\/abc\?\=([a-z])$");

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2016-10-10

Thanks James, but that doesn't work. I'd need a rule for each address, and I'm not sure a URL filter will do the job - pages are missed if they're always found with query strings. I need query strings discarded, so given two similar addresses with different query strings the address is crawled once. I could do this by modifying URLs as they are found, to strip the query string, then let the crawler carry on as normal, to crawl (or ignore as crawled already) the stripped address. Any ideas?

Last edit: Anonymous 2016-10-10

Thanks James, but that doesn't work. I'd need a rule for each address, and I'm not sure a URL filter will do the job - pages are missed if they're always found with query strings. I need query strings discarded, so given two similar addresses with different query strings the address is crawled once. I could do this by modifying URLs as they are found, to strip the query string, then let the crawler carry on as normal, to crawl (or ignore as crawled already) the stripped address. Any ideas?

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

James Shaver - 2016-10-10

Using regex covers anything that falls under it. There's a little typo in my example above though, so try this:

$crawler->addURLFilterRule("^somedomain.com\/abc\?([a-z])\=([a-z])$");

What this does is look for any URL that contains a query string after 'abc', and would match a - z on both sides of the '=':
somedomain.com\abc crawled
somedomain.com\abc?w=y ignored
somedomain.com\abc?x=y ignored
somedomain.com\abc?w=z ignored

Last edit: James Shaver 2016-10-10

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2016-10-11

Sorry James, I'm not making myself clear, if a site has these URLs:

somedomain.com/abc?a=b
somedomain.com/abc?a=c
somedomain.com/xyz
somedomain.com/qwe?a=b
somedomain.com/qwe?a=c

With the regex you suggest the /abc page is ignored completely and the /qwe page is crawled twice:

somedomain.com/abc?a=b (ignored)
somedomain.com/abc?a=c (ignored)
somedomain.com/xyz (crawled)
somedomain.com/qwe?a=b (crawled)
somedomain.com/qwe?a=c (crawled)

But I need the query strings ignored, and the pages crawled:

somedomain.com/abc
somedomain.com/xyz
somedomain.com/qwe

All help gratefully recieved!

Last edit: Anonymous 2016-10-11

Sorry James, I'm not making myself clear, if a site has these URLs: somedomain.com/abc?a=b somedomain.com/abc?a=c somedomain.com/xyz somedomain.com/qwe?a=b somedomain.com/qwe?a=c With the regex you suggest the /abc page is ignored completely and the /qwe page is crawled twice: somedomain.com/abc?a=b (ignored) somedomain.com/abc?a=c (ignored) somedomain.com/xyz (crawled) somedomain.com/qwe?a=b (crawled) somedomain.com/qwe?a=c (crawled) But I need the query strings ignored, and the pages crawled: somedomain.com/abc somedomain.com/xyz somedomain.com/qwe All help gratefully recieved!

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

James Shaver - 2016-10-11

So you would need to specify each page that has an additional query string besides its base URL:

$crawler->addURLFilterRule("^somedomain.com\/abc\?([a-z])\=([a-z])$");
$crawler->addURLFilterRule("^somedomain.com\/xyz\?([a-z])\=([a-z])$");
$crawler->addURLFilterRule("^somedomain.com\/qwe\?([a-z])\=([a-z])$");

This crawls each page base, but ignores the query strings. There may be an easier way to do it, but that depends on what you have available. Maybe a loop through a list of these URLs?

$filters = array( 'abc','qwe','xyz' ); foreach($filters as $filter) { $crawler->addURLFilterRule("^somedomain.com\/" . $filter . "\?([a-z])\=([a-z])$"); }
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2016-10-13

Thanks James, that's why a regex filter is not a workable solution for this task. There's no way to know what pages exist prior to a crawl, nor which have query strings. I need a method to simply discard query strings, not a filter to remove pages with query strings. I'll let you know if I find such a method. Thanks.

Last edit: Anonymous 2016-10-13

Thanks James, that's why a regex filter is not a workable solution for this task. There's no way to know what pages exist prior to a crawl, nor which have query strings. I need a method to simply discard query strings, not a filter to remove pages with query strings. I'll let you know if I find such a method. Thanks.

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

James Shaver - 2016-10-16

I think you underestimate regular expressions :)

$crawler->addURLFilterRule("^somedomain.com\/[a-z]{1,99}\?([a-z]{1,99})\=([a-z]{0,99})$");

This does not match:
somedomain.com/asdf
somedomain.com/abc
somedomain.com/xyz
somedomain.com/qwe

It DOES match:
somedomain.com/asdf?a=c
somedomain.com/abc?b=a
somedomain.com/xyz?anything=something
somedomain.com/qwe?id=var

Where I added "[a-z]{1,99}", it means it will meet any alphabetic character, 1 - 99 times. Depending on what you're expecting you could make it alphanumeric with [a-z0-9]{1,99}

Last edit: James Shaver 2016-10-16
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2016-10-19

Thanks James, that's very useful. I will use that to filter out any page with a query string, where that's required, but it still doesn't do exactly what's needed here. It filters out all pages with query strings, so if a page only exists with query strings that page will be completely filtered out. The requirement in this case is for query strings to be ignored, not pages with query strings. The difference is crucial - for example, if during a crawl only the following two URLs are found:

somedomain.com/abc?a=c
somedomain.com/abc?b=a

if pages with query strings are ignored, nothing will be crawled as both URLs have query strings, however, if query strings are ignored, the page itself will be crawled once:

somedomain.com/abc

I really appreciate your help on this, but I'm still not convinced the requirements outlined can be satisfied via regex, however powerful it certainly is.The example you give doesn't do so (yet!).

Last edit: Anonymous 2016-10-19

Thanks James, that's very useful. I will use that to filter out any page with a query string, where that's required, but it still doesn't do exactly what's needed here. It filters out all pages with query strings, so if a page only exists with query strings that page will be completely filtered out. The requirement in this case is for **query strings** to be ignored, not **pages with query strings**. The difference is crucial - for example, if during a crawl only the following two URLs are found: somedomain.com/abc?a=c somedomain.com/abc?b=a if **pages with query strings** are ignored, nothing will be crawled as both URLs have query strings, however, if **query strings** are ignored, the page itself will be crawled once: somedomain.com/abc I really appreciate your help on this, but I'm still not convinced the requirements outlined can be satisfied via regex, however powerful it certainly is.The example you give doesn't do so (yet!).

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

James Shaver - 2016-11-22

Sorry for the late reply...

Actually it doesn't ignore pages entirely with query strings. Maybe a better way to explain it:

It will crawl:
somedomain.com/asdf
somedomain.com/abc
somedomain.com/xyz
somedomain.com/qwe

It will not crawl:
somedomain.com/asdf?a=c
somedomain.com/abc?b=a
somedomain.com/xyz?anything=something
somedomain.com/qwe?id=var

Notice the root pages are identical, but the pages ignored contain query strings.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Discard query strings

Forums

Help

Discard query strings document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Discard query strings