PHPCrawl / Forum / Help: Problem with # links

Nobody/Anonymous - 2012-10-14

Hi,

I Have a problem with links containing "#" for example : http://www.site.com/hello#seek/me.

First of all when I make the follow rule with only to check the adress "http://www.site.com/ I have'nt get any relults with "#" only without that symbol.

The next problem is that when the follow rule is containg the ^ adress (.*)(seek/me) without # but still i dont get any results.

Is there any dependence between the # used to open, close the reg. exp. and my # used in link? if is how can I get the results?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2012-10-14

Hi!

I dont's know if i get you right, but i.e. "http://www.site.com/hello#seek/me" is the same exact page as "http://www.site.com/hello" (the #-part is just an anchor). So if the crawler already visited http://www.site.com/hello, he won't visit this site anymore (what for, it's the same page as i said).

Is that maybe the reason why don't get any results with #?

And if you want to use a # in your follow-rules, you have to escape it (like "#http://www\.site\.com/hello\#seek#")

Could you maybe post your follow-rules for understanding your problem?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2012-10-15

Ah ok, so you really want the crawler to follow only links that look like http://www.site.com/hi/prometheus/#seek/bestfilm/me.

So try $crawler->addURLFollowRule("#(^http://www.site.com/(hello|hi)/(.*?)\#seek/".$myquery."/me# i");
(didn't test it, but should work)

And if the crawler finds and follows a link like http://www.site.com/hi/prometheus/#seek/bestfilm/me, $DocInfo->url just contains http://www.site.com/hi/prometheus/ (because the anchor-part is irrelevant).

But i still don't understand, why is it important to only find/follow links that match the mentioned anchor?
Whats't the difference between the page you'll get with "http://www.site.com/hi/prometheus/#seek/bestfilm/me" and "http://www.site.com/hi/prometheus/" (inspite of the fact that the anchor jumps/scolls to a part of the page in a browser)?

Did i miss something maybe?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2012-10-15

Hi again,

completely ignore what i was writing before ;) Now i got it (took a time, sorry).

Ok, the crawler COMPLETELY ignores anchors in URLs/links. If the crawler finds a link with an anchor, only the URL WITHOUT the anchor goes to the URL cache (for the reason i mentioned above).
So if you add a follow-rule containing such a anchor, the crawler can't find any matching URLs in the cache and stops.

Now i don't know if that's really the right approach (nobody moaned this before), and the question is:
Why do you want the crawler only follow links containing these anchors?
Again: What's the difference between the page "http://www.site.com/hi/prometheus/#seek/bestfilm/me" and "http://www.site.com/hi/prometheus/"? Is there a difference?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

kamil - 2012-10-15

In this case the problem is that the anchor is signing the links that are the result of youre search, and they are moxed with the links that have nothing to do with the query, for example:

query: best film

links on page:

\"…ww.site.com/hi/prometheus#seek/bestfilm/me\"

\"…ww.site.com/hi/back-to-the-future#seek/bestfilm/me\"

\"…ww.site.com/hi/matrix\" (has nothig to do with search)

\"…ww.site.com/hi/fightback#seek/bestfilm/me\"

\"…ww.site.com/hi/the-300\"

it isn't a good solution, but I didn't code the page :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

kamil - 2012-10-15

sourceforge added the "\"

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2012-10-15

Hey chrustol,

im sorry, but im confused ;) Sorry, it's monday.

What search do you mean? Can you maybe post the actual page/search/project you are dealing with?
It's difficult (for me) to understand your problem a littel bit.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2013-12-05

Hi

I'm also having the same issue with links the google trader website
http://www.google.com.gh/local/trader/.

Is there a way to force the crawler to not filter out the anchor part of the url.

Thanks
Donald

Hi I'm also having the same issue with links the google trader website http://www.google.com.gh/local/trader/. Is there a way to force the crawler to not filter out the anchor part of the url. Thanks Donald

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2013-12-05

Hey Donald,

no, sorry, there is no easy way right now, that's because of the way the crawler works.

But i'll open a feature-request.

Just to get sure that i understand this right:
The problem is, that the crawler doesnt't return all anchor-links to the same site in the array of links found, is that right?

So if there are twe links on a site, lets say bla.com/bli.html#1 and bla.com/bli.html#2, then you want them BOTH in the array of found links?

But they DON'T have to be followed both, right? Following bla.com/bli.html (without anchor) once is ok, right?

I'm just asking to understand this.

Thanks!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2013-12-05

Hi Uwe

thank you for your reply.
Ideally I would like the crawler to return AND follow all the anchor links as well.

The way the google trader site is structured.
The category e.g Computers and software would look like this : http://www.google.com.gh/local/trader/#!search:c=cat1789266239 and those are the links that need to be crawled.

Currently the crawler does not pick up these links at all.

Hi Uwe thank you for your reply. Ideally I would like the crawler to return AND follow all the anchor links as well. The way the google trader site is structured. The category e.g Computers and software would look like this : http://www.google.com.gh/local/trader/#!search:c=cat1789266239 and those are the links that need to be crawled. Currently the crawler does not pick up these links at all.

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2013-12-06

OK, now i understand the problem!

Right now i don't know how to achieve this. How sould the crawler know if an anchor-link is a site on it's own (like your google-example) or a normal anchor-link leading just to a section of a site?

Simply following all anchor-links is a bad solution, the crawler would request one and the same URL lots of times in most cases. (Think of documentations like selfhtml for example, they have lot's of stuff in one huge page just seperated by anchors, so there are hundrets of links like docs.com/reference.html#function1, docs.com/reference.html#function2, ... docs.com/reference.html#function1783, and the crawler would follow ALL of them although its the exact same page every time).

So, any ideas maybe?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2013-12-07

Hi Uwe

you are right. It's actually a lot trickier than that. It seems that the site html is loaded using javascript which is fine for browsers but it makes webcrawling difficult as the webcrawler cannot run the js code first before looking for links anyway. So I'm looking at altenatives like selenium or pjscape.

Thanks a lot for your help dude.
The software is great.

Hi Uwe you are right. It's actually a lot trickier than that. It seems that the site html is loaded using javascript which is fine for browsers but it makes webcrawling difficult as the webcrawler cannot run the js code first before looking for links anyway. So I'm looking at altenatives like selenium or pjscape. Thanks a lot for your help dude. The software is great.

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Problem with # links

Forums

Help

Problem with # links document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Problem with # links