Menu

Problem with # links

Help
2012-10-14
2013-12-07
  • Nobody/Anonymous

    Hi,

    I Have a problem with links containing "#" for example : http://www.site.com/hello#seek/me.

    First of all when I make the follow rule with only to check  the adress "http://www.site.com/ I have'nt get any relults with "#" only without that symbol.

    The next problem is that when the follow rule is containg the  ^ adress (.*)(seek/me) without #  but still i dont get any results.

    Is there any dependence between the # used to open, close the reg. exp. and my # used in link? if is how can I get the results?

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2012-10-14

    Hi!

    I dont's know if i get you right, but i.e. "http://www.site.com/hello#seek/me" is the same exact page as "http://www.site.com/hello" (the #-part is just an anchor). So if the crawler already visited http://www.site.com/hello, he won't visit this site anymore (what for, it's the same page as i said).

    Is that maybe the reason why don't get any results with #?

    And if you want to use a # in your follow-rules, you have to escape it (like "#http://www\.site\.com/hello\#seek#")

    Could you maybe post your follow-rules for understanding your problem?

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2012-10-15

    Hi again,

    completely ignore what i was writing before ;) Now i got it (took a time, sorry).

    Ok, the crawler COMPLETELY ignores anchors in URLs/links. If the crawler finds a link with an anchor, only the URL WITHOUT the anchor goes to the URL cache (for the reason i mentioned above).
    So if you add a follow-rule containing such a anchor, the crawler can't find any matching URLs in the cache and stops.

    Now i don't know if that's really the right approach (nobody moaned this before), and the question is:
    Why do you want the crawler only follow links containing these anchors?
    Again: What's the difference between the page "http://www.site.com/hi/prometheus/#seek/bestfilm/me" and "http://www.site.com/hi/prometheus/"? Is there a difference?

     
  • kamil

    kamil - 2012-10-15

    In this case the problem is that the anchor is signing the links that are the result of youre search, and they are moxed with the links that have nothing to do with the query, for example:

    query: best film

    links on page:

    \"…ww.site.com/hi/prometheus#seek/bestfilm/me\"

    \"…ww.site.com/hi/back-to-the-future#seek/bestfilm/me\"

    \"…ww.site.com/hi/matrix\"      (has nothig to do with search)

    \"…ww.site.com/hi/fightback#seek/bestfilm/me\"

    \"…ww.site.com/hi/the-300\"

    it isn't a good solution, but I didn't code the page :)

     
  • kamil

    kamil - 2012-10-15

    sourceforge added the "\"

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2012-10-15

    Hey chrustol,

    im sorry, but im confused ;) Sorry, it's monday.

    What search do you mean? Can you maybe post the actual page/search/project you are dealing with?
    It's difficult (for me) to understand your problem a littel bit.

     
  • Anonymous

    Anonymous - 2013-12-05

    Hi

    I'm also having the same issue with links the google trader website
    http://www.google.com.gh/local/trader/.

    Is there a way to force the crawler to not filter out the anchor part of the url.

    Thanks
    Donald

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2013-12-05

    Hey Donald,

    no, sorry, there is no easy way right now, that's because of the way the crawler works.

    But i'll open a feature-request.

    Just to get sure that i understand this right:
    The problem is, that the crawler doesnt't return all anchor-links to the same site in the array of links found, is that right?

    So if there are twe links on a site, lets say bla.com/bli.html#1 and bla.com/bli.html#2, then you want them BOTH in the array of found links?

    But they DON'T have to be followed both, right? Following bla.com/bli.html (without anchor) once is ok, right?

    I'm just asking to understand this.

    Thanks!

     
  • Anonymous

    Anonymous - 2013-12-05

    Hi Uwe

    thank you for your reply.
    Ideally I would like the crawler to return AND follow all the anchor links as well.

    The way the google trader site is structured.
    The category e.g Computers and software would look like this : http://www.google.com.gh/local/trader/#!search:c=cat1789266239 and those are the links that need to be crawled.

    Currently the crawler does not pick up these links at all.

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2013-12-06

    OK, now i understand the problem!

    Right now i don't know how to achieve this. How sould the crawler know if an anchor-link is a site on it's own (like your google-example) or a normal anchor-link leading just to a section of a site?

    Simply following all anchor-links is a bad solution, the crawler would request one and the same URL lots of times in most cases. (Think of documentations like selfhtml for example, they have lot's of stuff in one huge page just seperated by anchors, so there are hundrets of links like docs.com/reference.html#function1, docs.com/reference.html#function2, ... docs.com/reference.html#function1783, and the crawler would follow ALL of them although its the exact same page every time).

    So, any ideas maybe?

     
  • Anonymous

    Anonymous - 2013-12-07

    Hi Uwe

    you are right. It's actually a lot trickier than that. It seems that the site html is loaded using javascript which is fine for browsers but it makes webcrawling difficult as the webcrawler cannot run the js code first before looking for links anyway. So I'm looking at altenatives like selenium or pjscape.

    Thanks a lot for your help dude.
    The software is great.

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.