Menu

#576 Is there a way to always get absolute URLs?

Backlog
open
nobody
2021-03-06
2020-11-26
Anonymous
No

Originally created by: gonssal

I wanted to know if there's a way to make Ferret always return absolute URLs when they are relative in the source code, like web browsers do.

I'm crawling a site by getting a bunch of href attribute values from different anchors into an array and then iterating that array to load and return the content I need from each of the URLs.

The problem is that some of the URLs are absolute (https://example.com/whatever) and others are relative (/whichever), so when I try to get a DOCUMENT from one of the relative URLs, I get the following error:

Failed to execute the query
failed to retrieve a document /whichever: Get /whichever: unsupported protocol scheme "": DOCUMENT(url) at 11:16: FORurlinurlsLETpropDoc=DOCUMENT(url)RETURN{...} at 10:1

I'd ideally want to run the entire process in a single FQL script, but I couldn't find a way to convert the relative URLs or make them work, so it seems my only option is to first return them to a Go program to be fixed and then run an additional data-gathering query on each of them.

Discussion

  • Anonymous

    Anonymous - 2020-11-26

    Originally posted by: ziflex

    If it's relative, why don't you just concat it with a base url?

    doc.url + link.attributes.href
    
     
  • Anonymous

    Anonymous - 2020-11-26

    Originally posted by: gonssal

    If it's relative, why don't you just concat it with a base url?

    doc.url + link.attributes.href

    Because as I explain in the issue, there's both relative and absolute URLs. In the third paragraph specifically.

     
  • Anonymous

    Anonymous - 2020-11-27

    Originally posted by: ziflex

    You can do something like this:

    LET href = link.attributes.href
    LET url = CONTAINS(href, "http") ? href : doc.url + link.attributes.href
    

    I might add helper functions for url manipulations in the future release.

     
  • Anonymous

    Anonymous - 2020-11-27

    Originally posted by: gonssal

    I ended up using FIND_FIRST instead, thank you.

    I think it would be really nice to automatically convert all relative paths in href, src, etc... in the same way web broswers do, if you hover a link it will always show the absolute URL it points to. Considering this is a crawling tool, I don't think relative URLs make a lot of sense.

    This is also specially true for URI fragments. For example if I'm in https://example.com/some-url and there's an anchor with href="#marker", with your proposed solution I'd get https://example.com/#marker instead of the correct https://example.com/some-url#marker.

     

Log in to post a comment.

MongoDB Logo MongoDB