Originally created by: gonssal
I wanted to know if there's a way to make Ferret always return absolute URLs when they are relative in the source code, like web browsers do.
I'm crawling a site by getting a bunch of href attribute values from different anchors into an array and then iterating that array to load and return the content I need from each of the URLs.
The problem is that some of the URLs are absolute (https://example.com/whatever) and others are relative (/whichever), so when I try to get a DOCUMENT from one of the relative URLs, I get the following error:
Failed to execute the query
failed to retrieve a document /whichever: Get /whichever: unsupported protocol scheme "": DOCUMENT(url) at 11:16: FORurlinurlsLETpropDoc=DOCUMENT(url)RETURN{...} at 10:1
I'd ideally want to run the entire process in a single FQL script, but I couldn't find a way to convert the relative URLs or make them work, so it seems my only option is to first return them to a Go program to be fixed and then run an additional data-gathering query on each of them.
Originally posted by: ziflex
If it's relative, why don't you just concat it with a base url?
Originally posted by: gonssal
Because as I explain in the issue, there's both relative and absolute URLs. In the third paragraph specifically.
Originally posted by: ziflex
You can do something like this:
I might add helper functions for url manipulations in the future release.
Originally posted by: gonssal
I ended up using
FIND_FIRSTinstead, thank you.I think it would be really nice to automatically convert all relative paths in
href,src, etc... in the same way web broswers do, if you hover a link it will always show the absolute URL it points to. Considering this is a crawling tool, I don't think relative URLs make a lot of sense.This is also specially true for URI fragments. For example if I'm in
https://example.com/some-urland there's an anchor withhref="#marker", with your proposed solution I'd gethttps://example.com/#markerinstead of the correcthttps://example.com/some-url#marker.