I use the example php script to crawl a webpage but it generates a 505 on most URL:s with a space or swedish characters in them (åäö). There is no error-code in PHPCrawlerDocumentInfo. Here's an example.
Page requested: http://www.lidingo.se/lexportlet/binary/Hantering av ma.pdf?id=%7bFCFE-AB19-0CA2-D14E-DD13-A100%7d (505)
Error_code: 0
Occured:
String:
- header start -
GET /lexportlet/binary/Hantering av ma.pdf?id=%7bFCFE-AB19-0CA2-D14E-DD13-A100%7d HTTP/1.0
I guess this doesn't have to do anything with bug 3504517 or URL-encoding.
If the URL gets encoded "wrong", the server normally responds a "404 - not found".
… but maybe it does ;)
I'll let you know what's going wrong there.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Some characteds are urlencoded (%7b) and some aren't (the whitespaces).
So the crawler can't find out if it's an urlencoded link (is it?) and the request fails.
Added this to the buglist: 3530805
Will try to fix that.
Thanks for the report!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I use the example php script to crawl a webpage but it generates a 505 on most URL:s with a space or swedish characters in them (åäö). There is no error-code in PHPCrawlerDocumentInfo. Here's an example.
Page requested: http://www.lidingo.se/lexportlet/binary/Hantering av ma.pdf?id=%7bFCFE-AB19-0CA2-D14E-DD13-A100%7d (505)
Error_code: 0
Occured:
String:
- header start -
GET /lexportlet/binary/Hantering av ma.pdf?id=%7bFCFE-AB19-0CA2-D14E-DD13-A100%7d HTTP/1.0
HOST: www.lidingo.se
User-Agent: PHPCrawl
Referer: http://www.lidingo.se/toppmeny/stadpolitik/handlingarochprotokoll/sammantraden.4.6339ad9913358eb933f80007.html?meetingId=%7BFCFC-9BE0-0CA2-FC3A-DD13-A0FE%7D&action=showMeeting&sv.url=12.6339ad9913358eb933f800012
Cookie: JSESSIONID=2BD0BAEF434353DDF79F25ED3E94296C
Connection: close
- header end -
My guess would be that the url should be utf8 encoded or something, however the 3504517 bugfix should have fixed that?
Hi!
I will take a look at this soon.
I guess this doesn't have to do anything with bug 3504517 or URL-encoding.
If the URL gets encoded "wrong", the server normally responds a "404 - not found".
… but maybe it does ;)
I'll let you know what's going wrong there.
Thanks!
Meanwhile, this is the code I was using.
http://pastebin.com/A8FnPHGn
Thanks!
I just figuered out the problem with that links: They are partially urlencoded and partially not.
Like this one: http://www.lidingo.se/lexportlet/binary/Hantering av ma.pdf?id=%7bFCFE-AB19-0CA2-D14E-DD13-A100%7d
Some characteds are urlencoded (%7b) and some aren't (the whitespaces).
So the crawler can't find out if it's an urlencoded link (is it?) and the request fails.
Added this to the buglist: 3530805
Will try to fix that.
Thanks for the report!