I noticed that urls with umlaut (öäü and more) which are like %FC were converted to \0x0FC
These urls were rejected by our indexer (solr 3.5) - the old solr 1.4 accetps it still.
It would be helpful, when these characters are left untouched and remain url encoded.
PS: PHPCrawl is a geat tool. Thanks a lot. Heiko
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
just noticed that phpcrawl has some problems with links containing umlauts (and other speacial characters) in general.
HTTP-requests for these URLs sometimed don't work as expected (depending on the char-encoding of the refering document and other encoding stuff).
Will try to fix that.
Thanks!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I noticed that urls with umlaut (öäü and more) which are like %FC were converted to \0x0FC
These urls were rejected by our indexer (solr 3.5) - the old solr 1.4 accetps it still.
It would be helpful, when these characters are left untouched and remain url encoded.
PS: PHPCrawl is a geat tool. Thanks a lot. Heiko
Hey Heiko again,
could you please post an example-website containing (working) links with umlauts (so i may be able to relate wht's going on there)?
Thanks again and best regards,
huni.
Hi Huni,
You can try these documents:
http://www.v.tu-harburg.de/sifa/index.html?category=Gef%E4hrdungsbeurteilung
http://www.tu-harburg.de/ibb/Research/PublicationsZeng/2004%20IdentiCS%20%96%20Identification%20of%20coding%20sequence%20and%20in%20silico%20reconstruction%20of%20the%20metabolic%20network%20directly%20from%20unannotated%20low-coverage%20bacterial%20genome%20sequence.pdf
Hi Heiko,
just noticed that phpcrawl has some problems with links containing umlauts (and other speacial characters) in general.
HTTP-requests for these URLs sometimed don't work as expected (depending on the char-encoding of the refering document and other encoding stuff).
Will try to fix that.
Thanks!