Hi
i'm trying to configure OSS as a search engine for our company fileserver. The plan is to have an intranet wide search server for files. I have set up a webserver (apache 2.2) with mod_autoindex on top level directory - which is the crawl entry for OSS. Configured a web-crawl, file-crawl is not suitable for my needs.
Now i encountered the following problem:
When crawling a file or directory that contains a whitespace there seems to be some double URL encoding that makes it impossible to index directories/files with whitespaces.
Apache via mod_autoindex: Links that to "http://fileserver/files/test%20space.txt" (which is correct)
OSS URL browser/query says: "http://fileserver/files/test%20space.txt" (which looks correct but may be decoded in the interface already and fooling me)
OSS crawler/fetcher calls: "http://fileserver/files/test%2520space.txt" (which is incorrect)
The % of %20 is encoded to %25, producing a %2520, which is incorrect and produces a 404 not found -> never fetched and parsed.
Looks like somewhere in the crawl/fetch cycle there is one URL encoding step too much. The %2520 decodes back to "/files/test%20text.txt" in filesystem, which of course does not exist.
Hi Alex,
Thank you for the detailled feedback.
To reproduce the issue, we have configured a directory using mod_autoindex with two files: http://www.open-search-server.com/ftp/IndexTest/
Nothing to fix on Apache side. We are investigating. Stay tuned.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi
i'm trying to configure OSS as a search engine for our company fileserver. The plan is to have an intranet wide search server for files. I have set up a webserver (apache 2.2) with mod_autoindex on top level directory - which is the crawl entry for OSS. Configured a web-crawl, file-crawl is not suitable for my needs.
Now i encountered the following problem:
When crawling a file or directory that contains a whitespace there seems to be some double URL encoding that makes it impossible to index directories/files with whitespaces.
Example:
File: "/files/test space.txt" (legal filename)
Apache via mod_autoindex: Links that to "http://fileserver/files/test%20space.txt" (which is correct)
OSS URL browser/query says: "http://fileserver/files/test%20space.txt" (which looks correct but may be decoded in the interface already and fooling me)
OSS crawler/fetcher calls: "http://fileserver/files/test%2520space.txt" (which is incorrect)
The % of %20 is encoded to %25, producing a %2520, which is incorrect and produces a 404 not found -> never fetched and parsed.
Looks like somewhere in the crawl/fetch cycle there is one URL encoding step too much. The %2520 decodes back to "/files/test%20text.txt" in filesystem, which of course does not exist.
OSS version: 1.4 rc2 r2221
OS: FreeBSD 9.1 64bit
Java: openjdk7 64bit
Webserver: Apache 2.2.24
Don't know where to start looking - any tips to pinpoint the problem are appreciated.
Thanks in advance
Last edit: Alex Weisel 2013-03-21
Hi Alex,
Thank you for the detailled feedback.
To reproduce the issue, we have configured a directory using mod_autoindex with two files:
http://www.open-search-server.com/ftp/IndexTest/
Nothing to fix on Apache side. We are investigating. Stay tuned.