Menu

Problem with whitespace/percent double encoding

Help
2013-03-21
2013-03-22
  • Alex Weisel

    Alex Weisel - 2013-03-21

    Hi
    i'm trying to configure OSS as a search engine for our company fileserver. The plan is to have an intranet wide search server for files. I have set up a webserver (apache 2.2) with mod_autoindex on top level directory - which is the crawl entry for OSS. Configured a web-crawl, file-crawl is not suitable for my needs.

    Now i encountered the following problem:
    When crawling a file or directory that contains a whitespace there seems to be some double URL encoding that makes it impossible to index directories/files with whitespaces.

    Example:
    File: "/files/test space.txt" (legal filename)

    Apache via mod_autoindex: Links that to "http://fileserver/files/test%20space.txt" (which is correct)

    OSS URL browser/query says: "http://fileserver/files/test%20space.txt" (which looks correct but may be decoded in the interface already and fooling me)

    OSS crawler/fetcher calls: "http://fileserver/files/test%2520space.txt" (which is incorrect)
    The % of %20 is encoded to %25, producing a %2520, which is incorrect and produces a 404 not found -> never fetched and parsed.

    Looks like somewhere in the crawl/fetch cycle there is one URL encoding step too much. The %2520 decodes back to "/files/test%20text.txt" in filesystem, which of course does not exist.

    OSS version: 1.4 rc2 r2221
    OS: FreeBSD 9.1 64bit
    Java: openjdk7 64bit
    Webserver: Apache 2.2.24

    Don't know where to start looking - any tips to pinpoint the problem are appreciated.
    Thanks in advance

     

    Last edit: Alex Weisel 2013-03-21
  • Emmanuel Keller

    Emmanuel Keller - 2013-03-22

    Hi Alex,
    Thank you for the detailled feedback.
    To reproduce the issue, we have configured a directory using mod_autoindex with two files:
    http://www.open-search-server.com/ftp/IndexTest/
    Nothing to fix on Apache side. We are investigating. Stay tuned.

     

Log in to post a comment.